Structural Risk Minimization (SRM) Task

A Structural Risk Minimization (SRM) Task is an Minimization Task that uses Inductive Learning to select a generalized model from a training dataset to resolve an overfitting.

Context:
- Task Input: Training Dataset.
- Task Output: a Generalized Model.
- Task Requirements: Structural Risk Minimization Function.
- It can be solved by a Structural Risk Minimization (SRM) System that implements a Structural Risk Minimization (SRM) Algorithm.
- It can also be defined as "the capacity tuning of a classifier to the available amount of Training Data".
Example(s):
- a Character Recognition SRM Task (Guyon et al., 1992),
- a Neural Networks SRM Task (Vieira et al., 2010),
- a Dyadic Classification Trees via Structural Risk Minimization Task (Scott & Nowak, 2003),
- a Vapnik's SRM Task (e.g. Vapnik, 1992).
- …
Counter-Example(s):
- a Empirical Risk Minimization Task.
- a Financial Risk Modeling Task,
- a Maximization Task,
- a Minimum Set Cover Task.
See: VC Dimension, Machine Learning, Overfitting, Loss Function, Maximum Entropy, Maximum Likelihood Estimation, Risk Modeling Task, Bound Minimization Algorithm, Rademacher Complexity.

References

2020

(Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Structural_risk_minimization Retrieved:2020-2-28.
- Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data.
  In practical terms, Structural Risk Minimization is implemented by minimizing [math]\displaystyle{ E_{train} + \beta H(W) }[/math] , where [math]\displaystyle{ E_{train} }[/math] is the train error, the function [math]\displaystyle{ H(W) }[/math] is called a regularization function, and [math]\displaystyle{ \beta }[/math] is a constant. [math]\displaystyle{ H(W) }[/math] is chosen such that it takes large values on parameters [math]\displaystyle{ W }[/math] that belong to high-capacity subsets of the parameter space. Minimizing [math]\displaystyle{ H(W) }[/math] in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error.
  The SRM principle was first set out in a 1974 paper by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension.

2017a

(Zhang, 2011f) ⇒ Xinhua Zhang. (2017). "Structural Risk Minimization". In: (Sammut & Webb, 2017). [https://doi.org/10.1007/978-1-4899-7687-1_799 DOI:10.1007/978-1-4899-7687-1_799
- QUOTE: Structural risk minimization is an inductive principle used to combat overfitting. It seeks a tradeoff between model complexity and fitness of the model on the training data.
  ...
  
  Empirical Risk can be defined as:
  [math]\displaystyle{ \dfrac{1}{n}\displaystyle\sum_{i=1}^n \delta \left(f(X_i)\ne Y_i)\right) }[/math].
  Choosing a function f by minimizing the empirical risk often leads to Overfitting. To alleviate this problem, the idea of structural risk minimization (SRM) is to employ an infinite sequence of models $\mathcal{F}_1, \mathcal{F}_2, \cdots$ with increasing capacity. Here each $\mathcal{F}_i$ is a set of functions, e.g., polynomials with degree 3. We minimize the empirical risk in each model with a penalty for the capacity of the model:
  [math]\displaystyle{ f_n:=\displaystyle\underset{f\in \mathcal{F}_i, i\in \mathbb{N}}{argmin}\dfrac{1}{n}\sum_{i=1}^n \delta \left(f(X_i)\ne Y_i)\right)+capacity\left(\mathcal{F}_i, n \right) }[/math],
  where capacity $(\mathcal{F}_i , n)$ quantifies the complexity of model $\mathcal{F}_i$ in the context of the given training set.

2017b

(Krishnamurthy, 2017) ⇒ Akshay Krishnamurthy (2017). "Lecture 8: Model Selection And Structural Risk Minimization".
- QUOTE: In the more general setting, imagine we have a possibly infinite sequence of hypothesis classes $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$ , at our disposal. Further assume that each $\mathcal{H}_i$ has the uniform convergence property, but possibly with different functions $n_{\mathcal{H}_i}^{UC}$. As usual, if $n$ is big, we should choose a large class, which corresponds to $i$ big, but if $n$ is small, we should choose a smaller class to better balance estimation and approximation. Structural risk minimization is a way to basically do this for free, and get the best of both worlds. The first observation is that if we weight the classes appropriately by taking a union bound, we can get a convergence result that is fairly similar to the uniform convergence results we have used in the past. This is a form of non-uniform learnability. Let us define
  [math]\displaystyle{ \epsilon(n, \delta)=min\{\epsilon|n_{\mathcal{H}_i}^{UC}(\epsilon,\delta)\leq n\} }[/math]
  to be the smallest error that you can achieve in the uniform convergence bound with $n$ samples and failure probability $\delta$. We will also need a weighting function $w : \mathbb{N} \to [0, 1]$ such that $\sum_i w(i) \leq 1$. This function should be chosen so that $w(i)$ encodes how much we believe the best hypothesis is in $\mathcal{H}_i$ . Some choices that encode minimal beliefs are
  - 1. If there are finitely many classes, say just $N$. Then set $w(i) = 1/N$.
  - 2. Otherwise set $w(i) = \frac{6}{\pi^2n^2}$ which is a convergence sequence. In the nested setting, this expresses a very mild preference for smaller clases.

The basic lemma underlying the Structural Risk Minimization procedure is now easy for us to prove if we work through the definitions and use the union bound.

**Figure 1:** Structural risk minimization.

↑ Vapnik, V. N., and A. Ya. Chervonenkis, 1974. Teoriya Raspoznavaniya Obrazov: Statisticheskie Problemy Obucheniya. (Russian) [Theory of Pattern Recognition: Statistical Problems of Learning]. Moscow: Nauka

2003

(Scott & Nowak, 2003) ⇒ Clayton Scott, and Robert Nowak (2003). "Dyadic Classification Trees Via Structural Risk Minimization". In: Advances in Neural Information Processing Systems.
- QUOTE: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.

1992a

(Vapnik, 1992) ⇒ V. Vapnik (1992). "Principles Of Risk Minimization For Learning Theory". In: Advances in Neural Information Processing Systems.
- QUOTE: The principle of structure risk minimization (SRM) requires a two-step process: the empirical risk has to be minimized for each element of the structure. The optimal element $S^*$ is then selected to minimize the guaranteed risk, defined as the sum of the empirical risk and the confidence interval. This process involves a trade-off: as $h$ increases the minimum empirical risk decreases, but the confidence interval increases.

1992b

(Guyon et al., 1992) ⇒ I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla (1992). "Structural Risk Minimization For Character Recognition". In: Advances in Neural Information Processing Systems. NIPS'91: Proceedings of the 4th International Conference on Neural Information Processing Systems.
- QUOTE: The method of Structural Risk Minimization refers to tuning the capacity of the classifier to the available amount of training data. This capacity is influenced by several factors, including: (1) properties of the input space, (2) nature and structure of the classifier, and (3) learning algorithm. Actions based on these three factors are combined here to control the capacity of linear classifiers and improve generalization on the problem of handwritten digit recognition.

[1] Vapnik, V. N., and A. Ya. Chervonenkis, 1974. Teoriya Raspoznavaniya Obrazov: Statisticheskie Problemy Obucheniya. (Russian) [Theory of Pattern Recognition: Statistical Problems of Learning]. Moscow: Nauka

[1]

Structural Risk Minimization (SRM) Task

References

2020

2017a

2017b

2012

2010

2008

2003

1992a

1992b

Navigation menu

Search