Structural Risk Minimization (SRM) Task
A Structural Risk Minimization (SRM) Task is an Minimization Task that uses Inductive Learning to select a generalized model from a training dataset to resolve an overfitting.
- Context:
- Task Input: Training Dataset.
- Task Output: a Generalized Model.
- Task Requirements: Structural Risk Minimization Function.
- It can be solved by a Structural Risk Minimization (SRM) System that implements a Structural Risk Minimization (SRM) Algorithm.
- It can also be defined as "the capacity tuning of a classifier to the available amount of Training Data".
- Example(s):
- Counter-Example(s):
- See: VC Dimension, Machine Learning, Overfitting, Loss Function, Maximum Entropy, Maximum Likelihood Estimation, Risk Modeling Task, Bound Minimization Algorithm, Rademacher Complexity.
References
2020
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Structural_risk_minimization Retrieved:2020-2-28.
- Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data.
In practical terms, Structural Risk Minimization is implemented by minimizing [math]\displaystyle{ E_{train} + \beta H(W) }[/math] , where [math]\displaystyle{ E_{train} }[/math] is the train error, the function [math]\displaystyle{ H(W) }[/math] is called a regularization function, and [math]\displaystyle{ \beta }[/math] is a constant. [math]\displaystyle{ H(W) }[/math] is chosen such that it takes large values on parameters [math]\displaystyle{ W }[/math] that belong to high-capacity subsets of the parameter space. Minimizing [math]\displaystyle{ H(W) }[/math] in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error.
The SRM principle was first set out in a 1974 paper by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension.
- Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data.
2017a
- (Zhang, 2011f) ⇒ Xinhua Zhang. (2017). "Structural Risk Minimization". In: (Sammut & Webb, 2017). [https://doi.org/10.1007/978-1-4899-7687-1_799 DOI:10.1007/978-1-4899-7687-1_799
- QUOTE: Structural risk minimization is an inductive principle used to combat overfitting. It seeks a tradeoff between model complexity and fitness of the model on the training data. ...
Empirical Risk can be defined as:
[math]\displaystyle{ \dfrac{1}{n}\displaystyle\sum_{i=1}^n \delta \left(f(X_i)\ne Y_i)\right) }[/math].Choosing a function f by minimizing the empirical risk often leads to Overfitting. To alleviate this problem, the idea of structural risk minimization (SRM) is to employ an infinite sequence of models $\mathcal{F}_1, \mathcal{F}_2, \cdots$ with increasing capacity. Here each $\mathcal{F}_i$ is a set of functions, e.g., polynomials with degree 3. We minimize the empirical risk in each model with a penalty for the capacity of the model:
[math]\displaystyle{ f_n:=\displaystyle\underset{f\in \mathcal{F}_i, i\in \mathbb{N}}{argmin}\dfrac{1}{n}\sum_{i=1}^n \delta \left(f(X_i)\ne Y_i)\right)+capacity\left(\mathcal{F}_i, n \right) }[/math],where capacity $(\mathcal{F}_i , n)$ quantifies the complexity of model $\mathcal{F}_i$ in the context of the given training set.
- QUOTE: Structural risk minimization is an inductive principle used to combat overfitting. It seeks a tradeoff between model complexity and fitness of the model on the training data.
2017b
- (Krishnamurthy, 2017) ⇒ Akshay Krishnamurthy (2017). "Lecture 8: Model Selection And Structural Risk Minimization".
- QUOTE: In the more general setting, imagine we have a possibly infinite sequence of hypothesis classes $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$ , at our disposal. Further assume that each $\mathcal{H}_i$ has the uniform convergence property, but possibly with different functions $n_{\mathcal{H}_i}^{UC}$. As usual, if $n$ is big, we should choose a large class, which corresponds to $i$ big, but if $n$ is small, we should choose a smaller class to better balance estimation and approximation. Structural risk minimization is a way to basically do this for free, and get the best of both worlds. The first observation is that if we weight the classes appropriately by taking a union bound, we can get a convergence result that is fairly similar to the uniform convergence results we have used in the past. This is a form of non-uniform learnability. Let us define[math]\displaystyle{ \epsilon(n, \delta)=min\{\epsilon|n_{\mathcal{H}_i}^{UC}(\epsilon,\delta)\leq n\} }[/math]
to be the smallest error that you can achieve in the uniform convergence bound with $n$ samples and failure probability $\delta$. We will also need a weighting function $w : \mathbb{N} \to [0, 1]$ such that $\sum_i w(i) \leq 1$. This function should be chosen so that $w(i)$ encodes how much we believe the best hypothesis is in $\mathcal{H}_i$ . Some choices that encode minimal beliefs are
- 1. If there are finitely many classes, say just $N$. Then set $w(i) = 1/N$.
- 2. Otherwise set $w(i) = \frac{6}{\pi^2n^2}$ which is a convergence sequence. In the nested setting, this expresses a very mild preference for smaller clases.
- QUOTE: In the more general setting, imagine we have a possibly infinite sequence of hypothesis classes $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$ , at our disposal. Further assume that each $\mathcal{H}_i$ has the uniform convergence property, but possibly with different functions $n_{\mathcal{H}_i}^{UC}$. As usual, if $n$ is big, we should choose a large class, which corresponds to $i$ big, but if $n$ is small, we should choose a smaller class to better balance estimation and approximation. Structural risk minimization is a way to basically do this for free, and get the best of both worlds. The first observation is that if we weight the classes appropriately by taking a union bound, we can get a convergence result that is fairly similar to the uniform convergence results we have used in the past. This is a form of non-uniform learnability. Let us define
- The basic lemma underlying the Structural Risk Minimization procedure is now easy for us to prove if we work through the definitions and use the union bound.
2012
- (Oneto et al., 2010) ⇒ Luca Oneto, Davide Anguita, Alessandro Ghio, and Sandro Ridella (2012, September). "Rademacher Complexity and Structural Risk Minimization: An Application to Human Gene Expression Datasets". In: International Conference on Artificial Neural Networks. DOI:10.1007/978-3-642-33266-1_61.
2010
- (Vieira et al., 2010) ⇒ D.A.G. Vieira, J.A. Vasconcelos and R.R. Saldanha (2010). "Recent Advances In Neural Networks Structural Risk Minimization Based On Multiobjective Complexity Control Algorithms". In: Machine Learning - Intech.
2008
- (Sewell, 2008) ⇒ Martin Sewell (2008) ⇒ "Structural Risk Minimization". Published Online: https://www.svms.org/srm/ .
- QUOTE: Structural risk minimization (SRM) (Vapnik and Chervonekis, 1974[1]) is an inductive principle for model selection used for learning from finite training data sets. It describes a general model of capacity control and provides a trade-off between hypothesis space complexity (the VC dimension of approximating functions) and the quality of fitting the training data (empirical error). The procedure is outlined below.
- 1. Using a priori knowledge of the domain, choose a class of functions, such as polynomials of degree $n$, neural networks having $n$ hidden layer neurons, a set of splines with $n$ nodes or fuzzy logic models having $n$ rules.
- 2. Divide the class of functions into a hierarchy of nested subsets in order of increasing complexity. For example, polynomials of increasing degree.
- 3. Perform empirical risk minimization on each subset (this is essentially parameter selection).
- 4. Select the model in the series whose sum of empirical risk and VC confidence is minimal.
- QUOTE: Structural risk minimization (SRM) (Vapnik and Chervonekis, 1974[1]) is an inductive principle for model selection used for learning from finite training data sets. It describes a general model of capacity control and provides a trade-off between hypothesis space complexity (the VC dimension of approximating functions) and the quality of fitting the training data (empirical error). The procedure is outlined below.
- Figure 1 (page 2) gives a diagrammatic representation of SRM.
- ↑ Vapnik, V. N., and A. Ya. Chervonenkis, 1974. Teoriya Raspoznavaniya Obrazov: Statisticheskie Problemy Obucheniya. (Russian) [Theory of Pattern Recognition: Statistical Problems of Learning]. Moscow: Nauka
2003
- (Scott & Nowak, 2003) ⇒ Clayton Scott, and Robert Nowak (2003). "Dyadic Classification Trees Via Structural Risk Minimization". In: Advances in Neural Information Processing Systems.
- QUOTE: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
1992a
- (Vapnik, 1992) ⇒ V. Vapnik (1992). "Principles Of Risk Minimization For Learning Theory". In: Advances in Neural Information Processing Systems.
- QUOTE: The principle of structure risk minimization (SRM) requires a two-step process: the empirical risk has to be minimized for each element of the structure. The optimal element $S^*$ is then selected to minimize the guaranteed risk, defined as the sum of the empirical risk and the confidence interval. This process involves a trade-off: as $h$ increases the minimum empirical risk decreases, but the confidence interval increases.
1992b
- (Guyon et al., 1992) ⇒ I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla (1992). "Structural Risk Minimization For Character Recognition". In: Advances in Neural Information Processing Systems. NIPS'91: Proceedings of the 4th International Conference on Neural Information Processing Systems.
- QUOTE: The method of Structural Risk Minimization refers to tuning the capacity of the classifier to the available amount of training data. This capacity is influenced by several factors, including: (1) properties of the input space, (2) nature and structure of the classifier, and (3) learning algorithm. Actions based on these three factors are combined here to control the capacity of linear classifiers and improve generalization on the problem of handwritten digit recognition.