Empirical Risk Minimization (ERM) Algorithm
An Empirical Risk Minimization (ERM) Algorithm is a Supervised Learning Algorithm that is an optimization algorithm used to determine theoretical bounds on a machine learning algorihtm's performance.
- Example(s):
- …
- Counter-Example(s):
- See: PAC Learning Theory, Vladimir Vapnik, Empirical Inference, Unsupervised Learning Algorithm, Data Streaming Algorithm, M-Estimator, Loss Function.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Empirical_risk_minimization Retrieved:2019-5-27.
- Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance (...)
In general, the risk [math]\displaystyle{ R(h) }[/math] cannot be computed because the distribution [math]\displaystyle{ P(x, y) }[/math] is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:
[math]\displaystyle{ \! R_\text{emp}(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i) }[/math].
The empirical risk minimization principle [1] states that the learning algorithm should choose a hypothesis [math]\displaystyle{ \hat{h} }[/math] which minimizes the empirical risk:
[math]\displaystyle{ \hat{h} = \underset{h \in \mathcal{H}}{\arg \min} R_{\text{emp}}(h) }[/math].
Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.
- Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance (...)
- ↑ V. Vapnik (1992). Principles of Risk Minimization for Learning Theory.
2015
- (Frostig et al., 2015) ⇒ Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. (2015). “Competing with the Empirical Risk Minimizer in a Single Pass.” In: Proceedings of In Conference on learning theory (JMLR 2015).
- QUOTE: Stochastic approximation algorithms, such as stochastic gradient descent (SGD) (Robbins and Monro, 1951), are the most widely used in practice, due to their ease of implementation and their efficiency with regards to runtime and memory. Without consideration for computational constraints, we often wish to compute the empirical risk minimizer (ERM; or, equivalently, the M-estimator):[math]\displaystyle{ \hat{w}^{ERM}_{N}\in \underset{w\in S}{\arg\min}\dfrac{1}{N}\sum^N_{i=1}\psi(w)\quad }[/math] (2)
In the context of statistical modeling, the ERM is the maximum likelihood estimator (MLE). Under certain regularity conditions, and under correct model specification[1] the MLE is asymptotically efficient, in that no unbiased estimator can have a lower variance in the limit (see Lehmann and Casella (1998); van der Vaart (2000))[2]. Analogous arguments have been made in the stochastic approximation setting, where we do not necessarily have a statistical model of the distribution [math]\displaystyle{ \mathcal{D} }[/math] (see Kushner and Yin (2003)). The regularity conditions we consider herein are standard; without such assumptions the work in Shalev-Shwartz et a1. (2010) shows that the ERM may unstable.
- QUOTE: Stochastic approximation algorithms, such as stochastic gradient descent (SGD) (Robbins and Monro, 1951), are the most widely used in practice, due to their ease of implementation and their efficiency with regards to runtime and memory. Without consideration for computational constraints, we often wish to compute the empirical risk minimizer (ERM; or, equivalently, the M-estimator):
- ↑ A well specified statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.1
- ↑ Note that biased estimators, e.g. the James-Stein estimator, can outperform the MLE (Lehmann and Casella, 1998).