Supervised Ensemble-based Learning Algorithm
An Supervised Ensemble-based Learning Algorithm is a supervised learning algorithm that is an ensemble-based learning algorithm.
- Context:
- It can be applied by a Supervised Ensemble-based Learning System (to solve a supervised ensemble-based learning task).
- It can range from being a Fully-Supervised Ensemble Algorithm to being a being a Fully-Supervised Ensemble Algorithm.
- It can range from (typically) being Ensemble Classification Algorithm to being an Ensemble Ranking Algorithm to being an Ensemble Regression Algorithm.
- Example(s):
- a Boosting Algorithm, such as AdaBoost.
- a Bagging Algorithm.
- a Decision Tree Ensemble Learning Algorithm, such as Random Forests.
- …
- Counter-Example(s):
- See: Non-Parametric Supervised Learning, Weak Learner, Learning Algorithm, Bayesian Probability.
References
2011
- (Brown, 2011) ⇒ Gavin Brown. (2011). “Ensemble Learning.” In: (Sammut & Webb, 2011) p.312
2010
- (Seni & Elder, 2010) ⇒ Giovanni Seni, and John F. Elder. (2010). “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions.” Morgan & Claypool. doi:10.2200/S00240ED1V01Y200912DMK002
2006
- (Bishop, 2006) ⇒ Christopher M. Bishop. (2006). “Pattern Recognition and Machine Learning. Springer, Information Science and Statistics.
2005
- (Bühlmann, 2005) ⇒ Peter Bühlmann. (2005). “16.1 An Introduction to Ensemble Methods." website
- QUOTE: Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. More precisely, consider for simplicity the framework of function estimation. We are interested in estimating a real-valued function [math]\displaystyle{ \displaystyle g:\mathbb{R}^d \to \mathbb{R} }[/math] based on data [math]\displaystyle{ (X_1,Y_1),\ldots, (X_n,Y_n) }[/math] where [math]\displaystyle{ X }[/math] is a [math]\displaystyle{ d }[/math]-dimensional predictor variable and [math]\displaystyle{ Y }[/math] a univariate response. Generalizations to other functions [math]\displaystyle{ g(\cdot) }[/math] and other data-types are possible. We assume to have specified a base procedure which, given some input data (as above), yields an estimated function [math]\displaystyle{ \hat{g}(\cdot) }[/math]. For example, the base procedure could be a nonparametric kernel estimator (if [math]\displaystyle{ d }[/math] is small) or a nonparametric statistical method with some structural restrictions (for [math]\displaystyle{ d \ge 2 }[/math]) such as a regression tree (or class-probability estimates from a classification tree). We can run a base procedure many times when changing the input data: the original idea of ensemble methods is to use reweighted original data to obtain different estimates [math]\displaystyle{ \hat{g}_1(\cdot),\hat{g}_2(\cdot),\hat{g}_3(\cdot),\ldots }[/math] based on different reweighted input data. We can then construct an ensemble-based function estimate [math]\displaystyle{ g_{ens}(\cdot) }[/math] by taking linear combinations of the individual function estimates [math]\displaystyle{ \hat{g}_k(\cdot) }[/math]:
[math]\displaystyle{ \displaystyle \hat{g}_{ens}(\cdot) =\sum_{k=1}^M c_k \hat{g}_k(\cdot)\;, }[/math] (16.1)
where the [math]\displaystyle{ \hat{g}_k(\cdot) }[/math] are obtained from the base procedure based on the [math]\displaystyle{ k }[/math]th reweighted data-set. For some ensemble methods, e.g. for bagging (see Sect. 16.2), the linear combination coefficients [math]\displaystyle{ c_k \equiv 1/M }[/math] are averaging weights; for other methods, e.g. for boosting (see Sect. 16.3), [math]\displaystyle{ \sum_{k=1}^M c_k }[/math] increases as [math]\displaystyle{ M }[/math] gets larger.
Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. We will argue in Sects. 16.3.1 and 16.3.6 that boosting may be even viewed as a non-ensemble method which has tremendous advantages over ensemble (or multiple prediction) methods in terms of interpretation.
Random forests (Breiman, 2001) is a very different ensemble method than bagging or boosting. The earliest random forest proposal is from Amit and Geman (Amit & Geman, 1997). From the perspective of prediction, random forests is about as good as boosting, and often better than bagging. For further details about random forests we refer to (Breiman, 2001).
- QUOTE: Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. More precisely, consider for simplicity the framework of function estimation. We are interested in estimating a real-valued function [math]\displaystyle{ \displaystyle g:\mathbb{R}^d \to \mathbb{R} }[/math] based on data [math]\displaystyle{ (X_1,Y_1),\ldots, (X_n,Y_n) }[/math] where [math]\displaystyle{ X }[/math] is a [math]\displaystyle{ d }[/math]-dimensional predictor variable and [math]\displaystyle{ Y }[/math] a univariate response. Generalizations to other functions [math]\displaystyle{ g(\cdot) }[/math] and other data-types are possible. We assume to have specified a base procedure which, given some input data (as above), yields an estimated function [math]\displaystyle{ \hat{g}(\cdot) }[/math]. For example, the base procedure could be a nonparametric kernel estimator (if [math]\displaystyle{ d }[/math] is small) or a nonparametric statistical method with some structural restrictions (for [math]\displaystyle{ d \ge 2 }[/math]) such as a regression tree (or class-probability estimates from a classification tree). We can run a base procedure many times when changing the input data: the original idea of ensemble methods is to use reweighted original data to obtain different estimates [math]\displaystyle{ \hat{g}_1(\cdot),\hat{g}_2(\cdot),\hat{g}_3(\cdot),\ldots }[/math] based on different reweighted input data. We can then construct an ensemble-based function estimate [math]\displaystyle{ g_{ens}(\cdot) }[/math] by taking linear combinations of the individual function estimates [math]\displaystyle{ \hat{g}_k(\cdot) }[/math]:
2000
- (Dietterich, 2000a) ⇒ Thomas G. Dietterich. (2000). “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” In: Machine Learning Journal, 40(2). doi:10.1023/A:1007607513941
- (Valpola, 2000) ⇒ Harri Valpola. (2000). “Bayesian Ensemble Learning for Nonlinear Factor Analysis." PhD Dissertation, Helsinki University of Technology.
- QUOTE: ensemble learning: A technique for approximating the exact application of Bayesian probability theory.
Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.
- QUOTE: ensemble learning: A technique for approximating the exact application of Bayesian probability theory.
1998
- (Barber & Bishop, 1998) ⇒ David Barber, and Christopher M. Bishop. (1998). “Ensemble Learning in Bayesian Neural Networks.” In: Christopher M. Bishop (editor), "Generalization in Neural Networks and Machine Learning." Springer. ISBN:354064928X
- QUOTE: A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993) and again involves finding a simple, analytically tractable, approximation to the true posterior distribution. Unlike Laplace's method, however, the approximating distribution is fitted globally, rather than locally, by minimizing a Kullback-Leibler divergence. Hinton and van Camp (1993) showed that, in the case of a Gaussian approximating distribution with a diagonal covariance, a deterministic learning algorithm could be derived.
1993
- (Hinton & Camp, 1993) ⇒ G. E. Hinton, and D. van Camp (1993). “Keeping neural networks simple by minimizing the description length of the weights.” In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.