Supervised Learning Algorithm
A supervised learning algorithm is a learning algorithm that can solve a supervised learning task.
- AKA: Supervised Learner, Supervised Machine Learning Algorithm, Data-Driven Algorithm, Training Algorithm, Classifier Training Algorithm.
- Context:
- It can produce a predictive function.
- It can, based on the Target Attribute datatype, range from being a Supervised Classification Algorithm (categorical attribute) to being a Supervised Ranking Algorithm (ordinal attribute), to being a Supervised Regression Algorithm (numeric attribute).
- It can be, depending on the resources consumed during a training phase, be an Eager Learning Algorithm (that produces a fast Predictive Model for all possible Testing Records) to being a a Lazy Learning Algorithm (that can iteratively produce a Predictive Model as more Testing Records are processed).
- It can, depending on the model representation formalism used, range from being a Model-based Learning Algorithm to being an Instance-based Learning Algorithm.
- It can range from being a Discriminant Learning Algorithm to being a Generative Learning Algorithm
- It can, depending on whether there is Unlabeled Data, range from being a Fully-Supervised Learning Algorithm to being a Semi-Supervised Learning Algorithm.
- It can, depending on whether there is feedback in the testing phase, range from being an Offline Supervised Learning Algorithm to being an Online Supervised Learning Algorithm.
- It can be implemented into a Supervised Learning System.
- Example(s):
- a Supervised Decision Tree Learning Algorithm, such as a C4.5 algorithm
- a k-Nearest Neighbor Algorithm
- a Statistical Regression Algorithm, such as logistic regression or linear regression.
- Inductive Learning Algorithm.
- A Domain-specific Supervised Learning Algorithm, such as Supervised Named Entity Recognition Algorithm.
- Counter-Example(s):
- See: Model Selection Task, Function Fitting Algorithm, Target Attribute, Discriminatively Trained Model, Generatively Trained Model.
References
2011
In order to solve a given problem of supervised learning, one has to perform the following steps:
- Determine the type of training examples. Before doing anything else, the engineer should decide what kind of data is to be used as an example. For instance, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.
- Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
- Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
- Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees.
- Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
- Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).
There are four major issues to consider in supervised learning:
- Bias-variance tradeoff
A first issue is the tradeoff between bias and variance[1]. Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input [math]\displaystyle{ x }[/math] if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for [math]\displaystyle{ x }[/math]. A learning algorithm has high variance for a particular input [math]\displaystyle{ x }[/math] if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm[2]. Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).
- Function complexity and amount of training data
The second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance. Good learning algorithms therefore automatically adjust the bias/variance tradeoff based on the amount of data available and the apparent complexity of the function to be learned.
- Dimensionality of the input space
A third issue is the dimensionality of the input space. If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower dimensional space prior to running the supervised learning algorithm.
- Noise in the output values
A fourth issue is the degree of noise in the desired output values (the supervisory targets). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. This is another case where it is usually best to employ a high bias, low variance classifier.
- Other factors to consider
Other factors to consider when choosing and applying a learning algorithm include the following:
- Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.
- Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often by solved by imposing some form of regularization.
- Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.
When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation. Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
The most widely used learning algorithms are Support Vector Machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, and Neural Networks (Multilayer perceptron).
2009
- http://www.nature.com/nrc/journal/v5/n11/glossary/nrc1739_glossary.html
- SUPERVISED ALGORITHM A method of statistical or machine learning in which a model is fitted to observations. The algorithm, in effect, learns by example.
2006
- (Bishop, 2006) ⇒ Christopher M. Bishop. (2006). "Pattern Recognition and Machine Learning. Springer, Information Science and Statistics.
1998
- (Kohavi & Provost, 1998) ⇒ Ron Kohavi, and Foster Provost. (1998). "Glossary of Terms." In: Machine Leanring 30(2-3).
- Supervised learning: Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label). Most induction algorithms fall into the supervised learning category.
1997
- (Mitchell, 1997) ⇒ Tom M. Mitchell. (1997). "Machine Learning." McGraw-Hill.