Semi-Supervised Machine Learning System
A Semi-Supervised Learning System is a Machine Learning System that can learn a predictor from both labeled and unlabeled data.
- AKA: Co-training System.
- Context:
- It implements a Semi-Supervised Algorithm to solve a Semi-Supervised Task that will output the predictor variable.
- It can range from being a Transductive Learning System to being an Inductive Semi-supervised System.
- Example(s):
- Counter-Example(s)
- See: Vapnik's Principle, Independent Identically Distributed, Statistical Classification, Transduction (Machine Learning), Inductive Reasoning.
References
2017A
- (Zhu, 2017) ⇒ Zhu X. (2017) Semi-supervised Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- QUOTE: Semi-supervised learning uses both labeled and unlabeled data to perform an otherwise supervised learning or unsupervised learning task.
In the former case, there is a distinction between inductive semi-supervised learning and transductive learning. In inductive semi-supervised learning, the learner has both labeled training data [math]\displaystyle{ \{(x_{i},y_{i})\}_{i=1}^{l}\mathop{ \sim }\limits^{ iid}p(\mathbf{x},y) }[/math] and unlabeled training data [math]\displaystyle{ \{\mathbf{x}_{i}\}_{i=l+1}^{l+u}\mathop{ \sim }\limits^{ iid}p(\mathbf{x}) }[/math] , and learns a predictor [math]\displaystyle{ f : \mathcal{X}\mapsto \mathcal{Y},f \in \mathcal{F} }[/math], where [math]\displaystyle{ \mathcal{F} }[/math] is the hypothesis space. Here [math]\displaystyle{ \mathbf{x} \in \mathcal{X} }[/math] is an input instance, [math]\displaystyle{ y \in \mathcal{Y} }[/math] its target label (discrete for classification or continuous for regression), [math]\displaystyle{ p(\mathbf{x}, y) }[/math] the unknown joint distribution and [math]\displaystyle{ p(\mathbf{x}) }[/math] its marginal, and typically [math]\displaystyle{ l ≪ u }[/math]. The goal is to learn a predictor that predicts future test data better than the predictor learned from the labeled training data alone. In transductive learning, the setting is the same except that one is solely interested in the predictions on the unlabeled training data [math]\displaystyle{ \{\mathbf{x}_{i}\}_{i=l+1}^{l+u} }[/math] , without any intention to generalize to future test data.
In the latter case, an unsupervised learning task is enhanced by labeled data. For example, in semi-supervised clustering (a.k.a. constrained clustering) one may have a few must-links (two instances must be in the same cluster) and cannot-links (two instances cannot be in the same cluster) in addition to the unlabeled instances to be clustered; in semi-supervised dimensionality reduction one might have the target low-dimensional coordinates on a few instances.
- QUOTE: Semi-supervised learning uses both labeled and unlabeled data to perform an otherwise supervised learning or unsupervised learning task.
2017B
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Semi-supervised_learning Retrieved:2017-12-24.
- Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
As in the supervised learning framework, we are given a set of [math]\displaystyle{ l }[/math] independently identically distributed examples [math]\displaystyle{ x_1,\dots,x_l \in X }[/math] with corresponding labels [math]\displaystyle{ y_1,\dots,y_l \in Y }[/math] . Additionally, we are given [math]\displaystyle{ u }[/math] unlabeled examples [math]\displaystyle{ x_{l+1},\dots,x_{l+u} \in X }[/math] . Semi-supervised learning attempts to make use of this combined information to surpass the classification performance that could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data [math]\displaystyle{ x_{l+1},\dots,x_{l+u} }[/math] only. The goal of inductive learning is to infer the correct mapping from [math]\displaystyle{ X }[/math] to [math]\displaystyle{ Y }[/math] .
Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.
It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
- Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
2017C
- (Muslea, 2017) ⇒ Muslea I. (2017) Semi-supervised Text Processing. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- QUOTE: In contrast to supervised and unsupervised learners, which use solely labeled or unlabeled examples, respectively, semi-supervised learning systems exploit both labeled and unlabeled examples. In a typical semi-supervised framework, the system takes as input a (small) training set of labeled examples and a (larger) working set of unlabeled examples; the learner’s performance is evaluated on a test set that consists of unlabeled examples. Transductive learning is a particular case of semi-supervised learning in which the working set and the test set are identical.
Semi-supervised learners use the unlabeled examples to improve the performance of the system that could be learned solely from labeled data. Such learners typically exploit – directly or indirectly – the distribution of the available unlabeled examples. Text processing is an ideal application domain for semi-supervised learning because the abundance of text documents available on the Web makes it impossible for humans to label them all. We focus here on two related types of text processing tasks that were heavily studied in the semi-supervised framework: text classification and text Clustering.
- QUOTE: In contrast to supervised and unsupervised learners, which use solely labeled or unlabeled examples, respectively, semi-supervised learning systems exploit both labeled and unlabeled examples. In a typical semi-supervised framework, the system takes as input a (small) training set of labeled examples and a (larger) working set of unlabeled examples; the learner’s performance is evaluated on a test set that consists of unlabeled examples. Transductive learning is a particular case of semi-supervised learning in which the working set and the test set are identical.