Semi-Supervised Learning Task
A semi-supervised learning task is a supervised learning task with access to an unlabeled training set.
- Context:
- It can be solved by a Semi-Supervised Learning System (that implements a Semi-Supervised Learning algorithm).
- It can range from being a Semi-Supervised Classification Task to being a Semi-Supervised Ranking Task to being a Semi-Supervised Estimation Task.
- It can be:
- a Standard Semi-Supervised Learning Task.
- a Weakly-Supervised Learning Task (with a Small Training Set).
- a Self-Supervised Learning Task (with a Labeling Heuristic).
- a Semi-Supervised Learning Task with Constraints.
- a Transductive Learning Task (solvable by a Transductive Learning Algorithm).
- a Co-Training Learning Task (solvable by a Co-Training Learning Algorithm).
- Example(s):
- many NLP Tasks because of the existence of Large Unannotated Corpora.
- …
- Counter-Example(s):
- See: Co-Training; Active Learning; Constrained Clustering; Dimensionality Reduction; Online Learning; Vapnik's Principle, Transduction (Machine Learning), Inductive Reasoning.
References
2016
- (Wikipedia, 2016) ⇒ http://wikipedia.org/wiki/semi-supervised_learning Retrieved:2016-2-6.
- Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
As in the supervised learning framework, we are given a set of [math]\displaystyle{ l }[/math] independently identically distributed examples [math]\displaystyle{ x_1,\dots,x_l \in X }[/math] with corresponding labels [math]\displaystyle{ y_1,\dots,y_l \in Y }[/math] . Additionally, we are given [math]\displaystyle{ u }[/math] unlabeled examples [math]\displaystyle{ x_{l+1},\dots,x_{l+u} \in X }[/math] . Semi-supervised learning attempts to make use of this combined information to surpass the classification performance that could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data [math]\displaystyle{ x_{l+1},\dots,x_{l+u} }[/math] only. The goal of inductive learning is to infer the correct mapping from [math]\displaystyle{ X }[/math] to [math]\displaystyle{ Y }[/math] .
Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.
It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
- Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
2011
- (Zhu, 2011) ⇒ Xiaojin Zhu. (2011). “Semi-Supervised Learning.” In: (Sammut & Webb, 2011) p.892
2009
- (Zhu & Goldberg, 2009) ⇒ Xiaojin Zhu, and Andrew B. Goldberg. (2009). “Introduction to Semi-supervised Learning." Synthesis lectures on artificial intelligence and machine learning, 3(1).
- ABSTRACT: Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as humans learn in the presence of both labeled and unlabeled data. Traditionally, learning has been studied either in the unsupervised paradigm (e.g., clustering, outlier detection) where all the data are unlabeled, or in the supervised paradigm (e.g., classification, regression) where all the data are labeled. The goal of semi-supervised learning is to understand how combining labeled and unlabeled data may change the learning behavior, and design algorithms that take advantage of such a combination. Semi-supervised learning is of great interest in machine learning and data mining because it can use readily available unlabeled data to improve supervised learning tasks when the labeled data are scarce or expensive. Semi-supervised learning also shows potential as a quantitative tool to understand human category learning, where most of the input is self-evidently unlabeled. In this introductory book, we present some popular semi-supervised learning models, including self-training, mixture models, co-training and multiview learning, graph-based methods, and semi-supervised support vector machines. For each model, we discuss its basic mathematical formulation. The success of semi-supervised learning depends critically on some underlying assumptions. We emphasize the assumptions made by each model and give counterexamples when appropriate to demonstrate the limitations of the different models. In addition, we discuss semi-supervised learning for cognitive psychology. Finally, we give a computational learning theoretic perspective on semi-supervised learning, and we conclude the book with a brief discussion of open questions in the field.
2008
- (Zhu, 2008) ⇒ Xiaojin Zhu. (2008). “Semi-Supervised Learning Literature Survey (revised edition)." Technical Report 1530, Department of Computer Sciences, University of Wisconsin, Madison.
- Some often-used methods include: EM with generative mixture models, self-training, co-training, transductive support vector machines, and graph-based methods.
2006
- (Chapelle et al., 2006b) ⇒ Olivier Chapelle, Alexander Zien, and Bernhard Schölkopf (Editors). (2006). “Introduction to Semi-Supervised Learning.” In: (Chapelle et al., 2006a)
- Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some supervision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples. In this case, the data set X = (xi)i∈[n] can be divided into two parts: the points Xl := (x1, . . ., xl), for which labels Yl := (y1, . . ., yl) are provided, and the points Xu := (xl+1, . . ., xl+u), the labels of which are not known. This is “standard” semi-supervised learning as investigated in this book; most chapters will refer to this setting.