Expectation-Maximization (EM) Algorithm
An Expectation-Maximization (EM) Algorithm is a deterministic nonparametric maximum marginal likelihood estimation algorithm that alternates between performing an expectation (E) step (which computes an expectation of the likelihood by including the latent variables as if they were observed) and a maximization (M) step (which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step).
- Context:
- It can be used to Train a Generative Classification Model when there is Unlabeled Data.
- It can be used as part of a Semi-Supervised Learning Algorithm.
- It can be used as a Latent Variable Model Training Algorithm.
- It can be implemented by an EM-based System (to solve an EM-based task).
- Example(s):
- a Regularized EM,
- a Baum-Welch Algorithm.
- …
- Counter-Example(s):
- See: Random Variable Expectation; Gaussian Mixture Model; co-EM Algorithm; Distribution Model; Latent Variable Models.
References
2013
- (Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/Expectation-maximization_algorithm
- In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
2011
- (Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Expectation-Maximization Algorithm.” In: (Sammut & Webb, 2011) p.387
2009
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Expectation-maximization_algorithm
- … Given a statistical model consisting of a set [math]\displaystyle{ \mathbf{X} }[/math] of observed data, a set of unobserved latent data or missing values [math]\displaystyle{ \mathbf{Z} }[/math], and a vector of unknown parameters [math]\displaystyle{ \boldsymbol\theta }[/math], along with a likelihood function [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math], the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math] However, this quantity is often intractable.
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
- Expectation step (E-step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of [math]\displaystyle{ \mathbf{Z} }[/math] given [math]\displaystyle{ \mathbf{X} }[/math] under the current estimate of the parameters
[math]\displaystyle{ \boldsymbol\theta^{(t)} }[/math]: [math]\displaystyle{ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \, }[/math]
- Maximization step (M-step): Find the parameter that maximizes this quantity:
[math]\displaystyle{ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \, }[/math]
- Expectation step (E-step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of [math]\displaystyle{ \mathbf{Z} }[/math] given [math]\displaystyle{ \mathbf{X} }[/math] under the current estimate of the parameters
- Note that in typical models to which EM is applied:
- The observed data points [math]\displaystyle{ \mathbf{X} }[/math] may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). There may in fact be a vector of observations associated with each data point.
- The missing values (aka latent variables) [math]\displaystyle{ \mathbf{Z} }[/math] are discrete, drawn from a fixed number of values, and there is one latent variable per observed data point.
- The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and parameters associated with a particular value of a latent variable (i.e. associated with all data points whose corresponding latent variable has a particular value).
- … Given a statistical model consisting of a set [math]\displaystyle{ \mathbf{X} }[/math] of observed data, a set of unobserved latent data or missing values [math]\displaystyle{ \mathbf{Z} }[/math], and a vector of unknown parameters [math]\displaystyle{ \boldsymbol\theta }[/math], along with a likelihood function [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math], the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math] However, this quantity is often intractable.
- (Wu & Kumar, 2009) ⇒ Xindong Wu, and Vipin Kumar, editors. (2009). “The Top Ten Algorithms in Data Mining.” Chapman & Hall. ISBN:1420089641
- http://www.ics.uci.edu/~smyth/courses/ics274/notes4.pdf
2006
- (Bishop, 2006) ⇒ Christopher M. Bishop. (2006). “Pattern Recognition and Machine Learning. Springer, Information Science and Statistics.
2001
- (Luo & Hancock, 2001) ⇒ Bin Luo and Edwin R. Hancock. (2001). “Structural Graph Matching Using the EM Algorithm and Singular Value Decomposition.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10).
1981
- (Bock & Aitkin, 1981) ⇒ R Darrell Bock, and Murray Aitkin. (1981). “Marginal Maximum Likelihood Estimation of Item Parameters: Application of An EM Algorithm.” In: Psychometrika. doi:10.1007/BF02293801
- QUOTE: Maximum likelihood estimation of item parameters in the marginal distribution, integrating over the distribution of ability, becomes practical when computing procedures based on an EM algorithm are used.
1977
- (Dempster et al., 1977) ⇒ Arthur P. Dempster, Nan Laird, and Donald Rubin. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” In: Journal of the Royal Statistical Society, Series B, 39(1):1-38.