Expectation-Maximization (EM) Algorithm

An Expectation-Maximization (EM) Algorithm is a deterministic nonparametric maximum marginal likelihood estimation algorithm that alternates between performing an expectation (E) step (which computes an expectation of the likelihood by including the latent variables as if they were observed) and a maximization (M) step (which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step).

Context:
- It can be used to Train a Generative Classification Model when there is Unlabeled Data.
- It can be used as part of a Semi-Supervised Learning Algorithm.
- It can be used as a Latent Variable Model Training Algorithm.
- It can be implemented by an EM-based System (to solve an EM-based task).
Example(s):
- a Regularized EM,
- a Baum-Welch Algorithm.
- …
Counter-Example(s):
- a Variational Inference Algorithm.
- a Gibbs Sampling Algorithm.
- a Maximum a Posteriori Point Estimation Algorithm.
- a Clustering Algorithm, such as k-Means.
See: Random Variable Expectation; Gaussian Mixture Model; co-EM Algorithm; Distribution Model; Latent Variable Models.

References

2013

(Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/Expectation-maximization_algorithm
- In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

2011

(Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Expectation-Maximization Algorithm.” In: (Sammut & Webb, 2011) p.387

2009

(Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Expectation-maximization_algorithm
- … Given a statistical model consisting of a set [math]\displaystyle{ \mathbf{X} }[/math] of observed data, a set of unobserved latent data or missing values [math]\displaystyle{ \mathbf{Z} }[/math], and a vector of unknown parameters [math]\displaystyle{ \boldsymbol\theta }[/math], along with a likelihood function [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math], the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data [math]\displaystyle{ L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) }[/math] However, this quantity is often intractable.
  The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
  1. Expectation step (E-step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of [math]\displaystyle{ \mathbf{Z} }[/math] given [math]\displaystyle{ \mathbf{X} }[/math] under the current estimate of the parameters
    [math]\displaystyle{ \boldsymbol\theta^{(t)} }[/math]: [math]\displaystyle{ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \, }[/math]
  2. Maximization step (M-step): Find the parameter that maximizes this quantity:
    [math]\displaystyle{ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \, }[/math]
- Note that in typical models to which EM is applied:
  1. The observed data points [math]\displaystyle{ \mathbf{X} }[/math] may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). There may in fact be a vector of observations associated with each data point.
  2. The missing values (aka latent variables) [math]\displaystyle{ \mathbf{Z} }[/math] are discrete, drawn from a fixed number of values, and there is one latent variable per observed data point.
  3. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and parameters associated with a particular value of a latent variable (i.e. associated with all data points whose corresponding latent variable has a particular value).
(Wu & Kumar, 2009) ⇒ Xindong Wu, and Vipin Kumar, editors. (2009). “The Top Ten Algorithms in Data Mining.” Chapman & Hall. ISBN:1420089641
http://www.ics.uci.edu/~smyth/courses/ics274/notes4.pdf

1977

(Dempster et al., 1977) ⇒ Arthur P. Dempster, Nan Laird, and Donald Rubin. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” In: Journal of the Royal Statistical Society, Series B, 39(1):1-38.

Expectation-Maximization (EM) Algorithm

References

2013

2011

2009

2006

2001

1981

1977

Navigation menu

Search