Variational Bayes Inference Algorithm
A variational Bayes inference algorithm is a deterministic approximate Bayesian inference algorithm that uses a global inference criteria.
- Context:
- It can (often) be a Tractable Algorithm (relative to deterministic exact Bayesian inference algorithms).
- It can have better computational performance than an MCMC Algorithm on large data sets.
- It can be implemented by a Variational Bayes Inference System (to solve a variational Bayesian inference task, such as intractable integral approximation).
- It can be applied to a Learning Task, both a Supervised Learning Task and an Unsupervised Learning Task.
- It can have difficulty with Transitivity (in which cases MCMC Algorithm may perform better).
- It can exchange Inference for optimization.
- Example(s):
- a Variational Expectation Maximization Algorithm.
- a Sampling-based Approximate Variational Inference.
- a Mean Field Algorithm, in which "your product is factorized".
- …
- Counter-Example(s):
- a Markov Chain Monte Carlo (MCMC) Sampling Algorithm.
- a Gibbs Sampling Algorithm.
- an Variational EM Algorithm.
- a Sampling Monte-Carlo Inference Algorithm, such as a Monte Carlo Markov Chain Algorithm.
- a Loopy Belief Propogation Algorithm.
- a Bounded Cutset Conditioning Algorithm.
- a Parametric Approximation Algorithm.
- See: Semi-Analytical Algorithm, Local Inference Criteria, Ensemble Learning Algorithm, Bayesian Inference, Integral, Posterior Probability, Statistical Inference, Marginal Likelihood, Marginal Probability.
References
2018
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Variational_Bayesian_methods Retrieved:2018-3-6.
- Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
- To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
- To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)
- In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods — particularly, Markov chain Monte Carlo methods such as Gibbs sampling — for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the EM (expectation-maximization) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
- Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
2017
- (Zhang et al., 2017) ⇒ Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. (2017). “Advances in Variational Inference.” In: arXiv preprint arXiv:1711.05597.
- QUOTE: Many modern unsupervised or semi-supervised machine learning algorithms rely on Bayesian probabilistic models. These models are usually intractable and thus require approximate inference. Variational inference (VI) lets us approximate a high-dimensional Bayesian posterior with a simpler variational distribution by solving an optimization problem. This approach has been successfully used in various models and large-scale applications. ...
2006
- (Bishop, 2006) ⇒ Christopher M. Bishop. (2006). “Pattern Recognition and Machine Learning." Springer, Information Science and Statistics. ISBN:0387310738
- QUOTE: Here we turn to a family of approximation techniques called variational inference or variational Bayes, which use more global criteria and which have been widely applied. ... Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the “calculus of variations”. Standard calculus is concerned with finding derivatives of functions. We can think of a function as a mapping that takes the value of a variable as the input and returns the value of the function as the output. ... We can then introduce the concept of a functional derivative, which expresses how the value of the functional changes in response to infinitesimal changes to the input function. ... Although there is nothing intrinsically approximate about variational methods, they do naturally lend themselves to finding approximate solutions. This is done by restring the range of functions over which the optimization is performed, for instance by considering only quadratic functions or by considering functions composed of a linear combination of fixed basis function in which only the coefficients of the linear combinations can vary. In the case of applications to probabilistic inference, the restriction may for example take the form of factorization assumptions (Jordan et al., 1999; Jaakkola, 2001).
We conclude this chapter by discussion an alternative form of deterministic approximation inference, known as expectation propagation or EP (Minka, 2001a; Minka, 2001b). As with the variational Bayes methods discussed so far, this too is based on the minimization of a Kullback-Leibler divergence but now of the reverse form, which gives the approximation rather different properties.
- QUOTE: Here we turn to a family of approximation techniques called variational inference or variational Bayes, which use more global criteria and which have been widely applied. ... Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the “calculus of variations”. Standard calculus is concerned with finding derivatives of functions. We can think of a function as a mapping that takes the value of a variable as the input and returns the value of the function as the output. ... We can then introduce the concept of a functional derivative, which expresses how the value of the functional changes in response to infinitesimal changes to the input function. ... Although there is nothing intrinsically approximate about variational methods, they do naturally lend themselves to finding approximate solutions. This is done by restring the range of functions over which the optimization is performed, for instance by considering only quadratic functions or by considering functions composed of a linear combination of fixed basis function in which only the coefficients of the linear combinations can vary. In the case of applications to probabilistic inference, the restriction may for example take the form of factorization assumptions (Jordan et al., 1999; Jaakkola, 2001).
2005
- (Blei & Jordan, 2005) ⇒ David M. Blei, and Michael I. Jordan (2005), "Variational Methods for Dirichlet Process Mixtures.” In: Bayesian Analysis, 1.
- QUOTE: Dirichlet process (DP) mixture models are the cornerstone of non-parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non-parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to explore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, variational methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.
One such class of alternatives is provided by variational inference methods (Ghahramani and Beal 2001; Jordan et al. 1999; Opper and Saad 2001;Wainwright and Jordan 2003; Wiegerinck 2000). Like MCMC, variational inference methods have their roots in statistical physics, and, in contradistinction to MCMC methods, they are deterministic. The basic idea of variational inference is to formulate the computation of a marginal or conditional probability in terms of an optimization problem. This (generally intractable) problem is then "relaxed," yielding a simplified optimization problem that depends on a number of free parameters, known as variational parameters. Solving for the variational parameters gives an approximation to the marginal or conditional probabilities of interest.
Variational inference methods have been developed principally in the context of the exponential family, where the convexity properties of the natural parameter space and the cumulant function yield an elegant general variational formalism (Wainwright and Jordan 2003). For example, variational methods have been developed for parametric hierarchical Bayesian models based on general exponential family specifications (Ghahramani and Beal 2001). MCMC methods have seen much wider application. In particular, the development of MCMC algorithms for nonparametric models such as the Dirichlet process has led to increased interest in nonparametric Bayesian methods. In the current paper, we aim to close this gap by developing variational methods for Dirichlet process mixtures.
In this paper, we present a variational inference algorithm for DP mixtures based on the stick-breaking representation of the underlying DP. The algorithm involves two probability distributions -- the posterior distribution [math]\displaystyle{ p }[/math] and a variational distribution q. ...
- QUOTE: Dirichlet process (DP) mixture models are the cornerstone of non-parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non-parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to explore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, variational methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.
2004
- Zoubin Ghahramani. (2004). “Bayesian methods in machine learning." Seminar Talk, Oct 18 2004 at University of Birmingham.
- QUOTE: Bayesian methods can be applied to a wide range of probabilistic models commonly used in machine learning and pattern recognition. The challenge is to discover approximate inference methods that can deal with complex models and large scale data sets in reasonable time. In the past few years Variational Bayesian (VB) approximations have emerged as an alternative to MCMC methods. I will review VB methods and demonstrate applications to clustering, dimensionality reduction, time series modelling with hidden Markov and state-space models, independent components analysis (ICA) and learning the structure of probablistic graphical models. Time permitting, I will discuss current and future directions in the machine learning community, including non-parametric Bayesian methods (e.g. Gaussian processes, Dirichlet processes, and extensions).
2003
- (Beal, 2003) ⇒ Matthew J. Beal. (2003). “Variational Algorithms for Approximate Bayesian Inference." Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London.
- QUOTE: This thesis presents a unified variational Bayesian (VB) framework which approximates these computations in models with latent variables using a lower bound on the marginal likelihood.
- (Wainwright & Jordan, 2003) ⇒ M. Wainwright, and Michael I. Jordan. (2003). “Graphical Models, Exponential Families, and Variational Inference.” Technical Report 649, U.C. Berkeley, Dept. of Statistics.
2001
- (Opper & Saad, 2001) ⇒ Manfred Opper, and David Saad, editors. (2001). “Advanced Mean Field Methods: theory and practice." MIT Press. ISBN:0-262-15054-9
- (Jaakkola, 2001) ⇒ Tommi S. Jaakkola. (2001). “Tutorial on Variational Approximation Methods.” In: (Opper & Saad, 2001)
- http://people.csail.mit.edu/tommi/papers/Jaa-nips00-tutorial.pdf
- Let [math]\displaystyle{ P(x_1,...,x_n) }[/math] be the distribution of interest over n variables
- We divide the set of variables into
- 1. “visible" variables xv whose marginal distribution P(xv) we are interested in computing
- 2. “hidden" variables xh whose posterior distribution P(xh|xv) we want
- Evaluating the marginal or posterior involves summing over all con figurations of the hidden variables xh ⇒ P=...
- The complexity of this operation depends on the structure or factorization of the joint distribution P
- We try to capture the factorization explicitly in terms of graphs
- http://people.csail.mit.edu/tommi/papers/Jaa-nips00-tutorial.pdf
2000
- http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html
- Variational methods. The simplest example is the mean-field approximation, which exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node, and iteratively update these parameters so as to minimize the cross-entropy (KL distance) between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood. More sophisticated methods are possible, which give tighter lower (and upper) bounds.
1998
- (Jordan, 1998) ⇒ Michael I. Jordan (ed). (1998). “Learning in Graphical Models. MIT Press. ISBN:0-262-60032-3