Variational Bayes Inference Algorithm

From GM-RKB
Jump to navigation Jump to search

A variational Bayes inference algorithm is a deterministic approximate Bayesian inference algorithm that uses a global inference criteria.



References

2018

  • (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Variational_Bayesian_methods Retrieved:2018-3-6.
    • Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
      1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
      2. To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)
    • In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods — particularly, Markov chain Monte Carlo methods such as Gibbs sampling — for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.

      Variational Bayes can be seen as an extension of the EM (expectation-maximization) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.

      For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.

2017

2006

2005

  • (Blei & Jordan, 2005) ⇒ David M. Blei, and Michael I. Jordan (2005), "Variational Methods for Dirichlet Process Mixtures.” In: Bayesian Analysis, 1.
    • QUOTE: Dirichlet process (DP) mixture models are the cornerstone of non-parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non-parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to explore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, variational methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem.

      One such class of alternatives is provided by variational inference methods (Ghahramani and Beal 2001; Jordan et al. 1999; Opper and Saad 2001;Wainwright and Jordan 2003; Wiegerinck 2000). Like MCMC, variational inference methods have their roots in statistical physics, and, in contradistinction to MCMC methods, they are deterministic. The basic idea of variational inference is to formulate the computation of a marginal or conditional probability in terms of an optimization problem. This (generally intractable) problem is then "relaxed," yielding a simplified optimization problem that depends on a number of free parameters, known as variational parameters. Solving for the variational parameters gives an approximation to the marginal or conditional probabilities of interest.

      Variational inference methods have been developed principally in the context of the exponential family, where the convexity properties of the natural parameter space and the cumulant function yield an elegant general variational formalism (Wainwright and Jordan 2003). For example, variational methods have been developed for parametric hierarchical Bayesian models based on general exponential family specifications (Ghahramani and Beal 2001). MCMC methods have seen much wider application. In particular, the development of MCMC algorithms for nonparametric models such as the Dirichlet process has led to increased interest in nonparametric Bayesian methods. In the current paper, we aim to close this gap by developing variational methods for Dirichlet process mixtures.

      In this paper, we present a variational inference algorithm for DP mixtures based on the stick-breaking representation of the underlying DP. The algorithm involves two probability distributions -- the posterior distribution [math]\displaystyle{ p }[/math] and a variational distribution q. ...

2004

  • Zoubin Ghahramani. (2004). “Bayesian methods in machine learning." Seminar Talk, Oct 18 2004 at University of Birmingham.
    • QUOTE: Bayesian methods can be applied to a wide range of probabilistic models commonly used in machine learning and pattern recognition. The challenge is to discover approximate inference methods that can deal with complex models and large scale data sets in reasonable time. In the past few years Variational Bayesian (VB) approximations have emerged as an alternative to MCMC methods. I will review VB methods and demonstrate applications to clustering, dimensionality reduction, time series modelling with hidden Markov and state-space models, independent components analysis (ICA) and learning the structure of probablistic graphical models. Time permitting, I will discuss current and future directions in the machine learning community, including non-parametric Bayesian methods (e.g. Gaussian processes, Dirichlet processes, and extensions).

2003

2001


  • (Jaakkola, 2001) ⇒ Tommi S. Jaakkola. (2001). “Tutorial on Variational Approximation Methods.” In: (Opper & Saad, 2001)
    • http://people.csail.mit.edu/tommi/papers/Jaa-nips00-tutorial.pdf
      • Let [math]\displaystyle{ P(x_1,...,x_n) }[/math] be the distribution of interest over n variables
      • We divide the set of variables into
      • 1. “visible" variables xv whose marginal distribution P(xv) we are interested in computing
      • 2. “hidden" variables xh whose posterior distribution P(xh|xv) we want
      • Evaluating the marginal or posterior involves summing over all con figurations of the hidden variables xh ⇒ P=...
      • The complexity of this operation depends on the structure or factorization of the joint distribution P
      • We try to capture the factorization explicitly in terms of graphs

2000

1998