Gibbs Sampling Algorithm

A Gibbs sampling algorithm is an MCMC algorithm that generates a sequence of random samples from the joint probability distribution of two or more random variables.

AKA: Gibbs Sampling-based Inference Algorithm.
Context:
- It is a Randomized Algorithm.
- It can support Approximate Inferencing, such as Bayesian Inference using Gibbs Sampling.
- It can be an Iterative Algorithm.
- It can be:
  - a Collapsed Gibbs Sampling Algorithm.
  - a Blocked Gibbs Sampling Algorithm.
Example(s):
- …
Counter-Example(s):
- a Variational Inference Algorithm.
- an Expectation Maximization Algorithm.
See: Gibbs Sweep; Gibbs Sampling Distribution; LDA Algorithm; Gibbs Random Field; Local Probability; Simulated Annealing.

References

2011

(Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Gibbs Sampling.” In: (Sammut & Webb, 2011) p.457
http://en.wikipedia.org/wiki/Gibbs_sampling
- In statistics and in statistical physics, Gibbs sampling or Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to approximate the joint distribution; to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled. Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and hence produces different results each time it is run), and is an alternative to deterministic algorithms for statistical inference such as variational Bayes or the expectation-maximization algorithm (EM).
  Gibbs sampling is an example of a Markov chain Monte Carlo algorithm. The algorithm is named after the physicist J. W. Gibbs, in reference to an analogy between the sampling algorithm and statistical physics. The algorithm was described by brothers Stuart and Donald Geman in 1984, some eight decades after the passing of Gibbs. (Geman, 1983) In its basic version, Gibbs sampling is a special case of the Metropolis–Hastings algorithm. However, in its extended versions (see below), it can be considered a general framework for sampling from a large set of variables by sampling each variable (or in some cases, each group of variables) in turn, and can incorporate the Metropolis–Hastings algorithm (or similar methods such as slice sampling) to implement one or more of the sampling steps. Gibbs sampling is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy (or at least, easier) to sample from. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. It can be shown (see, for example, Gelman et al. 1995) that the sequence of samples constitutes a Markov chain, and the stationary distribution of that Markov chain is just the sought-after joint distribution. Gibbs sampling is particularly well-adapted to sampling the posterior distribution of a Bayesian network, since Bayesian networks are typically specified as a collection of conditional distributions.

2009

(Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). 10.1145/1557019.1557121
- … we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, … we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions … Gibbs sampling is an MCMC method that involves iterating over a set of variables z₁, z₂, ...z_n, sampling each z_i from P(z_i|z_\i,w). Each iteration over all variables is referred to as a Gibbs sweep. Given enough iterations, Gibbs sampling for LDA [4] produces samples from the posterior P(z|w). The efficiency of Gibbs sampling-based inference methods depends almost entirely on how fast we can evaluate the 'sampling distribution over topics for a given token. We therefore present SparseLDA, our new algorithm and data structure that substantially improves sampling performance.

2008

(Porteous et al., 2008) ⇒ I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. (2008). “Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation.” In: Proceedings of ACM SIGKDD Conference (KDD-2008).

2006

(Bhattacharya & Getoor, 2006) ⇒ Indrajit Bhattacharya, and Lise Getoor. (2006). “A Latent Dirichlet Model for Unsupervised Entity Resolution.” In: Proceedings of the Sixth SIAM International Conference on Data Mining (SIAM 2006).
- Another contribution of this paper is an unsupervised Gibbs sampling algorithm for collective entity resolution. It is unsupervised because we do not make use of a labeled training set and it is collective because the resolution decisions depend on each other through the group labels.
(Bishop, 2006) ⇒ Christopher M. Bishop. (2006). “Pattern Recognition and Machine Learning." Springer, Information Science and Statistics. ISBN:0387310738
- 11.3 Gibbs Sampling

2005

(Finkel et al., 2005) ⇒ Jenny_Finkel, Trond Grenager, and Christopher D. Manning. (2005). “Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling.” In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005).
- QUOTE: A general method for solving this problem is to relax the requirement of exact inference, substituting approximate inference algorithms instead, thereby permitting tractable inference in models with non-local structure. One such algorithm is Gibbs sampling, a simple Monte Carlo algorithm that is appropriate for inference in any factored probabilistic model, including sequence models and probabilistic context free grammars (Geman and Geman, 1984). Although Gibbs sampling is widely used elsewhere, there has been extremely little use of it in natural language processing.^[1] Here, we use it to add non-local dependencies to sequence models for information extraction.

↑ Prior uses in NLP of which we are aware include: Kim et al. (1995), Della Pietra et al. (1997) and Abney (1997).

2004

http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf
- ABSTRACT: A major limitation towards more widespread implementation of Bayesian approaches is that obtaining the posterior distribution often requires the integration of high-dimensional functions. This can be computationally very difficult, but several approaches short of direct integration have been proposed (reviewed by Smith 1991, Evans and Swartz 1995, Tanner 1996). We focus here on Markov Chain Monte Carlo (MCMC) methods, which attempt to simulate direct draws from some complex distribution of interest. MCMC approaches are so-named because one uses the previous sample values to randomly generate the next sample value, generating a Markov chain (as the transition probabilities between sample values are only a function of the most recent sample value).
- QUOTE: The realization in the early 1990’s (Gelfand and Smith 1990) that one particular MCMC method, the Gibbs sampler, is very widely applicable to a broad class of Bayesian problems has sparked a major increase in the application of Bayesian analysis, and this interest is likely to continue expanding for sometime to come. MCMC methods have their roots in the Metropolis algorithm (Metropolis and Ulam 1949, Metropolis et al. 1953), an attempt by physicists to compute complex integrals by expressing them as expectations for some distribution and then estimate this expectation by drawing samples from that distribution. The Gibbs sampler (Geman and Geman 1984) has its origins in image processing. It is thus somewhat ironic that the powerful machinery of MCMC methods had essentially no impact on the field of statistics until rather recently. Excellent (and detailed) treatments of MCMC methods are found in Tanner (1996) and Chapter two of Draper (2000). Additional references are given in the particular sections below.

2003

(Andrieu et al., 2003) ⇒ Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I. Jordan. (2003). “An Introduction to MCMC for Machine Learning.” In: Machine Learning, 50(1). doi:10.1023/A:1020281327116

1999

(Jordan et al., 1999) ⇒ Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. (1999). “An Introduction to Variational Methods for Graphical Models.” In: Machine Learning, 37(2). doi:10.1023/A:1007665907178
- QUOTE: In this section we make a few remarks on the relationships between variational methods and stochastic methods, in particular the Gibbs sampler. In the setting of graphical models, both classes of methods rely on extensive message-passing. In Gibbs sampling, the message-passing is particularly simple: each node learns the current instantiation of its Markov blanket. With enough samples the node can estimate the distribution over its Markov blanket and (roughly speaking) determine its own statistics. The advantage of this scheme is that in the limit of very many samples, it is guaranteed to converge to the correct statistics. The disadvantage is that very many samples may be required.

1993

(Smith & Roberts, 1993) ⇒ A. F. M. Smith, and G. O. Roberts. (1993). “[http://www.jstor.org/stable/2346063 Bayesian Computation via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods.” In: Journal of the Royal Statistical Society, 55(1).
- ABSTRACT: The use of the Gibbs sampler for Bayesian computation is reviewed and illustrated in the context of some canonical examples. Other Markov chain Monte Carlo simulation methods are also briefly described, and comments are made on the advantages of sample-based approaches for Bayesian inference summaries.

1992

(Casella & George, 1992) ⇒ George Casella, and Edward I. George. (1992). “Explaining the Gibbs Sampler.” In: The American Statistician 46 (3): 167–174. JSTOR: 2685208.
- ABSTRACT: Computer-intensive algorithms, such as the Gibbs sampler, have become increasingly popular statistical tools, both in applied and theoretical work. The properties of such algorithms, however, may sometimes not be obvious. Here we give a simple explanation of how and why the Gibbs sampler works. We analytically establish its properties in a simple case and provide insight for more complicated cases. There are also a number of examples.

1991

(Zeger & Karim, 1991) ⇒ S. L. Zeger, and M. R. Karim. (1991). “Generalized Linear Models with Random Effects; a Gibbs sampling approach.” In: Journal of the American Statistical Association, 86(413).
- ABSTRACT: Generalized linear models have unified the approach to regression for a wide variety of discrete, continuous, and censored response variables that can be assumed to be independent across experimental units. In applications such as longitudinal studies, genetic studies of families, and survey sampling, observations may be obtained in clusters. Responses from the same cluster cannot be assumed to be independent. With linear models, correlation has been effectively modeled by assuming there are cluster-specific random effects that derive from an underlying mixing distribution. Extensions of generalized linear models to include random effects has, thus far, been hampered by the need for numerical integration to evaluate likelihoods. In this article, we cast the generalized [[linear

random effect]]s model in a Bayesian framework and use a Monte Carlo method, the Gibbs sampler, to overcome the current computational limitations. The resulting algorithm is flexible to easily accommodate changes in the number of random effects and in their assumed distribution when warranted. The methodology is illustrated through a simulation study and an analysis of infectious disease data.

1990

(Gelfand & Smith, 1990) ⇒ Alan E. Gelfand, and Adrian F. M. Smith. (1990). “Sampling-based Approaches to Calculating Marginal Densities.” In: Journal of the American Statistical Association 85 (410): 398–409. MR1141740, JSTOR: 2289776.
- ABSTRACT: Stochastic substitution, the Gibbs sampler, and the sampling-importance-resampling algorithm can be viewed as three alternative sampling- (or Monte Carlo- ) based approaches to the calculation of numerical estimates of marginal probability distributions. The three approaches will be reviewed, compared, and contrasted in relation to various joint probability structures frequently encountered in applications. In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.
- KEY WORDS: Conditional probability structure; Data augmentation; Gibbs sampler; Hierarchical models; Importance sampling; Missing data; Monte Carlo sampling; Posterior distributions; Stochastic substitution; Variance components.

1984

(Geman & Geman, 1984) ⇒ Stuart Geman, and Donald Geman. (1984). “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6). doi:10.1109/TPAMI.1984.4767596
- ABSTRACT: We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel relaxation algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

[1] Prior uses in NLP of which we are aware include: Kim et al. (1995), Della Pietra et al. (1997) and Abney (1997).

[1]