2006 HierarchicalDirichletProcesses

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Dirichlet Process, Clustering, Mixture Model, Nonparametric Bayesian Analysis, Hierarchical Bayesian Model, Markov-Chain Monte Carlo Algorithm.

Notes

Cited By

Quotes

Author Keywords

clustering, mixture models, nonparametric Bayesian statistics, hierarchical models, Markov chain Monte Carlo

Abstract

We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the “Chinese restaurant franchise.” We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling.

1. Introduction

A recurring theme in statistics is the need to separate observations into groups, and yet allow the groups to remain linked — to "share statistical strength.” In the Bayesian formalism such sharing is achieved naturally via hierarchical modeling; parameters are shared among groups, and the randomness of the parameters induces dependencies among the groups. Estimates based on the posterior distribution exhibit “shrinkage.”

In the current paper we explore a hierarchical approach to the problem of model-based clustering of grouped data. We assume that the data are subdivided into a set of groups, and that within each group we wish to find clusters that capture latent structure in the data assigned to that group. The number of clusters within each group is unknown and is to be inferred. Moreover, in a sense that we make precise, we wish to allow clusters to be shared among the groups.

Moreover, we may want to extend the model to allow for multiple corpora. For example, documents in scientific journals are often grouped into themes (e.g., “empirical process theory,” “multivariate statistics,” “survival analysis”), and it would be of interest to discover to what extent the latent topics that are shared among documents are also shared across these groupings. Thus in general we wish to consider the sharing of clusters across multiple, nested groupings of data

Our approach to the problem of sharing clusters among multiple, related groups is a nonparametric Bayesian approach, reposing on the Dirichlet process (Ferguson 1973). The Dirichlet process [math]\displaystyle{ \operatorname{DP}(\alpha_0,G_0) }[/math] is a measure on measures. It has two parameters, a scaling parameter [math]\displaystyle{ \alpha_0 \gt 0 }[/math] and a base probability measure [math]\displaystyle{ G_0 }[/math]. An explicit representation of a draw from a Dirichlet process (DP) was given by Sethuraman (1994), who showed that if [math]\displaystyle{ G ~ \operatorname{DP}(\alpha_0,G_0) }[/math], then with probability one:

[math]\displaystyle{ G = \Sigma_{k=1}^{\infty} \beta_k \delta_{\phi_k} , \text{(1)} }[/math]

where the [math]\displaystyle{ \phi_k }[/math] are independent random variables distributed according to [math]\displaystyle{ G0 }[/math], where [math]\displaystyle{ \delta_{\phi_k} }[/math] is an atom at [math]\displaystyle{ \phi_k }[/math], and where the “stick-breaking weights” [math]\displaystyle{ \beta_k }[/math] are also random and depend on the parameter [math]\displaystyle{ \alpha_0 }[/math] (the definition of the [math]\displaystyle{ \beta_k }[/math] is provided in Section 3.1).

The representation in (1) shows that draws from a DP are discrete (with probability one). The discrete nature of the DP makes it unsuitable for general applications in Bayesian nonparametrics, but it is well suited for the problem of placing priors on mixture components in mixture modeling. The idea is basically to associate a mixture component with each atom in [math]\displaystyle{ G }[/math]. Introducing indicator variables to associate data points with mixture components, the posterior distribution yields a probability distribution on partitions of the data. A number of authors have studied such Dirichlet process mixture models (Antoniak 1974; Escobar and West 1995; MacEachern and M ̈uller 1998). These models provide an alternative to methods that attempt to select a particular number of mixture components, or methods that place an explicit parametric prior on the number of components.

References

  • Aldous, D. (1985), “Exchangeability and Related Topics,” in ´Ecole d’´Eté de Probabilités de Saint-Flour XIII–1983, Springer, Berlin, pp. 1–198.
  • Antoniak, C. E. (1974), “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” Annals of Statistics, 2(6), pp. 1152–1174.
  • Matthew J. Beal (2003), “Variational Algorithms for Approximate Bayesian Inference,” Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London.
  • Matthew J. Beal, Ghahramani, Z., and Rasmussen, C. (2002), “The Infinite Hidden Markov Model,” in Thomas G. Dietterich, S. Becker, and Zoubin Ghahramani (eds.) Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, vol. 14, pp. 577–584.
  • Blackwell, D. and MacQueen, J. B. (1973), “Ferguson Distributions via P´olya Urn Schemes,” Annals of Statistics, 1, pp. 353–355.
  • (Blei & Jordan, 2005) ⇒ David M. Blei and Michael I. Jordan (2005), “Variational methods for Dirichlet process mixtures,” Bayesian Analysis, 1, pp. 121–144.
  • (Blei, Jordan & Ng, 2003) ⇒ David M. Blei, Michael I. Jordan, and Ng, A. Y. (2003). “Hierarchical Bayesian Models for Applications in Information Retrieval.” In: Bayesian Statistics, 7.
  • Carota, C. and Parmigiani, G. (2002), “Semiparametric Regression for Count Data,” Biometrika, 89(2), pp. 265–281.
  • Cifarelli, D. and Regazzini, E. (1978), “Problemi Statistici Non Parametrici in Condizioni di Scambiabilit` a Parziale e Impiego di Medie Associative,” Tech. rep., Quaderni Istituto Matematica Finanziaria dell’Universit`a di Torino.
  • De Iorio, M., Müller, P., and Rosner, G. L. (2004), “An ANOVA Model for Dependent Random Measures,” Journal of the American Statistical Association, 99(465), pp. 205–215.
  • Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998), Biological Sequence Analysis, Cambridge, UK: Cambridge University Press.
  • Escobar, M. D. andWest, M. (1995), “Bayesian Density Estimation and Inference Using Mixtures,” Journal of the American Statistical Association, 90, pp. 577–588.
  • Ferguson, T. S. (1973), “A Bayesian Analysis of Some Nonparametric Problems,” Annals of Statistics, 1(2), pp. 209–230.
  • Fong, D. K. H., Pammer, S. E., Arnold, S. F., and Bolton, G. E. (2002), “Reanalyzing Ultimatum Bargaining — Comparing Nondecreasing CurvesWithout Shape Constraints,” Journal of Business and Economic Statistics, 20, pp. 423–440.
  • Forsyth, D. A. and Ponce, J. (2002), Computer Vision — A Modern Approach, Upper Saddle River, NJ: Prentice-Hall.
  • Fraley, C. and Raftery, A. E. (2002), “Model-based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, 97, pp. 611–631.
  • Gabriel, S. B. et al. (2002), “The Structure of Haplotype Blocks in the Human Genome,” Science, 296, pp. 2225–2229.
  • Green, P. and Richardson, S. (2001), “Modelling Heterogeneity with and without the Dirichlet Process,” Scandinavian Journal of Statistics, 28, pp. 355–377.
  • Huang, X., Acero, A., and Hon, H.-W. (2001), Spoken Language Processing, Upper Saddle River, NJ: Prentice-Hall.
  • Ishwaran, H. and James, L. F. (2004), “Computational Methods for Multiplicative Intensity Models using Weighted Gamma Processes: Proportional Hazards, Marked Point Processes and Panel Count Data,” Journal of the American Statistical Association, 99, pp. 175–190.
  • Ishwaran, H. and Zarepour, M. (2002), “Exact and Approximate Sum-Representations for the Dirichlet Process,” Canadian Journal of Statistics, 30, pp. 269–283.
  • Jain, S. and Neal, R. M. (2000), “A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model,” Tech. Rep. 2003, Department of Statistics, University of Toronto.
  • Kleinman, K. P. and Ibrahim, J. G. (1998), “A Semi-parametric Bayesian Approach to Generalized Linear Mixed Models,” Statistics in Medicine, 17, pp. 2579–2596.
  • MacEachern, S. N. (1999), “Dependent Nonparametric Processes,” In: Proceedings of the Section on Bayesian Statistical Science, American Statistical Association.
  • MacEachern, S. N., Kottas, A., and Gelfand, A. E. (2001), “Spatial Nonparametric Bayesian Models,” Tech. Rep. 01-10, Institute of Statistics and Decision Sciences, Duke University.
  • MacEachern, S. N. and Müller, P. (1998), “Estimating Mixture of Dirichlet Process Models,” Journal of Computational and Graphical Statistics, 7, pp. 223–238.
  • MacKay, D. J. C. (1997), “Ensemble Learning for Hidden Markov Models,” Tech. rep., Cavendish Laboratory, University of Cambridge.
  • MacKay, D. J. C. and Peto, L. C. B. (1994), “A Hierarchical Dirichlet Language Model,” Natural Language Engineering.
  • Mallick, B. K. and Walker, S. G. (1997), “Combining Information from Several Experiments with Nonparametric Priors,” Biometrika, 84, pp. 697–706.
  • Muliere, P. and Petrone, S. (1993), “A Bayesian Predictive Approach to Sequential Search for an Optimal Dose: Parametric and Nonparametric Models,” Journal of the Italian Statistical Society, 2, pp. 349–364.
  • Müller, P., Quintana, F., and Rosner, G. (2004), “AMethod for Combining Inference Across Related Nonparametric Bayesian Models,” Journal of the Royal Statistical Society, 66, pp. 735–749.
  • Neal, R. M. (1992), “Bayesian Mixture Modeling,” In: Proceedings of the Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, vol. 11, pp. 197–211.
  • —— (2000), “Markov Chain Sampling Methods for Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics, 9, pp. 249–265. Pitman, J. (1996), “Random discrete distributions invariant under size-biased permutation,” Advances in Applied Probability, 28, pp. 525–539.
  • —— (2002a), “Combinatorial Stochastic Processes,” Tech. Rep. 621, Department of Statistics, University of California at Berkeley, lecture notes for St. Flour Summer School.
  • —— (2002b), “Poisson-Dirichlet and GEM invariant distributions for split-and-merge transformations of an interval partition,” Combinatorics, Probability and Computing, 11, pp. 501–514.
  • Pritchard, J. K., Stephens, M., and Donnelly, P. (2000), “Inference of Population Structure using Multilocus Genotype Data,” Genetics, 155, pp. 945–959.
  • Rabiner, L. (1989), “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, 77, pp. 257–285.
  • Rasmussen, C. E. (2000), “The Infinite Gaussian Mixture Model,” in S. Solla, T. Leen, and K.-R. Müller (eds.) Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, vol. 12.
  • Gerard M. Salton. and McGill, M. J. (1983), An Introduction to Modern Information Retrieval, New York: McGraw-Hill.
  • Sethuraman, J. (1994), “A Constructive Definition of Dirichlet Priors,” Statistica Sinica, 4, pp. 639–650.
  • Stephens, M., Smith, N., and Donnelly, P. (2001), “A New Statistical Method for Haplotype Reconstruction from Population Data,” American Journal of Human Genetics, 68, pp. 978–989.
  • Stolcke, A. and Omohundro, S. (1993), “Hidden Markov Model Induction by Bayesian Model Merging,” in C.Giles, S. Hanson, and J. Cowan (eds.) Advances in Neural Information Processing Systems, San Mateo CA: Morgan Kaufmann, vol. 5, pp. 11–18.
  • Tomlinson, G. A. (1998), “Analysis of Densities,” Ph.D. thesis, Department of Public Health Sciences, University of Toronto.
  • Tomlinson, G. A. and Escobar, M. (2003), “Analysis of Densities,” Tech. rep., Department of Public Health Sciences, University of Toronto.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 HierarchicalDirichletProcessesYee Whye Teh
Matthew J. Beal
Michael I. Jordan
David M. Blei
Hierarchical Dirichlet ProcessesJournal of the American Statistical Associationhttp://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/jasa2006.pdf2006