2003 VariationalAlgsForApproxBayesInf

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Approximate Bayesian Inference, Variational Bayesian Algorithm, Latent Variable, Propagation Algorithm, EM Algorithm, Variational Bayesian EM Algorithm, Conjugate-Exponential Graphical Model.

Notes

Cited By

Quotes

Abstract

The Bayesian framework for machine learning allows for the incorporation of prior knowledge in a coherent way, avoids overfitting problems, and provides a principled basis for selecting between alternative models. Unfortunately the computations required are usually intractable. This thesis presents a unified variational Bayesian (VB) framework which approximates these computations in models with latent variables using a lower bound on the marginal likelihood.

Chapter 1 presents background material on Bayesian inference, graphical models, and propagation algorithms. Chapter 2 forms the theoretical core of the thesis, generalising the expectation-maximisation (EM) algorithm for learning maximum likelihood parameters to the VB EM algorithm which integrates over model parameters. The algorithm is then specialised to the large family of conjugate-exponential (CE) graphical models, and several theorems are presented to pave the road for automated VB derivation procedures in both directed and undirected graphs (Bayesian and Markov networks, respectively).

Chapters 3-5 derive and apply the VB EM algorithm to three commonly-used and important models: mixtures of factor analysers, linear dynamical systems, and hidden Markov models. It is shown how model selection tasks such as determining the dimensionality, cardinality, or number of variables are possible using VB approximations. Also explored are methods for combining sampling procedures with variational approximations, to estimate the tightness of VB bounds and to obtain more effective sampling algorithms. Chapter 6 applies VB learning to a long-standing problem of scoring discrete-variable directed acyclic graphs, and compares the performance to annealed importance sampling amongst other methods. Throughout, the VB approximation is compared to other methods including sampling, Cheeseman-Stutz, and asymptotic approximations such as BIC. The thesis concludes with a discussion of evolving directions for model selection including infinite models and alternative approximations to the marginal likelihood.

Table of Contents

1 Introduction 13
1.1 Probabilistic inference . . . 16
1.1.1 Probabilistic graphical models: directed and undirected networks . . . 17
1.1.2 Propagation algorithms . . . 19
1.2 Bayesian model selection . . . 24
1.2.1 Marginal likelihood and Occam’s razor . . . 25
1.2.2 Choice of priors . . . 27
1.3 Practical Bayesian approaches . . . 32
1.3.1 Maximum a posteriori (MAP) parameter estimates . . . 33
1.3.2 Laplace’s method . . . 34
1.3.3 Identifiability: aliasing and degeneracy . . . 35
1.3.4 BIC and [[Minimum Description Length|MDL]] . . . 36
1.3.5 Cheeseman & Stutz’s method . . . 37
1.3.6 Monte Carlo methods . . . 38
1.4 Summary of the remaining chapters . . . 42

2 Variational Bayesian Theory 44
2.1 Introduction . . . 44
2.2 Variational methods for ML / MAP learning . . . 46
2.2.1 The scenario for parameter learning . . . 46
2.2.2 EM for unconstrained (exact) optimisation . . . 48
2.2.3 EM with constrained (approximate) optimisation . . . 49
2.3 Variational methods for Bayesian learning . . . 53
2.3.1 Deriving the learning rules . . . 53
2.3.2 Discussion . . . 58
2.4 Conjugate-Exponential models . . . 64
2.4.1 Definition . . . 64
2.4.2 Variational Bayesian EM for CE models . . . 66
2.4.3 Implications . . . 69
2.5 Directed and undirected graphs . . . 73
2.5.1 Implications for directed networks . . . 73
2.5.2 Implications for undirected networks . . . 74
2.6 Comparisons of VB to other criteria . . . 75
2.6.1 BIC is recovered from VB in the limit of large data . . . 75
2.6.2 Comparison to Cheeseman-Stutz (CS) approximation . . . 76
2.7 Summary . . . 80
3 Variational Bayesian Hidden Markov Models 82
3.1 Introduction . . . 82
3.2 Inference and learning for maximum likelihood HMMs . . . 83
3.3 Bayesian HMMs . . . 88
3.4 Variational Bayesian formulation . . . 91
3.4.1 Derivation of the VBEM optimisation procedure . . . 92
3.4.2 Predictive probability of the VB model . . . 97
3.5 Experiments . . . 98
3.5.1 Synthetic: discovering model structure . . . 98
3.5.2 Forwards-backwards English discrimination . . . 99
3.6 Discussion . . . 104

4 Variational Bayesian Mixtures of Factor Analysers 106
4.1 Introduction . . . 106
4.1.1 Dimensionality reduction using factor analysis . . . 107
4.1.2 Mixture models for [[manifold learning]] . . . 109
4.2 Bayesian Mixture of Factor Analysers . . . 110
4.2.1 Parameter priors for MFA . . . 111
4.2.2 Inferring dimensionality using ARD . . . 114
4.2.3 Variational Bayesian derivation . . . 115
4.2.4 Optimising the lower bound . . . 119
4.2.5 Optimising the hyperparameters . . . 122
4.3 Model exploration: birth and death . . . 124
4.3.1 Heuristics for component death . . . 126
4.3.2 Heuristics for component birth . . . 127

4.3.3 Heuristics for the optimisation endgame . . . 130
4.4 Handling the predictive density . . . 130
4.5 Synthetic experiments . . . 132
4.5.1 Determining the number of components . . . 133
4.5.2 Embedded Gaussian clusters . . . 133
4.5.3 Spiral dataset . . . 135
4.6 Digit experiments . . . 138
4.6.1 Fully-unsupervised learning . . . 138
4.6.2 Classification performance of BIC and VB models . . . 141
4.7 Combining VB approximations with Monte Carlo . . . 144
4.7.1 Importance sampling with the variational approximation . . . 144
4.7.2 Example: Tightness of the lower bound for MFAs . . . 148
4.7.3 Extending simple importance sampling . . . 151
4.8 Summary . . . 157

5 Variational Bayesian Linear Dynamical Systems 159
5.1 Introduction . . . 159
5.2 The Linear Dynamical System model . . . 160
5.2.1 Variables and topology . . . 160
5.2.2 Specification of parameter and hidden state priors . . . 163
5.3 The variational treatment . . . 168
5.3.1 VBM step: Parameter distributions . . . 170
5.3.2 VBE step: The Variational Kalman Smoother . . . 173
5.3.3 Filter (forward recursion) . . . 174
5.3.4 Backward recursion: sequential and parallel . . . 177
5.3.5 Computing the single and joint marginals . . . 181
5.3.6 Hyperparameter learning . . . 184
5.3.7 Calculation of F . . . 185
5.3.8 Modifications when learning from multiple sequences . . . 186
5.3.9 Modifications for a fully hierarchical model . . . 189
5.4 Synthetic Experiments . . . 189
5.4.1 Hidden state space dimensionality determination (no inputs) . . . 189
5.4.2 Hidden state space dimensionality determination (input-driven) . . . 191
5.5 Elucidating gene expression mechanisms . . . 195
5.5.1 Generalisation errors . . . 198
5.5.2 Recovering gene-gene interactions . . . 200
5.6 Possible extensions and future research . . . 201
5.7 Summary . . . 204

6 Learning the structure of discrete-variable graphical models with hidden variables
6.1 Introduction . . . 206
6.2 Calculating marginal likelihoods of DAGs . . . 207
6.3 Estimating the marginal likelihood . . . 210
6.3.1 ML and MAP parameter estimation . . . 210
6.3.2 BIC . . . 212
6.3.3 Cheeseman-Stutz . . . 213
6.3.4 The VB lower bound . . . 215
6.3.5 Annealed Importance Sampling (AIS) . . . 218
6.3.6 Upper bounds on the marginal likelihood . . . 222
6.4 Experiments . . . 223
6.4.1 Comparison of scores to AIS . . . 226
6.4.2 Performance averaged over the parameter prior . . . 232
6.5 Open questions and directions . . . 236
6.5.1 AIS analysis, limitations, and extensions . . . 236
6.5.2 Estimating dimensionalities of the incomplete and complete-data models 245


References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 VariationalAlgsForApproxBayesInfMatthew J. BealVariational Algorithms for Approximate Bayesian Inferencehttp://www.cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf2003