2008 ModelingScience
Jump to navigation
Jump to search
- (Blei, 2008) ⇒ David M. Blei. (2008). “Modeling Science.” Presentation. April 17, 2008
Subject Headings: Topic Modeling, Latent Dirichlet Allocation, Graphical Models.
Notes
Cited By
Quotes
Abstract
Modeling Science
- On-line archives of document collections require better organization. Manual organization is not practical.
- Our goal: To discover the hidden thematic structure with hierarchical probabilistic models called topic models.
- Use this structure for browsing, search, and similarity.
- Our data are the pages Science from 1880-2002 (from JSTOR)
- No reliable punctuation, meta-data, or references.
- Note: this is just a subset of JSTOR’s archive.
Discover topics from a corpus
- human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacteria …
Model the evolution of topics over time
Model connections between topics
Probabilistic modeling
- Treat data as observations that arise from a generative probabilistic process that includes hidden variables
- For documents, the hidden variables reflect the thematic structure of the collection.
- Infer the hidden structure using posterior inference
- What are the topics that describe this collection?
- Situate new data into the estimated model.
- How does this query or new document fit into the estimated topic structure?
Graphical models (Aside)
- Nodes are random variables
- Edges denote possible dependence
- Observed variables are shaded
- Plates denote replicated structure
LDA summary
- LDA is a powerful model for
- Visualizing the hidden thematic structure in large corpora
- Generalizing new data to fit into that structure
- LDA is a mixed membership model (Erosheva, 2004) that builds on the work of Deerwester et al. (1990) and Hofmann (1999).
- For document collections and other grouped data, this might be more appropriate than a simple finite mixture
- Modular: It can be embedded in more complicated models.
- E.g., syntax and semantics; authorship; word sense
- General: The data generating distribution can be changed.
- E.g., images; social networks; population genetics data
- Variational inference is fast; lets us to analyze large data sets.
- See Blei et al., 2003 for details and a quantitative comparison.
- Code to play with LDA is freely available on my web-site, http://www.cs.princeton.edu/ blei.
- The Dirichlet is an exponential family distribution on the simplex, positive vectors that sum to one.
- However, the near independence of components makes it a poor choice for modeling topic proportions.
- An article about fossil fuels is more likely to also be about geology than about genetics.
Summary
- Topic models provide useful descriptive statistics for analyzing and understanding the latent structure of large text collections.
- Probabilistic graphical models are a useful way to express assumptions about the hidden structure of complicated data.
- Variational methods allow us to perform posterior inference to automatically infer that structure from large data sets.
- Current research
- Choosing the number of topics
- Continuous time dynamic topic models
- Topic models for prediction
- Inferring the impact of a document
References
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2008 ModelingScience | David M. Blei | Modeling Science | http://www.cs.princeton.edu/~blei/modeling-science.pdf | 2008 |