2008 ModelingScience

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Topic Modeling, Latent Dirichlet Allocation, Graphical Models.

Notes

Cited By

Quotes

Abstract

Modeling Science

  • On-line archives of document collections require better organization. Manual organization is not practical.
  • Our goal: To discover the hidden thematic structure with hierarchical probabilistic models called topic models.
  • Use this structure for browsing, search, and similarity.
  • Our data are the pages Science from 1880-2002 (from JSTOR)
  • No reliable punctuation, meta-data, or references.
  • Note: this is just a subset of JSTOR’s archive.

Discover topics from a corpus

  • human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacteria …

Model the evolution of topics over time

Model connections between topics

Probabilistic modeling

  • Treat data as observations that arise from a generative probabilistic process that includes hidden variables
    • For documents, the hidden variables reflect the thematic structure of the collection.
  • Infer the hidden structure using posterior inference
    • What are the topics that describe this collection?
  • Situate new data into the estimated model.
    • How does this query or new document fit into the estimated topic structure?

Graphical models (Aside)

  • Nodes are random variables
  • Edges denote possible dependence
  • Observed variables are shaded
  • Plates denote replicated structure

LDA summary

  • LDA is a powerful model for
    • Visualizing the hidden thematic structure in large corpora
    • Generalizing new data to fit into that structure
  • LDA is a mixed membership model (Erosheva, 2004) that builds on the work of Deerwester et al. (1990) and Hofmann (1999).
  • For document collections and other grouped data, this might be more appropriate than a simple finite mixture


  • Modular: It can be embedded in more complicated models.
    • E.g., syntax and semantics; authorship; word sense
  • General: The data generating distribution can be changed.
    • E.g., images; social networks; population genetics data
  • Variational inference is fast; lets us to analyze large data sets.
  • See Blei et al., 2003 for details and a quantitative comparison.
  • Code to play with LDA is freely available on my web-site, http://www.cs.princeton.edu/ blei.

The hidden assumptions of the Dirichlet distribution

  • The Dirichlet is an exponential family distribution on the simplex, positive vectors that sum to one.
  • However, the near independence of components makes it a poor choice for modeling topic proportions.
  • An article about fossil fuels is more likely to also be about geology than about genetics.

Summary

  • Topic models provide useful descriptive statistics for analyzing and understanding the latent structure of large text collections.
  • Probabilistic graphical models are a useful way to express assumptions about the hidden structure of complicated data.
  • Variational methods allow us to perform posterior inference to automatically infer that structure from large data sets.
  • Current research
    • Choosing the number of topics
    • Continuous time dynamic topic models
    • Topic models for prediction
    • Inferring the impact of a document

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 ModelingScienceDavid M. BleiModeling Sciencehttp://www.cs.princeton.edu/~blei/modeling-science.pdf2008