Probabilistic LSI Model
(Redirected from probabilistic latent semantic indexing)
Jump to navigation
Jump to search
A probabilistic LSI model is a generative probabilistic LSI Model that is used in the analysis of for two-mode and co-occurrence data.
- AKA: pLSI, Probabilistic Latent Semantic Indexing Model, Probabilistic Latent Semantic Analysis.
- Context:
- It characterizes topics as a multinomial distribution over a vocabulary of words (rather than clusters of documents).
- It treats topics as latent variables (Latent Topics).
- It represents the the joint probability of terms and documents as the mixture of conditional probabilities over the latent topics.
- It (typically) infers latent topics by maximum likelihood or Bayesian procedures that involve either variational inference or Gibbs Sampling Algorithm.
- It communicates "topic" as keywords that have highest mass in these learnt distributions.
- It can be induced by a PLSI Model Training Algorithm (that typically performs maximum likelihood estimation of model parameters).
- It can (typically) represent a Document Topic Model.
- …
- Example(s):
- Counter-Example(s):
- See: Topic Model, Generative Model, LSI Model.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis Retrieved:2019-6-1.
- Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.
Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model.
- Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.
2003
- (Blei, Ng & Jordan, 2003) ⇒ David M. Blei, Andrew Y. Ng , and Michael I. Jordan. (2003). “Latent Dirichlet Allocation.” In: The Journal of Machine Learning Research, 3.
- QUOTE: A significant step forward in this regard was made by Hofmann (1999), who presented the probabilistic LSI (pLSI) model, also known as the aspect model, as an alternative to LSI. The pLSI approach, which we describe in detail in Section 4.3, models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the “reduced description” associated with the document.
- QUOTE: While Hofmann’s work is a useful step toward probabilistic modeling of text, it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set. To see how to proceed beyond pLSI, let us consider the fundamental probabilistic assumptions underlying the class of dimensionality reduction methods that includes LSI and pLSI. All of these methods are based on the “bag-of-words” assumption — that the order of words in a document can be neglected. In the language of probability theory, this is an assumption of exchangeability for the words in a document (Aldous, 1985). Moreover, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be neglected.
1999
- (Hofmann, 1999) ⇒ Thomas Hofmann. (1999). “Probabilistic Latent Semantic Indexing.” In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999) doi:10.1145/312624.312649
- QUOTE: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain-specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model.