Latent Semantic Indexing Algorithm
A Latent Semantic Indexing Algorithm is a automatic indexing and information retrieval algorithm that is based on the analysis of distributional semantics of a set a documents.
- AKA: Latent Semantic Analysis Algorithm LSA, LSI.
- Context:
- It can be implemented by a Latent Semantic Indexing System to solve a Latent Semantic Indexing Task.
- It can be based on an global matrix factorization algorithm.
- It can (typically) be used as a Topic Indexing Algorithm (to capture implicit associations between words and text documents and learn a latent vector representations of each word type).
- It can (often) be used as an Unsupervised Dimensionality Reduction Algorithm.
- It can use a Singular Value Decompostion of a Cooccurrence Matrix (e.g. between word forms and their documents).
- Example(s):
- Counter-Example(s):
- See: Neural Semantic Indexing Algorithm, Supervised Semantic Indexing (SSI) Algorithm, Latent Semantic Modeling Algorithm, Semantic Analysis, Vectorial Semantics, Artificial Neural Network, Synonymy, Polysemy.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Latent_semantic_analysis Retrieved:2019-6-2.
- Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Paragraphs are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs. An information retrieval technique using latent semantic structure was patented in 1988 (US Patent 4,839,853, now expired) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).
2014a
- (code google, 2014) ⇒ https://code.google.com/p/semanticvectors/
- QUOTE: … The models are created by applying concept mapping algorithms to term-document matrices created using Apache Lucene. The concept mapping algorithms supported by the package include Random Projection, Latent Semantic Analysis (LSA) and Reflective Random Indexing. …
2014b
- (code google, 2014) ⇒ https://code.google.com/p/semanticvectors/wiki/LatentSemanticAnalysis
- QUOTE: Latent Semantic Analysis, or LSA, was one of the earliest methods for compressing distributional semantic models and learning implicit semantic information in the process (see http://lsa.colorado.edu/).
LSA usually refers specifically to the use of Singular Value Decomposition (SVD) to compress a term-document matrix, though the term has sometimes been used more generally to cover more types of distributional semantic methods.
Traditional LSA is available into SemanticVectors: instead of using java pitt.search.semanticvectors.BuildIndex, use java pitt.search.semanticvectors.LSA in just the same way. This directs the learning process to use SVD instead of Random Projection.
The Singular Value Decomposition itself is computed using the SVDLIBJ package written by Adrian Kuhn and David Erni at the University of Bern. The SemanticVectors authors are very grateful for this contribution.
- QUOTE: Latent Semantic Analysis, or LSA, was one of the earliest methods for compressing distributional semantic models and learning implicit semantic information in the process (see http://lsa.colorado.edu/).
2011a
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Latent_semantic_indexing
- Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts
2011b
- (Osadchy, 2011) ⇒ Margarita Osadchy (2011) "Lecture :Latent Semantic Indexing (LSI)" (PowerPoint Presentation) - Unsupervised Learning 2011.
- QUOTE: Uses statistically derived conceptual indices instead of individual words for retrieval.
- Assumes that there is some underlying or latent structure in word usage that is obscured by variability in word choice
- Key idea: instead of representing documents and queries as vectors in a t-dim space of terms.
- Represent them (and terms themselves) as vectors in a lower-dimensional space whose axes are concepts that effectively group together similar words.
- These axes are the Principal Components from PCA
- QUOTE: Uses statistically derived conceptual indices instead of individual words for retrieval.
2007a
- (Landauer et al., 2007) ⇒ Thomas K. Landauer (editor), Walter Kintsch (editor), Danielle S. McNamara (editor), Simon Dennis. (2007). “Handbook of Latent Semantic Analysis." Routledge. ISBN:0805854185
2007b
- (Steyvers & Griffiths, 2007) ⇒ Mark Steyvers, and Thomas L. Griffiths. (2007). “Probabilistic Topic Models.” In: (Landauer et al., 2007).
- QUOTE: The LSA approach makes three claims: that semantic information can be derived from a word-document co-occurrence matrix; that dimensionality reduction is an essential part of this derivation; and that words and documents can be represented as points in Euclidean space. In this chapter, we pursue an approach that is consistent with the first two of these claims, but differs in the third, describing a class of statistical models in which the semantic properties of words and documents are expressed in terms of probabilistic topics.
2003
- (Baldwin & el, 2003) ⇒ Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. (2003). “An Empirical Model of Multiword Expression Decomposability.” In: Proceedings of the of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. doi:10.3115/1119282.1119294
1998
- (Landauer et al., 1998) ⇒ Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. (1998). “An Introduction to Latent Semantic Analysis.” In: Discourse Processes, 25(2&3).
- ABSTRACT: Latent Semantic Analysis (LSA) is an theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word–word and passage–word lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay.
1997a
- (Landauer & Dumais, 1997) ⇒ Thomas K. Landauer, and Susan T. Dumais. (1997). “A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological review 104, no. 2
1997b
- (Schütze & Silverstein, 1997) ⇒ Hinrich Schütze, and Craig Silverstein. (1997). “Projections for Efficient Document Clustering.” In: ACM SIGIR Forum.
1997c
- (Dumais et al., 1997) ⇒ Susan T. Dumais, Todd A. Letsche, Michael L. Littman, and Thomas K. Landauer. (1997). “Automatic Cross-language Retrieval Using Latent Semantic Indexing.” In: Proceedings of AAAI spring symposium on cross-language text and speech retrieval.
- QUOTE: LSI examines the similarity of the “contexts” in which words appear, and creates a reduced-dimension feature space in which words that occur in similar contexts are near each other. LSI uses a method from linear algebra, singular value decomposition (SVD), to discover the important associative relationships. It is not necessary to use any external dictionaries, thesauri, or knowledge bases to determine these word associations because they are derived from a numerical analysis of existing texts. The learned associations are specific to the domain of interest, and are derived completely automatically.
1990
- (Deerwester et al, 1990) ⇒ Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. L, Richard Harshman. (1990). “Indexing by Latent Semantic Analysis.” In: Journal of the American Society for Information Science
- QUOTE: We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user.
The proposed approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring "noise". A description of terms and documents based on the latent semantic structure is used for indexing and retrieval [1].
- QUOTE: We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user.