1990 IndexingbyLatentSemanticAnalysi
- (Deerwester et al., 1990) ⇒ Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. (1990). “Indexing by Latent Semantic Analysis.” In: JASIS, 41(6).
Subject Headings: Latent Semantic Indexing, Unsupervised Dimensionality Reduction Algorithm, Lexical Semantic Similarity Function, Singular-Value Decomposition.
Notes
- ftp://ftp.cse.ucsc.edu/pub/darrell/deerwester-jasis90.pdf
- https://nats-www.informatik.uni-hamburg.de/pub/CrossLingIR/LiteraturListe/deerwester90indexing.pdf
- http://www.cs.csustan.edu/~mmartin/LDS/Deerwester-et-al.pdf
Cited By
Quotes
Abstract
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising.
1. Introduction
The proposed approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring "noise". A description of terms and documents based on the latent semantic structure is used for indexing and retrieval [1].
The particular “latent semantic indexing” (LSI) analysis that we have tried uses singular-value decomposition. We take a large matrix of term-document association data and construct a “semantic” space wherein terms and documents that are closely associated are placed near one another. Singular-value decomposition allows the arrangement of the space to reflect the major associative patterns in the data, and ignore the smaller, less important influences. As a result, terms that did not actually appear in a document may still end up close to the document, if that is consistent with the major patterns of association in the data. Position in the space then serves as the new kind of semantic indexing, and retrieval proceeds by using the terms in a query to identify a point in the space, and documents in its neighborhood are returned to the user.
References
;