1990 IndexingbyLatentSemanticAnalysi

Subject Headings: Latent Semantic Indexing, Unsupervised Dimensionality Reduction Algorithm, Lexical Semantic Similarity Function, Singular-Value Decomposition.

Notes

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising.

We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user.

The proposed approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring "noise". A description of terms and documents based on the latent semantic structure is used for indexing and retrieval ^[1].

The particular “latent semantic indexing” (LSI) analysis that we have tried uses singular-value decomposition. We take a large matrix of term-document association data and construct a “semantic” space wherein terms and documents that are closely associated are placed near one another. Singular-value decomposition allows the arrangement of the space to reflect the major associative patterns in the data, and ignore the smaller, less important influences. As a result, terms that did not actually appear in a document may still end up close to the document, if that is consistent with the major patterns of association in the data. Position in the space then serves as the new kind of semantic indexing, and retrieval proceeds by using the terms in a query to identify a point in the space, and documents in its neighborhood are returned to the user.

↑ By “semantic structure” we mean here only the correlation structure in the way in which individual words appear in documents; “semantic” implies only the fact that terms in a document may be taken as referents to the document itself or to its topic.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
1990 IndexingbyLatentSemanticAnalysi	Susan T. Dumais George W. Furnas Thomas K. Landauer Scott C. Deerwester Richard A. Harshman			Indexing by Latent Semantic Analysis						1990