1997 ASolutiontoPlatosProblemTheLate

From GM-RKB
(Redirected from Landauer & Dumais, 1997)
Jump to navigation Jump to search

Subject Headings: Latent Semantic Analysis, SVD Task.

Notes

Cited By

Quotes

Abstract

How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena and problems are sketched.

Overview

In this paper we will report the results of using Latent Semantic Analysis (LSA), a high-dimensional linear associative model that embodies no human knowledge beyond its general learning mechanism, to analyze a large corpus of natural text and generate a representation that captures the similarity of words and text passages. The model's resulting knowledge was tested with a standard multiple-choice synonym test, and its learning power compared to the rate at which school-aged children improve their performance on similar tests as a result of reading. The model's improvement per paragraph of encountered text approximated the natural rate for school children, and most of its acquired knowledge was attributable to indirect inference rather than direct co-occurrence relations. This result can be interpreted in at least two ways. The more conservative interpretation is that it shows that, with the right analysis, a substantial portion of the information needed to answer common vocabulary test questions can be inferred from the contextual statistics of usage alone. This is not a trivial conclusion. As we alluded to above and will elaborate below, much theory in philosophy, linguistics, artificial intelligence research, and psychology has supposed that acquiring human knowledge, especially knowledge of language, requires more specialized primitive structures and processes, ones that presume the prior existence of special foundational knowledge rather than just a general purpose analytic device. This result questions the scope and necessity of such assumptions. Moreover, no previous model has been applied to simulate the acquisition of any large body of knowledge from the same kind of experience used by a human learner.

The other, more radical, interpretation of this result takes the mechanism of the model seriously as a possible theory about all human knowledge acquisition, as a homologue of an important underlying mechanism of human cognition in general. In particular, the model employs a means of induction-dimension matching-that greatly amplifies its learning ability, allowing it to correctly infer indirect similarity relations only implicit in the temporal correlations of experience. It exhibits human-like generalization that is based on learning and that does not rely on primitive perceptual or conceptual relations or representations. Similar induction processes are inherent in the mechanisms of certain other theories, e.g., some associative, semantic and neural network models. However, as we will show, substantial effects arise only if the body of knowledge to be learned contains appropriate structure and only when a sufficient-possibly quite large-quantity of it has been learned. As a result, the posited induction mechanism has not previously been credited with the significance it deserves or exploited to explain the many poorly understood phenomena to which it may be germane. The mechanism lends itself, among other things, to a deep reformulation of associational learning theory that appears to offer explanations and modeling directions for a wide variety of cognitive phenomena. One set of phenomena that we will discuss in detail, along with some auxiliary data and simulation results, is contextual disambiguation of words and passages in text comprehension.

Because readers with different theoretical interests may find these two interpretations differentially attractive, we will follow a slightly unorthodox manner of exposition. While we will present a general theory, or at least the outline of one, that incorporates and fleshes out the implications of the inductive mechanism of the formal model, we will try to keep this development somewhat independent of the report of our simulation studies. That is, we will eschew the conventional stance that the theory is primary and the simulation studies are tests of it. Indeed, the historical fact is that the mathematical text analysis technique came first, as a practical expedient for automatic information retrieval, the vocabulary acquisition simulations came next, and the theory arose last, as a result of observed empirical successes and discovery of the unsuspectedly important effects of the model's implicit inferential operations.

The Problem of Induction

One of the deepest, most persistent mysteries of cognition is how people acquire as much knowledge as they do on the basis of as little information as they get. Sometimes called "Plato's problem", "the poverty of the stimulus", or, in another guise, "the problem of the expert", the question is how observing a relatively small set of events results in beliefs that are usually correct or behaviors that are usually adaptive in a large, potentially infinite variety of situations. Following Plato, philosophers (e.g. Goodman, 1972, Quine, 1960 ), psychologists (e.g. Shepard, 1987; Vygotsky, 1968), linguists (e.g. Chomsky, 1991; Jackendoff, 1992; Pinker, 1990), computation scientists (e.g. Angluin & Smith, 1983; Michaelski, 1983) and combinations thereof (Holland, Holyoak, Nisbett & Thagard, 1989) have wrestled with the problem in many guises. Quine (1960), following a tortured history of philosophical analysis of scientific truth, calls the problem "the scandal of induction", essentially concluding that purely experience-based objective truth cannot exist. Shepard (1987) has placed the problem at the heart of psychology, maintaining that a general theory of generalization and similarity is as necessary to psychology as Newton's laws are to physics. Perhaps the most well advertised examples of the mystery lie in the acquisition of language. Chomsky (e.g. Chomsky, 1991) and followers assert that a child's exposure to adult language provides inadequate evidence from which to learn either grammar or lexicon. Gold, Osherson, Feldman and others (see Osherson, Weinstein, & Stob, 1986) have formalized this argument, showing mathematically that certain kinds of languages cannot be learned to certain criteria on the basis of finite data. The puzzle presents itself with quantitative clarity in the learning of vocabulary during the school years, the particular case that we will address most fully here. School children learn to understand words at a rate that appears grossly inconsistent with the information about each word provided by the individual language samples to which they are exposed, and much faster than they can be made to by explicit tuition.

Recently Pinker (1994) has summarized the broad spectrum of evidence on the origins of language-in evolution, history, anatomy, physiology and development. In accord with Chomsky's dictum, he concludes that language learning must be based on a very strong and specific innate foundation, a set of general rules and predilections which need parameter-setting and filling in, but not acquisition as such, from experience. While this "language instinct" position is debatable as stated, it rests on an idea that is surely correct, that some powerful mechanism exists in the minds of children that can use the finite information they receive to turn them into competent users of human language. What we want to know, of course, is what this mechanism is, what it does, how it works. Unfortunately the rest of the instinctivist answers are as yet of limited help. The fact that the mechanism is given by biology or that it exists as an autonomous mental or physical "module" (if it does), tells us next to nothing about how the mind solves the basic inductive problem.

Shepard's answer to the induction problem in stimulus generalization is equally dependent on biological givens, but offers a more precise description of some parts of the proposed mechanism. He posits that the nervous system has evolved general functional relations between monotone transductions of input values and the similarity of central interpretive processes. On average, he maintains, the similarities generated by these functions are adaptive because they predict in what situations-consequential regions in his terminology-the same behavioral cause-effect relations are likely to hold. Shepard's mathematical laws for stimulus generalization are empirically correct or nearly so for a considerable range of low-dimensional, psychophysical continua, and for certain functions computed on behaviorally measured relations such as choices between stimuli or judgments of inequality on some experiential dimension. However, his laws fall short of being able to predict whether cheetahs are considered more similar to zebras or tigers, whether friendship is thought to be more similar to love or hate, and are mute, or at least very incomplete, on the similarity of the meanings of the words "cheetah", "zebra", "tiger" ,"love" "hate" and "pode". Indeed, it is the generation of psychological similarity relations based solely on experience, the achievement of bridging inferences from experience about cheetahs and friendship to behavior about tigers and love, and from hearing conversations about one to knowledge about the other, that pose the most difficult and tantalizing puzzle.

Often the cognitive aspect of the induction puzzle is cast as the problem of categorization, of finding a mechanism by which a set of stimuli, words, or concepts (cheetahs, tigers) come to be treated as the same for some purposes (running away from, or using metaphorically to describe a friend or enemy). The most common attacks on this problem invoke similarity as the underlying relation among stimuli, concepts, or features (e.g. Rosch, 1978; Smith & Medin, 1981; Vygotsky, 1986). But as Goodman (1972) has trenchantly remarked, "similarity is an impostor", at least for the solution of the fundamental problem of induction. For example, the categorical status of a concept is often assumed to be determined by similarity to a prototype, or to some set of exemplars (e.g. Rosch, 1978, Smith & Medin, 1981). Similarity is either taken as primitive (e.g. Posner & Keele, 1968; Rosch, 1978 ) or as dependent on shared component features (e.g. Smith & Medin, 1981; Tversky, 1977; Tversky & Gati, 1978). But this throws us into an unpleasant regress; when is a feature a feature? Do bats have wings? When is a wing a wing? Apparently, the concept "wing" is also a category dependent on the similarity of features. Presumably, the regress ends when it grounds out in the primitive perceptual relations assumed, for example, by Shepard's theory. But only some basic perceptual similarities are relevant to any feature or category, others are not; a wing can be almost any color. The combining of disparate things into a common feature identity, or into a common category must very often depend on experience. How does that work? Crisp categories, logically defined on rules about feature combinations, such as those often used in category-learning, probability estimation, choice and judgment experiments, lend themselves to acquisition by logical rule-induction processes, although whether such processes are what humans always or usually use is questionable (Holland, Holyoak, Nisbett & Thagard, 1986; Medin, Goldstone & Gentner, 1993; Murphy & Medin, 1985; Smith & Medin, 1981). Surely, the natural acquisition of fuzzy or probabilistic features or categories relies on some other underlying process, some mechanism by which experience with examples can lead to treating new instances more-or-less equivalently, some mechanism by which common significance, common fate, or common context of encounter can generate acquired similarity. We seek a mechanism by which the experienced and functional similarity of concepts, especially complex, largely arbitrary ones, like the meaning of , "concept", "component" or "feature", or, perhaps, the component features of which concepts might consist, are created from an interaction of experience with the logical (or mathematical or neural) machinery of mind.

Something of the sort is the apparent aim of Chomsky's program for understanding the acquisition of grammar. He supposes that the mind contains a prototypical framework, a set of kinds of rules, on which any natural language grammar can be built, and that being required to obey some one of the allowable sets of rules sufficiently constrains the problem that a child can solve it; a small amount of evidence will suffice to choose between the biologically possible alternative grammars. Of what the presumed primordial, universal, abstract grammar consists remains unsettled, although some of its gross features have been described. How experiential evidence is brought to bear in setting its options also has yet to be well specified, although developmental psycholinguists have provided a great deal of relevant evidence (see e.g. Slobin, 1982). Finally, the rules so far hypothesized for "universal grammar" are stated in sophisticated mentalistic terms, like "head noun", that beg for reduction to a level at which some logical or neural computation acting on observables or inferables can be imagined for their mechanism.

A similar tack has been taken in attempting to explain the astonishing rate of vocabulary learning-some seven to ten words per day-in children during the early years of preliterate language growth. Here, theorists such as Carey (1985), E. Clark (1987), Keil (1989) and Markman (1994), have hypothesized constraints on the assignment of meanings to words. For example it has been proposed that early learners assume that most words are names for perceptually coherent objects, that any two words usually have two distinct meanings, that words containing common sounds have related meanings, that an unknown speech sound probably refers to something for which the child does not yet have a word, and that children obey certain strictures on the structure of relations among concept classes. Some theorists have supposed that the proposed constraints are biological givens, some have supposed that they derive from progressive logical derivation during development, some have allowed that constraints may have prior bases in experience; many have hedged on the issue of origins, which is probably not a bad thing, given our state of knowledge. For the most part, proposed constraints on lexicon learning have also been described in qualitative mentalistic terminology that fails to provide entirely satisfying causal explanations; exactly how, for example does a child apply the idea that a new word has a new meaning?

What all modern theories of knowledge acquisition (as well as Plato's) have in common is the postulation of constraints that greatly (in fact, infinitely) narrow the solution space of the problem that is to be solved by induction, that is, by learning. This is the obvious, indeed the only, escape from the inductive paradox. The fundamental notion is to replace an intractably large or infinite set of possible solutions with a problem that is soluble on the data available. So, for example, if biology specifies a function on wavelength of light that is assumed to map the difference between two objects that differ only in color onto the probability that doing the same thing with them will have the same consequences, then a bear need sample only one color of a certain type of berry before knowing which others to pick. A syntax learner who can assume that verbs either always precede nouns, or always follow them, need only learn which; a word-referent learner who can assume that no two words refer to the same object, when presented with an as-yet unnamed object and an as-yet unknown word can guess with reasonable safety that they are related to each other.

There are several problematical aspects to constraint-based resolutions of the induction paradox. One is whether a particular constraint exists as supposed. For example, is it true that young children assume that the same object is given only one name, and if so is the assumption correct about the language to which they are exposed? (It is not in adult English usage; ask 100 people what to title a recipe or name a computer command and you will get almost 30 different answers on average-see Furnas, Landauer, Dumais & Gomez, 1983, 1987). These are empirical questions, and ones to which most of the research in early lexical acquisition has been addressed. One can also wonder about the origin of a particular constraint, and whether it is plausible to regard it as a primitive process with an evolutionary basis. For example, most of the constraints proposed for language learning are very specific and relevant only to human language, making their postulation consistent with a very strong instinctive and modular view of mental processes. In Pinker's (1994) recent pursuit of this reasoning he is led to postulating, albeit apparently with tongue somewhat in cheek, no less than 15 different domains of human knowledge, each with its own set of specific innate-knowledge constraints. Is it likely that such a panoply of domain-specific innate knowledge could have arisen over less than a million years of Homo Sapiens evolution? Or is some more general set of constraints, in spirit more like those proposed by Shepard, at work throughout cognition? One potential advantage of more general cognitive constraints is that they might make possible derived sets of higher-order constraints based on experience, which could then underwrite induction in relatively labile domains of knowledge such as those aspects of culture invented slowly by earlier generations but learned quickly by later ones.

The existence and origin of particular constraints is only one part of the problem. The existence of some set of constraints is a logical necessity, so that showing that some exist is good but not nearly enough. The rest of the problem involves three general issues. The first is whether a particular set of constraints is logically and pragmatically sufficient, that is, whether the problem space remaining after applying them is soluble. For example, suppose that young children do, in fact, assume that there are no synonyms. How much could that help them in learning the lexicon from the language to which they are exposed? Enough? Indeed, that particular constraint leaves the mapping problem potentially infinite; it could even exacerbate the problem by tempting the child to assign too much or the wrong difference to "our dog", "the collie" and "Fido.” Add in the rest of the constraints that have been proposed: enough now?

The second issue is methodological, how to get an answer to the first question, how to determine whether a specified combination of constraints when applied to natural environmental input would solve the problem, or perhaps better, determine how much of the problem it would solve. We believe that the best available strategy for doing this is to specify a concrete computational model embodying the proposed constraints and to simulate as realistically as possible its application to the acquisition of some measurable and interesting properties of human knowledge. In particular, with respect to constraints supposed to allow the learning of language and other large bodies of complexly structured knowledge, domains in which there are very many facts each weakly related to very many others, effective simulation may require data sets of the same size and content as those encountered by human learners. Formally, that is because weak local constraints can combine to produce strong inductive effects in aggregate. A simple analog is the familiar example of a diagonal brace to produce rigidity in a structure made of three beams. Each connection between three beams can be a single bolt. Two such connections exert no constraint at all on the angle between the beams. However, when all three beams are so connected, all three angles are completely specified. In structures consisting of thousands of elements weakly connected (i.e. constrained) in hundreds of different ways (i.e. in hundreds of dimensions instead of two), the effects of constraints may emerge only in very large naturally generated ensembles. In other words, experiments with miniature or concocted subsets of language experience may not be sufficient to reveal or assess the forces that hold conceptual knowledge together. The relevant quantitative effects of such phenomena may only be ascertainable from experiments or simulations based on the same masses of input data encountered by people.

The third problem is to determine whether a postulated model corresponds to what people actually do, whether it is psychologically valid, whether the constraints it uses are the same ones upon which human achievement relies. As we said earlier, showing that a particular constraint, e.g. avoidance ofs ynonyms, exists in a knowledge domain and is used by learners, is not enough unless we can show that it sufficiently helps to solve the overall inductive problem over a representative mass of input. Moreover, even if a model could solve the same difficult problem that a human does given the same data it would not prove that the model solves the problem in the same way. What to do? Apparently, one necessary test is to require a conjunction of both kinds of evidence, observational or experimental evidence that learners are exposed to and influenced by a certain set of constraints, and evidence that when embedded in a simulation model running over a natural body of data the same constraints approximate natural human learning and performance. However, in the case of effective but locally weak constraints, the first part of this two-pronged test, experimental or observational demonstration of their human use, might well fail. Such constraints might not be detectable by isolating experiments or in small samples of behavior. Thus, while an experiment or series of observational studies could prove that a particular constraint is used by people, it could not prove that it is not. A useful strategy for such a situation is to look for additional effects predicted by the postulated constraint system in other phenomena exhibited by learners after exposure to large amounts of data.

The Latent Semantic Analysis Model

The model we have used for simulation is a purely mathematical analysis technique. However, we want to interpret the model in a broader and more psychological manner. In doing so, we hope to show that the fundamental features of the theory that we will later describe are plausible, to reduce the otherwise magical appearance of its performance, and to suggest a variety of relations to psychological phenomena other than the ones to which we have as yet applied it.

We will explicate all of this in a somewhat spiral fashion. First, we will try to explain the underlying inductive mechanism of dimensionality matching upon which the model's power hinges. We will then sketch how the model's mathematical machinery operates and how it has been applied to data and prediction. Next, we will offer a psychological process interpretation of the model that shows how it maps onto but goes beyond familiar theoretical ideas, empirical principles, findings and conjectures. We will then, finally, return to a more detailed and rigorous presentation of the model and its applications.

An Informal Explanation of The Inductive Value of Dimensionality Matching

Suppose that Jack and Jill can only communicate by telephone. Jack, sitting high on a hill and looking down at the terrain below estimates the distances separating three houses, A, B and C. He says that house A is 5 units from both house B and house C, and that houses B and C are separated by 8 units. Jill uses these estimates to plot the position of the three houses, as shown in the top portion of Figure 1. But then Jack says "Oh, by the way, they are all on the same straight, flat road". Now Jill knows that Jack's estimates must have contained errors, and revises her own in a way that uses all three together to improve each one, to 4 .5, 4 .5 and 9, as shown in the bottom portion of Figure 1.

Figure 1.

Three distances among three objects are always consistent in two dimensions so long as they obey the triangle inequality (the longest distance must be less than or equal to the sum of the other two). But, knowing that all three distances must be accommodated in one dimension strengthens the constraint (the longest must be exactly equal to the sum of the other two). If the dimensional constraint is not met, the apparent errors in the estimates must be resolved. One compromise is to adjust each distance by the same proportion so as to make two of the lengths add up to the third. The important point is that knowing the dimensionality improves the estimates. Of course, this works the other way around as well. Had the distances been generated from a two- or three-dimensional array-e.g. the road wascurved or curved and hilly-accommodating the estimates on a straight line would have distorted their original relations and added error rather than reducing it.

Sometimes researchers have considered dimensionality reduction as a method to reduce computational complexity or for smoothing, that is for simplifying the description of data or filling in missing points (e.g. Church & Hanks, 1990; Grefenstette, 1993; Shutze, 1992). However, as we will see, choosing the right dimensionality, when appropriate, can have a much more dramatic effect than these interpretations would seem to suggest.

Let us now construe the semantic similarity between two words in terms of distance: the closer the greater the similarity. Suppose we also assume that the likelihood of two words appearing in the same window of discourse-a phrase, a sentence, a paragraph or what have you-is inversely related to their semantic distance, that is directly related to their semantic similarity. 1 We could then estimate the relative similarity of any pair of words by observing the relative frequency of their joint occurrence in such windows.

Given a finite sample of language, such estimates would be quite noisy. Worse yet, estimates for most pairwise relations would be completely missing, not only because of thin sampling, but also because real language may use only one of several words of near-synonymous meaning in the same passage (just as only one view of the same object may be present in a given scene). If the internal representation of semantic similarity is constructed in as many dimensions as there are contexts, there would be little more we could do with the data. Putting this in linguistic terms, each time we encountered a word we could believe it to mean something entirely different. However, if the source of the discourse was a mind in which semantic similarities were represented in k dimensional space, then we might be able to improve our initial estimates of pairwise similarities, and to accurately estimate the similarities among pairs never observed together, by fitting them as best we could into a space of the same dimensionality. This is closely related to familiar uses of factor analysis and multi-dimensional scaling, and to unfolding, (Carroll & Arabie, in press; Coombs, 1964), but using a particular kind of data and writ very large. Charles Osgood (1971) seems to have anticipated such a theoretical development when computational power eventually rose to the task, as it now has. How much improvement will result from optimal dimensionality choice depends on empirical issues, the distribution of inter-word distances, the frequency and composition of their contexts in natural discourse, the detailed structure of distances among words estimated with varying precision, and so forth.

The scheme just outlined would make it possible to build a communication system in which two parties could come to agree on the usage of elementary components, e.g., words, at least up to the relative similarity among pairs of words. (The same process would presumably be used to reach agreement on similarities between words and perceptual inputs and perceptual inputs and each other, but for clarity and simplicity, and because the word domain is where we have data and have simulated the process, we concentrate here on word-word relations). Suppose that a communicator possesses a representation of a large number of words as points in a high dimensional space. In generating strings of words, the sender tends to choose words located near each other in some region of the space. Locally, over short time spans, contiguities among output words would reflect, at least weakly, their distances in the sender's semantic space. A receiver could make first order estimates of the distance between pairs by their relative frequency of occurrence in the same temporal contexts, e.g. a paragraph. However, since there are very many words in any natural language, and a relatively small amount of received discourse, such information would surely be inadequate. For example, it is quite likely that two words with frequencies of one in a million will never have been experienced near each other even though they have related meanings. However, if the receiving device sets out to represent the results of its statistical knowledge as points in a space of the same or nearly the same dimensionality as that from which it was generated, it is bound to do better. How much better will depend, as we've already said, on matters that can only be settled by observation.

Except for some technical matters, such as the similarity metric employed, our model works exactly as if the assumption of such a communicative process characterizes natural language (and, possibly, other domains of natural knowledge). In essence, and in detail, it assumes that the psychological similarity between any two words is reflected in the way they co-occur in small subsamples of language, that the source of language samples produces words in a way that ensures an orderly stochastic mapping between semantic similarity and output distance. It then fits all of the pairwise similarities into a common space of high but not unlimited dimensionality.

As in the house mapping and geometric examples, the assumed number of dimensions must be neither too great nor too small for such a trick to work. That is, in order to utilize the extra information inherent in the dimensional constraint, the receiver must be able to adopt an appropriate dimensionality in which to represent the joint set of observed relations. Because, as we will see, the model predicts what words should occur in the same contexts, an organism using such a mechanism could, either by evolution or learning, adjust the number of dimensions on the basis trial and error. By the same token, not knowing this dimensionality a priori, in our studies we have varied the dimensionality of the simulation model to determine what produces the best results.2

More conceptually or cognitively elaborate mechanisms for the representation of meaning also might generate dimensional constraints, and might correspond more closely to the mentalistic hypotheses of current linguistic and psycho-linguistics theories. For example, theories that postulate meaningful semantic features could be effectively isomorphic to LSA given the identification of a sufficient number of sufficiently independent features and their accurate quantitative assignment to all the words of a large vocabulary. But suppose that it is not necessary to add such subjective interpretations or elaborations for the model to work. Then LSA could be a direct expression of the fundamental principles on which semantic similarity (as well as other perceptual and memorial relations) are built rather than being a reflection of some other system. It is too early to tell whether the model is merely a mathematical convenience that approximates the effects of "true" cognitive features and processes, or corresponds directly to the actual underlying mechanism of which more qualitative theories now current are themselves but partial approximations. The model we propose is at the computational level described by Marr (1982) (see also Anderson, 1990), that is, it specifies the natural problem that must be solved and an abstract computational method for its solution.

A Psychological Description of LSA as a Theory of Learning, Memory and Knowledge

We will give a more complete description of LSA as a mathematical model below when we use it to simulate lexical acquisition. However, an overall outline is necessary to understand a roughly equivalent psychological theory we wish to present first. The input to LSA is a matrix consisting of rows representing unitary event types by columns representing contexts in which instances of the event types appear. One example is a matrix of unique word types by many individual paragraphs in which the words are encountered, where a cell contains the number of times that a particular word type, say model appears in a particular paragraph, say this one. After an initial transformation of the cell entries, this matrix is analyzed by a statistical technique called Singular Value Decomposition (SVD) closely akin to factor analysis, which allows event types and individual contexts to be re-represented as points or vectors in a high dimensional abstract space (Golub, Luk & Overton, 1981). The final output is a representation from which one can calculate similarity measures between all pairs consisting of either event types or contexts -e.g. word-word, word-paragraph, or paragraph-paragraph similarities.

Psychologically, the data that the model starts with are raw, first-order local associations between a stimulus and other temporally contiguous stimuli, or, equivalently, as associations between stimuli and the contexts or episodes in which they occur. The stimuli or event types may be thought of as unitary chunks of perception or memory. (We will describe a hypothetical unitization process later that is, in essence, a hierarchical recursion of the LSA representation).

The first-order process by which initial pairwise associations are entered and transformed in LSA resembles classical conditioning in that it depends on contiguity or co-occurrence, but weights the result first non-linearly with local co-occurrence frequency, then inversely with a function of the number of different contexts in which the particular component is encountered overall and the extent to which its occurrences are spread evenly over contexts. However, there are possibly important differences in the details as currently implemented; in particular, LSA associations are symmetrical; a context is associated with the individual events it contains by the same cell entry as the events that are associated with the context. This would not be a necessary feature of the model; it would be possible to make the initial matrix asymmetrical, with a cell indicating the association, for example, between a word and closely following words. Indeed, Lund & Burgess (1995, in press), and Schutze (1992), have explored related models in which such data are the input.

The first step of the LSA analysis is to transform each cell entry from the number of times that a word appeared in a particular context to the log of that frequency. This approximates the standard empirical growth functions of simple learning. The fact that this compressive function begins anew with each context also yields a kind of spacing effect; the association of A and B will be greater if both appear in two different contexts than if they each appear twice in the same context. In a second transformation each of these cell entries is divided by the entropy for the event type, - SUM p log p over all its contexts. Roughly speaking, this step accomplishes much the same thing as conditioning rules such as those described by Rescorla & Wagner (1972), in that it makes the association better represent the informative relation between the entities rather than the mere fact that they occurred together. Somewhat more formally, the inverse entropy measure estimates the degree to which observing the occurrence of a component specifies what context it is in; the larger the entropy of, say, a word, the less information its observation transmits about the places it has occurred, so the less usage-defined meaning it has, and conversely, the less a particular contextual occurrence tells about its meaning.

It is interesting to note that automatic information retrieval methods (including LSA when used for the purpose) are greatly improved by transformations of this general form, the present one usually appearing to be the best (Harman, 1986). It does not seem far fetched to believe that the necessary transform for good information retrieval, retrieval that brings back text corresponding to what a person has in mind when the person offers one or more query words, corresponds to the functional relations in basic associative processes. Anderson (1990) has drawn attention to the analogy between information retrieval in external systems and those in the human mind. It is not clear which way the relationship goes. Does information retrieval in automatic systems work best when it mimics the circumstances that make people think two things are related, or is there a general logic that tends to make them have similar forms? In automatic information retrieval the logic is usually assumed to be that idealized searchers have in mind exactly the same text as they would like the system to find, and draw the words in their queries from that text (see Bookstein & Swanson, 1974). Then the system's challenge is to estimate the probability that each text in its store is the one that the searcher was thinking about. This characterization, then, comes full circle to the kind of communicative agreement model we outlined above; the sender issues a word chosen to express a meaning he or she has in mind, and the receiver tries to estimate the probability of each of the sender's possible messages.

Gallistel (1990), has argued persuasively for the need to separate local conditioning or associative processes from global representation of knowledge. The LSA model expresses such a separation in a very clear and precise way. The initial matrix after transformation to log frequency/entropy represents the product of the local or pairwise processes. 3 The subsequent analysis and dimensionality reduction takes all of the previously acquired local information and turns it into a unified representation of knowledge.

Thus, the first processing step of the model, modulo its associational symmetry, is a rough approximation to a conditioning or associative processes. However, the model's next steps, the singular value decomposition and dimensionality reduction are not contained in any extant theory of learning, although something of the kind may be hinted at in some modern discussions of conditioning, and is latent in many neural net and spreading activation architectures. What this step does is to convert the transformed associative data into a condensed representation. The condensed representation can be seen as achieving several things, although they are at heart the result of only one mechanism. First, the re-representation captures indirect, higher-order associations. That is, if a particular stimulus, X, (e.g. a word) has been associated with some other stimulus, Y, by being frequently found in joint context (i.e. contiguity), and Y is associated with Z, then the condensation can cause X and Z to have similar representations. However, the strength of the indirect XZ association depends on much more than a combination of the strengths of XY and YZ. This is because the relation between X and Z also depends, in a well specified manner, on the relation of each of the stimuli, X, Y and Z, to every other entity in the space. In the past, attempts to predict indirect associations by stepwise chaining rules have not been notably successful (see, e.g. Young, 1968; Pollio, 1968). If associations correspond to distances in space, as supposed by LSA, stepwise chaining rules would not be expected to work well; if X is two units from Y and Y is two units from Z, all we know about the distance from X to Z is that it must be between zero and four. But with data about the distances between X, Y, Z and other points, the estimate of XZ may be greatly improved by also knowing XY and YZ.

An alternative view of LSA's effects is the one given earlier, the induction of a latent higher order similarity structure (thus its name) among representations of a large collection of events. Imagine, for example, that every time a stimulus, e.g. a word, is encountered, the distance between its representation and that of every other stimulus that occurs in close proximity to it is adjusted to be slightly smaller. The adjustment is then allowed to percolate through the whole previously constructed structure of relations, each point pulling on its neighbors until all settle into a stable compromise configuration (physical objects, weather systems, and Hopfield nets do this too (Hopfield, 1982)). It is easy to see that the resulting relation between any two representations will depend not only on direct experience with them but with everything else ever experienced. No single representation will be an island. Although the current mathematical implementation of LSA doesn't work in this incremental way, its effects are much the same. The question, then, is whether such a mechanism, when combined with the statistics of experience, will produce a faithful reflection of human knowledge.

Finally, to anticipate what will be developed below, the computational scheme used by LSA for combining and condensing local information into a common representation captures multi-variate correlational contingencies among all the events about which it has local knowledge. In a mathematically well defined sense it optimizes the prediction of the presence of all other events from those currently identified in a given temporal context, and does so using all relevant information it has experienced.

Having thus cloaked the model in traditional memory and learning vestments, we will next reveal it as a bare mathematical formalis

A Neural Net Analog of LSA

We will describe the matrix-mathematics of Singular Value Decomposition used in LSA more fully, but still informally, shortly, and in somewhat greater detail in the appendix. But first, for those more familiar with neural net models, we offer a rough equivalent in that terminology. Conceptually, the LSA model can be viewed as a simple but rather large three layer neural net. It has a layer-one node for every word-type (event-type) and a layer-three node for every text window (episode) ever encountered, several hundred layer-two nodes - the choice of number is presumed to be important - and complete connectivity between layers one and two and between layers two and three. (Obviously, one could substitute other identifications of the elements and episodes). The network is symmetrical; it can be run in either direction One finds an optimal large number of middle-layer nodes, then maximizes the accuracy (in a least squares sense) with which activating any layer-three node activates the layer-one nodes that are its elementary contents, and, simultaneously, vice-versa. The conceptual representation of either kind of event, a unitary episode or a word, for example, is a pattern of activation across layer-two nodes. All activations and summations are linear.

Note that the vector multiplication needed to generate the middle layer activations from layer three values is, in general, different from that to generate them from layer one values, thus a different computation is required to assess the similarity between two episodes, two event types, or an event type and an episode, even though both kinds of entities can be represented as values in the same middle-layer space. Moreover, an event-type or a set of event-types could also be compared with another of the same or with an episode or combination of episodes by computing their activations on layer three. Thus the network can create artificial or "imaginary" episodes, and, by the inverse operations, episodes can generate "utterances" to represent themselves as patterns of event-types with appropriately varying strengths. The same things are true in the equivalent Singular Value Decomposition matrix model of LSA. The Singular Value Decomposition (SVD) Realization of LSA

The principal virtues of SVD for this work are that it embodies the kind of inductive mechanisms that we want to explore, that it provides a convenient way to vary dimensionality, and that it can fairly easily be applied to data of the amount and kind that a human learner encounters over many years of experience. Realized as a mathematical data analysis technique, however, the particular model studied should be considered only one case of a class of potential models that one would eventually wish to explore, a case which uses a very simplified parsing and representation of input, and makes use only of linear relations. In possible elaborations one might want to add features that make it more closely resemble what we know or think we know about the basic processes of perception, learning and memory. It is plausible that complicating the model appropriately might allow it to simulate phenomena to which it has not been applied and to which it currently seems unlikely to give a good account, for example certain aspects of grammar and syntax that involve ordered and hierarchical relations rather than unsigned distances. However, what is most interesting at this point is how much it does in its present form.

Singular Value Decomposition (SVD)

A brief overview the mathematics of SVD is given in the appendix. For those who wish to skip it, we note that SVD is the general method for linear decomposition of a matrix into independent principal components of which factor analysis is the special case for square matrices with the same entities as columns and rows. Factor analysis finds a parsimonious representation of all the intercorrelations between a set of variables in terms of a new set of variables each of which is unrelated to any other but which can be combined to regenerate the original data. SVD does the same thing for an arbitrarily shaped rectangular matrix in which the columns and rows stand for different things, as in the present case one stands for words, the other for contexts in which the words appear. (For those with yet other vocabularies, SVD is a form of Eigenvalue-Eigenvector analysis or principal components decomposition and, in a more general sense, of multi-dimensional scaling. See Carroll & Arabie, in press.).

To implement the model concretely and simulate human word learning, SVD was used to analyze 4.6 million words of text taken from an electronic version of Grolier's Academic American Encyclopedia, a work intended for young students. This encyclopedia has 30,473 articles. From each article we took a sample consisting of (usually) the whole text or its first 2,000 characters, whichever was less, for a mean text sample length of 151 words, roughly the size of a rather long paragraph. The text data were cast into a matrix of 30,473 columns, each column representing one text sample, by 60,768 rows, each row representing a unique word-type that appeared in at least two samples. The cells in the matrix contained the frequency with which a particular word-type appeared in a particular text sample. The raw cell entries were first transformed to {ln (1 + cell frequency)/ entropy of the word over all contexts}. This matrix was then submitted to SVD and the-for example-300 most important dimensions were retained (those with the highest singular values, i.e. the ones that captured the greatest variance in the original matrix). The reduced dimensionality solution then generates a vector of 300 real values to represent each word. See Figure 2. The similarity of words was usually measured by the cosine between their vectors.4

References

  1. Anderson, J. R., & . (1990). The adaptive character of thought. Hillsdale, NJ: Lawrence Erlbaum Associates.
  2. Anderson, R. C., & Freebody, P. (1981). Vocabulary Knowledge. In J. T. Guthrie (Eds.), Comprehension and Teaching: Research Reviews (pp. 77-117). Newark, DE: International Reading Association.
  3. Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assessment and acquisition of word knowledge. In B. Huston (Eds.), Advances in Reading/Language Research: A Research Annual. (pp. 231-256). Greenwich, CT: JAI Press.
  4. Anglin, J. M. (1993). Vocabulary development: A morphological analysis. Monographs of the Society for Research in Child Development., 58 (10, Serial No. 238).
  5. Anglin, J. M., Alexander, T. M., & Johnson, C. J. (1996). Word learning and the growth of potentially knowable vocabulary. Submitted for publication.
  6. Angluin, D., & Smith, C. H. (1983). Inductive inference: theory and methods. Computing Surveys, 15(3), 237-269.
  7. Berry, M. W. (1992). Large scale singular value computations. International Journal of Supercomputer Applications, 6(1), 13-49.
  8. Bickerton, D. (1981). Roots of language. Ann Arbor, MI: Karoma.
  9. Bookstein, A., & Swanson, D. R. (1974). Probabilistic models for automatic indexing. Journal of the American Association for Information Science, 25, 312-318.
  10. Carey, S. (1985). Conceptual change in childhood. Cambridge, MA.: MIT Press, Bradford Books.
  11. Carroll, J. B. (1971) Statistical analysis of the corpus. In J. B. Carroll, P. Davies, & B. Richman (Eds.), Word frequency book (pp. xxii-xl). New York: Houghton Mifflin Company & American Heritage Publishing Co.
  12. Carroll, J.D., & Arabie, P. (In press). Multidimensional scaling. In M. H. Birnbaum (Ed.), Handbook of perception and cognition, Volume 3: Measurement, judgment and decision making. San Diego, CA: Academic Press.
  13. Carver, R. P. (1990). Reading rate: A review of research and theory. San Diego CA: Academic Press.
  14. Charness, N. (1991). Expertise in chess: The balance between knowledge and search. In K. A. Ericsson & J. Smith (Eds.), Toward a general theory of expertise. Cambridge, England: Cambridge University Press.
  15. Chomsky, N. (1991). Linguistics and cognitive science: Problems and mysteries. In A. Kasher (Eds.), The Chomskyan turn. Cambridge, MA.: Blackwell.
  16. Choueka, Y., & Lusignan, S. (1985). Disambiguation by short contexts. Computers and the Humanities, 19, 147-157.
  17. Christie, Agatha (1942) The moving finger, London, Dodd, Mead
  18. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16, 22-29.
  19. Clark, E. V. (1987). The principle of contrast: A constraint on language acquisition. In B. MacWhinney (Ed.), Mechanisms of language acquisition Hillsdale, NJ: Erlbaum.
  20. Coombs, C. H. (1964). A theory of data. New York: Wiley.
  21. Dahl, H. (1979). Word frequencies of spoken American English. Essex, CT: Verbatim.
  22. D'Andrade, R. G. (1993). Cultural cognition. In M. I. Posner (Eds.), Foundations of cognitive science. Cambridge, MA: MIT Press.
  23. Davies, P. (1971). New views of lexicon. In J. B. Carroll, P. Davies, & B. Richman (Eds.), Word frequency book (pp. xli-liv). New York: Houghton Mifflin Company & American Heritage Publishing Co.
  24. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6).
  25. Deese, J. (1965). The structure of associations in language and thought. Baltimore: Johns Hopkins Press. Donald, M. (1991). Origins of the modern mind. Cambridge, MA.: Harvard University Press.
  26. Drum, P. A., & Konopak, B. C. (1987). Learning word meaning from written context. In M. C. McKeown & M. E. Curtis (Eds.), The nature of vocabulary acquisition. (pp. 73-87). Hillsdale, NJ: Lawrence Erlbaum Associates.
  27. Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In D. Harman (Ed.), National Institute of Standards and Technology Text Retrieval Conference, NIST special publication.
  28. Durkin, D. (1979). What classroom observations reveal about reading comprehension instruction. Reading Research Quarterly, 14, 481-253.
  29. Durkin, D. (1983). Teaching them to read. Boston: Allyn and Bacon.
  30. Elley, W. B. (1989). Vocabulary acquisition from listening to stories. Reading Research Quarterly, 24, 174-187.
  31. Ericsson, K. A., & Smith, J. (1991). Prospects and limits of the empirical study of expertise: an introduction. In K. A. Ericsson & J. Smith (Eds.), Toward a general theory of expertise. (pp. 1-38). Cambridge, England: Cambridge University Press.
  32. Ericsson, A., & Kintsch, W. (1995). Long-term working memory. Psychological Review.
  33. Fillenbaum, S., & Rapoport, A. (1971). Structures in the subjective lexicon. New York: Academic Press.
  34. Foltz, P. W., Kintsch, W., & Landauer, T. K. (1993, Jan). An analysis of textual coherence using Latent Semantic Indexing. Paper presented at the meeting of the Society for Text and Discourse, Jackson, WY.
  35. Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of key-word information systems. The Bell System Technical Journal, 62, 1753-1804.
  36. Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 30 (11), 964-971.
  37. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA.: MIT Press.
  38. Gernsbacher, M. A. (1990). Language comprehension as structure building. Hillsdale, NJ: Lawrence Erlbaum Associates.
  39. Golub, G. H., Luk, F. T., & Overton, M. L. (1981). A block Lanczoz method for computing the singular values and corresponding singular vectors of a matrix. ACM Transactions on Mathematical Software, 7, 149-169.
  40. Goodman, N. (1972). Problems and projects. Indianapolis: Bobbs-Merrill. Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Boston: Kluwer Academic Press.
  41. Harman, D. (1986). An experimental study of the factors important in document ranking. In Association for Computing Machinery Conference on Research and Development in Information Retrieval, New York: Association for Computing Machinery.
  42. Hewes, G. W. (1974). Gesture language in culture contact. Sign Language Studies, 4 (1), 1-34.
  43. Hewes, G. W. (1994). The gestural origin of language and new neurological data. In J. Wind, A. Jonker, R. Allott, & L. Rolfe (Eds.), Studies in language origins, Volume 3 (pp. 294-307). Amsterdam: John Benjamins.
  44. Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: processes of inference, learning, and discovery. Cambridge, MA.: MIT Press.
  45. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA., 79, 2554-2558.
  46. Jackendoff, R. S. (1992). Languages of the mind. Cambridge, MA: MIT Press.
  47. Jenkins, J. R., Stein, M. L., & Wysocki, K. (1984). Learning vocabulary through reading. American Educational Research Journal, 21(4), 767-787.
  48. Keil, F. C. (1989). Concepts, kinds and cognitive development. Cambridge, MA.: MIT Press.
  49. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95, 163-182.
  50. Kintsch, W., & Vipond, D. (1979). Reading comprehension and reading ability in educational practice and psychological theory. In L. G. Nilsson (Eds.), Perspectives of memory research (pp. 325-366). Hillsdale, NJ.: Erlbaum.
  51. Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day English. Providence, RI: Brown University Press.
  52. Landauer, T. K. (1986). How much do people remember: some estimates of the quantity of learned information in long-term memory. Cognitive Science, 10 (4), 477-493.
  53. Landauer, T. K., & Dumais, S. T. (In press). How come you know so much? From practical problem to theory. In D. Hermann, C. Hertzog, C. McEvoy, P.
  54. Hertel, & M. Johnson (Eds.), Basic and Applied Memory: Memory in Context. Englewood Cliffs, N.J.: Lawrence Erlbaum Associates.
  55. Levy, E., & Nelson, K. (1994). Words in discourse: a dialectical approach to the acquisition of meaning and use. Journal of Child Language, 21, 367-389.
  56. Lucy, J., & Shweder, R. (1979). Whorf and his critics: Linguistic and non-linguistic influences on color memory. American Anthropologist, 81, 113-128.
  57. Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In J. D. Moore & J. F. Lehman (Ed.), Proceedings of the 17th annual meeting of the Cognitive Science Society, (pp. 660-665). Pittsburgh, PA: Lawrence Erlbaum Associates.
  58. Lund, K., & Burgess, C. (in press). Hyperspace analog to language (HAL): A general model of semantic representation (abstract). Brain and Cognition.
  59. McNamara, D. S., Kintsch, E., Butler-Songer, N., & Kintsch, W. (1993). Text coherence, background knowledge, and levels of understanding in learning from text. Manuscript submitted for publication.
  60. Markman, E. M. (1994). Constraints on word meaning in early language acquisition. Lingua, 92, 199-227.
  61. Marr, D. (1982). Vision. San Francisco: Freeman.
  62. Medin, D. L., Goldstone, R. L., & Gentner, D. (1993). Respects for similarity. Psychological Review, 100, 254-278.
  63. Michaelski, R. (1983). A theory and methodology of inductive learning. Artificial Intelligence, 20, 111-161.
  64. Miller, G. A. (1978). Semantic relations among words. In M. Halle, J. Bresnan, & G. A. Miller (Eds.), Linguistic theory and psychological reality. (pp. 60-118). Cambridge, MA: MIT Press.
  65. Miller, G. A. (1991). The science of words. New York: Scientific American Library.
  66. Murphy, G. L., & Medin, D. L. (1985). The role of theories in conceptual coherence. Psychological Review, 92, 289-316.
  67. Moyer, R. S., & Landauer, T. K. (1967). The time required for judgements of numerical inequality. Nature, 216, 159-160.
  68. Nagy, W., & Anderson, R. (1984). The number of words in printed school English. Reading Research Quarterly, 19, 304-330.
  69. Nagy, W., Herman, P., & Anderson, R. (1985). Learning words from context. Reading Research Quarterly, 20, 223-253.
  70. Nagy, W. E., & Herman, P. A. (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. In M. C. McKeown & M. E. Curtis (Eds.), The nature of vocabulary acquisition. (pp. 19- 35). Hillsdale, NJ: Erlbaum.
  71. Quine (1960). Word and object. Cambridge, MA.: MIT Press.
  72. Osgood, C. E. (1971). Exploration in semantic space: A personal diary. Journal of Social Issues, 27, 5-64.
  73. Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana, IL: University of Illinois Press.
  74. Osherson, D. N., Weinstein, S., & Stob, M. (1986). Systems that learn: An introduction to learning theory for cognitive and computer scientists. Cambridge, MA.: MIT Press.
  75. Pinker, S. (1990). The bootstrapping problem in language acquisition. In B. MacWhinney (Eds.), Mechanisms of Language Acquisition Hillsdale, NJ: Lawrence Erlbaum
  76. Pinker, S. (1994). The language instinct: how the mind creates language. New York, NY: William Morrow and Co.
  77. Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193.
  78. Pollio, H. R. (1968). Associative structure and verbal behavior. In T. R. Dixon & D. L. Horton (Eds.), Verbal behavior and general behavior theory. (pp. 37-66). Englewood Cliffs, NJ: Prentice-Hall.
  79. Posner, M. I., & Keele, S. W. (1968). On the genesis of abstract ideas. Journal of Experimental Psychology., 77, 353-363.
  80. Rapoport, A., & Fillenbaum, S. (1972). An experimental study of semantic structure. In A. K. Romney, R. N. Shepard, & S. B. Nerlove (Eds.), Multidimensional scaling: Theory and applications in the behavioral sciences. New York: Seminar Press.
  81. Ratcliff, R., & McKoon, G. (1978). Priming in item recognition: Evidence for the propositional nature of sentences. Journal of Verbal Learning and Verbal Behavior, 17, 403-417.
  82. Rayner, K., Pacht, J. M., & Duffy, S. A. (1994). Effects of prior encounter and global discourse bias on the processing of lexically ambiguous words: evidence from eye fixations. Journal of Memory and Language, 33, 527-544.
  83. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II. New York: Appleton-Century-Crofts.
  84. Rosch, E. (1978) Principles of categorization. In E. Rosch & B. B. Loyd (Eds.), Cognition and categorization. Hillsdale, NJ: Erlbaum Seashore, R. H. (1947). How many words do children know? The Packet, II, 3-17.
  85. Schutze, H. (1992). Context space. In Fall Symposium on probability and natural language. Cambridge, MA.: American Association for Artificial Intelligence.
  86. Shepard, R. N. (1987). Towards a universal law of generalization for psychological science. Science, 237, 1317-1323.
  87. Slobin, D. (1982). Universal and particular in the acquisition of language. In E. Wanner & L. R. Gleitman (Eds.), Language acquisition: The state of the art. Cambridge, Eng. Cambridge University Press.
  88. Smith, E. E. & Medin, D. L. (1981) Categories and concepts. Cambridge, MA: Harvard University Press.
  89. Smith, M. (1941). Measurement of the size of general English vocabulary through the elementary grades and high school. Genetic Psychology Monographs, 24, 311-345.
  90. Sternberg, R. J. (1987). Most vocabulary is learned from context. In M. G. McKeown & M. E. Curtis (Eds.), The Nature of Vocabulary Acquisition. (pp. 89-106). Hillsdale, NJ: Lawrence Erlbaum Associates.
  91. Swinney, D. A. (1979). Lexical access during sentence comprehension: (Re)consideration of context effects. Journal of Verbal Learning and Verbal Behavior, 18, 546-659.
  92. Taylor, B. M., Frye, B. J., & Maruyama, G. M. (1990). Time spent reading and reading growth. American Educational Research Journal, 27(2), 351-362.
  93. Till, R. E., Mross, E. F., & Kintsch, W. (1988). Time course of priming for associate and inference words in discourse context. Memory and Cognition, 16, 283-299.
  94. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327-352.
  95. Tversky, A., & Gati, I. (1978). Studies of similarity. In E. Rosch & B. Lloyd (Eds.), Cognition and categorization (pp. 79-98). Hillsdale, NJ: Lawrence Erlbaum Associates.
  96. Van Dijk, T. A., & Kintsch, W. (1983). Strategies of discourse comprehension. New York: Academic Press. Vygotsky, L. S. (1968). Thought and language. (1934), (A. Kozulin, Trans.). Cambridge, MA: The MIT Press.
  97. Walker, D. E., & Amsler, R. A. (1986). The use of machine-readable dictionaries in sublanguage analysis. In R. Grisham (Eds.), Analyzing languages in restricted domains: Sublanguage description and processing. Hillsdale, N.J.: Lawrence Erlbaum Associates.
  98. Webster's third new international dictionary of the English language unabridged. (1964) G. & C. Merriam Company, Publishers, Springfield, MA.
  99. Young, R. K. (1968). Serial learning. In T. R. Dixon & D. L. Horton (Eds.), Verbal behavior and general behavior theory. (pp. 122-148). Englewood Cliffs, NJ: Prentice-Hall.;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1997 ASolutiontoPlatosProblemTheLateSusan T. Dumais
Thomas K. Landauer
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.10.1037/0033-295X.104.2.2111997