1999 FoundationsOfStatisticalNLP
- (Manning & Schütze, 1999) ⇒ Christopher D. Manning, and Hinrich Schütze. (1999). “Foundations of Statistical Natural Language Processing.” In: The MIT Press. ISBN:0262133601
Subject Headings: Statistical NLP, Word Sense Disambiguation, Collocation, Lexical Acquisition, Markov Model, Part-of-Speech Tagging, Probabilistic Context Free Grammar, Probabilistic Parsing, Machine Translation, Information Retrieval, Text Categorization.
Notes
- Book website http://nlp.stanford.edu/fsnlp/
- It was a popular NLP Textbook.
Cited By
Quotes
Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.
Table of Contents
I Preliminaries
- 1. Introduction
- 1.1. Rationalist and Empiricist Approaches to Language
- 1.2 Scientific Content
- 1.3 The Ambiguity of Language: Why NLP Is Difficult
- 1.4 Dirty Hands
- 1.5 Further Reading
- 1.6 Exercises
- 2 Mathematical Foundations
- 2.1 Elementary Probability Theory
- 2.2 Essential Information Theory
- 2.3 Further Reading
- 3 Linguistics Essentials
- 3.1 Parts of Speech and Morphology
- 3.2 Phrase Structure
- 3.3 Semantics and Pragmatics
- 3.4 Other Areas
- 3.5 Further Reading
- 3.6 Exercises
- 4 Corpus-Based Work
- 4.1 Getting Set Up
- 4.2 Looking at Text
- 4.3 Marked-Up Data
- 4.4 Further Reading
- 4.5 Exercises
- II Words
- 5 Collocations
- 5.1 Frequency
- 5.2 Mean and Variance
- 5.3 Hypothesis Testing
- 5.4 Mutual Information
- 5.5 The Notion of Collocation
- 5.6 Further Reading
- 6 Statistical Inference: n-gram Models over Sparse Data
- 6.1 Bins: Forming Equivalence Classes
- 6.2 Statistical Estimators
- 6.3 Combining Estimators
- 6.4 Conclusions
- 6.5 Further Reading
- 6.6 Exercises
- 7 Word Sense Disambiguation
- 7.1 Methodological Preliminaries
- 7.2 Supervised Disambiguation
- 7.3 Dictionary-Based Disambiguation
- 7.4 Unsupervised Disambiguation
- 7.5 What Is a Word Sense?
- 7.6 Further Reading
- 7.7 Exercises
- 8 Lexical Acquisition
- 8.1 Evaluation Measures
- 8.2 Verb Subcategorization
- 8.3 Attachment Ambiguity
- 8.4 Selectional Preferences
- 8.5 Semantic Similarity
- 8.6 The Role of Lexical Acquisition in Statistical NLP
- 8.7 Further Reading
- III Grammar
- 9 Markov Models
- 9.1 Markov Models
- 9.2 Hidden Markov Models
- 9.3 The Three Fundamental Questions for HMMs
- 9.4 HMMs: Implementation, Properties, and Variants
- 9.5 Further Reading
- 10 Part-of-Speech Tagging
- 10.1 The Information Sources in Tagging
- 10.2 Markov Model Taggers
- 10.3 Hidden Markov Model Taggers
- 10.4 Transformation-Based Learning of Tags
- 10.5 Other Methods, Other Languages
- 10.6 Tagging Accuracy and Uses of Taggers
- 10.7 Further Reading
- 10.8 Exercises
- 11 Probabilistic Context Free Grammars
- 11.1 Some Features of PCFGs
- 11.2 Questions for PCFGs
- 11.3 The Probability of a String
- 11.4 Problems with the Inside-Outside Algorithm
- 11.5 Further Reading
- 11.6 Exercises
- 12 Probabilistic Parsing
- 12.1 Some Concepts
- 12.2 Some Approaches
- 12.3 Further Reading
- 12.4 Exercises
- IV Applications and Techniques
- 13 Statistical Alignment and Machine Translation
- 13.1 Text Alignment
- 13.2 Word Alignment
- 13.3 Statistical Machine Translation
- 13.4 Further Reading
- 14 Clustering
- 14.1 Hierarchical Clustering
- 14.2 Non-Hierarchical Clustering
- 14.3 Further Reading
- 14.4 Exercises
- 15 Topics in Information Retrieval
- 15.1 Some Background on Information Retrieval
- 15.2 The Vector Space Models
- 15.3 Term Distribution Models
- 15.4 Latent Semantic Indexing
- 15.5 Discourse Segmentation
- 15.6 Further Reading
- 15.7 Exercises
- 16 Text Categorization
- 16.1 Decision Trees
- 16.2 Maximum Entropy Modeling
- 16.3 Perceptrons
- 16.4 k Nearest Neighbor Classification
- 16.5 Further Reading
- 1. Introduction
3. Linguistic Essentials
3.1 Parts of Speech and Morphology
Linguists group the words of a language into classes (sets) which show similar syntactic behavior, and often a typical semantic type. These words classes are otherwise called syntactic or grammatical categories, but more commonly still by the traditional name parts of speech (POS).
Normally the various parts of speech for a words are listed in an online dictionary, otherwise known as a lexicon.
Word categories are systematically related by morphological processes such as the formation of the plural form (dog-s). from the singular form of the noun (dog). Morphology is important in NLP because language is productive: in any given text we will encounter words and word forms that we haven't seen before and that are not in our precompiled dictionary. Many of these new words are morphologically related to known words. So if we understand morphological processes, we can infer a log about the syntactic and semantic properties of new words.
The major types of morphological processes are inflection, derivation, and compounding. Inflections are the systematic modifications of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. Inflection does not change words class of meaning significantly, but varies features such as tense, number, and plurality. All the inflectional forms of a word are often grouped as manifestations of a single lexeme.
Derivation is less systematic. It usually results in a more radical change of syntactic category, and it often involves a change in meaning. An example is the derivation of the adverb widely from the adjective wide (by appending the suffix -ly). “Widely in a phrase like it is widely believed means among a large well-dispersed group of people, a shift from the core meaning of wide (extending over a vast area). Adverb formation is also less systematic than plural inflection. Some adjectives like old or difficult don't have adverbs: *oldly and *difficultly are not words of English. Here are some other examples of derivations: the suffix -en transforms adjectives into verbs (weak-en, soft-en), the suffix -able transforms verbs into adjectives (understand-able, accept-able), and the suffix -er transforms verbs into nouns (teach-er, lead-er).
Compounding refers to the merging of two or more words into a new word. English has many noun-noun compounds, nouns that are combinations of two other nouns. Examples are tea kettle, disk drive, or college degree. While these are (usually) written as separate words, they are pronounced as a single word, and denote a single semantic concept, which one would normally wish to list in the lexicon. There are also other compounds that involve parts of speech such as adjectives, verbs, and prepositions, such as downmarket, (to) overtake, and mad cow disease.
…
Tokenization
Hyphenation: Different forms representing the same word.
Do sequences of letters with a hyphen in between count as one word or two? Again, the intuitive answer seems to be sometimes one, sometimes two. This reflects the many sources of hyphen in texts.
One source is typographical. Words have traditionally been broken and hyphens inserted to improve justification of text.
Some things with hyphens are clearly best treated as single words, such as e-mail or co-operate, or A1-plus (as in A-1-plus commercial paper, a financial rating). Other cases are more arguable, although we usually want to regard them as a single words, for example, non-lawyer, pro-Arb, and so-called. The hyphens here might be termed lexical hyphens. They are commonly inserted before or after small words formatives, sometimes for the purpose of splitting up vowel sequences.
The third class of hyphens is ones inserted to help indicate the correct grouping of words. A common copy-editing practice is to hyphenate compound pre-modifiers, as in the example earlier in the sense or in examples like these:
- 4.1a the once-quiet study of superconductivity.
- 4.1b a though regime of business-conduct rules
- 4.1c the aluminum-export ban.
- 4.1d a text-based medium.
And hyphens occur in other places, where a phrase is seen as in some sense quotative or as expressing a quantity or rate:
- 4.2a the ideas of a child-as-required-yuppie-possession must be motivating them.
- 4.2b a final "take-it-or-leave-it" offer.
- 4.3c the 90-cent-an-hour raise.
- 4.3d the 26-year-old
In these cases, we would probably want to treat the things joined by hyphens as separate words.
Note that this means that we will often have multiple forms, perhaps some treated as one words and others as two, for what is best though of as a single lexeme (a single dictionary entry with a single meaning).
Word segmentation in other languages
Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.
While maintaining most words spaces, in German compound nouns are written as single words, for example Lebensversicherungsgesellschaftsangestellter 'life insurance company employee.' In many ways this makes linguistic sense, as compounds are a single words, at least phonologically. But for process purposes one may wish to divide such a compound, or at least to be aware of the internal structure of the words, and this becomes a limited words segmentation task. While not the rule, joining of compounds sometimes also happens in English, especially when they are common and have a specialized meaning. We noted above that one finds both data base and database. As another examples, while hard disk is more common, one sometimes finds harddisk in the computer press.
Whitespace not indicating a word break
Until now, the problems we have dealt with have mainly involved splitting apart sequence of characters where the word division are not shown by whitespace. But the opposite problem of wanting to lump things together also occurs. Where, things are separate by whitespace but we may with to regard them as a single word. One possible case is the reverse of the German compound problem. If one decides to treat database as one word, one may wish to treat it as one word even when it is written as database. More common cases are things such as phone numbers, where we may with to recard 9465 1873 as a single 'word,' or in the cases of multi-part names such as New York or San Francisco. An especially difficult case is when this problem interacts with hyphenation as in a phrase like this one:
- (4.3) the New York-New Haven railroad.
Here the hyphen does not express grouping of just the immediate adjacent graphic words - treating York-New as a semantic unit would be a big mistake.
Other cases are of more linguistic interest. For many purposes, one would want to regard phrasal verbs (make up, work out) as single lexemes (section 3.1.4), but this case is especially tricky since in many cases the particle is separable from the verb (I couldn't work' the answer out), and so in general identification of possible phrasal verbs will have to be left to subsequent processing. One might also want to treat as a single lexeme creating other fixed phrases, such as in spite of/, in order to, and because of, but typically a tokenizer will regard them as separate words. A partial implementation of this approach occurs in the LOB corpus where certain pairs of words such as because of are tagged with a single part of speech, here preposition, by means of using so-called ditto tags.
Morphology
Another question is whether on wants to keep word forms like sit, sits and sat separate or to collapse them. The issues here are similar to those in the discussion of capitalization, but have traditionally been regarded as more linguistically interesting. At first, grouping such forms together and working in terms of lexemes feels as if it is the right thing to do. Doing this is usually referred to in the literature as stemming in reference to a process that strips off affixes and leaves you with a stem. Alternatively, to find the lemma or lexeme of which one is looking at an inflected form. These latter terms imply disambiguation at the level of lexemes, such as whether a use of lying represents the verb lie-lay 'to prostrate oneself' or lie-lied/ 'to fib.'
Extensive empirical research within the Information Retrieval (IR) community has shown that doing stemming does not help the performance of classic IR system when performance is measure as an average over queries (Salton 1989; Hull 1996). There are always some queries for which stemming helps a lot. But there are other where performance goes down. This is a somewhat surprising result, especially from the viewpoint of linguist intuition, and so it is important to understand why that is. There are three main reasons for this.
Collocations
A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things. Or in the words of Firth (1957:181): "Collocations of a given word are statements of the habitual or customary places of that word.” Collocations include noun phrass like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the rich and powerful. Particularly interesting are the subtle and not-easily-explainable patterns of word usage that native speakers all know: why we say a stiff breeze but no ??a stiff wind (while either a a strong breeze or a a strong wind is okay), or why we speak of broad daylight (but not ?bright daylight or ??narrow darkness)).
Collocations are characterized by limited compositionality. We call a natural language expression compositions if the meaning of the expression can be predicted from the meaning of the parts. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. In the case of strong tea, strong has acquired the meaning rich in some active agent which is closely related, but slightly different from the basic sense having great physical strength. Idioms are the most extreme examples of non-compositionality. Idioms like to kick the bucket or to hear it through the grapevine only have an indirect historical relationship to the meaning of the expression. We are not talking about buckets or grapevines literally when we use these idioms.
There is considerable overlap between the concept of collocation and notions like term, technical term and terminological phrase. As these names suggest, the latter three are commonly used when collocations are extracted from technical domains (in a process called terminology extraction). The reader shoull be warned, though, that the word term has a different meaning in information retrieval. There, it refers to both words and phrases. So it subsumes the more narrow meaning that we will use in this chapter.
8 Lexical Acquisition
8.1 Evaluation Measures
8.2 Verb Subcategorization
8.3 Attachment Ambiguity
8.4 Selectional Preferences
8.5 Semantic Similarity
= 8.5.2 Probabilistic Measures
... (Dis-)similarity measure … information radius (IRad) … [math]\displaystyle{ D(p \vert\vert \frac{p+q}{2}) + D(q \vert\vert \frac{p+q}{2}) }[/math] …
- KL Divergence
We are already familiar with …
- Information Radius
The second measure in table 8.9, information radius (or total divergence to the average as Dagan et al. (1997b) call it), overcomes both these problems. It is symmetric ([math]\displaystyle{ \operatorname{IRad}(p,q) = \operatorname{IRad}(q,p) }[/math]) and there is no problem with infinite values since [math]\displaystyle{ \frac{p_i+q_i}{2} \ne 0 }[/math] if either [math]\displaystyle{ p_i \ne 0 }[/math] or [math]\displaystyle{ q_i \ne 0 }[/math]. The intuitive interpretation of IRad is that it answer the question: How much information is lost if we describe the two words (or random variables in the general case) that correspond to [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] with this average distribution? IRad ranges from 0 for identical distributions to [math]\displaystyle{ 2\log 2 }[/math] for maximally different distributions (see exercise 8.25). As usual we assume [math]\displaystyle{ 0\log 0 = 0 }[/math].
- [math]\displaystyle{ L_1 }[/math] Norm ; Manhattan Norm
A third …
8.6 The Role of Lexical Acquisition in Statistical NLP
}}
References
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
1999 FoundationsOfStatisticalNLP | Hinrich Schütze Christopher D. Manning | Foundations of Statistical Natural Language Processing | The MIT Press | http://books.google.com/books?id=YiFDxbEX3SUC | 1999 |