2002 MultiwordExpressions
- (Sag et al., 2002) ⇒ Ivan A. Sag, T. Baldwin, F. Bond, Ann Copestake, and Dan Flickinger. (2002). “Multiword Expressions: A pain in the neck for NLP.” In: Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002).
Subject Headings: Multiword Expression, Compound Word.
Notes
- It defines a Multiword Expressions (MWEs) to be “idiosyncratic interpretations that cross word boundaries (or spaces)”.
- It generally agrees with (Bauer, 1983)'s classification of MWEs into Lexicalized Phrases and Institutionalized Phrases.
Quotes
Abstract
Multiword expressions are a key problem for the development of large-scale, linguistically sound natural language processing technology. This paper surveys the problem and some currently available analytic techniques. The various kinds of multiword expressions should be analyzed in distinct ways, including listing “words with spaces”, hierarchically organized lexicons, restricted combinatoric rules, lexical selection, “idiomatic constructions” and simple statistical affinity. An adequate comprehensive analysis of multiword expressions must employ both symbolic and statistical techniques.
The research reported here was conducted in part under the auspices of the LinGO project, an international collaboration centered around the lkb system and related resources (see http://lingo.stanford.edu). This research was supported in part by the Research Collaboration between NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation and CSLI, Stanford University. We would like to thank Emily Bender and Tom Wasow for their contributions to our thinking. However, we alone are responsible for any errors that remain.
1. Introduction
The first of these is disambiguation. Paradoxically, linguistic precision is inversely correlated with degree of sentence ambiguity. This is a fact of life encountered by every serious grammar development project. Though knowledge representation, once thought to hold the key to the problem of disambiguation, it has largely failed to provide completely satisfactory solutions. Most research communities we are aware of that are currently developing large scale, linguistically precise, computational grammars are now exploring the integration of stochastic methods for ambiguity resolution. The second key problem facing the deep processing program – the problem of multiword expressions – is underappreciated in the field at large. There is insufficient ongoing work investigating the nature of this problem or seeking computationally tractable techniques that will contribute to its solution.
We define multiword expressions (MWEs) very roughly as “idiosyncratic interpretations that cross word boundaries (or spaces)”. As Jackendoff (1997: 156) notes, the magnitude of this problem is far greater than has traditionally been realized within linguistics. He estimates that the number of MWEs in a speaker’s lexicon is of the same order of magnitude as the number of single words. In fact, it seems likely that this is an underestimate, even if we only include lexicalized phrases. In WordNet 1.7 (Fellbaum 1999), for example, 41% of the entries are multiword. For a wide coverage NLP system, this is almost certainly an underestimate. Specialized domain vocabulary, such as terminology, overwhelmingly consists of MWEs, and a system may have to handle arbitrarily many such domains. As each new domain adds more MWEs than simplex words, the proportion of MWEs will rise as the system adds vocabulary for new domains.
2 Some Kinds of MWE
MWEs can be broadly classified into lexicalized phrases and institutionalized phrases (terminology adapted from Bauer (1983)). Lexicalized phrases have at least partially idiosyncratic syntax or semantics, or contain ‘words’ which do not occur in isolation; they can be further broken down into fixed expressions, semi-fixed expressions and syntactically-flexible expressions, in roughly decreasing order of lexical rigidity. Institutionalized phrases are syntactically and semantically compositional, but occur with markedly high frequency (in a given context). Below, we examine instances of each category and discuss some of the peculiarities that pose problems for both words-with-spaces and fully compositional analyses.
References
- (Bauer, 1983) ⇒ Laurie Bauer. (1983). “English Word-formation.” Cambridge University Press.
References
- (Bauer, 1983) ⇒ Laurie Bauer. (1983). “English Word-formation.” Cambridge University Press.
- Eugene Charniak. (2001). “Immediate-Head Parsing for Language Models.” In: Proceedings of the 39th Annual Meeting of the ACL and 10th Conference of the EACL (ACL-EACL 2001).
- Ann Copestake. (1992). “The Representation of Lexical Semantic Information.” Ph.D. thesis, University of Sussex.
- Ann Copestake. (1994). “Representing Idioms.” Presentation at the HPSG Conference.
,