2004 FindingPredominantSensesInUntaggedText
Jump to navigation
Jump to search
- (McCarthy et al., 2004) ⇒ Diana McCarthy, Rob Koeling, Julie Weeds, John Carroll. (2004). “Finding Predominant Senses in Untagged Text.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004).
Subject Headings:
Notes
- It proposes an Unsupervised Algorithm.
- It uses (Lin, 1998b) for Automated Thesaurus Creation.
- It uses (Patwardhan and Pedersen, 2003) as a WordNet Similarity Measure.
Cited By
- ~180 http://scholar.google.com/scholar?cites=254641737593843663
- (Mohammad & Hirst, 2006) ⇒ Saif Mohammad, and Graeme Hirst. (2006). “Determining Word Sense Dominance Using a Thesaurus.” In: Proceedings of EACL-2006.
Quotes
Abstract
- In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of hand-tagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64% on the nouns of the SENSEVAL-2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text, such as SemCor. Furthermore, we demonstrate that our method discovers appropriate predominant senses for words from two domain-specific corpora.
1 Introduction
- The first sense heuristic which is often used as a baseline for supervised WSD systems outperforms many of these systems which take surrounding context into account. This is shown by the results of the English all-words task in SENSEVAL-2 (Cotton et al., 1998) in figure 1 below, where the first sense is that listed in WordNet for the PoS given by the Penn TreeBank (Palmer et al., 2001). The senses in WordNet are ordered according to the frequency data in the manually tagged resource SemCor (Miller et al., 1993). Senses that have not occurred in SemCor are ordered arbitrarily and after those senses of the word that have occurred. The figure distinguishes systems which make use of hand-tagged data (using HTD) such as SemCor, from those that do not (without HTD). The high performance of the first sense baseline is due to the skewed frequency distribution of word senses. Even systems which show superior performance to this heuristic often make use of the heuristic where evidence from the context is not sufficient (Hoste et al., 2001). Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case for obtaining a first, or predominant, sense from untagged corpus data so that a WSD system can be tuned to the genre or domain at hand.
- SemCor comprises a relatively small sample of 250,000 words. There are words where the first sense in WordNet is counter-intuitive, because of the size of the corpus, and because where the frequency data does not indicate a first sense, the ordering is arbitrary. For example the first sense of tiger in WordNet is audacious person whereas one might expect that carnivorous animal is a more common usage. There are only a couple of instances of tiger within SemCor. Another example is embryo, which does not occur at all in SemCor and the first sense is listed as rudimentary plant rather than the anticipated fertilised egg meaning. We believe that an automatic means of finding a predominant sense would be useful for systems that use it as a means of backing-off (Wilks and Stevenson, 1998; Hoste et al., 2001) and for systems that use it in lexical acquisition (McCarthy, 1997; Merlo and Leybold, 2001; Korhonen, 2002) because of the limited size of hand-tagged resources. More importantly, when working within a specific domain one would wish to tune the first sense heuristic to the domain at hand. The first sense of star in SemCor is celestial body, however, if one were disambiguating popular news celebrity would be preferred.
- …
2 Method
- In order to find the predominant sense of a target word we use a thesaurus acquired from automatically parsed text based on the method of Lin (1998). This provides the [math]\displaystyle{ k }[/math] nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour. We then use the WordNet similarity package (Patwardhan and Pedersen, 2003) to give us a semantic similarity measure (hereafter referred to as the WordNet similarity measure) to weight the contribution that each neighbour makes to the various senses of the target word.
,