Word Lemmatisation Task
A Word Lemmatisation Task is a lexical classification task that maps a word form to a lexeme lemma.
- AKA: Lemmatize, WLT.
- Context:
- Input: a Word Mention.
- optional: a Word Mention String (that the word is in).
- output: a Natural Language Lemma.
- It can be performed by a Word Lemmatisation System (that implements a word lemmatisation algorithm).
- It can range from being a Heuristic Word Lemmatisation Task to being a Data-Driven Word Lemmatisation Task.
- It can (typically) be performed after a Word Mention Segmentation Task.
- ...
- Input: a Word Mention.
- Example(s):
- Lemmatise("wolves”) ⇒ wolf.
- Lemmatise("went”) ⇒ go
- Lemmatise("running”, "He is running") ⇒ “run”.
- Lemmatise("viewpoints”, "There are many viewpoints to the Grand Canyon".) ⇒ “viewpoint”.
- Lemmatise("executes", "The party executes the agreement.") ⇒ "execute"
- Lemmatise("leases", "The company leases the property.") ⇒ "lease"
- ...
- Counter-Example(s):
- a Word Stemming Task.
- a Word Morphology Parsing Task, such as:
WMPT("running”, "He is running") ⇒ <lemma=run, form=present participle>
. - a Part-of-Speech Tagging Task.
- See: Morphological Parsing Task, Morphological Lemma, Dictionary Record, Named Entity Recognizer .
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Lemmatisation Retrieved:2018-9-23.
- Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. [1] In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research (Bergmanis & Goldwater, 2018; Muller et al., 2015; Green et al., 2009).
2018b
- (Bergmanis & Goldwater, 2018) ⇒ Toms Bergmanis, and Sharon Goldwater (2018). "Context Sensitive Neural Lemmatization with Lematus". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1391-1400).
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “ceļu” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “ceļš” (road); without context, the lemmatizer can only guess.
2015
- (Muller et al., 2015) ⇒ Thomas Muller, Ryan Cotterell, Alexander Fraser, and Hinrich Schutze (2015). "Joint lemmatization and morphological tagging with lemming". In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2268-2274).
- ABSTRACT: We present LEMMING, a modular loglinear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
2009a
- (Green et al., 2009) ⇒ Nathan Green, Paul Breimyer, Vinay Kumar, and Nagiza F. Samatova (2009). "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".
- ABSTRACT: Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-of-speech taggers, etc. Annotated corpora are specialized for a particular language or NLP task. Hence, a majority of the world’s 6000+ languages lack NLP resources, and therefore remain minority, or under-resourced, languages in modern language technologies.
In this paper we present WebBANC, a framework for building Annotated NLP Corpora from user annotations on the Web. With WebBANC, a casual user can annotate parts of HTML or PDF text on any website and associate the text with semantic concepts specific to an NLP task. User annotations are combined by WebBANC to produce annotated corpora potentially comparable in diversity to corpora in English, minority languages, and human generated categories, such as those on Yahoo.com, with an average precision and recall of 0.80, which is comparable to automated NER tools on the CoNLL benchmark.
- ABSTRACT: Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-of-speech taggers, etc. Annotated corpora are specialized for a particular language or NLP task. Hence, a majority of the world’s 6000+ languages lack NLP resources, and therefore remain minority, or under-resourced, languages in modern language technologies.
2009b
- (Wiktionary, 2009) ⇒ http://en.wiktionary.org/wiki/lemmatisation
- 1. (computing) The process of finding the lemma that corresponds to an inflected form of a word
2009c
- (IBM knowledge Center, 2009) ⇒ https://www.ibm.com/support/knowledgecenter/bs/SS8NLW_12.0.0/com.ibm.discovery.es.ta.doc/iiysalgseg.html
- Lemmatization is a form of linguistic processing that determines the lemma for each word form that occurs in text. The lemma of a word encompasses its base form plus inflected forms that share the same part of speech. For example, the lemma for go encompasses go, goes, went, gone, and going. Lemmas for nouns group singular and plural forms (such as calf and calves). Lemmas for adjectives group comparative and superlative forms (such as good, better, and best). Lemmas for pronouns group different cases of the same pronoun (such as I, me, my, and mine).
Lemmatization requires a dictionary for both indexing and searching.
Watson Explorer Content Analytics indexes the lemmas and the inflected words and lemmatizes all inflected words in a query. Lemmatization enhances search quality by finding documents that contain variants of an inflected word in the query. For example, documents that contain the word mice are found when a query includes the word mouse.
- Lemmatization is a form of linguistic processing that determines the lemma for each word form that occurs in text. The lemma of a word encompasses its base form plus inflected forms that share the same part of speech. For example, the lemma for go encompasses go, goes, went, gone, and going. Lemmas for nouns group singular and plural forms (such as calf and calves). Lemmas for adjectives group comparative and superlative forms (such as good, better, and best). Lemmas for pronouns group different cases of the same pronoun (such as I, me, my, and mine).
2006
- (Airio, 2006) ⇒ Eija Airio. (2006). “Word Normalization and Decompounding in Mono- and Bilingual IR.” In: Journal of Information Retrieval, 9(3).
- ABSTRACT: The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization.
- Keywords Monolingual information retrieval - bilingual information retrieval - lemmatization - stemming - decompounding
2003
- (Mitkov, 2003) ⇒ Ruslan Mitkov, editor. (2003). “The Oxford Handbook of Computational Linguistics." Oxford University Press. ISBN:019927634X
- QUOTE: lemmatization: The process of grouping the inflected forms of a words together under a base form, or of recovering the base form from an inflected form, e.g. grouping the inflected forms 'run','runs','running','ran' under the base form 'run'.
1998
- (Lezius et al, 1998) ⇒ Wolfgang Lezius, Reinhard Rapp, Manfred Wettler. (1998). “A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German.. In: Proceedings of the 17th International Conference on Computational linguistics.
- ↑ Collins English Dictionary, entry for "lemmatise"