Word Lemmatisation System
Jump to navigation
Jump to search
A Word Lemmatisation System is a Computing System that can solve a Word Lemmatisation Task.
- AKA: Lemmatizer.
- Context:
- ...
- Example(s):
- TextGrid Lemmatizer http://www.textgrid.de/en/beta/lemmatizer.html
- http://lemmatizer.org/
- CST Lemmatizer http://www.clarin.eu/tools/csts-lemmatizer
- http://code.google.com/p/mate-tools/source/browse/trunk/mate-tools/src/is2/lemmatizer/Lemmatizer.java?r=136
- LEMMING.
- a Word Embedding-based Lemmatizer.
- spaCy Lemmatizer.
- …
- Counter-Example(s):
- See: Word Tokenization System, WebBANC, Lemma, Natural Language Processing, Named Entity Recognizer, CoNLL.
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Lemmatisation Retrieved:2018-9-23.
- Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. [1] In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research (Bergmanis & Goldwater, 2018; Muller et al., 2015; Green et al., 2009).
2018b
- (Bergmanis & Goldwater, 2018) ⇒ Toms Bergmanis, and Sharon Goldwater (2018). "Context Sensitive Neural Lemmatization with Lematus". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1391-1400).
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “ceļu” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “ceļš” (road); without context, the lemmatizer can only guess.
2015
- (Muller et al., 2015) ⇒ Thomas Muller, Ryan Cotterell, Alexander Fraser, and Hinrich Schutze (2015). "Joint lemmatization and morphological tagging with lemming". In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2268-2274).
- ABSTRACT: We present LEMMING, a modular loglinear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
2009a
- (Green et al., 2009) ⇒ Nathan Green, Paul Breimyer, Vinay Kumar, and Nagiza F. Samatova (2009). "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".
- ABSTRACT: Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-of-speech taggers, etc. Annotated corpora are specialized for a particular language or NLP task. Hence, a majority of the world’s 6000+ languages lack NLP resources, and therefore remain minority, or under-resourced, languages in modern language technologies.
In this paper we present WebBANC, a framework for building Annotated NLP Corpora from user annotations on the Web. With WebBANC, a casual user can annotate parts of HTML or PDF text on any website and associate the text with semantic concepts specific to an NLP task. User annotations are combined by WebBANC to produce annotated corpora potentially comparable in diversity to corpora in English, minority languages, and human generated categories, such as those on Yahoo.com, with an average precision and recall of 0.80, which is comparable to automated NER tools on the CoNLL benchmark.
- ABSTRACT: Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-of-speech taggers, etc. Annotated corpora are specialized for a particular language or NLP task. Hence, a majority of the world’s 6000+ languages lack NLP resources, and therefore remain minority, or under-resourced, languages in modern language technologies.
2009b
- http://www.cis.upenn.edu/~cis639/docs/lookup.html
- Lexical Lookup: Lexical lookup requires a morphological analyzer to associate each token with one or more readings. Unknown words are handled by a guesser which provides potential part-of-speech categories based on affix patterns.
- ↑ Collins English Dictionary, entry for "lemmatise"