Word Lemmatisation System

AKA: Lemmatizer.
Context:
- ...
Example(s):
Counter-Example(s):
- a Word Stemming System.
- a Word Tokenization System.
- a Morphological Parsing System.
See: Word Tokenization System, WebBANC, Lemma, Natural Language Processing, Named Entity Recognizer, CoNLL.

References

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Lemmatisation Retrieved:2018-9-23.
- Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. ^[1] In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research (Bergmanis & Goldwater, 2018; Muller et al., 2015; Green et al., 2009).

(Bergmanis & Goldwater, 2018) ⇒ Toms Bergmanis, and Sharon Goldwater (2018). "Context Sensitive Neural Lemmatization with Lematus". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1391-1400).
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “ceļu” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “ceļš” (road); without context, the lemmatizer can only guess.

(Green et al., 2009) ⇒ Nathan Green, Paul Breimyer, Vinay Kumar, and Nagiza F. Samatova (2009). "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".
- ABSTRACT: Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-of-speech taggers, etc. Annotated corpora are specialized for a particular language or NLP task. Hence, a majority of the world’s 6000+ languages lack NLP resources, and therefore remain minority, or under-resourced, languages in modern language technologies.
  In this paper we present WebBANC, a framework for building Annotated NLP Corpora from user annotations on the Web. With WebBANC, a casual user can annotate parts of HTML or PDF text on any website and associate the text with semantic concepts specific to an NLP task. User annotations are combined by WebBANC to produce annotated corpora potentially comparable in diversity to corpora in English, minority languages, and human generated categories, such as those on Yahoo.com, with an average precision and recall of 0.80, which is comparable to automated NER tools on the CoNLL benchmark.

http://www.cis.upenn.edu/~cis639/docs/lookup.html
- Lexical Lookup: Lexical lookup requires a morphological analyzer to associate each token with one or more readings. Unknown words are handled by a guesser which provides potential part-of-speech categories based on affix patterns.