Word Lemmatisation Algorithm

Example(s):
Counter-Example(s):
- Canonicalization Algorithm,
- Word Stemming Algorithm.
See: Lemmatisation , Latent Semantic Analysis.

References

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Lemmatisation Retrieved:2018-9-16.
- Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. ^[1] In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research (Green et al., 2009 ; Muller et al., 2015 ; Bergmanis & Goldwater, 2018).

(Bergmanis & Goldwater, 2018) ⇒ Toms Bergmanis, and Sharon Goldwater.(2018). "Context Sensitive Neural Lemmatization with Lematus". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1391-1400).
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “cel¸u” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “cel¸sˇ” (road); without context, the lemmatizer can only guess (...)
  This paper presents Lematus—a system that adapts the neural machine translation framework of Sennrich et al. (2017) to learn context sensitive lemmatization using an encoder-decoder model. Context is represented simply using the character contexts of each form to be lemmatized, meaning that our system requires fewer training resources than previous systems...

(Muller et al., 2015) ⇒ Thomas Muller , Ryan Cotterell, Alexander Fraser, and Hinrich Schutze. (2015). "Joint lemmatization and morphological tagging with lemming". In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2268-2274).
- ABSTRACT: We present LEMMING, a modular loglinear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.

(Ingason, 2008) ⇒ Anton Karl Ingason, Sigrun Helgadottir, Hrafn Loftsson, and Eirikur Rognvaldsson (2008). "A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI)". In Advances in Natural Language Processing (pp. 205-216). Springer, Berlin, Heidelberg.
- ABSTRACT: We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger (1) for tagging and The Icelandic Frequency Dictionary (2) corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections (3). Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich language.