Word Lemmatisation Algorithm
Jump to navigation
Jump to search
A Word Lemmatisation Algorithm is an NLP algorithm that can be implemented by a word lemmatisation system to solve a word lemmatisation task.
- Example(s):
- Counter-Example(s):
- See: Lemmatisation , Latent Semantic Analysis.
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Lemmatisation Retrieved:2018-9-16.
- Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. [1] In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research (Green et al., 2009 ; Muller et al., 2015 ; Bergmanis & Goldwater, 2018).
2018b
- (Bergmanis & Goldwater, 2018) ⇒ Toms Bergmanis, and Sharon Goldwater.(2018). "Context Sensitive Neural Lemmatization with Lematus". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Vol. 1, pp. 1391-1400).
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “cel¸u” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “cel¸sˇ” (road); without context, the lemmatizer can only guess (...)
This paper presents Lematus—a system that adapts the neural machine translation framework of Sennrich et al. (2017) to learn context sensitive lemmatization using an encoder-decoder model. Context is represented simply using the character contexts of each form to be lemmatized, meaning that our system requires fewer training resources than previous systems...
- QUOTE: Lemmatization is the process of determining the dictionary form of a word (e.g. swim) given one of its inflected variants (e.g. swims, swimming, swam, swum). Data-driven lemmatizers face two main challenges: first, to generalize beyond the training data in order to lemmatize unseen words; and second, to disambiguate ambiguous wordforms from their sentence context. In Latvian, for example, the wordform “cel¸u” is ambiguous when considered in isolation: it could be an inflected variant of the verb “celt” (to lift) or the nouns “celis” (knee) or “cel¸sˇ” (road); without context, the lemmatizer can only guess (...)
2017
- (Cancedda & Renders, 2017) ⇒ Nicola Cancedda, and Jean-Michel Renders (2017). "Cross-Lingual Text Mining". In: (Sammut & Webb, 2017). DOI: 10.1007/978-1-4899-7687-1_189
- QUOTE: In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list – possibly weighted – of interlingua concepts.
For any textual object (typically a document or a section of document), the interlingua concept representation is derived from a sequence of operations that encompass:
- QUOTE: In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list – possibly weighted – of interlingua concepts.
- 1. Linguistic preprocessing (as explained in previous sections, this step amounts to extract the relevant, normalized “terms” of the textual objects, by tokenisation, word segmentation/decompounding, lemmatisation/stemming, part-of-speech tagging, stopword removal, corpus-based term filtering, Noun-phrase extractions, etc.).
- 2. Semantic enrichment and/or monolingual dimensionality reduction.
- 3. Interlingua semantic projection.
2015
- (Muller et al., 2015) ⇒ Thomas Muller , Ryan Cotterell, Alexander Fraser, and Hinrich Schutze. (2015). "Joint lemmatization and morphological tagging with lemming". In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2268-2274).
- ABSTRACT: We present LEMMING, a modular loglinear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
2009
- (Green et al., 2009) ⇒ Nathan Green, Paul Breimyer, Vinay Kumar, and Nagiza F. Samatova (2009). "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Language".
2008
- (Ingason, 2008) ⇒ Anton Karl Ingason, Sigrun Helgadottir, Hrafn Loftsson, and Eirikur Rognvaldsson (2008). "A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI)". In Advances in Natural Language Processing (pp. 205-216). Springer, Berlin, Heidelberg.
- ABSTRACT: We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger (1) for tagging and The Icelandic Frequency Dictionary (2) corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections (3). Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich language.
- ↑ Collins English Dictionary, entry for "lemmatise"