Morph Segmentation Task

AKA: Word Decompounding, WDT.
Context:
- Input: a Linguistic Expression.
- output: a string of Morphs.
- Unspaced Compound Word.
- It can support:
  - a Word Stemming Task.
  - a Word Lemmatisation Task.
Example(s):
- MST(“dogs”) ⇒ [dog] [s].
- MST(“wolves”) ⇒ [wolv], [es].
- MST(“arachnophobia”) ⇒ (arachno, phonia).
- MST(“Fliegangst” ~fear of flying) ⇒ (flieg, angst).
- MST(“Lebensversicherungsgesellschaft” ~ "life insurance company) ⇒ (Lebens, versicher ungs, gesell, schaft).
- MST(“The wolves' den was empty.”) ⇒ [The] [wolv] [es] ['] [den] [was] [empty].
- …
Counter-Example(s):
- Morphological Parsing.
- Orthographic Segmentation.
- Lemmatization("vliegangst”) ⇒ [Fliegen] [angst].
- WST("日文章魚怎麼說") ⇒ 日文, 章魚, 怎麼, 說.
- Linguistic Translation("日文章魚怎麼說", English) ⇒ “How do you say octopus in Japanese?”.
See: Morpheme.

References

(Karrij, 2004) ⇒ Wessel Kraaij. (2004). “Variations on Language Modeling for Information Retrieval." PhD Thesis, University of Twente, June 2004.
- QUOTE: Compound analysis (also called decompounding or compound splitting) is an additional normalization technique for Germanic languages, since these have a productive compounding capacity. This means that new words can be formed by concatenating existing words. Decomposition of these compound words into their constituting morphological base forms is important for IR, since these compounds can usually be paraphrased by a noun-phrase construction, e.g., “vliegangst” and “angst om te vliegen” (fear of flying). Normalization of compounds will enable a match between both forms of the same composite concept and partial matches with related words after compound splitting, e.g., ’luchtvervuiling’ will match with ’vervuiling’ Several algorithms have been proposed for compound splitting.. They either use a lexicon (e.g. Vosse, 1994) or a corpus (e.g. Hollink et al., 2003) as a resource for the identification of candidate base forms which can form compounds. We will discuss the results of several comparative studies concerning stemming algorithms in the rest of this section.

(Hollink et al., 2003) ⇒ Hollink, V., Kamps, J., Monz, C., & de Rijke, M. (2003). Monolingual document retrieval for european languages. Information Retrieval.

(Vosse, 1994) ⇒ Vosse, T. G. (1994). “The Word Connection." PhD thesis, Rijksuniversiteit Leiden, Neslia Paniculata Uitgeverij, Enschede.