2004 VariationsOnLanguageModelingForIR

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Information Retrieval Task, Information Retrieval Algorithm, Survey.

Notes

Quotes

  • Keyword(s): Language Modeling, Cross-Language Information Retrieval, Topic Detection and Tracking, IR evaluation, Parallel Corpora. [bibtex-entry]

3.3.1. Morphological normalization: stemming, lemmatization.

  • One of the techniques employed in Information Retrieval (IR) to improve effectiveness is normalization of document and query terms. By reducing morphological variance of terms e.g., by mapping singular and plural forms of the same word on a single base form (stem), the querydocument matching process can be improved. The normalization process generates so-called conflation classes. Members of conflation classes are treated as if they were equivalent terms. In practice, this means that during document indexing and query analysis, full wordforms are replaced by an index term representing the conflation class. This is usually the normalized form to which all members can be reduced, but it is not necessarily a well-formed word since it just acts as a placeholder for the class. Morphological normalization is usually called stemming in an IR context. Sometimes the term lemmatization is used, which is restricted to approaches that produce lemmas as base-forms.
  • There are two main approaches to achieve morphological normalization. One could either attempt to reduce affixes (usually suffixes) by simple substring removal operations or even truncation. These simple methods usually do not produce morphologically well-formed base-forms. A more principled approach is to apply morphological analysis grounded in linguistic theory about word formation. This method does produce well-formed base-forms which is important in case of showing feedback terms to the user or to access translation dictionaries in the case of a cross-language setting. In addition, such a knowledge rich approach will have a correct coverage of irregular morphology. Three morphological phenomena are of particular interest to IR: inflection, derivation and compounding. The aim of normalization is to group morphological variants that have a similar meaning. Normalizing inflectional variants is usually a meaning-neutral operation. However, the semantic relationship between derivational variants can range from very close to quite distinct e.g., like, likely, art, artist or unite, union.
  • Compound analysis (also called decompounding or compound splitting) is an additional normalization technique for Germanic languages, since these have a productive compounding capacity. This means that new words can be formed by concatenating existing words. Decomposition of these compound words into their constituting morphological base forms is important for IR, since these compounds can usually be paraphrased by a noun-phrase construction, e.g., “vliegangst” and “angst om te vliegen” (fear of flying). Normalization of compounds will enable a match between both forms of the same composite concept and partial matches with related words after compound splitting, e.g., ’luchtvervuiling’ will match with ’vervuiling’ Several algorithms have been proposed for compound splitting.. They either use a lexicon (e.g. Vosse, 1994) or a corpus (e.g. Hollink et al., 2003) as a resource for the identification of candidate base forms which can form compounds. We will discuss the results of several comparative studies concerning stemming algorithms in the rest of this section.

References

  • Hollink, V., Kamps, J., Monz, C., & de Rijke, M. (2003). Monolingual document retrieval for european languages. Information Retrieval.
  • Vosse, T. G. (1994). The Word Connection. PhD thesis, Rijksuniversiteit Leiden, Neslia Paniculata Uitgeverij, Enschede.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 VariationsOnLanguageModelingForIRWessel KraaijVariations on Language Modeling for Information Retrievalhttp://www.ctit.utwente.nl/library/phd/kraaij.pdf2004