Word Stemming System
A Word Stemming System is a text processing system that can solve a word stemming task.
- AKA: Stemmer.
- Context:
- It can be used by a Text Vector Generator.
- It produces Stemmed Words.
- Example(s):
- Counter-Example(s):
- See: Bag-of-Words Representation.
References
2018
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stemming Retrieved:2018-10-19.
- In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.
- In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
2009a
- (Snowball, 2009) ⇒ http://snowball.tartarus.org/texts/stemmersoverview.html
- We present stemming algorithms, and Snowball stemmers, for English, for Russian, for the Romance languages French, Spanish, Portuguese and Italian, for German and Dutch, for Swedish, Norwegian (bokmål dialect) and Danish, and for Finnish.
- Snowball, and most of the current stemming algorithms were written by Dr Martin Porter, who also prepared the material for the Website. The Snowball to Java codegenerator, and supporting Java libraries, were contributed by Richard Boulton. Dr Andrew Macfarlane, of City University, London, gave much initial encouragement and proofreading assistance.
2009b
- (Richardson, 2009) ⇒ Jim Richardson (2009) http://search.cpan.org/~snowhare/Lingua-Stem-0.83/lib/Lingua/Stem/En.pm
- QUOTE: This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.
It is derived from the C program “
stemmer.c
” as found in freewais and elsewhere, which contains these notes:
- QUOTE: This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.
Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137. Provenance: Written by B. Frakes and C. Cox, 1986.
2006
- (Porter, 2006) ⇒ Martin Porter (2006). The Porter Stemming Algorithm.
- QUOTE: The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
1980
- (Porter, 1980) ⇒ Martin F. Porter. (1980). “An algorithm for suffix stripping” (PDF). In: Program, 14(3):130–137.
- QUOTE: In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.
Perhaps the best criterion for removing suffixes from two words W1 and W2 to produce a single stem S, is to say that we do so if there appears to be no difference between the two statements `a document is about W1' and `a document is about W2'. So if W1=`CONNECTION' and W2=`CONNECTIONS' it seems very reasonable to conflate them to a single stem. But if W1=`RELATE' and W2=`RELATIVITY' it seems perhaps unreasonable, especially if the document collection is concerned with theoretical physics (...)
- QUOTE: In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.