Word Segmentation Algorithm
Jump to navigation
Jump to search
A Word Segmentation Algorithm is a sequence segmentation algorithm that can be applied by a word segmentation system (to solve a word segmentation task.
- AKA: Word Detection Algorithm.
- Context:
- It can, depending on the Written Languages:
- not use Whitespace Characters between Terminal Words (e.g. for Japanese language and Chinese language).
- not use a large number of Spaced Compound Content Words, (e.g. for English language).
- It can range from being:
- …
- It can, depending on the Written Languages:
- Counter-Example(s):
- See: Concept Mention.
References
2007
- (Schmid, 2007) ⇒ Helmut Schmid. (2007). “Tokenizing.” In: Corpus Linguistics: An International Handbook. Walter de Gruyter, Berlin.
1999
- (Brent, 1999) ⇒ Michael R. Brent. (1999). “An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery.” In: Machine Learning, 34(1-3). doi:10.1023/A:1007541817488.
1997
- (Palmer, 1997) ⇒ David D. Palmer. (1997). “A Trainable Rule-based Algorithm for Word Segmentation.” In: Proceedings of the ACL 1997 Conference. doi:10.3115/976909.979658.
- QUOTE: This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple, language-independent alternative to large-scale lexical-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter, we show our algorithm to produce high performance Chinese segmentation. In addition, we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages.