Word Segmentation Algorithm

AKA: Word Detection Algorithm.
Context:
- It can, depending on the Written Languages:
  - not use Whitespace Characters between Terminal Words (e.g. for Japanese language and Chinese language).
  - not use a large number of Spaced Compound Content Words, (e.g. for English language).
- It can range from being:
  - a Dictionary-based Word Mention Segmentation Algorithm (Dictionary-based Word Mention Segmentation Algorithm).
  - to being a Supervised Word Mention Segmentation Algorithm (Supervised Word Mention Segmentation Algorithm).
- …
Counter-Example(s):
- a Mention Linking Algorithm.
- a Mention Resolution Algorithm.
See: Concept Mention.

References

(Schmid, 2007) ⇒ Helmut Schmid. (2007). “Tokenizing.” In: Corpus Linguistics: An International Handbook. Walter de Gruyter, Berlin.

(Brent, 1999) ⇒ Michael R. Brent. (1999). “An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery.” In: Machine Learning, 34(1-3). doi:10.1023/A:1007541817488.

(Palmer, 1997) ⇒ David D. Palmer. (1997). “A Trainable Rule-based Algorithm for Word Segmentation.” In: Proceedings of the ACL 1997 Conference. doi:10.3115/976909.979658.
- QUOTE: This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple, language-independent alternative to large-scale lexical-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter, we show our algorithm to produce high performance Chinese segmentation. In addition, we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages.