Word Segmentation Task (WST)
A Word Segmentation Task (WST) is a text segmentation task that is restricted to the detection of the word mentions within a text artifact.
- AKA: Surface Word Boundary Detection, Lexing.
- Context:
- Input: a Text Artifact (or a grapheme string).
- output: a Word Mention String.
- It can range from being a Written Word Segmentation Task (for written expressions) to being a Spoken Word Segmentation Task for spoken expressions)
- It can interface with an Morph Segmentation Task (MST).
- It can support a Part-of-Speech Tagging Task.
- It can be solved by a Word Mention Segmentation System (that implements a Word Mention Segmentation Algorithm).
- It can range from being a Heuristic Word Mention Identification Task to being a Data-Driven Word Mention Identification Task.
- It can be a Language-Dependent Task, such as a Chinese Word Segmentation Task.
- ...
- Example(s):
- WMST("I'm coming home”) ⇒ ([I] ['m] [coming] [home]).
- WMST("I bought a real time operating system”) ⇒ ([I] [bought] [a] [real time] [operating system]).
- WMST("日文章魚怎麼說") ⇒ ([日文] [章魚] [怎麼] [說]) (i.e. ~[Japanese] [octopus] [how] [say]).
- VWST("Famous notaries public include ex-attorney generals.") ⇒ ([Famous] [notaries public] [include] [ex-] [attorney generals]).
- WMST("Der Lebensversicherungsgesellschaftsangestellte kam gestern mit seinem Deutscher Schäferhund.” (~The life insurance company employee came yesterday with their German Shepherd) ⇒ ([Der], [Lebensversicherungs], [gesellschafts], [angestellte], [kam], [gestern], [mit], [seinem], [Deutscher Schäferhund])
- notice that both “life insurance" and "insurance company” may exist in a lexicon but "... [life] [insurance company]..” is incorrect.
- WMST("The ex-governor general's sisters-in-law saw the wolves' den near Mr. Smith's home in Sault Ste. Marie.”) ⇒ ([The] [ex-] [governor general's] [sisters-in-law] [saw] [the] [wolves'] [den] [near] [Mr. Smith's] [home] [in] [Sault Ste. Marie].”).
- any Term Mention Segmentation Task.
- any Entity Mention Detection Task.
[PWST]](#Imcominghome) ⇒ ([#] [I] [m] [coming] [home])
.- more Word Segmentation Task Examples.
- …
- Counter-Example(s):
- Text Tokenization Task, such as
TTT("I bought a real time operating system”) ⇒ [I] [bought] [a] [real] [time] [operating] [system]
. - a Word Stemming Task, such as:
WSIT("The ex-governor general's sisters-in-law saw the wolves' den near Mr. Smith's home in Sault Ste. Marie.”) ⇒ the, exgovernor, general, s, sistersinlaw, saw, the, wolv, den, near, mr, smith, s, home, in, sault, ste, mari)
- an Allomorph Segmentation Task, such as:
f("The wolves' den was empty.”) ⇒ ([The], [wolv], [es], ['], [den], [was], [empty]).
- a Phrase Chunking Task, such as:
f("Famous notaries public include ex-attorney generals.”)
⇒ ([Famous notaries public] [include] [ex-attorney generals]). - a Relation Mention Detection Task (e.g. a Semantic Relation Mention Detection Task).
- a Word Mention Reference Resolution Task.
- a Word Mention Coreference Resolution Task.
- a Sentence Boundary Detection.
- Text Tokenization Task, such as
- See: Text Chunking Task, Pattern Matching Task, Sentence (Linguistics), Turkish Word, German Word, Lexical Stress, Vowel Harmony, Word Separator, Space (Punctuation), Word Boundaries.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Word#Word_boundaries Retrieved:2015-4-11.
- The task of defining what constitutes a "word" involves determining where one word ends and another word begins—in other words, identifying word boundaries. There are several ways to determine where the word boundaries of spoken language should be placed:
- Potential pause: A speaker is told to repeat a given sentence slowly, allowing for pauses. The speaker will tend to insert pauses at the word boundaries. However, this method is not foolproof: the speaker could easily break up polysyllabic words, or fail to separate two or more closely related words.
- 'Indivisibility: A speaker is told to say a sentence out loud, and then is told to say the sentence again with extra words added to it. Thus, I have lived in this village for ten years might become My family and I have lived in this little village for about ten or so years. These extra words will tend to be added in the word boundaries of the original sentence. However, some languages have infixes, which are put inside a word. Similarly, some have separable affixes; in the German sentence "Ich komme gut zu Hause an", the verb ankommen is separated.
- Phonetic boundaries: Some languages have particular rules of pronunciation that make it easy to spot where a word boundary should be. For example, in a language that regularly stresses the last syllable of a word, a word boundary is likely to fall after each stressed syllable. Another example can be seen in a language that has vowel harmony (like Turkish): the vowels within a given word share the same quality, so a word boundary is likely to occur whenever the vowel quality changes. Nevertheless, not all languages have such convenient phonetic rules, and even those that do present the occasional exceptions.
- Orthographic boundaries: See below.
- The task of defining what constitutes a "word" involves determining where one word ends and another word begins—in other words, identifying word boundaries. There are several ways to determine where the word boundaries of spoken language should be placed:
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation Retrieved:2015-4-11.
- Word segmentation is the problem of dividing a string of written language into its component words.
In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.
In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
The Unicode Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.
Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of hyphenation.
- Word segmentation is the problem of dividing a string of written language into its component words.
2009
- (Lin, 2009) ⇒ Dekang Lin. (2009). “Combining Language Modeling and Discriminative Classification for Word Segmentation.” In: Proceedings of the CICLing Conference (CICLing 2009).
2007
- (Schmid, 2007) ⇒ Helmut Schmid. (2007). “Tokenizing.” In: Corpus Linguistics: An International Handbook. Walter de Gruyter, Berlin.
2005
- (McDonald et al., 2005) ⇒ Ryan McDonald, Koby Crammer, and Fernando Pereira. (2005). “Flexible Text Segmentation with Structured Multilabel Classification.” In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP, 2005).
2004
- (Sproat, 2004) ⇒ Richard Sproat. (2004). “Issue of Chinese Word Segmentation.” In: Journal of Chinese Language and Computing 14(3).
2003
- (Mikheev, 2003) ⇒ Andrei Mikheev. (2003). “Text Segmentation.” In: (Mitkov, 2003).
2000
- (Grover et al., 2000) ⇒ Claire Grover, Colin Matheson, Andrei Mikheev, and Marc Moens. (2000). “LT TTT - A flexible tokenisation tool.” In: Proceedings of LREC-2000.
1999
- (Ge et al, 1999) ⇒ Xianping Ge, Wanda Pratt, and Padhraic Smyth. (1999). “Discovering Chinese Words from Unsegmented Text.” In: Proceedings of [[SIGIR]-1999.
- (Manning and Schütze, 1999) ⇒ Christopher D. Manning and Hinrich Schütze. (1999). “Foundations of Statistical Natural Language Processing." The MIT Press.
- QUOTE: Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.
While maintaining most words spaces, in German compound nouns are written as single words, for example Lebensversicherungsgesellschaftsangestellter 'life insurance company employee.' In many ways this makes linguistic sense, as compounds are a single words, at least phonologically. But for process purposes one may wish to divide such a compound, or at least to be aware of the internal structure of the words, and this becomes a limited words segmentation task. While not the rule, joining of compunds sometimes also happens in English, especially when they are common and have a specialized meaning. We noted above that one finds both data base and database. As another examples, while hard disk is more common, one sometimes finds harddisk in the computer press.
Until now, the problems we have dealt with have mainly involved splitting apart sequence of characters where the word division are not shown by whitespace. But the opposite problem of wanting to lump things together also occurs. Where, things are separate by whitespace but we may with to regard them as a single word. One possible case is the reverse of the German compound problem. If one decides to treat database as one word, one may wish to treat it as one word even when it is written as database. More common cases are things such as phone numbers, where we may with to recard 9465 1873 as a single 'word,' or in the cases of multi-part names such as New York or San Francisco. An especially difficult case is when this problem interacts with hyphenation as in a phrase like this one: "the New York-New Haven railroad.” Here the hyphen does not express grouping of just the immediate adjacent graphic words - treating York-New as a semantic unit would be a big mistake.
Other cases are of more linguistic interest. For many purposes, one would want to regard phrasal verbs (make up, work out) as single lexemes (section 3.1.4), but this case is especially tricky since in many cases the particle is separable from the verb (I couldn't work the answer out), and so in general identification of possible phrasal verbs will have to be left to subsequent processing. One might also want to treat as a single lexeme creating other fixed phrases, such as in spite of/, in order to, and because of, but typically a tokenizer will regard them as separate words. A partial implementation of this approach occurs in the LOB corpus where certain pairs of words such as because of are tagged with a single part of speech, here preposition, by means of using so-called ditto tags.
- QUOTE: Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.
1997
- (Palmer, 1997) ⇒ David D. Palmer. (1997). “A Trainable Rule-based Algorithm for Word Segmentation.” In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
- ABSTRACT: This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple, language-independent alternative to large-scale lexical-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter, we show our algorithm to produce high performance Chinese segmentation. In addition, we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages.
1996a
- (Sproat et al, 1996) ⇒ Richard Sproat, William A. Gale, Chilin Shih, and Nancy Chang. (1996). “A Stochastic Finite-state Word-Segmentation Algorithm for Chinese.” In: Computational Linguistics, 22(3).
- ABSTRACT: The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and---since the primary intended application of this model is to text-to-speech synthesis --- provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgements of a pool of human segmenters, and the system is shown to perform quite well.
- It includes an example of correct and incorrect Word Segmentation of "日文章魚怎麼說” (How do you say octopus in Japanese?)
- [日文] [章魚] [怎麼] [說] ([Japanese] [octopus] [how] [say])
- [日] [文章] [魚] [怎麼] [說] ([Japan] [essay] [fish] [how] [say]).
- Distinguishes between: Orthographic Word, Syntactic Word, Dictionary Word, and Phonological Word.
- Orthographic Word: "I am" ⇒ [I] [am].
- Syntactic Word: "I'm" ⇒ [I] ['m]
- Dictionary Word: "show up" ⇒ [show up]
- Phonological Word: "ACL" ⇒ [A] [C][L].
1996b
- (Wall et al., 1996) ⇒ Larry Wall, Tom Christiansen, and Randal L. Schwartz. (1996). “Programming Perl, 2nd edition." O'Reilly. ISBN:1565921496
- QUOTE: tokenizing: Splitting up a program text into its separate words and symbols, each of which is called a token. Also known as "lexing", in which case you get "lexemes" instead of tokens.
1994
- (Grefenstette & Tapanainen) ⇒ Gregory Grefenstette, and Pasi Tapanainen. (1994). “What is a Word, What is a Sentence? Problems of Tokenization.” In: Proceedings of 3rd Conference on Computational Lexicography and Text Research (COMPLEX 1994).
- (Sproat et al., 1994) ⇒ Richard Sproat, Chilin Shih, William A. Gale, Nancy Chang. (1994). “A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.” In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference (ACL 1994).
1990
- (Sproat & Shi, 1990) ⇒ Richard Sproat, and Chilin Shih. (1990). “A Statistical Method for Finding Word Boundaries in Chinese Text.” In: Computer Processing of Chinese and Oriental Languages, 4.
1987
- (Sproat & Brunson, 1987) ⇒ Richard Sproat, and Barbara Brunson. (1987). “Constituent-based Morphological Parsing: a new approach to the problem of word-recognition.” In: Proceedings of the 25th annual meeting on Association for Computational Linguistics.