Text Token Segmentation Task (TTT)
An Text Token Segmentation Task (TTT) is a text segmentation task that is restricted to the detection of text tokens within a text item's linguistic expressions.
- Context:
- Input: a Text Linguistic Expression (typically of a Segmented Written Language).
- output: a Text Token String.
- It can be solved by a Text Tokenization System (that implements a text tokenization algorithm).
- It can range from being a Orthographic Word Segmentation (e.g. white space-based separation) to being a Syntactic Text Tokenization Task.
- It can (typically) be a Text Preprocessing Task.
- ...
- Example(s):
- TTT(“I'm coming home” )
⇒[I] ['m] [coming] [home]
. - TTT(“I bought a real time operating system”)
⇒[I] [bought] [a] [real] [time] [operating] [system]
. - TTT(“Famous notaries public include former ex-attorney generals.”)
⇒[Famous] [notaries] [public] [include] [former] [ex-attorney] [generals]
. - TTT(“Der Lebensversicherungsgesellschaftsangestellte kam gestern mit seinem Deutscher Schäferhund.”(~The life insurance company employee came yesterday with their German Shepherd)
⇒[Der] [Lebensversicherungsgesellschaftsangestellte] [kam] [gestern] [mit] [seinem] [Deutscher] [Schäferhund]
- TTT(“The ex-governor general's sisters-in-law saw the wolves' den near Mr. Smith's home in Sault Ste. Marie.”)
⇒[The] [ex-] [governor] general] ['s] [sisters-in-law] [saw] [the] [wolves] ['] [den] [near] [Mr.] [Smith] ['s] [home] [in] [Sault] [Ste.] [Marie]
. - …
- TTT(“I'm coming home” )
- Counter-Example(s):
- Grapheme Segmentation("日文章魚怎麼說") ⇒
[日] [文] [章] [魚] [怎] [麼] [說]
. - Orthographic Segmentation(“I'm coming home”) ⇒
[I'm] [coming] [home]
. - WMST(“I bought a real time operating system”) ⇒
[I] [bought] [a] [real time] [operating system]
. - PWST(I'mcominghome) ⇒
[I'm] [coming] [home]
. - [math]\displaystyle{ f }[/math]("日文章魚怎麼說") ⇒ ([日文] [章魚] [怎麼] [說]) (i.e. ~[Japanese] [octopus] [how] [say]).
- [math]\displaystyle{ f }[/math](“The ex-governor general's sisters-in-law saw the wolves' den near Mr. Smith's home in Sault Ste. Marie.”) ⇒ (the, exgovernor, general, s, sistersinlaw, saw, the, wolv, den, near, mr, smith, s, home, in, sault, ste, mari), likely a Word Stemming Task.
- Allomorph Segmentation(“The wolves' den was empty.”) ⇒
[The], [wolv], [es], ['], [den], [was], [empty]
. - Phrase Chunking(“Famous notaries public include ex-attorney generals.'") ⇒
[Famous notaries public] [include] [ex-attorney generals]
- Grapheme Segmentation("日文章魚怎麼說") ⇒
- See: Text Chunking Task, Word Stemming Task, Lexical Analysis, Parsing, Text Segmentation.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis) Retrieved:2015-4-11.
- In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.
2009
- http://en.wiktionary.org/wiki/tokenization
- 1. The act or process of tokenizing.
- 2. Something tokenized. This was an unlikely tokenization of the input string.
- http://en.wikipedia.org/wiki/Tokenization
- Tokenization is the process of breaking a stream of text up into meaningful elements. This is useful in linguistics and in computer science.
2008
- (Manning et al., 2008) ⇒ Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. (2008). “Introduction to Information Retrieval." Cambridge University Press. ISBN:0521865719.
- QUOTE: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. ... The major question of the tokenization phase is what are the correct text tokens to use?
- (Reiss et al., 2008) ⇒ Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. (2008). “An Algebraic Approach to Rule-Based Information Extraction.” In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008). doi:10.1109/ICDE.2008.4497502
- QUOTE: Dictionary matching is a fairly expensive operation that involves tokenizing the current document’s text and looking for all occurrences of the set of words and phrases listed in a specified dictionary.
2007
- (Schmid, 2007) ⇒ Helmut Schmid. (2007). “Tokenizing.” In: Corpus Linguistics: An International Handbook. Walter de Gruyter, Berlin.
2003
- (Mikheev, 2003) ⇒ Andrei Mikheev. (2003). “Text Segmentation.” In: (Mitkov, 2003).
1999
- (Manning and Schütze, 1999) ⇒ Christopher D. Manning and Hinrich Schütze. (1999). “Foundations of Statistical Natural Language Processing." The MIT Press.
- QUOTE: Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.
While maintaining most words spaces, in German compound nouns are written as single words, for example Lebensversicherungsgesellschaftsangestellter 'life insurance company employee.' In many ways this makes linguistic sense, as compounds are a single words, at least phonologically. But for process purposes one may wish to divide such a compound, or at least to be aware of the internal structure of the words, and this becomes a limited words segmentation task. While not the rule, joining of compunds sometimes also happens in English, especially when they are common and have a specialized meaning. We noted above that one finds both data base and database. As another examples, while hard disk is more common, one sometimes finds harddisk in the computer press.
Until now, the problems we have dealt with have mainly involved splitting apart sequence of characters where the word division are not shown by whitespace. But the opposite problem of wanting to lump things together also occurs. Where, things are separate by whitespace but we may with to regard them as a single word. One possible case is the reverse of the German compound problem. If one decides to treat database as one word, one may wish to treat it as one word even when it is written as database. More common cases are things such as phone numbers, where we may with to recard 9465 1873 as a single 'word,' or in the cases of multi-part names such as New York or San Francisco. An especially difficult case is when this problem interacts with hyphenation as in a phrase like this one: "the New York-New Haven railroad.” Here the hyphen does not express grouping of just the immediate adjacent graphic words - treating York-New as a semantic unit would be a big mistake.
Other cases are of more linguistic interest. For many purposes, one would want to regard phrasal verbs (make up, work out) as single lexemes (section 3.1.4), but this case is especially tricky since in many cases the particle is separable from the verb (I couldn't work the answer out), and so in general identification of possible phrasal verbs will have to be left to subsequent processing. One might also want to treat as a single lexeme creating other fixed phrases, such as in spite of/, in order to, and because of, but typically a tokenizer will regard them as separate words. A partial implementation of this approach occurs in the LOB corpus where certain pairs of words such as because of are tagged with a single part of speech, here preposition, by means of using so-called ditto tags.
- QUOTE: Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.
1996
- (Wall et al., 1996) ⇒ Larry Wall, Tom Christiansen, and Randal L. Schwartz. (1996). “Programming Perl, 2nd edition." O'Reilly. ISBN:1565921496
- QUOTE: tokenizing: Splitting up a program text into its separate words and symbols, each of which is called a token. Also known as "lexing", in which case you get "lexemes" instead of tokens.
1994
- (Grefenstette & Tapanainen) ⇒ Gregory Grefenstette, and Pasi Tapanainen. (1994). “What is a Word, What is a Sentence? Problems of Tokenization.” In: Proceedings of 3rd Conference on Computational Lexicography and Text Research (COMPLEX 1994).
- (Sproat et al., 1994) ⇒ Richard Sproat, Chilin Shih, William A. Gale, Nancy Chang. (1994). “A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.” In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference (ACL 1994).
1992
- Penn Treebank Project tokenization
- http://www.cis.upenn.edu/~treebank/tokenization.html
- Our tokenization is fairly simple:
- most punctuation is split from adjoining words
- double quotes (") are changed to doubled single forward- and backward- quotes (`` and )
- verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.
- …