Tokenized Document
(Redirected from tokenized text document)
Jump to navigation
Jump to search
A Tokenized Document is a text document that is represented as a Text Token String.
- AKA: Tokenized Text Document.
- Context
- It can range from being an Unannotated Tokenized Document to being an Annotated Tokenized Document.
- It is the output of a Tokenization Task.
- It can be produced by a Tokenization System that applies a Tokenization algorithm).
- Example(s):
- See: Annotated Document, CoNLL Format.
References
2008
- (Manning et al., 2008) ⇒ Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. (2008). “Introduction to Information Retrieval." Cambridge University Press. ISBN:0521865719.
- QUOTE: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
- (Reiss et al., 2008) ⇒ Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. (2008). “An Algebraic Approach to Rule-Based Information Extraction.” In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008). doi:10.1109/ICDE.2008.4497502
- QUOTE: Dictionary matching is a fairly expensive operation that involves tokenizing the current document’s text and looking for all occurrences of the set of words and phrases listed in a specified dictionary. ... Even when documents are tokenized at the very beginning of the processing pipeline, an entire pass over these tokens for each Ed operator requires thousands of probes into the dictionary data structures. ... A DictEval that produces a set of matching spans given a dictionary and a tokenized document.