Text Token String
Jump to navigation
Jump to search
A Text Token String is a symbol string composed of text tokens.
- AKA: Text Token Span.
- Context:
- It can (typically) represent a Text Item (or text item substring).
- It can be produced by a Text Tokenization Task.
- It can be a Text Token Subsequence (a subsequence of a text item).
- It can be in a Substring Relations with another Text Token String.
- It can range from being a Whitespace-Separated Text Token String to being a Non-Whitespace-Separated Text Token String.
- ...
- Example(s):
- a Tagged String, such as a POS-tagged string.
- a Word Mention String, such as: ("I", "'m", "a", "notary public").
- ("I", "'m", "a", "notary", "public"), likely intended to represent a Written Sentence.
- ("keyword-based", "search"), likely intented to represent a Term Mention.
- ("task", "of", "keyword", "searching"), likely intented to represent a Concept Mention.
- a Tokenized Document;
- a Python Text String, Java Text String, ...
- …
- Counter-Example(s):
- a Grapheme String, or a Phoneme String;
- a DNA String, ...;
- a Classified String;
- a Tagged String, such as:
The/B government/I has/O other/B agencies/I and/I instruments/I for/O pursuing/O these/B other/I objectives/I
.
- See: Text Token String Probability Function, Word Mention String, Text Token String Member Relation.
References
2018
- https://www.programiz.com/python-programming/string
- QUOTE: A string is a sequence of characters.
A character is simply a symbol. For example, the English language has 26 characters.
Computers do not deal with characters, they deal with numbers (binary). Even though you may see characters on your screen, internally it is stored and manipulated as a combination of 0's and 1's.
This conversion of character to a number is called encoding, and the reverse process is decoding. ASCII and Unicode are some of the popular encoding used.
- QUOTE: A string is a sequence of characters.
2009
- (Kulkarni et al., 2009) ⇒ Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, Soumen Chakrabarti. (2009). “Collective Annotation of Wikipedia Entities in Web Text.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557073.
- QUOTE: To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on documents must be identified as references to real-world entities from an entity catalog.