Word Instance
(Redirected from Word-Token)
Jump to navigation
Jump to search
A Word Instance is an instance/linguistic utterance of a linguistic word (produced at some place and time).
- AKA: Terminal Word Mention, Surface Word.
- Context:
- It can (typically) have a Word Sense.
- It can (typically) not be divisible into smaller units without changing some of its Intension or Extension. E.g. “tea bag” <=> “bag of tea”.
- It can (typically) be associated with a Word Mention Boundary.
- It can (often) be within a Linguistic Expression Instance.
- It can range from being an Atomic Word Instance (e.g. “[fire]!") to being an Embedded Word Instance (“The [fire] is out.”), within another linguistic utterance.
- It can range from being an Ambiguous Word Mention to being an Unambiguous Word Mention.
- It can range from being a Written Word Mention to being a Spoken Word Utterance to being a Signed Word.
- It can be identified by a Word Segmentation Task.
- It can be mapped to a Word Form (such as a word form record) by a Word Mention Normalization Task.
- It can have a Word Mention Lemma (and be mapped to a lexeme lemma).
- It can be mapped to a Word Part of Speech (verb mention, noun mention, other).
- ...
- Example(s):
- All of the word mentions written in this concept description.
- “New York” in “[New York]-based Jamaicans are racking up the minutes.”
- …
- Counter-Example(s):
- a Word Form, which is an un-instantiated Abstract Concept.
- a Concept Mention, which can be composed of more than one word mention.
- a Terminal Symbol.
- See: Lemmatisation Task, Word Type, Linguistic Agent, Orthographic Word Mention.
References
2003
- (Mikheev, 2003) ⇒ Andrei Mikheev. (2003). “Text Segmentation.” In: (Mitkov, 2003).
- QUOTE: The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.