Orthographic Word Mention
An Orthographic Word Mention is a text character string demarcated by whitespace characters.
- AKA: Word Token.
- Context:
- It can be a Written Word (in Handwritten Text or Typewritten Text).
- It can be a member of a Bag-of-Tokens.
- It can be more difficult to detect in a Handwritten Text than in a Typewritten Text.
- It can contain a Contraction, e.g. “I'm” and “wouldn't've”.
- It can be detected by an Orthographic Word Segmentation Task.
- It can be associated to an Orthographic Word Form.
- Example(s):
- The four in: “[in] [so] [far] [as]”
- The two in: “[insofar] [as]”
- The one in: “[insofaras]”
- The one in: “[wouldn't've]”
- The two in: "[I'm] [home]"
- The three in: "[They] [were] [sisters-in-law]".
- …
- Counter-Example(s):
- any Multi-Word Expression.
- any Grapheme Sequence in Written Languages that do not use Word Separators.
- "[日][文][章][魚][怎][麼][說]" (~[Japan], [?], [?], [fish], [?][?], [say]).
- "[日文] [章魚] [怎麼] [說]" (~[Japanese], [octopus], [how], [say]) are Word Mentions (Vocabulary Words?).
- See: Word, Grapheme.
References
2000
- (Bauer, 2000) ⇒ Laurie Bauer. (2000). “Word.” In: "Morphology.", edited by Geert Booij, Christian Lehmann, and Joachim Mugdan. ISBN:9783110111286
- QUOTE: An orthographic word is usually defined as a unit which, in writing, is bounded by spaces on both sides …
… Similarly, it is relatively easy to show that the orthographic word does not always coincide with speakers intuitions about word-units. There are a number of cases even in highly codified languages like the European languages where native speakers are in doubt as to whether to write something as one or more words. Alright versus all right and insofaras versus insofar as versus in so far as provide simple examples from English...
- QUOTE: An orthographic word is usually defined as a unit which, in writing, is bounded by spaces on both sides …
1996
- (Sproat et al, 1996) ⇒ Richard Sproat, William A. Gale, Chilin Shih, and Nancy Chang. (1996). “A Stochastic Finite-state Word-Segmentation Algorithm for Chinese.” In: Computational Linguistics, 22(3).
- QUOTE: … A moment's reflection will reveal that things are not quite that simple. There are clearly eight orthographic words in the example given, but if one were doing syntactic analysis one would probably want to consider I'm to consist of two syntactic words, namely I and am. If one is interested in translation, one would probably want to consider show up as a single dictionary word since its semantic interpretation is not trivially derivable from the meanings of thow and up.
Whether a language even has orthographic words is largely dependent on the writing system used to represent the language (rather than the language itself); the notion “orthographic word” is not universal. Most languages that use Roman, Greek, Cyrillic, Armenian, or Semitic scripts, and many that use Indian-derived scripts, mark orthographic word boundaries; however, languages written in a Chinese-derived writing system, including Chinese and Japanese, as well as Indian-derived writing systems of languages like Thai, do not delimit orthographic words.
Put another way, written Chinese simply lacks orthographic words.
- QUOTE: … A moment's reflection will reveal that things are not quite that simple. There are clearly eight orthographic words in the example given, but if one were doing syntactic analysis one would probably want to consider I'm to consist of two syntactic words, namely I and am. If one is interested in translation, one would probably want to consider show up as a single dictionary word since its semantic interpretation is not trivially derivable from the meanings of thow and up.