Characters n-Gram

From GM-RKB

(Redirected from Character n-Gram)

Jump to navigation Jump to search

A Characters n-Gram is an n-gram composed of text characters.

Context:
- It can be a member of a Character N-gram Model.
- It can represent Adjacent Characters in a Text.
- …
Example(s):
- "TEX" a 3-gram Character N-gram from (Cavnar & Trenkle, 1994).
- …
Counter-Example(s):
- "ceramics collected by" ⇒ (52) a 3-gram Word N-gram from the Google N-gram Dataset.
- "serve as the independent" ⇒ (794) a 4-gram Word N-gram from the Google N-gram Dataset.
See: Token N-gram, Bag of Character n-Grams, Words n-Gram.

References

1994

(Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
  - bi-grams: _T, TE, EX, XT, T_
  - tri-grams: _TE, TEX, EXT, XT_, T_ _
  - quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
- In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Characters_n-Gram&oldid=869909"

Concept