Characters n-Gram
Jump to navigation
Jump to search
A Characters n-Gram is an n-gram composed of text characters.
- Context:
- It can be a member of a Character N-gram Model.
- It can represent Adjacent Characters in a Text.
- …
- Example(s):
- "TEX" a 3-gram Character N-gram from (Cavnar & Trenkle, 1994).
- …
- Counter-Example(s):
- "ceramics collected by" ⇒ (52) a 3-gram Word N-gram from the Google N-gram Dataset.
- "serve as the independent" ⇒ (794) a 4-gram Word N-gram from the Google N-gram Dataset.
- See: Token N-gram, Bag of Character n-Grams, Words n-Gram.
References
1994
- (Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
- bi-grams: _T, TE, EX, XT, T_
- tri-grams: _TE, TEX, EXT, XT_, T_ _
- quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
- In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.
- An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams: