Character n-Gram Model
(Redirected from Character N-gram Model)
Jump to navigation
Jump to search
A Character n-Gram Model is a n-gram model of the Character N-grams in a Corpus.
- See: Word N-gram Model.
References
1994
- (Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
- bi-grams: _T, TE, EX, XT, T_
- tri-grams: _TE, TEX, EXT, XT_, T_ _
- quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
- In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.
- An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams: