Characters n-Gram

From GM-RKB
Jump to navigation Jump to search

A Characters n-Gram is an n-gram composed of text characters.



References

1994

  • (Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
    • An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
      • bi-grams: _T, TE, EX, XT, T_
      • tri-grams: _TE, TEX, EXT, XT_, T_ _
      • quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
    • In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.