n-Gram Tuple
(Redirected from N-gram)
Jump to navigation
Jump to search
An n-Gram Tuple is a tuple that represents a string subsequence.
- Context:
- It can range from being a Unigram, to Bigram, to Trigram, ... based on its n-Gram Length.
- It can range from (typically) being a Contiguous n-Gram to being a Noncontiguous n-Gram.
- It can range from (typically) being an Unordered n-Gram Tuple to being an Ordered n-Gram Tuple.
- It can range from being a Text Window-based n-Gram to being a Sentence-based n-Gram to being a Document-based n-Gram.
- It can be a k-Skip n-Gram, such as a 0-Skip n-Gram or a 1-Skip n-Gram or a 2-Skip n-Gram.
- It can be the output of an n-Gram Generation System.
- It can be a member of an n-Gram Dataset (which might represent an n-Gram Model).
- Example(s):
- a Text-Item n-Gram, such as:
- a Word N-gram, that represents Adjacent Words in a string.
- a Character N-gram.
- a Text-Item n-Gram, such as:
- Counter-Example(s):
- a Substring.
- See: N-tuple, Co-occurrence Statistic, Base Pairs, Text Corpus.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/n-gram Retrieved:2015-2-6.
- In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a “bigram” (or, less commonly, a "digram"); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.
- In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
2011
- http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html
- QUOTE:A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.
For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.
- QUOTE:A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.
1994
- (Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- QUOTE: An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
- bi-grams: _T, TE, EX, XT, T_
- tri-grams: _TE, TEX, EXT, XT_, T_ _
- quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
- In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.
- QUOTE: An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams: