k-Skip n-Gram
A k-Skip n-Gram is a subset of an unordered n-gram that can involve a skip operation of size [math]\displaystyle{ k }[/math].
- Context:
- It can be used to circumvent Data Sparsity.
- Example(s):
[fighting, in, insurgents]
is a 2-skip tri-gram of the string “Insurgents killed in ongoing fighting.”
- Counter-Example(s):
- See: Skip-Gram Algorithm, Text Window, String Edit Operation, Levenshtein Distance.
References
2017
- http://multithreaded.stitchfix.com/blog/2017/10/25/word-tensors/
- QUOTE: In skipgram matrix factorization, we SVD factorize a matrix like: ...
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/N-gram#Skip-Gram Retrieved:2015-2-6.
- In the field of computational linguistics, in particular language modeling, skip-grams[1] are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.[2] They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.
Formally, an n-gram is a consecutive subsequence of length n of some sequence of tokens w₁ … wₙ. A k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.
For example, in the input text:
the rain in Spain falls mainly on the plain
the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences
the in, rain Spain, in falls, Spain mainly, mainly the and on plain.
Recently, Mikolov et al. (2013) have demonstrated that skip-gram language models can be trained so that it is possible to do ″word arithmetic". In their model, for example the expression
king - man + woman
evaluates very close to queen.[3]
- In the field of computational linguistics, in particular language modeling, skip-grams[1] are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.[2] They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.
- ↑ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.1629
- ↑ David Guthrie et al. (2006). "A Closer Look at Skip-gram Modelling". http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf.
- ↑ Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space". http://arxiv.org/abs/1301.3781.
2013
- (Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” In: Proceedings of International Conference of Learning Representations Workshop.
2006
- (Guthrie et al., 2006) ⇒ David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. (2006). “A Closer Look at Skip-gram Modelling.” In: Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006).
- QUOTE: We define k-skip-n-grams for a sentence [math]\displaystyle{ w_1 ... w_m }[/math] to be the set : [math]\displaystyle{ \{ w_{i_1}, w_{i_2}, ... w_{i_n} \mid \Sigma_{j=1}^n i_j - i_{j-1} \lt k \} }[/math] Skip-grams reported for a certain skip distance [math]\displaystyle{ k }[/math] allow a total of [math]\displaystyle{ k }[/math] or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).
Here is an actual sentence example showing 2-skip-bigrams and tri-grams compared to standard bi-grams and trigrams consisting of adjacent words for the sentence:
“Insurgents killed in ongoing fighting.”
Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}.
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}
Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}.
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.
In this example, over three times as many 2-skip-tri-grams were produced than adjacent tri-grams and this trend continues the more skips that are allowed. A typical sentence of ten words, for example, will produce 8 trigrams, but 80 4-skip-tri-grams. Sentences that are 20 words long have 18 tri-grams and 230 4-skip-tri-grams (see Table 1).
- QUOTE: We define k-skip-n-grams for a sentence [math]\displaystyle{ w_1 ... w_m }[/math] to be the set : [math]\displaystyle{ \{ w_{i_1}, w_{i_2}, ... w_{i_n} \mid \Sigma_{j=1}^n i_j - i_{j-1} \lt k \} }[/math] Skip-grams reported for a certain skip distance [math]\displaystyle{ k }[/math] allow a total of [math]\displaystyle{ k }[/math] or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).