k-Shingle Set
Jump to navigation
Jump to search
See: Shingle, k-Shingle Generation, n-Gram, n-Gram Set, n-Gram Generation.
References
2011
- http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html
- QUOTE:A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.
For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.
- QUOTE:A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.
2008
- (Manning et al., 2008) ⇒ Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. (2008). “Introduction to Information Retrieval." Cambridge University Press. ISBN:0521865719.
- http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
- QUOTE: We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies in a technique known as shingling. Given a positive integer [math]\displaystyle{ k }[/math] and a sequence of terms in a document [math]\displaystyle{ d }[/math], define the [math]\displaystyle{ k }[/math]-shingles of [math]\displaystyle{ d }[/math] to be the set of all consecutive sequences of [math]\displaystyle{ k }[/math] terms in [math]\displaystyle{ d }[/math]. As an example, consider the following text: a rose is a rose is a rose. The 4-shingles for this text ([math]\displaystyle{ k=4 }[/math] is a typical value used in the detection of near-duplicate web pages) are a rose is a, rose is a rose and is a rose is. The first two of these shingles each occur twice in the text. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. We now make this intuition precise, then develop a method for efficiently computing and comparing the sets of shingles for all web pages. Let [math]\displaystyle{ S(d_j) }[/math] denote the set of shingles of document [math]\displaystyle{ d_j }[/math]. Recall the Jaccard coefficient ..., which measures the degree of overlap between the sets [math]\displaystyle{ S(d_1) }[/math] and [math]\displaystyle{ S(d_2) }[/math] as [math]\displaystyle{ \vert S(d_1) \cap S(d_2)\vert/\vert S(d_1) \cup S(d_2)\vert }[/math]; denote this by [math]\displaystyle{ J(S(d_1),S(d_2)) }[/math]. Our test for near duplication between [math]\displaystyle{ d_1 }[/math] and [math]\displaystyle{ d_2 }[/math] is to compute this Jaccard coefficient; if it exceeds a preset threshold (say, [math]\displaystyle{ 0.9 }[/math]), we declare them near duplicates and eliminate one from indexing. However, this does not appear to have simplified matters: we still have to compute Jaccard coefficients pairwise.
1997
- (Broder et al., 1997) ⇒ Andrei Z Broder, Steven C. Glassman, Mark Manasse, and Geoffrey Zweig. (1997). “Syntactic Clustering of the Web.” In: Computer Networks and ISDN Systems, 29(8-13).
- QUOTE: A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling [math]\displaystyle{ S(D,w) }[/math] as the set of all unique shingles of size [math]\displaystyle{ w }[/math] contained in D. So for instance the 4-shingling of (a,rose,is,a,rose,is,a,rose) is the set { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) } ... For a given shingle size, the resemblance r of two documents A and B is defined as
1981
- (Rabin, 1981) ⇒ M. O. Rabin. (1981). “Fingerprinting by Random Polynomials." Technical Report, TR-15-81. Center for Research in Computing Technology, Harvard University.