Jaccard Set Similarity Measure
A Jaccard set similarity measure is a set distance function that is based on the ratio of the set difference and the set union.
- AKA: Jaccard Index/Coefficient/Distance, J.
- Context:
- Output: a Rational Number in [0,1].
- Definition: Jaccard Similarity[math]\displaystyle{ (A,B) = }[/math] Divide(Intersection[math]\displaystyle{ (A,B) }[/math], Union[math]\displaystyle{ (A,B) = {{|A \cap B|}\over{|A \cup B|}}. }[/math]
- It can be converted to a Jaccard Dissimilarity Function (1-Jaccard).
- It can (typically) be used as a Set Similarity Measure.
- …
- Example(s):
- Counter-Example(s):
- See: Overlap Coefficient, String Distance Function, Intersection over Union.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Jaccard_index Retrieved:2017-6-2.
- The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: : [math]\displaystyle{ J(A,B) = = . }[/math] (If A and B are both empty, we define J(A,B) = 1.) : [math]\displaystyle{ 0\le J(A,B)\le 1. }[/math] The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union: : [math]\displaystyle{ d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }. }[/math] An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference [math]\displaystyle{ A \triangle B = (A \cup B) - (A \cap B) }[/math] to the union.
This distance is a metric on the collection of all finite sets. [1]
There is also a version of the Jaccard distance for measures, including probability measures. If [math]\displaystyle{ \mu }[/math] is a measure on a measurable space [math]\displaystyle{ X }[/math], then we define the Jaccard coefficient by [math]\displaystyle{ J_\mu(A,B) = }[/math] , and the Jaccard distance by [math]\displaystyle{ d_\mu(A,B) = 1 - J_\mu(A,B) = }[/math] . Care must be taken if [math]\displaystyle{ \mu(A \cup B) = 0 }[/math] or [math]\displaystyle{ \infty }[/math] , since these formulas are not well defined in that case.
The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.
- The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: : [math]\displaystyle{ J(A,B) = = . }[/math] (If A and B are both empty, we define J(A,B) = 1.) : [math]\displaystyle{ 0\le J(A,B)\le 1. }[/math] The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union: : [math]\displaystyle{ d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }. }[/math] An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference [math]\displaystyle{ A \triangle B = (A \cup B) - (A \cap B) }[/math] to the union.
2009
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- QUOTE: |Jaccard Distance: Another common method for comparing strings, which is actually much more efficient to implement, is the so-called "Jaccard distance". The Jaccard distance implementation in spell.JaccardDistance operates at a token level, comparing two strings by first tokenizing them and then dividing the number of tokens shared by the strings by the total number of tokens.
2003
- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- QUOTE: Two strings and t can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The |Jaccard similarity between the word sets S and T is simply jS\Tj jS[Tj . TFIDF or 1 Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. cosine similarity, which is widely used in the information retrieval community …
- ↑ Sven Kosub, "A note on the triangle inequality for the Jaccard distance" arXiv:1612.02696