Set Distance Function
Jump to navigation
Jump to search
A set distance function is a distance function between two finite sets.
- AKA: Set Overlap Measure, Set Similarity Function.
- Context:
- Input: (set [math]\displaystyle{ A }[/math], set [math]\displaystyle{ B }[/math]).
- It can produce a high value for Dissimilar Sets and a low value for Similar Sets.
- Example(s):
- Counter-Example(s):
- See: Intersection Set Operation, Bag-of-Words Vector.
References
2008
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- Jaccard Distance Another common method for comparing strings, which is actually much more efficient to implement, is the so-called "Jaccard distance". The Jaccard distance implementation in spell.JaccardDistance operates at a token level, comparing two strings by first tokenizing them and then dividing the number of tokens shared by the strings by the total number of tokens.
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.