Jaro-Winkler Distance Measure
A Jaro-Winkler Distance Measure is a unit edit distance measure that modifies the weights of poorly matching string pairs that share a common string prefix.
- AKA: Jaro String Distance Function, Jaro Distance Measure, Jaro-Winkler String Comparator.
- Context:
- It can adjust agreement weights when two strings do not agree on a Character.
- It can be well suited for short strings (such as person names) when there are large amount of Typographical Errors.
- It can lose its semantic meaning when the strings include multiple tokens or words and different words may carry different meanings.
- It can adjust for prefixes.
- It can adjust for long-strings.
- …
- Counter-Example(s):
- See: Distance Function, Record Linkage, Edit Distance.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Jaro–Winkler_distance Retrieved:2015-3-24.
- In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.
2009
- http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#jarowinkler
- QUOTE: This is an extension of the Jaro distance metric, from the work of Winkler in (1999). This extension modifies the weights of poorly matching pairs s, t that share a common prefix.
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- QUOTE: String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers. In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.
… Jaro-Winkler Distance There are a family of distance measures defined by the U.S. Census Bureau for comparing single person names. The original metric was defined by Matt Jaro and later refined by Bill Winkler.
- QUOTE: String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers. In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.
2006
- William E. Winkler. (2006). “Overview of Record Linkage and Current Research Directions". Research Report Series, RRS.
- QUOTE: Jaro (1989) introduced a string comparator that accounts for insertions, deletions, and transpositions. The basic Jaro algorithm has three components: (1) compute the string lengths, (2) find the number of common characters in the two strings, and (3) find the number of transpositions. The definition of common is that the agreeing character must be within half the length of the shorter string. The definition of transposition is that the character from one string is out of order with the corresponding common character from the other string.
2003
- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- QUOTE: A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings. Given strings [math]\displaystyle{ s = a_1\; \cdots \; a_K }[/math] and [math]\displaystyle{ t= b_1 \;\cdots\;b_L }[/math], define a character [math]\displaystyle{ a_i }[/math] in [math]\displaystyle{ s }[/math] to be common with [math]\displaystyle{ t }[/math] there is a [math]\displaystyle{ b_j = a_i }[/math] in [math]\displaystyle{ t }[/math] such that [math]\displaystyle{ i − H \leq j \leq i + H }[/math], where [math]\displaystyle{ H = \frac{min(|s|\cdot|t|)}{2} }[/math] . Let [math]\displaystyle{ s' = a_1' \; \cdots \; a_{K'}' }[/math] be the characters in [math]\displaystyle{ s }[/math] which are common with [math]\displaystyle{ t }[/math] (in the same order they appear in [math]\displaystyle{ s }[/math]) and let [math]\displaystyle{ t'= b_1' \;\cdots\;b_{L'}' }[/math] be analogous; now define a transposition for [math]\displaystyle{ s' }[/math] , [math]\displaystyle{ t' }[/math] to be a position [math]\displaystyle{ i }[/math] such that [math]\displaystyle{ a'_i \neq b'_i }[/math]. Let [math]\displaystyle{ T_{s',t'} }[/math] be half the number of transpositions for [math]\displaystyle{ s' }[/math] and [math]\displaystyle{ t' }[/math] . The Jaro similarity metric for [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] is
[math]\displaystyle{ Jaro(s, t) = \frac{1}{3}\cdot \left( \frac{|s'|}{|s|} + \frac{|t'|}{|t|} + \frac{|s'| − T_{s',t'}}{|s'|}\right) }[/math]
A variant of this due to Winkler (1999) also uses the length [math]\displaystyle{ P }[/math] of the longest common prefix of [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math]. Letting [math]\displaystyle{ P'= max(P, 4) }[/math] we define
[math]\displaystyle{ Jaro-Winkler(s,t) = Jaro(s, t) + \frac{P'}{10}\cdot (1 − Jaro(s, t)) }[/math].
The Jaro and Jaro-Winkler metrics seem to be intended primarily for short strings (e.g., personal first or last names.)
- QUOTE: A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings. Given strings [math]\displaystyle{ s = a_1\; \cdots \; a_K }[/math] and [math]\displaystyle{ t= b_1 \;\cdots\;b_L }[/math], define a character [math]\displaystyle{ a_i }[/math] in [math]\displaystyle{ s }[/math] to be common with [math]\displaystyle{ t }[/math] there is a [math]\displaystyle{ b_j = a_i }[/math] in [math]\displaystyle{ t }[/math] such that [math]\displaystyle{ i − H \leq j \leq i + H }[/math], where [math]\displaystyle{ H = \frac{min(|s|\cdot|t|)}{2} }[/math] . Let [math]\displaystyle{ s' = a_1' \; \cdots \; a_{K'}' }[/math] be the characters in [math]\displaystyle{ s }[/math] which are common with [math]\displaystyle{ t }[/math] (in the same order they appear in [math]\displaystyle{ s }[/math]) and let [math]\displaystyle{ t'= b_1' \;\cdots\;b_{L'}' }[/math] be analogous; now define a transposition for [math]\displaystyle{ s' }[/math] , [math]\displaystyle{ t' }[/math] to be a position [math]\displaystyle{ i }[/math] such that [math]\displaystyle{ a'_i \neq b'_i }[/math]. Let [math]\displaystyle{ T_{s',t'} }[/math] be half the number of transpositions for [math]\displaystyle{ s' }[/math] and [math]\displaystyle{ t' }[/math] . The Jaro similarity metric for [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] is
1999
- (Winkler, 1999) ⇒ William E. Winkler. (1999). “The State of Record Linkage and Current Research Problems.” In: Statistics of Income Division, Internal Revenue Service Publication R99/04.
1997
- Edward H. Porter, and William E. Winkler. (1997). “Approximate String Comparison and its Effect on an Advanced Record Linkage Systems. U.S. Bureau of the Census, Research Report.
1995
- Matthew A. Jaro. (1995). “Probabilistic Linkage of Large Public Health Data File.” In: Statistics in Medicine, 14(5-7).
1990
- (Winkler, 1990) ⇒ William E. Winkler. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” In: Proceedings of the Section on Survey Research Methods, American Statistical Association.
1989
- (Jaro, 1989) ⇒ Matthew A. Jaro. (1989). “Advances in record linking methodology as applied to the 1985 census of Tampa Florida.” In: Journal of the American Statistical Society 84 (406): 414–20.