String Edit Distance Function
A String Edit Distance Function is a string metric that is a edit distance function (based on string edit operations).
- AKA: String Edit Distance Measure.
- Context:
- It can produce a four-tuple consisting of a sequence of edit operations, two sequences of string positions, and a sequence of FSM states. (McCallum, Bellare & Pereira, 2005).
- It can use an n-gram Index to speed up performance. This requires the enumeration of all query n-grams and Lexicon n-grams.
- Example(s):
- Counter-Example(s):
- See: Duplicate Record Detection Algorithm.
References
2012
- http://en.wikipedia.org/wiki/Edit_distance
- QUOTE: In information theory and computer science, the edit distance between two strings of characters generally refers to the Levenshtein distance[1]. However, according to Nico Jacobs, “The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and replacements have twice the cost of an insertion”[2].
It may also refer to the whole class of string metrics that measure distance as the (weighted or unweighted) number of operations required to transform a string into another. There are several different ways to define an edit distance, depending on which edit operations are allowed: replace, delete, insert, transpose, and so on. There are algorithms to calculate its value under various definitions:
- Hamming distance.
- Levenshtein distance (the most common definition, calculated by Hirschberg's algorithm or the Wagner–Fischer algorithm)
- Damerau–Levenshtein distance.
- Jaro–Winkler distance
- QUOTE: In information theory and computer science, the edit distance between two strings of characters generally refers to the Levenshtein distance[1]. However, according to Nico Jacobs, “The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and replacements have twice the cost of an insertion”[2].
- ↑ Algorithms and Theory of Computation Handbook, CRC Press LLC, 1999, "Levenshtein distance", in Dictionary of Algorithms and Data Structures, Paul E. Black, ed., U.S. National Institute of Standards and Technology, 14 August 2008 (accessed 31 October 2011).
- ↑ See p. 190 in Jacobs, Nico “Relational Sequence Learning and User Modelling”. Katholieke Universiteit Leuven, Faculteit Wetenschappen, Faculteit Toegepaste Wetenschappen; Departement Computerwetenschappen, Leuven, Oct. 2004, xvi + 235 pp.
2009
- Wiktionary.
- String distance: (computer science) Any of several metrics that represent the degree of similarity betwen two strings of characters. The string distance between 'here' and 'there' is 1 (insertion of a t).
- www.stanford.edu/class/cs276/handouts/lecture3-tolerantretrieval.ppt
- Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) E.g., the edit distance
from dof to dog is 1
From cat to act is 2 (Just 1 with transpose.)
from cat to dog is 3.
- Generally found by dynamic programming.
Weighted edit distance can encode background knowledge such as that the letter "m" is more likely to be mis-typed as the letter "n" than as the letter "q".
- Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) E.g., the edit distance
2005
- (McCallum, Bellare & Pereira, 2005) ⇒ Andrew McCallum, Kedar Bellare, and Fernando Pereira. (2005). “A conditional random field for discriminatively-trained finite-state string edit distance.” In: Proceedings of the Conference on Uncertainty in AI (UAI 2005).
- QUOTE: Let [math]\displaystyle{ x = x_1 ... x_m }[/math] and [math]\displaystyle{ y = y_1 ... y_n }[/math] be two strings or symbol sequences. This pair of input strings is associated with an output label [math]\displaystyle{ z ∈ {0, 1} }[/math] indicating whether or not the strings should be considered a match (1) or a mismatch (0). [1] As we now explain, our model scores alignments between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] as to whether they are a match or a mismatch. An alignment a is a four-tuple consisting of a sequence of edit operations, two sequences of string positions, and a sequence of FSM states.
- ↑ One could also straight-forwardly imagine a different regression-based scenario in which [math]\displaystyle{ z }[/math] is real-valued, or also a ranking-based criteria, in which two pairs are provided and [math]\displaystyle{ z }[/math] indicates which pair of strings should be considered closer.
2003
- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- We will consider two edit-distance functions. The simple Levenstein distance assigns a unit cost to all edit operations. As an example of a more complex well-tuned distance function, we also consider the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine1 variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters, and scaled to the interval [0,1].
2000
- (Zhu & Ungar, 2000) ⇒ J. J. Zhu, and Lyle H. Ungar. (2000). “String Edit Analysis for Merging Databases].” In: KDD 2000 Workshop on Text Mining.
1997
- (Porter & Winkler, 1997) ⇒ Edward H. Porter, and William E. Winkler. (1997). “Approximate String Comparison and its Effect on an Advanced Record Linkage Systems. U.S. Bureau of the Census, Research Report.
1990
- (Winkler, 1990) ⇒ William E. Winkler. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” In: Proceedings of the Section on Survey Research Methods, American Statistical Association, 354-359.
1981
- (Smith & Waterman, 1981) ⇒ T.F. Smith, and M.S. Waterman. (1981). “Identification of Common Molecular Subsequences.” In: Journal of Molecular Biology, 147.