String Edit Distance Function

AKA: String Edit Distance Measure.
Context:
- It can produce a four-tuple consisting of a sequence of edit operations, two sequences of string positions, and a sequence of FSM states. (McCallum, Bellare & Pereira, 2005).
- It can use an n-gram Index to speed up performance. This requires the enumeration of all query n-grams and Lexicon n-grams.
Example(s):
Counter-Example(s):
- Graph Edit Distance Measure.
See: Duplicate Record Detection Algorithm.

References

↑ Algorithms and Theory of Computation Handbook, CRC Press LLC, 1999, "Levenshtein distance", in Dictionary of Algorithms and Data Structures, Paul E. Black, ed., U.S. National Institute of Standards and Technology, 14 August 2008 (accessed 31 October 2011).
↑ See p. 190 in Jacobs, Nico “Relational Sequence Learning and User Modelling”. Katholieke Universiteit Leuven, Faculteit Wetenschappen, Faculteit Toegepaste Wetenschappen; Departement Computerwetenschappen, Leuven, Oct. 2004, xvi + 235 pp.

Wiktionary.
- String distance: (computer science) Any of several metrics that represent the degree of similarity betwen two strings of characters. The string distance between 'here' and 'there' is 1 (insertion of a t).

(McCallum, Bellare & Pereira, 2005) ⇒ Andrew McCallum, Kedar Bellare, and Fernando Pereira. (2005). “A conditional random field for discriminatively-trained finite-state string edit distance.” In: Proceedings of the Conference on Uncertainty in AI (UAI 2005).
- QUOTE: Let [math]\displaystyle{ x = x_1 ... x_m }[/math] and [math]\displaystyle{ y = y_1 ... y_n }[/math] be two strings or symbol sequences. This pair of input strings is associated with an output label [math]\displaystyle{ z ∈ {0, 1} }[/math] indicating whether or not the strings should be considered a match (1) or a mismatch (0). ^[1] As we now explain, our model scores alignments between [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] as to whether they are a match or a mismatch. An alignment a is a four-tuple consisting of a sequence of edit operations, two sequences of string positions, and a sequence of FSM states.

↑ One could also straight-forwardly imagine a different regression-based scenario in which [math]\displaystyle{ z }[/math] is real-valued, or also a ranking-based criteria, in which two pairs are provided and [math]\displaystyle{ z }[/math] indicates which pair of strings should be considered closer.

(Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- We will consider two edit-distance functions. The simple Levenstein distance assigns a unit cost to all edit operations. As an example of a more complex well-tuned distance function, we also consider the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine1 variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters, and scaled to the interval [0,1].

(Zhu & Ungar, 2000) ⇒ J. J. Zhu, and Lyle H. Ungar. (2000). “String Edit Analysis for Merging Databases].” In: KDD 2000 Workshop on Text Mining.

(Porter & Winkler, 1997) ⇒ Edward H. Porter, and William E. Winkler. (1997). “Approximate String Comparison and its Effect on an Advanced Record Linkage Systems. U.S. Bureau of the Census, Research Report.

(Winkler, 1990) ⇒ William E. Winkler. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” In: Proceedings of the Section on Survey Research Methods, American Statistical Association, 354-359.

(Smith & Waterman, 1981) ⇒ T.F. Smith, and M.S. Waterman. (1981). “Identification of Common Molecular Subsequences.” In: Journal of Molecular Biology, 147.