String Distance Measure
A string distance measure is a distance measure that is restricted to string items.
- AKA: String Similarity, String Comparison, String Metric.
- Context:
- It can range from being an Equal-Length String Distance Measure to being a Non-Equal-Length String Distance Measure.
- It can be applied by a String Distance Software System.
- Example(s):
- a String Edit Distance Function.
- Hamming("David L. Edwards","David L. Emerson")=6
- Soundex("Dave","David") = ??, a Word Distance Function.
- Levenshtein("David L. Edwards","David L. Emerson")=??
- Damerau-Levenshtein("David L. Edwards","David L. Emerson")=??
- General Levenshtein("David L. Edwards","David L. Emerson")=??
- Jaro-Winkler("David L. Edwards","David L. Emerson")=??
- Counter-Example(s):
- See: String Length, Approximate String Matching Algorithm, Knowledge Integration, Approximate String Matching.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/String_metric Retrieved:2015-3-24.
- In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. Necessary requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.
The most widely known string metric is a rudimentary one called the Levenshtein Distance (also known as Edit Distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.
A widespread example of a string metric is DNA sequence analysis and RNA analysis, which are performed by optimized string metrics to identify matching sequences.
String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, and semantic knowledge integration.
- In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. Necessary requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.
- http://en.wikipedia.org/wiki/String_metric#List_of_string_metrics
- Sørensen–Dice coefficient.
- Hamming distance.
- Levenshtein distance and Damerau–Levenshtein distance.
- Block distance or L1 distance or City block distance.
- Simple matching coefficient (SMC)
- Jaccard similarity or Jaccard coefficient or Tanimoto coefficient.
- Most frequent k characters.
- Tversky index.
- Overlap coefficient.
- Variational distance.
- Hellinger distance or Bhattacharyya distance.
- Information radius (Jensen–Shannon divergence)
- Skew divergence.
- Confusion probability.
- Tau metric, an approximation of the Kullback–Leibler divergence.
- Fellegi and Sunters metric (SFS)
- Maximal matches.
- Lee distance
2009
- Wiktionary.
- String distance: (computer science) Any of several metrics that represent the degree of similarity betwen two strings of characters. The string distance between 'here' and 'there' is 1 (insertion of a t).
- http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
- In my investigations into string metrics, similarity metrics and the like I have developed an open source library of Similarity metrics called SimMetrics. SimMetrics is an open source java library of Similarity or Distance Metrics, e.g. Levenshtein distance, that provide float based similarity measures between String Data. All metrics return consistant measures rather than unbounded similarity scores. This open source library is hosted at http://sourceforge.net/projects/simmetrics/. The JavaDoc's of SimMetrics are detailed here.
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers. In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.
- Damerau-Levenstein Distance The simplest form of edit distances were introduced in the by Damerau (1964) and Levenshtein (1965). The distance between two strings is defined as the minimal number of edits required to convert one into the other. In the simpler Levenshtein distance, the allowable edit operations are the deletion of a single character, the insertion of a single character and the substitutione of one character for another. In Damerau's version, transposition is also an allowable edit.
- Jaccard Distance Another common method for comparing strings, which is actually much more efficient to implement, is the so-called "Jaccard distance". The Jaccard distance implementation in spell.JaccardDistance operates at a token level, comparing two strings by first tokenizing them and then dividing the number of tokens shared by the strings by the total number of tokens.
- Jaro-Winkler Distance There are a family of distance measures defined by the U.S. Census Bureau for comparing single person names. The original metric was defined by Matt Jaro and later refined by Bill Winkler.
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.
- http://www.algorithmic-solutions.info/leda_manual/distance.html
- An instance D of the data type distance is an object maintaining two strings and providing several string distance functions.
- The Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different.
- The Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
- The General Levenshtein distance is given by the minimum sum of the costs over a sequence of operations needed to transform one string into the other, where operation is an insertion, deletion, or substitution and a cost is assigned to each of the operations.
- The Damerau-Levenshtein distance is an extension of Levenshtein distance that counts transposition as a single edit operation. Strictly speaking, the Damerau-Levenshtein distance is equal to the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other.
2003
- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Proceedings of IJCAI 2003.
2002
- (Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
2001
- (Navarro, 2001) ⇒ Gonzalo Navarro. (2001). “A Guided Tour to Approximate String Matching.” In: ACM Computing Surveys (CSUR), 33(1). doi:10.1145/375360.375365
1974
- (Wagner & Fischer, 1974) ⇒ Robert A. Wagner, and Michael J. Fischer. (1974). “The string to string correction problem.” In: Journal of the ACM (JACM), 21(1).
1970
- (Needleman & Wunsch, 1970) ⇒ Saul Needleman, and Christian Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” In: J. Mol Biol., 48(3).
- (Morgan, 1970) ⇒ Howard L. Morgan. (1970). “Spelling Correction in Systems Programs.” In: Communications of the ACM, 13(2).
1966
- (Levenshtein, 1966) ⇒ V. I. Levenshtein. (1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” In: Soviet Physics Doklady.
1964
- (Damerau, 1964) ⇒ Fred J. Damerau. (1964). “A Technique for Computer Detection and Correction of Spelling Errors.” In: Communications of the ACM, 7(3).