2003 AComparisonOfStringEditDistMetrics

From GM-RKB
Jump to navigation Jump to search

Subject Headings: String Edit Distance Function, String Distance Function, Levenshtein Distance Function, Monge-Elkan Distance Function, TF-IDF Distance Function, Jaro Distance Function, SecondString System, Entity Reference Normalization Algorithm.

Notes

Cited By

Quotes

Abstract

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.

Introduction

The task of matching entity names has been explored by a number of communities, including statistics, databases, and artificial intelligence. Each community has formulated the problem differently, and different techniques have been proposed. In statistics, a long line of research has been conducted in probabilistic record linkage, largely based on the seminal paper by Fellegi and Sunter (1969). This paper formulates entity matching as a classification problem, where the basic goal is to classify entity pairs as matching or non-matching. Fellegi and Sunter propose using largely unsupervised methods for this task, based on a feature-based representation of pairs which is manually designed and to some extent problem-specific. These proposals have been, by and large, adopted by subsequent researchers, although often with elaborations of the underlying statistical model (Jaro 1989; 1995; Winkler 1999; Larsen 1999; Belin & Rubin 1997). These methods have been used to match individuals and/or families between samples and censuses, e.g., in evaluation of the coverage of the U.S. decennial census; or between administrative records and survey data bases, e.g., in the creation of an anonymized research data base combining tax information from the Internal Revenue Service and data from the Current Population Survey.

In the database community, some work on record matching has been based on knowledge-intensive approaches (Hernandez & Stolfo 1995; Galhardas et al. 2000; Raman & Hellerstein 2001). However, the use of string-edit distances as a general-purpose record matching scheme was proposed by Monge and Elkan (Monge & Elkan 1997; 1996), and in previous work, we proposed use of the TFIDF distance metric for the same purpose (Cohen 2000). In the AI community, supervised learning has been used for learning the parameters of string-edit distance metrics (Ristad & Yianilos 1998; Bilenko & Mooney 2002) and combining the results of different distance functions (Tejada, Knoblock, & Minton 2001; Cohen & Richman 2002; Bilenko & Mooney 2002). More recently, probabilistic object identification methods have been adapted to matching tasks (Pasula et al. 2002). In these communities there has been more emphasis on developing “autonomous” matching techniques which require little or or no configuration for a new task, and less emphasis on developing “toolkits” of methods that can be applied to new tasks by experts.

Recently, we have begun implementing an open-source, Java toolkit of name-matching methods which includes a variety of different techniques. In this paper we use this toolkit to conduct a comparison of several string distances on the tasks of matching and clustering lists of entity names.

This experimental use of string distance metrics, while similar to previous experiments in the database and AI communities, is a substantial departure from their usual use in statistics. In statistics, databases tend to have more structure and specification, by design. Thus the statistical literature on probabilistic record linkage represents pairs of entities not by pairs of strings, but by vectors of “match features” such as names and categories for variables in survey databases. By developing appropriate match features, and appropriate statistical models of matching and non-matching pairs, this approach can achieve better matching performance (at least potentially).

The use of string distances considered here is most useful for matching problems with little prior knowledge, or illstructured data. Better string distance metrics might also be useful in the generation of “match features” in more structured database situations.

Edit-distance like functions

Distance functions map a pair of strings [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] to a real number [math]\displaystyle{ r }[/math], where a smaller value of [math]\displaystyle{ r }[/math] indicates greater similarity between [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math]. Similarity functions are analogous, except that larger values indicate greater similarity; at some risk of confusion to the reader, we will use this terms interchangably, depending on which interpretation is most natural. One important class of distance functions are edit distances, in which distance is the cost of best sequence of edit operations that convert s to t. Typical edit operations are character insertion, deletion, and substitution, and each operation much be assigned a cost.

We will consider two edit-distance functions. The simple Levenstein distance assigns a unit cost to all edit operations. As an example of a more complex well-tuned distance function, we also consider the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine1 variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters, and scaled to the interval [0,1].

A broadly similar metric, which is not based on an edit distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix of s and t.

Two strings [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The Jaccard similarity between the word sets S and T is simply jS\Tj jS[Tj . TFIDF or 1 Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. cosine similarity, which is widely used in the information retrieval community

References

  • Belin, T. R., and Rubin, D. B. (1997). A method for calibrating false-match rates in record linkage. In Record Linkage – 1997: Proceedings of an International Workshop and Exposition, 81–94. U.S. Office of Management and Budget (Washington).
  • Bilenko, M., and Mooney, R. (2002). Learning to combine trained distance metrics for duplicate detection in databases. Technical Report Technical Report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin. Available from http://www.cs.utexas.edu/users/ml/papers/marlin-tr-02.pdf.
  • William W. Cohen, and Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002).
  • William W. Cohen (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems 18(3):288–321.
  • Dagan, I.; Lee, L.; and Fernando Pereira (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1-3).
  • Durban, R.; Eddy, S. R.; Krogh, A.; and Mitchison, G. (1998). Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press.
  • (Fellegi and Sunter, 1969) ⇒ Ivan P. Fellegi, and Allan B. Sunter. (1969). “A theory for record linkage.” In: Journal of the American Statistical Society 64:1183–1210.
  • Galhardas, H.; Florescu, D.; Shasha, D.; and Simon, E. (2000). An extensible framework for data cleaning. In ICDE, 312.
  • Hernandez, M., and Stolfo, S. (1995). The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD.
  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.
  • Jaro, M. A. (1995). Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.
  • Thorsten Joachims (2002). Learning to Classify Text Using Support Vector Machines. Kluwer.
  • John D. Lafferty, and Zhai, C. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In 2001 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
  • Larsen, M. (1999). Multiple imputation analysis of records linked using mixture models. In Statistical Society of Canada Proceedings of the Survey Methods Section, 65–71. Statistical Society of Canada (McGill University, Montreal).
  • Andrew McCallum; Nigam, K.; and Lyle H. Ungar H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Knowledge Discovery and Data Mining, 169–178.
  • Monge, A., and Elkan, C. (1996). The field-matching problem: algorithm and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
  • Monge, A., and Elkan, C. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery.
  • Pasula, H.; Marthi, B.; Milch, B.; Russell, S.; and Shpitser, I. (2002). Identity uncertainty and citation matching. In Advances in Neural Processing Systems 15. Vancouver, British Columbia: MIT Press.
  • Raman, V., and Hellerstein, J. (2001). Potter’s wheel: An interactive data cleaning system. In The VLDB Journal, 381–390.
  • Ristad, E. S., and Yianilos, P. N. (1998). Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):522–532.
  • Tejada, S.; Knoblock, C. A.; and Minton, S. (2001). Learning object identification rules for information integration. Information Systems 26(8):607–633.
  • Winkler, W. E. (1999). The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html.

BibTeX

BibTeX

@inproceedings{DBLP:conf/ijcai/CohenRF03,

 author    = {William W. Cohen and
              Pradeep Ravikumar and
Stephen E. Fienberg},
 title     = {A Comparison of String Distance Metrics for Name-Matching
          Tasks},
 booktitle = {IIWeb},
 year      = {2003},
 pages     = {73-78},
 ee        = {http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf},
 crossref  = {DBLP:conf/ijcai/2003iiweb},
 bibsource = {DBLP, http://dblp.uni-trier.de}

}

@proceedings{DBLP:conf/ijcai/2003iiweb,

 editor    = {Subbarao Kambhampati and
              Craig A. Knoblock},
 title     = {Proceedings of IJCAI-03 Workshop on Information Integration
              on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico},
 booktitle = {IIWeb},
 year      = {2003},
 bibsource = {DBLP, http://dblp.uni-trier.de}

}


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 AComparisonOfStringEditDistMetricsWilliam W. Cohen
Pradeep Ravikumar
Stephen E. Fienberg
A Comparison of String Distance Metrics for Name-Matching TasksWorkshop on Information Integration on the Webhttp://secondstring.sourceforge.net/doc/iiweb03.pdf2003