Entity Record Deduplication System

From GM-RKB
Jump to navigation Jump to search

An Entity Record Deduplication System is a System that can solve an Entity Record Deduplication Task by means of an Entity Record Deduplication Algorithm.



References

  • http://infolab.stanford.edu/serf/#soft
    • Our first release of the SERF software can be downloaded here.
    • This package provides an implementation of the R-Swoosh algorithm described in reference [1]. The algorithm takes as input a dataset of records (in XML) and a "MatcherMerger" class that implements functions to match and merge pairs of records, and returns a dataset of resolved records.
    • A sample dataset of product records, along with a simple MatcherMerger implementation are provided as an example. Products are matched based on the similarity of their titles and prices.
  • (BenjellounGSW, 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution.” In: VLDB Journal, (2008).
    • http://infolab.stanford.edu/serf/swoosh_vldbj.pdf
    • Deciding if records match is often computationally expensive and application specific. For instance, a customer information management solution from a company we have been interacting with uses a combination of nickname algorithms, edit distance algorithms, fuzzy logic algorithms, and trainable engines to match customer records. On the latest hardware, the speeding of matching records ranges from 10M to 100M comparisons per hour (single threaded), depending on the parsing and data cleansing options executed. A record comparison can thus take up to about 0.36ms, greatly exceeding the runtime of any simple string/numeric value comparison. How to match and combine records is also application specific. For instance, the functions used by that company to match customers are different from those used by others to match say products or DNA sequences.