Entity Record Deduplication System
Jump to navigation
Jump to search
An Entity Record Deduplication System is a System that can solve an Entity Record Deduplication Task by means of an Entity Record Deduplication Algorithm.
- AKA: Record Depulication System, Deduplication System.
- …
- Example(s):
- Febrl System http://datamining.anu.edu.au/projects/linkage.html#project_description - Open source software application for RL by the Australian National University.
- RP Link Plus System developed at the US Centers For Disease Control and Prevention.
- FRIL System (Fine-Grained Record Integration and Linkage Tool) http://www.mathcs.emory.edu/Research/Area/datainfo/FRIL/ - developed at the Emory University and US Centers For Disease Control and Prevention.
- SimMetrics System http://sourceforge.net/projects/simmetrics/ Open source library of String Similarity techniques - by Sam Chapman at the University of Sheffield.
- Link King System http://www.the-link-king.com The SAS System application developed at Washington State's Division of Alcohol and Substance Abuse (DASA) by Kevin M. Campbell
- D-Dupe System (Data Deduplication and Integration) http://www.cs.umd.edu/projects/linqs/ddupe/ - developed by the CS department at Maryland University
- SERF System.
- See: Entity Mention Deduplication System.
References
- http://infolab.stanford.edu/serf/#soft
- Our first release of the SERF software can be downloaded here.
- This package provides an implementation of the R-Swoosh algorithm described in reference [1]. The algorithm takes as input a dataset of records (in XML) and a "MatcherMerger" class that implements functions to match and merge pairs of records, and returns a dataset of resolved records.
- A sample dataset of product records, along with a simple MatcherMerger implementation are provided as an example. Products are matched based on the similarity of their titles and prices.
- (BenjellounGSW, 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution.” In: VLDB Journal, (2008).
- http://infolab.stanford.edu/serf/swoosh_vldbj.pdf
- Deciding if records match is often computationally expensive and application specific. For instance, a customer information management solution from a company we have been interacting with uses a combination of nickname algorithms, edit distance algorithms, fuzzy logic algorithms, and trainable engines to match customer records. On the latest hardware, the speeding of matching records ranges from 10M to 100M comparisons per hour (single threaded), depending on the parsing and data cleansing options executed. A record comparison can thus take up to about 0.36ms, greatly exceeding the runtime of any simple string/numeric value comparison. How to match and combine records is also application specific. For instance, the functions used by that company to match customers are different from those used by others to match say products or DNA sequences.