Record Linkage Task

A Record Linkage Task is a coreference resolution task that requires the clustering of entity records with the same referent.

AKA: Database Matching, Coreferent Record Detection, Record Matching.
Context:
- Input: two or more Populated Data Structures.
- output: Coreferent Record Clusters.
- It can be solved by a Record Linkage System (that implements a record linkage algorithm).
- It can range from being a Heuristic Record Linkage to being a Data-Driven Record Linkage (such as unsupervised record linkage or supervised record linkage)
- It can range from being a Tabular Record Linkage Task to being a Graph Record Linkage Task to being ...
- It can range from being a Inter-Database Record Linkage Task to being a Intra-Database Record Linkage Task.
- It can range from being a One-Directional Record Linkage Task to being a Two-Directional Record Linkage Task.
- It can range from being a Record Deduplication Task (single database) to being a Coreference Resolution Task (multiple databases).
- It can range from being a Deterministic Record Linkage Task to being a Probabilistic Record Linkage.
- It can support a Duplicate Record Merging, Record Normalization Task, if one of the databases is a canonical database.
- ...
Example(s):
- a Citation Record Linkage Task, for citation databases.
- a Customer Record Linkage Task, for customer databases.
- a Product Record Linkage Task, for product databases.
- a Graph Mapping Task, such as an ontology mapping task or a taxonomy matching task.
- a Webpage Coreference Resolution Task, for webpage databases.
- a Terminology Matching Task, for terminology databases.
- …
Counter-Example(s):
- an Entity Mention Coreference Resolution Task.
- a Record Canonicalization Task (e.g. to support database merging).
- an Information Extraction Task.
See: Entity Database, Data Pre-Processing, Expectation Maximization Clustering, Link Prediction Task, Similarity Measure, Unsupervised Machine Learning, Entity Resolution, Identity Resolution.

References

2017a

(Christen & Winkler, 2017) ⇒ Peter Christen and William E. Winkler (2017) "Record Linkage". In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- QUOTE: Identifying and linking records that correspond to the same real-world entity in one or more databases is an increasingly important task in many data mining and machine learning projects. The aim of record linkage is to compare records within one (known as deduplication) or across two databases and classify the compared pairs of records as matches (pairs where both records are assumed to refer to the same real-world entity) and non-matches (pairs where the two records are assumed to refer to different entities). Formally, let us consider two databases (or files), A and B, and record pairs in the product space A × B (for the deduplication of a single database A, the product space is A × A). The aim of record linkage is to classify these record pairs into the classes of matches (links) and non-matches (non-links) (Christen 2012^[1]). Depending upon the decision model used (Fellegi and Sunter 1969^[2]; Herzog et al. 2007^[3]), a third clas of potential matches (potential links) might be used. These are difficult to classify record pairs that will need to be manually assessed and classified as matches or non-matches in a manual clerical review process.
  Each record pair in A × B is assumed to correspond to either a true match or a true non-match. The space A × B is therefore partitioned into the set M of true matches and the set U of true non-matches. The objective of record linkage is to correctly classify record pairs from M into the class of matches and pairs from U into the class of non-matches.

2017b

(Bhattacharya & Getoor, 2017) ⇒ Indrajit Bhattacharya and Lise Getoor (2017) "Entity Resolution". In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
- QUOTE: A fundamental problem in data cleaning and integration (see Data Preparation) is dealing with uncertain and imprecise references to real-world entities. The goal of entity resolution is to take a collection of uncertain entity references (or references, in short) from a single data source or multiple data sources, discover the unique set of underlying entities, and map each reference to its corresponding entity. This typically involves two subproblems – identification of references with different attributes to the same entity and disambiguation of references with identical attributes by assigning them to different entities.

2012

(Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Record_linkage
- Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. In mathematical graph theory, record linkage can be seen as a technique of resolving bipartite graphs.
http://en.wikipedia.org/wiki/Record_linkage#Naming_conventions
- QUOTE: "Record linkage" is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Computer scientists often refer to it as "data matching" or as the "object identity problem". Other names used to describe the same concept include "entity resolution", “identity resolution", "entity disambiguation", "duplicate detection", "record matching", "instance identification", "deduplication", “coreference resolution", "reference reconciliation", "data alignment", and "database hardening". This profusion of terminology has led to few cross-references between these research communities.

2008

(BenjellounGSW, 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution." VLDB Journal, (2008).
- QUOTE: Entity Resolution (ER) (sometimes referred to as deduplication) is the process of identifying and merging records judged to represent the same real-world entity. ER is a well-known problem that arises in many applications. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to consolidate them. Identifying matching records is challenging because there are no unique identifiers across databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match.

2009

(Dalvi et al., 2009) ⇒ Nilesh Dalvi, Ravi Kumar, Bo Pang, and Andrew Tomkins. (2009). “Matching Reviews to Objects Using a Language Model. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009).
- QUOTE: Entity matching is a well-studied topic in databases.

2007a

(Elmagarmid et al., 2007) ⇒ Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios (2007). “Duplicate Record Detection: A Survey." IEEE Transactions on Knowledge and Data Engineering 19 (1).
- It uses the term Duplicate Record Detection for Record Coreference Resolution.
- It defines the Record Coreference Resolution Task as the "detection of duplicate database records".
- It distinguishes between:
  - a Structural Heterogeneity Relation (with different Data Structure Representation)
  - a Lexical Heterogeneity Relation (with different (Value Representations).
- It is focused on Entity Databases where Data Records refer to "real-world objects" (e.g., StreetAddress=44 W. 4th St. vs. StreetAddress=44 West Fourth Street).

2007b

(Bhattacharya & Getoor, 2007) ⇒ Indrajit Bhattacharya, and Lise Getoor. (2007). “Collective entity resolution in relational data.” In: Proceedings for ACM Transactions on Knowledge Discovery from Data (TKDD)
- QUOTE: Entity resolution is a common problem that comes in different guises (and is given different names) in many computer science domains. Examples include computer vision, where we need to figure out when regions in two different images refer to the same underlying object (the correspondence problem); natural language processing when we would like to determine which noun phrases refer to the same underlying entity (coreference resolution); and databases, where, when merging two databases or cleaning a database,we would like to determine when two tuple records are referring to the same real-world object (deduplication and data integration). Deduplication [Hern´andez and Stolfo 1995; Monge and Elkan 1996] is important for both accurate analysis, for example, determining the number of customers, and for cost-effectiveness, for example, removing duplicates from mailing lists. In information integration, determining approximate joins [Cohen 2000] is important for consolidating information from multiple sources; most often there will not be a unique key that can be used to join tables in distributed databases, and we must infer when two records from different databases, possibly with different structures, refer to the same entity. In many of these examples, co-occurrence information in the input can be naturally represented as a graph.

2006

(Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions." Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.
- QUOTE: Record linkage is the means of combining information from a variety of computerized files. It is also referred to as data cleaning (McCallum and Wellner 2003) or object identification (Tejada et al. 2002).
  If a number of files are combined into a data warehouse, then Fayad and Uthurusamy (1996, 2002) and Fayad et al. (1996) have stated that the majority (possibly above 90%) of the work is associated with cleaning up the duplicates. Winkler (1995) has shown that computerized record linkage procedures can significantly reduce the resources needed for identifying duplicates in comparison with methods that are primarily manual. Newcombe and Smith (1975) have demonstrated the purely computerized duplicate detection in high quality person lists can often identify duplicates at greater level of accuracy than duplicate detection that involves a combination of computerized procedures and review by highly trained clerks. The reason is that the computerized procedures can make use of overall information from large parts of a list. For instance, the purely computerized procedure can make use of the relative rarity of various names and combinations of information in identifying duplicates. The relative rarity is computed as the files are being matched. Winkler (1995, 1999a) observed that the automated frequency-based (or value-specific) procedures could account for the relative rarity of a name such as ‘Martinez’ in cities such as Minneapolis, Minnesota in the US in comparison with the relatively high frequency of “Martinez’ in Los Angeles, California.
  Record linkage of files (Fellegi and Sunter 1969) is used to identify duplicates when unique identifiers are unavailable. It relies primarily on matching of names, addresses, and other fields that are typically not unique identifiers of entities. Matching businesses using business names and other information can be particularly difficult (Winkler 1995). Record linkage is also called object identification (Tejada et al. 2001, 2002), data cleaning (Do and Rahm 2000), approximate matching or approximate joins (Gravanao et al. 2001, Guha et al.2004), fuzzy matching (Ananthakrisha et al. 2002), and entity resolution (Benjelloun et al. 2005).
Nick Koudas, editor. (2006). “Issue on Data Quality.] IEEE Data Engineering Bulletin, volume 29.

2005

http://www.cs.utexas.edu/users/ml/riddle/
- RIDDLE: Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty

1962

(Newcombe and Kennedy, 1962) ⇒ H. B. Newcombe and J. M. Kennedy. (1962). “Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5:11.

1959

(Newcombe et al, 1959) ⇒ Howard B. Newcombe, James M. Kennedy, S. J. Axford, and A. P. James. (1959). “Automatic linkage of vital records.” In: Science, 130:954-959.

1946

Halbert L. Dunn. (1946). “Record Linkage." American Journal of Public Health 36 (12).

↑ Christen P (2012) Data matching – concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin/New York
↑ Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
↑ Herzog TN, Scheuren FJ, Winkler WE (2007) Data Quality and Record Linkage Techniques. Springer, New York/London

[1] Christen P (2012) Data matching – concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin/New York

[2] Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

[3] Herzog TN, Scheuren FJ, Winkler WE (2007) Data Quality and Record Linkage Techniques. Springer, New York/London

[1]

[2]

[3]