1969 ATheoryForRecordLinkage
Jump to navigation
Jump to search
- (Fellegi & Sunter, 1969) ⇒ Ivan P. Fellegi, Allan B. Sunter. (1969). “A theory for record linkage.” In: Journal of the American Statistical Society.
Subject Headings: Entity Record Deduplication Algorithm.
Notes
- It formalizes the research of (Newcombe et al., 1959)
Cited By
2009
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Record_linkage
- In 1969, Ivan Fellegi and Alan Sunter formalized these ideas and proved that the probabilistic decision rule they described was optimal when the comparison attributes are conditionally independent. Their pioneering work "A Theory For Record Linkage" is, still today, the mathematical tool for many record linkage applications.
2006
- (Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions.” Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.
- The basic ideas are based on statistical concepts such as odds ratios, hypothesis testing, and relative frequency.
-
Quotes
Abstract
- A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched).
- A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link $(A_1)$, a non-link $(A_3)$, and a possible link $(A_2)$. The first two decisions are called positive dispositions.
- The two types of error are defined as the error of the decision $A_1$ when the members of the comparison pair are in fact unmatched, and the error of the decision $A_3$ when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as $$mu = sum_{gammaepsilonGamma} u(gamma)P(A_1midgamma)$$ and $$lambda = sum_{gammaepsilonGamma} m(gamma)P(A_3midgamma)$$ respectively where $u(gamma), m(gamma)$ are the probabilities of realizing $gamma$ (a comparison vector whose components are the coded agreements and disagreements on each characteristic) for unmatched and matched record pairs respectively. The summation is over the whole comparison space $Gamma$ of possible realizations. A linkage rule assigns probabilities $P(A_1midgamma)$, and $P(A_2midgamma)$, and $P(A_3midgamma)$ to each possible realization of $gamma epsilon Gamma$. An optimal linkage rule $L(mu, lambda, Gamma)$ is defined for each value of $(mu, lambda)$ as the rule that minimizes $P(A_2)$ at those error levels. In other words, for fixed levels of error, the rule minimizes the probability of failing to make positive dispositions. A theorem describing the construction and properties of the optimal linkage rule and two corollaries to the theorem which make it a practical working tool are given.
References
,