Record Linkage Blocking Algorithm
Jump to navigation
Jump to search
A Record Linkage Blocking Algorithm is a record linkage algorithm that ...
- See: Positive Predictive Value, Jaro-Winkler Distance, Levenshtein Distance, Privacy Preserving Record Linkage.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/record_linkage#Probabilistic_record_linkage Retrieved:2015-11-8.
- … Determining where to set the match/non-match thresholds is a balancing act between obtaining an acceptable sensitivity (or recall, the proportion of truly matching records that are linked by the algorithm) and positive predictive value (or precision, the proportion of records linked by the algorithm that truly do match). Various manual and automated methods are available to predict the best thresholds, and some record linkage software packages have built-in tools to help the user find the most acceptable values. Because this can be a very computationally demanding task, particularly for large data sets, a technique known as blocking is often used to improve efficiency. Blocking attempts to restrict comparisons to just those records for which one or more particularly discriminating identifiers agree, which has the effect of increasing the positive predictive value (precision) at the expense of sensitivity (recall). For example, blocking based on a phonetically coded surname and ZIP code would reduce the total number of comparisons required and would improve the chances that linked records would be correct (since two identifiers already agree), but would potentially miss records referring to the same person whose surname or ZIP code was different (due to marriage or relocation, for instance). Blocking based on birth month, a more stable identifier that would be expected to change only in the case of data error, would provide a more modest gain in positive predictive value and loss in sensitivity, but would create only twelve distinct groups which, for extremely large data sets, may not provide much net improvement in computation speed. Thus, robust record linkage systems often use multiple blocking passes to group data in various ways in order to come up with groups of records that should be compared to each other.
2009
- (Whang et al., 2009) ⇒ Steven Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. (2009) "Entity Resolution with Iterative Blocking.” In: Proceedings of the 35th SIGMOD International Conference on Management of data (SIGMOD). doi:10.1145/1559845.1559870