2008 FRILaToolForComparRecordLinkage
- (Jurczyk et al., 2008) ⇒ Pawel Jurczyk, James J. Lu, Li Xiong, Janet D. Cragan, Adolfo Correa. (2008). “FRIL: A Tool for Comparative Record Linkage.” In: Proceedings of the AMIA Annual Symposium (AMIA 2008).
Subject Headings: Duplicate Record Detection Task, Duplicate Record Detection Algorithm.
Quotes
Abstract
- A fine-grained record integration and linkage tool (FRIL) is presented. The tool extends traditional record linkage tools with a richer set of parameters. Users may systematically and iteratively explore the optimal combination of parameter values to enhance linking performance and accuracy. Results of linking a birth defects monitoring program and birth certificate data using FRIL show 99% precision and 95% recall rates when compared to results obtained through handcrafted algorithms, and the process took significantly less time to complete. Experience and experimental result suggest that FRIL has the potential to increase the accuracy of data linkage across all studies involving record linkage. In particular, FRIL will enable researchers to assess objectively the quality of linked data.
Introduction
- The goal of record linkage is to find syntactically distinct data entries that refer to the same entity in two or more input files. The process is important for both data cleaning and integration in birth defects surveillance and research. Traditional interactive tools for record linkage provide users with a small number of parameters, consisting mostly of user options for selecting similarity measures and decision models. In some cases, the user may also pick the search algorithm. The combination of choices typically does not provide sufficient granularities to produce results that are easily discernible. Hence for most research involving record linkage, the accuracy of the linked data is not well-understood, and often not discussed in the evaluation of the study.
- As part of a surveillance program to monitor birth
defects in the metropolitan Atlanta area, we have developed a fine-grained record integration and linkage tool (FRIL) to link a 12,700 record database from the Metropolitan Atlanta Congenital Defects Program (MACDP) with a 1.25 million record birth certificate database. The objectives of MACDP are to monitor births of infants with malformations for changes in incidence over time or patterns suggestive of environmental influences, to maintain a case registry for epidemiologic studies, to quantify the morbidity and mortality associated with birth defects, and to provide data for education and health policy decisions related to prevention1. Towards these objectives, MACDP conducts data linkages to enhance the completeness of birth defects surveillance data.
Background
- The problem of record linkage is defined as follows.
Given sets A and B of records, find a partition of A×B consisting of sets M (matched), U (unmatched), and P (possibly matched) that satisfy M = {(a, b) | a = b} and U = {(a, b) | a b}. A widely adopted record linkage approach is the probabilistic approach by Fellegi et. al.2 First, a vector of similarity scores (or agreement values) is computed for each pair. Then, the pair is classified as either a match or non-match (or possibly matched) based on an aggregate of the similarity scores. Among methods used for classification we find rule-based methods that allow human experts to specify matching rules, unsupervised learning methods such as Expectation- Maximization (EM) that learns the weights or thresholds without relying on labeled data, and supervised learning methods that use labeled data to train a model, such as decision tree, nave Bayesian or SVM. For detailed descriptions of those methods we refer readers to3,4,5. For computing similarities, various distance functions are used and studied. Complete descriptions of these methods can be found in3,6,7, and several comparative evaluations of those methods have been performed8,9.
- FRIL adopts the probabilistic linkage approach. Its
strength is the amount of control that the user has for tuning the accuracy and performance of linkages. In the remainder of the paper, we describe the full spectrum of user-tunable parameters available in FRIL and discuss their importance in the context of birth defect surveillance (BDS).
Conclusion and Ongoing Development
- FRIL facilitates efficient and accurate record linkage over large data sources. The great flexibility of FRIL comes from the large number of fine-grained parameters that the user may tune, and it allowed us to link MACDP and birth certificates data efficiently and accurately (99% precision and 95% recall). By exploiting all the features of FRIL, we presented a process which enabled us to find good join condition. The benefits of FRIL extend beyond the results of linking. By revealing key algorithmic decision points for user inputs, the tool forces researchers to consider computational issues that impact accuracy and performance of the linkage process. As a result, researchers are able to judge the quality of the linked data scientifically and quantitatively. For already linked data, FRIL may also serve as a validation tool. Work on extending FRIL with several automated tools is ongoing. They include machine learning techniques to suggest values of certain parameters (e.g., attribute selection and weight). Borrowing query optimization techniques from databases, window size and sort ordering may also be suggested. We are optimistic that FRIL will facilitate many future projects based on birth defects surveillance data and other public health surveillance projects.
References
- 1. A. Correa, J.D. Cragan, M.E. Kucik, C.J. Alverson, S.M. Gilboa, R. Balakrishnan, M.J. Strickland, C.W. Duke, L.A. O'Leary, T. Riehle- Colarusso, C. Siffel, D. Gambrell, D. Thompson, M. Atkinson, J. Chitra. Metropolitan Atlanta Congenital Defects Program 40th Anniversary Edition Surveillance Report. Birth Defects Research Part A: Clinical and Molecular Teratology 79(2): 65-186, 2007.
- 2. I. P. Fellegi and A.B. Sunter. A Theory for Record Linkage. JASA, 64(328): 1183-1210, 1969.
- 3. A. K. Elmagarmid, Panagiotis G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1–16, 2007.
- 4. A. Halevy, A. Rajaraman, and J. Ordille. Data integration: the teenage years. In: Proceedings of the VLDB, 2006.
- 5. W. Winkler. Overview of record linkage and current research directions. U.S. Census Bureau, Technical Report, 2006.
- 6. Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing. Survey, 33(1):31–88, 2001.
- 7. E. S. Ristad and P. N. Yianilos. Learning string edit distance. Technical Report CS-TR-532-96, Department of Computer Science, Princeton University, 1996.
- 8. W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In: Proceedings of KDD Workshop on Data Cleaning, 2003.
- 9. E. Porter and W.Winkler. Approximate string comparison and its effect on an advanced record linkage system. U.S. Bureau of the Census, Research Report, 1997.
- 10. K.M. Campbell. Rule Your Data with The Link King (a SAS/AF application for record linkage and unduplication) . SUGI 30, 2005.
- 11. K.K. Thoburn, D. Gu and T. Rawson. Link Plus: Probabilistic Record Linkage Software. 2nd Probabilistic Record Linkage Conference Call, 2007.
- 12. Record linkage software. Version 5.0. LinkageWiz Inc. Available from http://www.linkagewiz.com/.
- 13. M. G. Elfeky, V. S. Verykios, and A. K. Elmagarmid. TAILOR: A record linkage toolbox. In: Proceedings of the ICDE, 2002.
,