2009 EntityResolutinwithIterBlock
- (Whang et al., 2009) ⇒ Steven Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. (2009) "Entity Resolution with Iterative Blocking.” In: Proceedings of the 35th SIGMOD International Conference on Management of data (SIGMOD). doi:10.1145/1559845.1559870
Subject Headings: Entity Record, Record Disambiguation.
Notes
Cited By
Quotes
Author Keywords
entity resolution, blocking, iterative blocking.
Abstract
Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.
References
- Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)
- Benjelloun, O., Garcia-Molina, H., Menestrina, D., Whang, S.E., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. (2008). doi:10.1007/s00778-008-0098-x
- Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM ’05: Proceedings of the 4th International workshop on multi-relational mining, pp. 3–12. ACM Press, New York (2005). doi:10.1145/1090193.1090195
- Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD Conference, pp. 143–154 (2005)
- Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE. Tokyo, Japan (2005)
- Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD’07: Proceedings of the 2007 ACM SIGMOD International Conference on management of data, pp. 437–448. ACM Press, New York (2007). doi:10.1145/1247480.1247530
- Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Inconsistency Tolerance, pp. 119–150 (2005)
- Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: A profiler-based approach. In: IIWeb, pp. 53–58 (2003)
- Doan, A., Lu, Y., Lee, Y., Han, J.: Profile-based object matching for information integration. IEEE Intell. Syst. 18(5), 54–59 (2003)
- Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, pp. 85–96 (2005)
- Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
- Eswaran, K.P., Chamberlin, D.D.: Functional specifications of subsystem for database integrity. In: VLDB, pp. 48–68 (1975)
- Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J Am. Stat. Assoc. 71(353), 17–35 (1976). http://www.jstor.org/stable/2285726
- Franconi, E., Palma, A.L., Leone, N., Perri, S., Scarcello, F.: Census data repair: A challenging application of disjunctive logic programming. In: LPAR’01: Proceedings of the Artificial Intelligence on Logic for Programming, pp. 561–578. Springer, London (2001)
- Genesereth, M.R., Nilsson, N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Palo Alto (1988)
- Ginsberg, M.L.: Readings in Nonmonotonic Reasoning. Morgan Kaufmann, Los Altos (1987)
- Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. Tech. Rep. 03/83, CSIRO Mathematical and Information Sciences (2003)
- Hammer, M., McLeod, D.: Semantic integrity in a relational data base system. In: VLDB, pp. 25–47 (1975)
- Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)
- Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
- McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178. Boston, MA (2000)
- Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: First International VLDB Workshop on Clean Databases. Seoul, Korea (2006)
- Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)
- Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
- Nilsson N.J.: Artificial Intelligence A New Synthesis. Morgan Kaufmann, San Francisco (1998)
- Rowland, T.: Connected component. http://mathworld.wolfram. com/ConnectedComponent.html
- Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD. Edmonton, Alberta (2002)
- Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)
- Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)
- Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Additional experiments on negative rules. Tech. rep., Stanford University. http://dbpubs.stanford.edu/pub/2005-5
- Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. Tech. rep., Stanford University (2008). http://dbpubs.stanford.edu/pub/2008-19
- Widom, J., Ceri, S. (eds.): Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, San Francisco (1996)
- Winkler, W.: Overview of record linkage and current research directions. Tech. rep., Statistical Research Division, US Bureau of the Census, Washington, DC (2006)
- Winkler, W.E.: State of statistical data editing and current research problems. In: UN/ECE Work Session on Statistical Data Editing, Working Paper n.29, pp. 2–4 (1999),