1995 TheMergePurgeProblemForLargeDBs
- (Hernández and Stolfo, 1995) ⇒ Mauricio A. Hernández, Salvatore J. Stolf. (1995). “The Merge/Purge Problem for Large Databases.” In: Proceedings of ACM SIGMOD (SIGMOD 1995). doi:10.1145/223784.223807
Subject Headings: Merge-Purge Algorithm, Sorted Neighborhood Algorithm, Clustering Algorithm.
Notes
Cited By
2007
- (Bhattacharya and Getoor, 2007) ⇒ Indrajit Bhattacharya, and Lise Getoor. (2007). “Collective Entity Resolution in Relational Data.” In: Proceedings for ACM Transactions on Knowledge Discovery from Data (TKDD 2007).
Quotes
Abstract
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
References
- Rakesh Agrawal, H. V. Jagadish, Multiprocessor transitive closure algorithms, Proceedings of the first international symposium on Databases in parallel and distributed systems, p.56-66, December 05-07, 1988, Austin, Texas, United States.
- Michael Allen Bickel, Automatic correction to misspelled names: a fourth-generation language approach, Communications of the ACM, v.30 n.3, p.224-228, March 1987 doi:10.1145/214748.214756.
- Dina Bitton, David J. DeWitt, Duplicate record elimination in large data files, ACM Transactions on Database Systems (TODS), v.8 n.2, p.255-265, June 1983 doi:10.1145/319983.319987
- Hasanat M. Dewan, Kui W. Mok, Mauricio Hernández, Salvatore J. Stolfo, Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew, Proceedings of the third International Conference on on Parallel and distributed information systems, p.40-49, October 1994, Autin, Texas, United States
- David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, An Evaluation of Non-Equijoin Algorithms, Proceedings of the 17th International Conference on Very Large Data Bases, p.443-452, September 03-06, 1991
- C. L. Forgy. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.
- Shahram Ghandeharizadeh, David J. Dewitt, Physical database design in multiprocessor database systems, 1990
- R. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal o} Computzng, 17:416-429, 1969.
- Mauricio A. Hernández. A Generalization of Band-Joins and the Merge/Purge Problem. Technical Report CUCS- 005-1995, Department of Computer Science, Columbia University, February 1995..
- William Kent, The breakdown of the information model in multi-database systems, ACM SIGMOD Record, v.20 n.4, p.10-15, Dec. 1991 doi:10.1145/141356.141358.
- Karen Kukich, Technique for automatically correcting words in text, ACM Computing Surveys (CSUR), v.24 n.4, p.377-439, Dec. 1992 doi:10.1145/146370.146380
- D. P. Miranker, B. Lofaso, G. Farmer, A. Chandra, and D. Brant. On a TREAT-based Production System Compiler. In: Proceedings of l Oth Int'l Conference on Expert Systems, pages 617-630, 1990..
- Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, Dave Lomet, AlphaSort: a RISC machine sort, Proceedings of the 1994 ACM SIGMOD Conference, p.233-242, May 24-27, 1994, Minneapolis, Minnesota, United States
- Y. Richard Wang, Stuart E. Madnick, The Inter-Database Instance Identification Problem in Integrating Autonomous Systems, Proceedings of the Fifth International Conference on Data Engineering, p.46-55, February 06-10, 1989,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
1995 TheMergePurgeProblemForLargeDBs | Mauricio A. Hernández Salvatore J. Stolf | The Merge/Purge Problem for Large Databases | Proceedings of ACM SIGMOD | 10.1145/223784.223807 | 1995 |