Data Merge-Purge Task
Jump to navigation
Jump to search
A Data Merge-Purge Task is a data processing task that requires the merging or two or more data record sets with coreferent records (with possibly different data structures) into a single canonical record set (with no duplicate records).
- Context:
- Input: One or more Data Record Sets, where each set may contain Duplicate Records.
- output: One Data Record Set with no Duplicate Records.
- It can be decomposed into:
- It can support a Data Cleaning Task.
- It can be solved by a Merge-Purge System (that implements a merge/purge algorithm).
- Example(s):
- Given two or more Person Tables (where each table may contain duplicate records), create a Person Table with no Duplicate Records, a Person Record Deduplication Task.
- Given two or more Citation Tables (where each table may contain duplicate records), create a single Citation Table with no Duplicate Records, a Citation Record Deduplication Task.
- Given two or more Customer Tables, return the sets of Records that Refer to the same person.
- Given a Table with Product Records, return the sets of Records that Refer to the same Product.
- Given a Table with Protein Records, return the sets of Records that Refer to the same Protein.
- Given a Table with Citation Records, return the sets of Records that Refer to the same Citation.
- …
- Counter-Example(s):
- A Record Normalization Task, were a Canonical Record Set is provided and no Canonicalization is required.
- Decompose a Record Set into Record Sets with less Redundant Data. This is a Data Normalization Task.
- Automatically detect whether two or more Record Sets Refer to the same type of Thing.
- See: Summarization Task, Coreference Resolution Task, Information Extraction Task, Approximate Matching Task.
References
2006
- (Koudas, 2006) ⇒ Nick Koudas, editor. (2006). “Issue on Data Quality.] IEEE Data Engineering Bulletin, volume 29.
1998
- (Hernández et al., 1998) ⇒ Mauricio A. Hernández, and Salvatore J. Stolfo. (1998). “Real-world Data is Dirty: Data Cleansing and the Merge / purge Problem." Data mining and knowledge discovery 2, no. 1
- QUOTE: The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data.
1995
- (Hernández and Stolfo, 1995) ⇒ Mauricio A. Hernández, Salvatore J. Stolfo, (1995). “The Merge/Purge Problem for Large Databases.” In: Proceedings of ACM SIGMOD (1995).