Hardening Soft Databases

Example(s):
- …
Counter-Example(s):
- Bibliographic Database,
- Field Matching Database.
See: Data Deduplication, Database, Optimization Task, Time Complexity, Heuristic, World Wide Web.

References

(Bilenko & Mooney, 2002) ⇒ Bilenko, M., & Mooney, R. J. (2002). "Learning to combine trained distance metrics for duplicate detection in databases" (PDF). Submitted to CIKM-2002, 1-19.
- ABSTRACT: The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field's domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.

(Cohen et al., 2000) ⇒ Cohen, W. W., Kautz, H., & McAllester, D. (2000, August). Hardening soft information sources (PDF). In: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge discovery and data mining (pp. 255-259). ACM.
- ABSTRACT: The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are "soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identifiers. Examples include large bibliographic databases harvested from raw scientific papers or databases constructed by merging heterogeneous "hard" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global - many sources of evidence for a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for finding a local optimum.