2005 ExploitingRelsForDomIndepDataCleaning

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Record Coreference Disambiguation Algorithm, RelDC, Relational Similarity Function.

Notes

Cited By

2008

2007

2006

Quotes

Abstract

In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the [[Proposed Algorithm|approach we propose]] (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real datasets and also over synthetic datasets show that analysis of relationships significantly improves quality of the result.

6 Related Work

Many research challenges have been explored in the context of data cleaning in the literature: dealing with missing data, handling erroneous data, record linkage, and so on. The closest to the problem of reference disambiguation addressed in this paper is the problem of record linkage. The importance of record linkage is underscored by the large number of companies, such as Trillium, Vality, FirstLogic, DataFlux, which have developed (domain-specific) record linkage solutions.

Researchers have also explored domain-independent techniques, e.g. [23, 12, 14, 5, 22]. Their work can be viewed as addressing two challenges: (1) improving similarity function, as in [6]; and (2) improving efficiency of linkage, as in [7]. Typically two-level similarity functions are employed to compare two records. First, such a function computes attribute-level similarities by comparing values in the same attributes of two records. Next the function combines the attribute-level similarity measures to compute the overall similarity of two records. A recent trend has been to employ machine learning techniques, e.g. SVM, to learn the best similarity function for a given domain [6]. Many techniques have been proposed to address the efficiency challenge as well: e.g. using specialized indexes [7], sortings, etc.

Those domain-independent techniques deal only with attributes. To the best of our knowledge, RelDC, which was first publicly released in [15], is the first domain-independent data cleaning framework which exploits relationships for cleaning. Recently, in parallel to our work, other researchers have also proposed using relationships for cleaning. In [5] Ananthakrishna et al. employ similarity of directly linked entities, for the case of hierarchical relationships, to solve the record de-duplication challenge. In [19] Lee et al. develop an association-rules mining based method to disambiguate references using similarity of the context attributes: the proposed technique is still an FBS method, but [19] also discusses concept hierarchies which are related to relationships. Getoor et al. in DKDM04 use similarity of attributes of directly linked objects, like in [5], for the purpose of object consolidation. However, the challenge of applying that technique in practice on real-world datasets was identified as future workin that paper. In contrast to the above described techniques, RelDC utilize the CAP principle to automatically discover and analyze relationship chains, thereby establishing a framework that employs systematic relationship analysis for the purpose of cleaning.

7 Conclusion

In this paper we have shown that analysis of interobject relationships is important for data cleaning and demonstrated one approach that utilizes relationships. As future workwe plan to apply similar techniques to the problem of record linkage. This paper outlines only the core of the RelDC approach, for more details the interested reader is referred to [16].

Another interesting follow-up work[18] addresses the challenge of automatically adapting RelDC to datasets at hand by learning how to weigh different connections directly from the data. Solving this challenge, in general, not only makes the approach to be a plug-and-play solution but also can improve the accuracy as well as efficiency of the approach as discussed in [18]

References

  • CiteSeer. http://citeseer.nj.nec.com/cs.
  • HomePage Search. http://hpsearch.uni-trier.de.
  • Knowledge Discovery. http://www.kdnuggets.com/polls/2003/data preparation.htm.
  • SNOPT solver. http://www.gams.com/solvers/.
  • Ananthakrishna, Chaudhuri, and Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.
  • (Bilenko and Mooney, 2003) ⇒ Mikhail Bilenko, and Raymond Mooney. (2003). “Adaptive duplicate detection using learnable string similarity measures.” In: Proceedings of [[KDD]-2003.
  • S. Chaudhuri, K. Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD Conf., 2003.
  • R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD Conf., 2003.
  • R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving object environments. IEEE TKDE, 16(9), Sept. 2004.
  • R. Cheng, S. Prabhakar, and D. Kalashnikov. Querying imprecise data in moving object environments. In: Proceedings of IEEE ICDE Conf., 2003.
  • Christos Faloutsos, K. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In SIGKDD, 2004.
  • I. Fellegi and A. Sunter. A theory for record linkage. J. of Amer. Stat. Assoc., 64(328):1183–1210, 1969.
  • H. Garcia-Molina, J. Ullman, and J. Widom. Database systems: the complete book. Prentice Hall, 2002.
  • M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In: Proceedings of SIGMOD, 1995.
  • D. Kalashnikov, and S. Mehrotra. “Exploiting relationships for data cleaning.” TR-RESCUE-03-02, Nov. 2003.
  • D. Kalashnikov, and S. Mehrotra. “Exploiting relationships for domain-independent data cleaning.” Extended Version of SIAM Data Mining 2005 publication, http://www.ics.uci.edu/∼dvk/pub/sdm05.pdf.
  • D. V. Kalashnikov, and S. Mehrotra. “RelDC project.http://www.ics.uci.edu/∼dvk/RelDC/.
  • D. V. Kalashnikov, and S. Mehrotra. “Learning importance of relationships for reference disambiguation.” Submitted for Publication, Dec. (2004). http://www.ics.uci.edu/∼dvk/RelDC/TR/TR-RESCUE-04-23.pdf.
  • M. Lee, W. Hsu, and V. Kothari. Cleaning the spurious links in data. IEEE Intelligent Systems, Mar-Apr 2004.
  • R. Little and D. Rubin. Statistical Analysis with Missing Data. John Wiley and Sons, 1986.
  • J. Maletic and A. Marcus. Data cleansing: Beyond integrity checking. In Conference on Inf. Quality, 2000.
  • A. K. McCallum, K. Nigam, and L. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In ACM SIGKDD, 2000.
  • Newcombe, Kennedy, Axford, and James. Automatic linkage of vital records. Science, 130:954–959, 1959.
  • S. White and Padhraic Smyth. Algorithms for estimating relative importance in networks. In SIGKDD, 2003.
  • G. Wiederhold. www-db.stanford.edu/pub/movies/,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ExploitingRelsForDomIndepDataCleaningDmitri V. Kalashnikov
Sharad Mehrotra
Zhaoqi Chen
Exploiting Relationships for Domain-Independent Data CleaningProceedings of the SIAM International Conference on Data Mininghttp://www.siam.org/proceedings/datamining/2005/dm05 24Kalashnikovd.pdf2005