2003 RobustEfficientFuzzyMatch
- (Chaudhuri et al., 2003) ⇒ Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani. (2003). “Robust and Efficient Fuzzy Match for Online Data Cleaning.” In: Proceedings of the 2003 ACM SIGMOD Conference (SIGMOD 2003). doi:10.1145/872757.872796
Subject Headings: Entity Record Deduplication Algorithm, Data Cleansing Task
Notes
Cited By
~276 http://scholar.google.com/scholar?cites=1027889780569772575
- (Chaudhuri et al., 2005) ⇒ Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. (2005). “Robust Identification of Fuzzy Duplicates.” Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.865-876, April 05-08, 2005 doi:10.1109/ICDE.2005.125
- (Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions.” Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.
- Chaudhuri et al (2003) showed how to create index structures that allow for certain types of typographical error in matching within a database. In their application, their methods reduced computation by a factor of three in comparison with naïve methods that compare every pair of records in two files.
Quotes
Abstract
- To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
References
- R. Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, Hong Kong, 2002.
- Ricardo Baeza-Yates and Gonzalo Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
- Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
- A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997
- Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, Proceedings of the 23rd International Conference on Very Large Data Bases, p.426-435, August 25-29, 1997
- Edith Cohen, Size-estimation framework with applications to transitive closure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997 doi:10.1006/jcss.1997.1534 7 William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, Proceedings of the 1998 ACM SIGMOD Conference, p.201-212, June 01-04, 1998, Seattle, Washington, United States 8 William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
- Edith Cohen, David D. Lewis, Approximating Matrix Multiplication for Pattern Recognition Tasks, Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, p.682-691, January 05-07, 1997, New Orleans, Louisiana, United States
- 10 W. Cohen and J. Richman. Learning to match and cluster entity names. In: Proceedings of SIGKDD, Edmonton, July (2002). 11 Volker Gaede, Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 doi:10.1145/280277.280279
- Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001 13 Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States 14 Piotr Indyk, Rajeev Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.604-613, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276876
- P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
- Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
- Gonzalo Navarro, Ricardo Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
- Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, Jorma Tarhio, Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, p.350-363, June 21-23, 2000
- Gonzalo Navarro, Searching in metric spaces by spatial approximation, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.1, p.28-46, August 2002 doi:10.1007/s007780200060 20 Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
- B. Schneier. Applied Cryptography John Wiley, 1996.
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
- Trillium Software. http://www.trilliumsoft.com
- W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2003 RobustEfficientFuzzyMatch | Rajeev Motwani Surajit Chaudhuri Venkatesh Ganti Kris Ganjam | Robust and Efficient Fuzzy Match for Online Data Cleaning | Proceedings of the 2003 ACM SIGMOD Conference | http://research.microsoft.com/pubs/68895/fm sigmod03.pdf | 10.1145/872757.872796 | 2003 |