2011 EfficientSimilarityJoinsforNear
- (Xiao et al., 2011) ⇒ Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. (2011). “Efficient Similarity Joins for Near-duplicate Detection.” In: ACM Transactions on Database Systems (TODS) Journal, 36(3). doi:10.1145/2000824.2000825
Subject Headings:
Notes
Cited By
- http://scholar.google.com/scholar?q=%22Efficient+similarity+joins+for+near-duplicate+detection%22+2011
- http://dl.acm.org/citation.cfm?id=2000824.2000825&preflayout=flat#citedby
Quotes
Author Keywords
Abstract
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are noa given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMS-based settings. Experimental results show our proposed algorithms can outperform previous algorithms on several real datasets.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2011 EfficientSimilarityJoinsforNear | Wei Wang Jeffrey Xu Yu Xuemin Lin Chuan Xiao Guoren Wang | Efficient Similarity Joins for Near-duplicate Detection | 10.1145/2000824.2000825 | 2011 |