2009 ApplyingSyntacticSimilarityAlgo
Jump to navigation
Jump to search
- (Cherkasova et al., 2009) ⇒ Ludmila Cherkasova, Kave Eshghi, Charles B. Morrey, Joseph Tucek, and Alistair Veitch. (2009). “Applying Syntactic Similarity Algorithms for Enterprise Information Management.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557137
Subject Headings:
Notes
Cited By
- http://scholar.google.com/scholar?q=%22Applying+syntactic+similarity+algorithms+for+enterprise+information+management%22+2009
- http://portal.acm.org/citation.cfm?doid=1557019.1557137&preflayout=flat#citedby
Quotes
Author Keywords
SyntActic Similarity, Enterprise Information Management, Performance Modeling, Shingling algorithms, Content-based chunking algorithms.
Abstract
- For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters : sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.
References
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2009 ApplyingSyntacticSimilarityAlgo | Ludmila Cherkasova Alistair Veitch Kave Eshghi Charles B. Morrey Joseph Tucek | Applying Syntactic Similarity Algorithms for Enterprise Information Management | KDD-2009 Proceedings | 10.1145/1557019.1557137 | 2009 |