2008 DeDupingURLsviaRewriteRules

(Dasgupta et al., 2008) ⇒ Anirban Dasgupta, Ravi Kumar, and Amit Sasturkar. (2008). “De-duping URLs via Rewrite Rules.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008). doi:10.1145/1401890.1401917

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.

In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 DeDupingURLsviaRewriteRules	Ravi Kumar Anirban Dasgupta Amit Sasturkar			De-duping URLs via Rewrite Rules				10.1145/1401890.1401917