2011 ClusteringVeryLargeMultiDimensi
- (Ferreira Cordeiro et al., 2011) ⇒ Robson Leonardo Ferreira Cordeiro, Caetano Traina Junior, Agma Juci Machado Traina, Julio López, U. Kang, and Christos Faloutsos. (2011). “Clustering Very Large Multi-dimensional Datasets with MapReduce.” In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2011) Journal. ISBN:978-1-4503-0813-7 doi:10.1145/2020408.2020516
Subject Headings:
Notes
Cited By
- http://scholar.google.com/scholar?q=%222011%22+Clustering+Very+Large+Multi-dimensional+Datasets+with+MapReduce
- http://dl.acm.org/citation.cfm?id=2020408.2020516&preflayout=flat#citedby
Quotes
Author Keywords
- Algorithms; clustering; data mining; design; experimentation; mapreduce; parallel programming; performance; very large multi-dimensional datasets
Abstract
Given a very large moderate-to-high dimensionality dataset, how could one cluster its points? For datasets that don't fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be a bottleneck. Thus, we propose the Best of both Worlds-BoW method, that automatically spots the bottleneck and chooses a good strategy. Our main contributions are: (1) We propose BoW and carefully derive its cost functions, which dynamically choose the best strategy; (2) We show that BoW has numerous desirable features: it can work with most serial clustering methods as a plugged-in clustering subroutine, it balances the cost for disk accesses and network accesses, achieving a very good tradeoff between the two, it uses no user-defined parameters (thanks to our reasonable defaults), it matches the clustering quality of the serial algorithm, and it has near-linear scale-up; and finally, (3) We report experiments on real and synthetic data with billions of points, using up to 1, 024 cores in parallel. To the best of our knowledge, our Yahoo ! web is the largest real dataset ever reported in the database subspace clustering literature. Spanning 0.2 TB of multi-dimensional data, it took only 8 minutes to be clustered, using 128 cores.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2011 ClusteringVeryLargeMultiDimensi | Christos Faloutsos U. Kang Robson Leonardo Ferreira Cordeiro Caetano Traina Junior Agma Juci Machado Traina Julio López | Clustering Very Large Multi-dimensional Datasets with MapReduce | 10.1145/2020408.2020516 | 2011 |