2009 FastApproximateSpectralClusteri

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

References

  • 1. David Arthur, Sergei Vassilvitskii, K-means++: The Advantages of Careful Seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, p.1027-1035, January 07-09, 2007, New Orleans, Louisiana
  • 2. Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Y. Wu, An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions, Journal of the ACM (JACM), v.45 n.6, p.891-923, Nov. 1998 doi:10.1145/293347.293348
  • 3. A. Asuncion and D. Newman. UCI Machine Learning Repository, Department of Information and Computer Science. Http://www.ics.uci.edu/ Mlearn/MLRepository.html, 2007.
  • 4. Francis R. Bach, Michael I. Jordan, Learning Spectral Clustering, With Application To Speech Separation, The Journal of Machine Learning Research, 7, p.1963-2001, 12/1/2006
  • 5. Mihai Bādoiu, Sariel Har-Peled, Piotr Indyk, Approximate Clustering via Core-sets, Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montreal, Quebec, Canada doi:10.1145/509907.509947
  • 6. Paul S. Bradley, Usama M. Fayyad, Refining Initial Points for K-Means Clustering, Proceedings of the Fifteenth International Conference on Machine Learning, p.91-99, July 24-27, 1998
  • 7. Leo Breiman, Random Forests, Machine Learning, v.45 n.1, p.5-32, October 1 2001 doi:10.1023/A:1010933404324
  • 8. Sanjoy Dasgupta, Yoav Freund, Random Projection Trees and Low Dimensional Manifolds, Proceedings of the 40th Annual ACM Symposium on Theory of Computing, May 17-20, 2008, Victoria, British Columbia, Canada doi:10.1145/1374376.1374452
  • 9. Inderjit S. Dhillon, Yuqiang Guan, Brian Kulis, Weighted Graph Cuts Without Eigenvectors A Multilevel Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.29 n.11, p.1944-1957, November 2007 doi:10.1109/TPAMI.2007.1115
  • 10. P. Drineas and M. W. Mahoney. On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-based Learning. In: Proceedings of COLT, Pages 323--337, 2005.
  • 11. Shai Fine, Katya Scheinberg, Efficient Svm Training Using Low-rank Kernel Representations, The Journal of Machine Learning Research, 2, 3/1/2002
  • 12. Charless Fowlkes, Serge Belongie, Fan Chung, Jitendra Malik, Spectral Grouping Using the Nyström Method, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.26 n.2, p.214-225, January 2004 doi:10.1109/TPAMI.2004.1262185
  • 13. R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions of Information Theory, 44(6):2325--2383, 1998.
  • 14. Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan, Fast Iterative Kernel Principal Component Analysis, The Journal of Machine Learning Research, 8, p.1893-1918, 12/1/2007
  • 15. John A. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc., New York, NY, 1975
  • 16. J. A. Hartigan and M. A. Wong. A K-means Clustering Algorithm. Applied Statistics, 28(1):100--108, 1979.
  • 17. Bruce Hendrickson, Robert Leland, A Multilevel Algorithm for Partitioning Graphs, Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM), p.28-es, December 04-08, 1995, San Diego, California, United States doi:10.1145/224170.224228
  • 18. L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral Clustering with Perturbed Data. In Advances in Neural Information Processing Systems (NIPS), December 2008.
  • 19. A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999 doi:10.1145/331499.331504
  • 20. Ravi Kannan, Santosh Vempala, Adrian Vetta, On Clusterings: Good, Bad and Spectral, Journal of the ACM (JACM), v.51 n.3, p.497-515, May 2004 doi:10.1145/990308.990313
  • 21. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, An Efficient K-Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.7, p.881-892, July 2002 doi:10.1109/TPAMI.2002.1017616
  • 22. George Karypis, Vipin Kumar, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on Scientific Computing, v.20 n.1, p.359-392, Aug. 1998 doi:10.1137/S1064827595287997
  • 23. Amit Kumar, Yogish Sabharwal, Sandeep Sen, A Simple Linear Time (1+ ") -Approximation Algorithm for K-Means Clustering in Any Dimensions, Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, p.454-462, October 17-19, 2004 doi:10.1109/FOCS.2004.7
  • 24. J. F. Lu, J. B. Tang, Z. M. Tang, J. Y. Yang, Hierarchical Initialization Approach for K-Means Clustering, Pattern Recognition Letters, v.29 n.6, p.787-795, April, 2008 doi:10.1016/j.patrec.2007.12.009
  • 25. David Madigan, Nandini Raghavan, William Dumouchel, Martha Nason, Christian Posse, Greg Ridgeway, Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction, Data Mining and Knowledge Discovery, v.6 n.2, p.173-190, April 2002 doi:10.1023/A:1014095614948
  • 26. M. Meila and J. Shi. Learning Segmentation with Random Walk. In Advances in Neural Information Processing Systems (NIPS), 2001.
  • 27. Pabitra Mitra, C. A. Murthy, Sankar K. Pal, Density-Based Multiscale Data Condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.6, p.734-747, June 2002 doi:10.1109/TPAMI.2002.1008381
  • 28. A. Y. Ng, M. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and An Algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.
  • 29. J. M. Peña, J. A. Lozano, P. Larrañaga, An Empirical Comparison of Four Initialization Methods for the <italic>K</italic>-Means Algorithm, Pattern Recognition Letters, v.20 n.10, p.1027-1040, Oct. 1999 doi:10.1016/S0167-8655(99)00069-0
  • 30. Stephen J. Redmond, Conor Heneghan, A Method for Initialising the K-means Clustering Algorithm Using Kd-trees, Pattern Recognition Letters, v.28 n.8, p.965-973, June, 2007 doi:10.1016/j.patrec.2007.01.001
  • 31. Jianbo Shi, Jitendra Malik, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.22 n.8, p.888-905, August 2000 doi:10.1109/34.868688
  • 32. Alex J. Smola, Bernhard Schökopf, Sparse Greedy Matrix Approximation for Machine Learning, Proceedings of the Seventeenth International Conference on Machine Learning, p.911-918, June 29-July 02, 2000
  • 33. G. Stewart. Introduction to Matrix Computation. Academic Press, 1973.
  • 34. U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of Spectral Clustering. Annals of Statistics, 36(2):555--586, 2008.
  • 35. C. Williams and M. Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems, 2001.
  • 36. D. Yan, L. Huang, and M. I. Jordan. Fast Approximate Spectral Clustering. Technical Report, Department of Statistics, UC Berkeley, 2009.
  • 37. Stella X. Yu, Jianbo Shi, Multiclass Spectral Clustering, Proceedings of the Ninth IEEE International Conference on Computer Vision, p.313, October 13-16, 2003
  • 38. P. L. Zador. Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension. IEEE Transactions of Information Theory, 28:139--148, 1982.

}},


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 FastApproximateSpectralClusteriDonghui Yan
Ling Huang
Michael I. Jordan
Fast Approximate Spectral ClusteringKDD-2009 Proceedings10.1145/1557019.15571182009