k-Means Clustering Algorithm
A k-Means Clustering Algorithm is a top-down clustering algorithm that relocates cluster centroids by minimizing mean within-cluster sum of squares (mean squared error).
- AKA: Forgys Method, MacQueens Algorithm.
- Context:
- It can be implemented by a k-Means Clustering System (to solve a k-means clustering task).
- It requires that k be provided.
- It can start by creating k initial sets, either at random or using some heuristic data.
- It can iterative refine the clusters by:
- 1. Each Instance d_i is assigned to its closest Cluster Centroid.
- 2. Each Cluster Centroid C_j is updated to be the Mean of its Members.
- It can terminate after an iteration that does not change Cluster composition.
- It can (often) be applied to Metric Spaces because of the requirements to compute the Mean.
- It can (often) be have Computational Complexity of [math]\displaystyle{ \mathcal{O} (n \times K \times I \times d) }[/math]; n : number of points; K : number of clusters; I : number of iterations; d : number of attributes.
- It can be considered a special base of the EM Algorithm, that assumes:
- Each cluster is modeled by a spherical Gaussian distribution;
- Each data item is assigned to a single cluster;
- The (prior probability) mixture weights are equal (each cluster is similarly sized).
- Example(s):
- See: Vector Quantization, Voronoi Cell, NP-Hard, Heuristic Algorithm, Mixture Model, Gaussian Distribution.
References
2014
- (Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/k-means_clustering Retrieved:2014-10-20.
- k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.
- k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
- (Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/K-means_clustering#Description Retrieved:2014-10-20.
- Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find: [math]\displaystyle{ \underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2 }[/math] where μi is the mean of points in Si.
2009
- (Ghosh & Liu, 2009) ⇒ Joydeep Ghosh and Alexander Liu. (2009). “K-Means.” In: (Wu & Kumar, 2009)
2004
- (Ding & He, 2004) ⇒ Chris Ding, and Xiaofeng He. (2004). “K-means Clustering via Principal Component Analysis.” In: Proceedings of Int'l Conf. Machine Learning (ICML 2004).
- http://www.nature.com/nrg/journal/v5/n6/glossary/nrg1349_glossary.html
- An unsupervised clustering technique. Data points are partitioned into a predetermined number of non-hierarchical clusters based on similarity.
2001
- (Zha et al., 2001) ⇒ H. Zha, Chris Ding, M. Gu, Xiaofeng He, and H.D. Simon. (2001). “Spectral Relaxation for K-means Clustering.” In: Neural Information Processing Systems vol.14 (NIPS 2001).
- (Wagstaff et al., 2001) ⇒ Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. (2001). “Constrained K-means Clustering with Background Knowledge." In: Proceedings of the Eighteenth International Conference on Machine Learning.
1999
- (Larsen & Aone, 1999) ⇒ B. Larsen, and C. Aone. (1999). “Fast and Effective Text Mining using Linear-time Document Clustering.” In: Proceedings of KDD 1999.
1998
- (Neal & Hinton, 1998) ⇒ R. Neal, and G. Hinton. (1998). “A view of the EM algorithm that justifies incremental, sparse, and other variants” In: Michael I. Jordan (Ed.), Learning in Graphical Models, Kluwer: 1998.
- Allows for application of k-Means/EM to non-continuous metric spaces
1992
- (Cutting et al., 1992) ⇒ D. R. Cutting, D. R. Karger, J. O. Pedersen, and John W. Tukey. (1992). “Scatter/gather: A cluster-based approach to browsing large document collections.” In: Proceedings of the Fifteenth ACM SIGIR Conference Retrieval.
1990
- (Kaufman & Rousseeuw, 1990) ⇒ L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, March 1990.
1979
- (Hartigan & Wong, 1979) ⇒ J. A. Hartigan, and M. A. Wong. (1979). “A K-means Clustering Algorithm". In: Soc. Ser. C-Appl. Stat.
1973
- (Duda & Hart, 1973) ⇒ R. O. Duda, and P.E. Hart. (1973). “Pattern Classification and Scene Analysis. New York: John Wiley and Sons.
1967
- (MacQueen, 1967) ⇒ J. B. MacQueen. (1967). “Some Methods for Classification and Analysis of Multivariate Observations.” In: Proceedings of the Fifth Symposium on Math, Statistics, and Probability (pp. 281{297).
1965
- (Forgy, 1965) ⇒ E. Forgy. (1965). “Cluster Analysis of Multivariate Data: Efficiency vs. interpretability of classifications.” In: Biometrics 21:768.
1956
- (Steinhaus, 1956) ⇒ H. Steinhaus. (1956). “Sur la division des corp materiels en parties.” In: Bull. Acad. Polon. Sci., C1. III vol IV.