Clustering Algorithm
Jump to navigation
Jump to search
A clustering algorithm is a unsupervised learning algorithm that can solve a clustering task.
- Context:
- It can range from being a Hard Clustering Algorithm to being a Fuzzy Clustering Algorithm.
- It can range from being a Generic Clustering Algorithm to being a Domain-Specific Clustering Algorithm.
- It can create a Clustering Model, that will place an unseen Data Record in a Cluster.
- It can be a k Clustering Algorithm that requires the number of Clusters to be specified in advance.
- It can use a Similarity Function as task input.
- It can be implemented into a Clustering System.
- It can be challenged by Vectors with a large number Dimensions.
- It can be:
- It can range from being a High-Dimensionality Clustering Algorithm to being a Low-Dimensionality Clustering Algorithm.
- It can be a Constrained Clustering Algorithm (that can make use of Background Knowledge).
- Example(s):
- k-Means Clustering Algorithm.
- KD-Tree Clustering Algorithm (efficient for low dimensionality)
- DBSCAN Clustering Algorithm.
- E-M Algorithm.
- …
- Counter-Example(s):
- See: Unsupervised Learning Task.
References
2013
- http://en.wikipedia.org/wiki/Cluster_analysis#Clustering_algorithms
- Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are probably a few dozen (if not over 100) published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.
There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of models has no chance on a data set that contains a radically different kind of models. For example, k-means cannot find non-convex clusters.
- Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are probably a few dozen (if not over 100) published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.
2009
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Cluster_analysis#Hierarchical_clustering
- Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
- Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
- Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
- Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
- Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
2009
- (Rajaraman & Ullman, 2009i) ⇒ Anand Rajaraman, and Jeffrey D. Ullman. (2009). “Clustering." Stanford Course - CS345A, Winter 2009: Data Mining.
2008
- (Xiang et al., 2008) ⇒ Shiming Xiang, Feiping Nie, and Changshui Zhang. (2008). “Learning a Mahalanobis Distance Metric for Data Clustering and Classification.” In: Pattern Recognition 41.
2005
- (Xy & Wunsch II, 2005) ⇒ Rui Xu, and Donald Wunsch II. (2005). “Survey of Clustering Algorithms.” In: IEEE Transactions on Neural Networks, 16(3). doi:10.1109/TNN.2005.845141
2004
- (Basu et al., 2004) ⇒ Sugato Basu, Mikhail Bilenko, and Raymond Mooney. (2004). “A Probabilistic Framework for Semi-Supervised Clustering.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004).
2002
- (Berkhin, 2002) ⇒ Pavel Berkhin. (2002). “A Survey of Clustering Data Mining Techniques." Technical Report, Accrue Software.
- For the reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:
- Hierarchical methods: Agglomerative algorithms, Divisive algorithms
- Partitioning relocation methods: Probabilistic clustering, k-medoids methods, k-means methods.
- Density-based partitioning methods: Density-based connectivity clustering, Density functions clustering, Grid-based methods
- Methods based on co-occurrence of categorical data
- Other clustering techniques: Constraint-based clustering, Graph partitioning, Clustering algorithms and supervised learning, Clustering algorithms in machine learning
- Scalable clustering algorithms.
- Algorithms for high-dimensional data: Subspace clustering, Coclustering techniques
- The properties of clustering algorithms of concern in data mining include:
- Type of attributes an algorithm can handle
- Scalability to large datasets
- Ability to work with high-dimensional data
- Ability to find clusters of irregular shape
- Handling outliers
- Time complexity (we often simply use the term complexity)
- Data order dependency
- Labeling or assignment (hard or strict vs. soft or fuzzy)
- Reliance on a priori knowledge and user-defined parameters
- Interpretability of results
- For the reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:
1998
- J. F. Hair et al. (1998). “Multivariate Data Analysis, 5th edition. Chapter 9, pages 469-517.
- L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, March 1990.
1979
- (Hartigan & Wong, 1979) ⇒ J. A. Hartigan, and M. A. Wong. (1979). “A K-means Clustering Algorithm". In: Soc. Ser. C-Appl. Stat.
1975
- (Hartigan, 1975) ⇒ John A. Hartigan. (1975). “Clustering Algorithms." John Wiley & Sons.
1973
- (Duda & Hart, 1973) ⇒ Richard O. Duda, and Peter E. Hart. (1973). “Pattern Classification and Scene Analysis." John Wiley & Sons, New York, NY.
1967
- MacQueen, J. B. (1967). “Some methods for classification and analysis of multivariate observations.” In: Proceedings of the Fifth Symposium on Math, Statistics, and Probability (pp. 281{297).
1965
- (Forgy, 1965) ⇒ Edward W. Forgy. 1965. “Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classification.” In: Biometrics 21 (1965).
1956
- H. Steinhaus. (1956). “Sur la division des corp materiels en parties.” In: Bull. Acad. Polon. Sci., C1. III vol IV.