Clustering Task
A Clustering Task is a Unsupervised Classification Task that divides the data into cluster sets.
- AKA: Grouping, Partitioning, Cluster Analysis, Segmentation Analysis, Taxonomy Analysis.
- Context:
- It creates a cluster set in which item clusters whose members have low in-group variation and high out-group variation according to some similarity function.
- Input:
- a Data Record Set.
- optional: the number [math]\displaystyle{ k }[/math] of Clusters to be generated; a Similarity Function.
- output: Clustering Result/Cluster Set.
- Clustering Performance Metric: explainability.
- It can be solved by a Clustering System (that implements a clustering algorithm).
- It can range from being a Heuristic Clustering Task to being a Data-Driven Clustering Task (such as unsupervised clustering).
- It can range from being a Crisp-Clusters Clustering Task to being a Fuzzy Clustering Task.
- It can range from being a Low-Dimensional Clustering Task to being a High-Dimensional Clustering Task.
- It can range from being a Small-Dataset Clustering Task to being a Large-Dataset Clustering Task.
- It can range from being a One-Sided Clustering Task, to being a Two-Sided Clustering Task, to being an n-Sided Clustering Task.
- It ranges from being a Generative Clustering Task to being a Discriminative Clustering Task.
- It can range from being a Partitional Clustering Task to being an Agglomerative Clustering Task.
- It can be the focus of a Clustering Discipline.
- It can range from being a Constrained Clustering Task to being an Unconstrained Clustering Task.
- It can be an Algorithm-specific Clustering Task, such as: k-Means Clustering, k-Medoids Clustering, ...
- It can provide information to other Data Mining Tasks (rather than provide actionable information).
- Example(s):
- Categorical Data Clustering,
- Clustering from Data Streams,
- Correlation Clustering,
- Cross-Language Document Clustering,
- Density-Based Clustering,
- Model-Based Clustering,
- Evolutionary Clustering,
- Sublinear Clustering,
- an Iris Dataset Clustering Task,
- a Customer Clustering Task.
- a Document Clustering Task, such as text clustering.
- a Reference Resolution Task.
- a Graph Clustering Task.
- a Random Clustering Task.
- …
- Counter-Example(s):
- See: Mapping Task, Unsupervised Learning Task, Distance Function, Cluster Set; Cluster Editing; Cluster Ensembles; Consensus Clustering; Dirichlet Process; Graph Clustering; Projective Clustering.
References
2017a
- (Sammut & Webb, 2017) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2017). "Clustering” In: (Sammut & Webb, 2017). DOI:/10.1007/978-1-4899-7687-1_943
- QUOTE: Clustering is a type of unsupervised learning in which the goal is to partition a set of examples into groups called clusters. Intuitively, the examples within a cluster are more similar to each other than to examples from other clusters. In order to measure the similarity between examples, clustering algorithms use various distortion or distance measures. There are two major types clustering approaches: generative and discriminative. The former assumes a parametric form of the data and tries to find the model parameters that maximize the probability that the data was generated by the chosen model. The latter represents graph-theoretic approaches that compute a similarity matrix defined over the input data.
2017b
- (Czumaj & Sohler, 2011) ⇒ Artur Czumaj, and Christian Sohler. (2017). "Sublinear Clustering". In: (Sammut & Webb, 2017). DOI:10.1007/978-1-4899-7687-1_798
- QUOTE: Clustering is the process of partitioning a set of objects into subsets of similar objects. In machine learning, it is, for example, used in unsupervised learning to fit input data to a density model. In many modern applications of clustering, the input sets consist of billions of objects to be clustered. Typical examples include web search, analysis of web traffic, and spam detection.
2013
- (Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/Cluster_analysis
- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape") and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.
- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
2009a
- (WordNet, 2009) ⇒ http://wordnetweb.princeton.edu/perl/webwn?s=clustering
- S: (v) cluster, constellate, flock, clump (come together as in a cluster or flock) "The poets constellate in this town every summer"
2009b
- (NCSU, 2009) ⇒ http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm
- QUOTE: Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Other techniques, such as and Q-mode factor analysis, multidimensional scaling, and latent class analysis also perform clustering and are discussed separately.
2009c
- (Statsoft, 2009) ⇒ http://www.statsoft.com/textbook/stcluan.html
- QUOTE: In the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
2003
- (Nature, 2003) ⇒ http://www.nature.com/nrg/journal/v4/n9/glossary/nrg1155_glossary.html
- QUOTE: CLUSTER ANALYSIS A mathematical algorithm that organizes a set of items according to their similarity. For example, genes can be clustered according to their similarity in pattern of expression.
2002
- (Berkhin, 2002) ⇒ Pavel Berkhin. (2002). “A Survey of Clustering Data Mining Techniques." Technical Report, Accrue Software.
- QUOTE: Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data.
- The goal of clustering is to assign data points to a finite system of [math]\displaystyle{ k }[/math] subsets (clusters).
2000
- (Witten & Frank, 2000) ⇒ Ian H. Witten, and Eibe Frank. (2000). “Data Mining: Practical Machine Learning Tools and Techniques with Java implementations." Morgan Kaufmann.
- QUOTE: In clustering, groups of examples that belong together are sought.
1999
- (Jain et al., 1999) ⇒ Anil K. Jain, M. N. Murty, and P. J. Flynn. (1999). “Data Clustering: A Review.” In: ACM Computing Surveys (CSUR) Journal, 31(3). doi:10.1145/331499.331504
- QUOTE: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters).