Text Document Clustering Task

Context:
- Input: a Text Document Set; Text Document Similarity Function.
- output: a set of Text Document Clusters.
- performance:
- It can be solved by a Text Clustering System (that implements a Text Clustering algorithm).
- It can (often) be a High-Dimensional Clustering Task.
- It can support: a Corpus Browsing Task, a Topic Modeling Task, an Information Retrieval Task (though there is scant evidence of performance improvement to support this application).
- …
Counter-Example(s):
- a Text Document Classification Task.
- a Text Document Search Task.
- a Text Document Retrieval Task.
- a Word Clustering Task, such as word vector clustering.
See: Information Retrieval; Passage Clustering, Phrase Clustering.

References

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/document_clustering Retrieved:2015-2-23.
- Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

(Zhao & Karypis, 2011) ⇒ Ying Zhao; George Karypis. (2011). “Document Clustering.” In: (Sammut & Webb, 2011) p.293

(Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- QUOTE: ... Cluster quality is evaluated by three metrics, purity [14], F-score [10], and normalized mutual information (NMI) [15].

(Yoo et al., 2006) ⇒ Illhoi Yoo, Xiaohua Hu, and Il-Yeol Song. (2006). “Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering.” In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006).
- Document clustering was initially investigated for improving information retrieval (IR) performance because similar documents grouped by document clustering tend to be relevant to the same user queries [20] [21]. Document clustering, however, has not been widely used in IR systems [7] because document clustering algorithms were too slow or infeasible for very large document sets in the early days. As faster clustering algorithms have been introduced, they have been adopted in document clustering. Document clustering has been recently used to facilitate the nearest-neighbor search [3], to support an interactive document browsing paradigm [7] [10] [26] and to construct hierarchical topic structures [14]. Thus, as information grows exponentially, document clustering plays an important role for IR and text mining.