Text Document Clustering Task
Jump to navigation
Jump to search
A Text Document Clustering Task is a text-item clustering task for text documents.
- Context:
- Input: a Text Document Set; Text Document Similarity Function.
- output: a set of Text Document Clusters.
- performance:
- It can be solved by a Text Clustering System (that implements a Text Clustering algorithm).
- It can (often) be a High-Dimensional Clustering Task.
- It can support: a Corpus Browsing Task, a Topic Modeling Task, an Information Retrieval Task (though there is scant evidence of performance improvement to support this application).
- …
- Counter-Example(s):
- See: Information Retrieval; Passage Clustering, Phrase Clustering.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/document_clustering Retrieved:2015-2-23.
- Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.
2011
- (Zhao & Karypis, 2011) ⇒ Ying Zhao; George Karypis. (2011). “Document Clustering.” In: (Sammut & Webb, 2011) p.293
2009
- (Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- QUOTE: ... Cluster quality is evaluated by three metrics, purity [14], F-score [10], and normalized mutual information (NMI) [15].
2006
- (Yoo et al., 2006) ⇒ Illhoi Yoo, Xiaohua Hu, and Il-Yeol Song. (2006). “Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering.” In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006).
- Document clustering was initially investigated for improving information retrieval (IR) performance because similar documents grouped by document clustering tend to be relevant to the same user queries [20] [21]. Document clustering, however, has not been widely used in IR systems [7] because document clustering algorithms were too slow or infeasible for very large document sets in the early days. As faster clustering algorithms have been introduced, they have been adopted in document clustering. Document clustering has been recently used to facilitate the nearest-neighbor search [3], to support an interactive document browsing paradigm [7] [10] [26] and to construct hierarchical topic structures [14]. Thus, as information grows exponentially, document clustering plays an important role for IR and text mining.
1992
- (Cutting et al, 1992) ⇒ Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. (1992). “Scatter/Gather: a cluster-based approach to browsing large document collections.” In: Proceedings of the 15th ACM SIGIR Conference retrieval (SIGIR 1992).