Text-Document Clustering Algorithm
Jump to navigation
Jump to search
A Text-Document Clustering Algorithm is a domain-specific clustering algorithm that can be implemented by a text-document clustering system to solve the text-document clustering task.
- Context:
- It can (often) make use of a Document Vectorizer.
- It can range from being a Knowledge-based Text Clustering Algorithm (making use of a knowledge base) to being a Knowledge-Free Text Clustering Algorithm.
- It can range from being a Heuristic Text Clustering Algorithm to being a Data-Driven Text Clustering Algorithm.
- Example(s):
- A Webpage Clustering Algorithm can be used to cluster webpages into different categories, such as news articles, blog posts, and product reviews.
- A Text Embedding-based Clustering Algorithm can be used to cluster text documents based on the similarity of their word embeddings.
- A Topic Modeling Algorithm can be used to cluster text documents based on the latent topics that they contain.
- …
- Counter-Example(s):
- See: Information Retrieval Algorithm, Text Classification Task.
References
2009
- (Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
2008
- (Li et al., 2008) ⇒ Yanjun Li, Soon M. Chung, and John D. Holt. (2008). “Text Document Clustering Based on Frequent Word Meaning Sequences.” In: Data & Knowledge Engineering 64(1). doi:10.1016/j.datak.2007.08.001
2007
- (Recupero, 2007) ⇒ Diego R. Recupero. (2007). “A New Unsupervised Method for Document Clustering by using WordNet Lexical and Conceptual Relations.” In: Information Retrieval (2007) 10:563–579.
2006
- (Yoo et al., 2006) ⇒ Illhoi Yoo, Xiaohua Hu, and Il-Yeol Song. (2006). “Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering.” In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006).
2005
- (Ferragina & Gulli, 2005) ⇒ Paolo Ferragina, and Antonio Gulli. (2005). “A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering.” In: Proceedings of International World Wide Web Conference (WWW 2005).
- (Surdeanu et al., 2005) ⇒ Mihai Surdeanu, Jordi Turmo, and Alicia Ageno. (2005). “A Hybrid Unsupervised Approach for Document Clustering.” In: Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge discovery in data mining ([[KDD] 2005]]).
- (Zhong & Ghosh, 2005) ⇒ S. Zhong, and Joydeep Ghosh. (2005). “Generative Model-based Document Clustering: A comparative study.” In: Journal of Knowledge and Information Systems, 8(3).
2004
- (Sedding and Kazakov, 2004) ⇒ Julian Sedding and Dimitar Kazakov. (2004). “Wordnet-based Text Document Clustering.” In: COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND).
2003
- (Xu et al., 2003) ⇒ Wei Xu, Xin Liu, and Yihong Gong. (2003). “Document Clustering Based on Non-Negative Matrix Factorization.” In: Proceedings of the 26th ACM SIGIR Conference (SIGIR 2003). doi:10.1145/860435.860485
- (Funt et al., 2003) ⇒ Benjamin C. M. Fung, Ke Wang, Martin Ester. “Hierarchical Document Clustering using Frequent Itemsets.” In: Proceedings of the SIAM International Conference on Data Mining 2003 (SDM 2003)
- (Hotho et al., 2003) ⇒ Andreas Hotho, Steffen Staab, and Gerd Stumme. (2003). “Wordnet Improves Text Document Clustering.” In: Proceedings of the Semantic Web Workshop (at SIGIR 2003).
2002
- (Zhao & Karypsis, 2002) ⇒ Ying Zhao, and George Karypis. (2002). “Evaluation of Hierarchical Clustering Algorithms for Document Datasets.” In: Conference on Information and Knowledge Management (CIKM 2002). doi:10.1145/584792.584877
- (Beil et al., 2002) ⇒ Florian Beil, Martin Ester, and Xiaowei Xu. (2002). “Frequent Term-based Text Clustering.” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002). doi:10.1145/775047.775110
- (Zhao & Karypsis, 2002) ⇒ Ying Zhao, and George Karypis. (2002). “Evaluation of Hierarchical Clustering Algorithms for Document Datasets.” In: Conference on Information and Knowledge Management.
2001
- (Hotho et al., 2001) ⇒ Andreas Hotho, Alexander Maedche, and Steffen Staab. “Ontology-based Text Clustering.” In: Proceedings of the IJCAI-2001 Workshop on Text Learning: Beyond Supervision.
- (Zhao & Karypsis, 2001) ⇒ Ying Zhao, and George Karypis. (2001). “Criterion Functions for Document Clustering: Experiments and analysis." Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN.
2000
- (Steinbach, 2000) ⇒ Michael Steinbach, George Karypis, and Vipin Kumar. (2000). “A Comparison of Document Clustering Techniques.” In: Proceedings of Workshop at KDD-2000 on Text Mining.
- We use two metrics for evaluating cluster quality: entropy, which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering, and the F-measure, which measures the effectiveness of a hierarchical clustering. (The F measure was recently extended to document hierarchies in [5].)
1999
- (Larsen & Aone, 1999) ⇒ Bjornar Larsen, and Chinatsu Aone. (1999). “Fast and Effective Text Mining Using Linear-time Document Clustering.” In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-1999). doi:10.1145/312129.312186
1997
- (Schütze & Silverstein, 1997) ⇒ Hinrich Schütze, and Craig Silverstein. (1997). “Projections for Efficient Document Clustering.” In: ACM SIGIR Forum.
- Zamir, O., Oren Etzioni, Madani, O., and Karp, R. (1997). “Fast and Intuitive Clustering of Web Documents.” In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.
1992
- (Cutting et al, 1992) ⇒ Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. (1992). “Scatter/Gather: a cluster-based approach to browsing large document collections.” In: Proceedings of the 15th ACM SIGIR Conference retrieval (SIGIR 1992).