2010 StatisticalSemaforEnhancDocClas
- (Farahat & Kamel, 2010) ⇒ Ahmed K. Farahat and Mohamed S. Kamel. (2010). “Statistical Semantics for Enhancing Document Clustering.” In: International Journal on Knowledge and Information Systems (KAIS), (28:2). doi:10.1007/s10115-010-0367-z
Subject Headings: Document clustering, Statistical semantics, Semantic similarity, Term–term correlations.
Notes
Cited By
Quotes
Abstract
Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. This paper proposes new models for document representation that capture semantic similarity between documents based on measures of correlations between their terms. The paper uses the proposed models to enhance the effectiveness of different algorithms for document clustering. The proposed representation models define a corpus-specific semantic similarity by estimating measures of term–term correlations from the documents to be clustered. The corpus of documents accordingly defines a context in which semantic similarity is calculated. Experiments have been conducted on thirteen benchmark data sets to empirically evaluate the effectiveness of the proposed models and compare them to VSM and other well-known models for capturing semantic similarity.
References
- Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637
- Carbonell J, Yang Y, Frederking R, Brown R, Geng Y, Lee D (1997) Translingual information retrieval: A comparative evaluation. In: Proceedings of the fifteenth international joint conference on artificial intelligence. Morgan Kaufmann, San Mateo, pp 708–715
- Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152
- Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Technol 41(6):391–407
- Dhillon I (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD International Conference on knowledge discovery and data mining. ACM, New York, pp 269–274
- Dhillon I, Kogan J, Nicholas C (2003) Feature selection and document clustering. In: Berry M (ed) Survey of Text Mining. Springer, New York, pp 73–100
- Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1/2):143–175
- Ding C, Li T, Jordan MI (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32:45–55
- Dongen S (2000) Performance criteria for graph clustering and Markov cluster experiments. Technical report, CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands
- Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36(1):132–157
- Farahat AK, Kamel MS (2009) Document clustering using semantic kernels based on term–term correlations. In: Proceedings of the 2009 IEEE International Conference on data mining workshops. IEEE Computer Society, Washington, DC, pp 459–464 123A. K. Farahat, M. S. Kamel
- Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: Proceedings of the eighth workshop on text mining at the tenth SIAM International Conference on data mining. SIAM, Philadelphia, pp 83–92
- Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM International Conference on data mining. SIAM, Philadelphia, pp 59–70
- Furnas G, Landauer T, Gomez L, Dumais S (1983) Statistical semantics: analysis of the potential performance of keyword information systems. Bell Syst Tech J 62(6):1753–1806
- Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the twentieth international joint conference on artificial intelligence. Morgan Kaufmann, San Mateo, pp 6–12
- Han E, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebACE: a web agent for document categorization and exploration. In: Proceedings of the second International Conference on autonomous agents. ACM, New York, pp 408–415
- He X, Zha H, Ding C, Simon H (2002) Web document clustering using hyperlink structures. Comput Stat Data Anal 41(1):19–45
- Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: Proceedings of the SIGIR 2003 semantic web workshop. ACM, New York, pp 541–544
- Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the fifteenth ACM SIGKDD International Conference on knowledge discovery and data mining. ACM, New York, pp 389–396
- Huang A, Milne D, Frank E, Witten I (2009) Clustering documents using a Wikipedia-based concept representation. In: Proceedings of the thirteenth Pacific-Asia conference on advances in knowledge discovery and data mining. Springer, Berlin, pp 628–636
- Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
- Jing L, Ng M, Huang J (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25:35–55
- Jolliffe I (2002) Principal component analysis. Springer, New York
- Karypis G (2003) CLUTO — a clustering toolkit. Technical Report #02-017, University of Minnesota, Department of Computer Science, Minnesota, MN, USA
- Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
- Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0
- Meila M (2003) Comparing clusterings by the variation of information. In: Learning theory and Kernel Machines. Springer, Berlin, pp 173–187
- Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
- Pessiot J-F, Kim Y-M, Amini MR, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manage 46(2):180–192
- Salton G, Wong A, Yang CS (1975) A Vector Space Model for Automatic Indexing. Commun ACM 18(11):613–620
- Scholkopf B, Smola A, Muller K (1997) Kernel principal component analysis. Lect Notes Comput Sci 1327:583–588
- Schütze H, Silverstein C (1997) Projections for efficient document clustering. In: Proceedings of the twentieth annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’97. ACM, New York, pp 74–81
- Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
- Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the twenty-third annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 208–215
- Luxburg U von (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
- Wang P, Hu J, Zeng H, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281
- Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the eighth annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 18–25
- Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the fifteenth ACM SIGKDD International Conference on knowledge discovery and data mining. ACM, New York, pp 877–886 123Statistical semantics for enhancing document clustering
- Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the twenty-sixth annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 267–273
- Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331
- Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168,