2005 IntelligentFusOfStructAndCitatBasedEvidenceForTextClass
- (Zhang et al., 2005) ⇒ Baoping Zhang, Marcos Andre´Goncalves, Weiguo Fan, Yuxin Chen, Edward A. Fox, Pavel Calado, Marco Cristo. (2005). “Intelligent Fusion of Structural and Citation-based evidence for Text Classification.” In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005).
Subject Headings: Classification; Document Similarity; Citation Analysis; Genetic Programming.
Notes
Cited By
Quotes
Abstract
This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers.
References
- R. Amsler. Application of citation-based automatic classiffication. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.
- P. Calado, M. Cristo, E. S. de Moura, N. Ziviani, B. A. Ribeiro-Neto, and M. A. Goncalves. Combining link-based and content-based methods for Web document classiffication. In: Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management, pages 394{401, New Orleans, US, 2003. ACM Press, New York, US.
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307{318, Seattle, Washington, June 1998.
- S. M. Cheang, K. H. Lee, and K. S. Leung. Data classiffication using genetic parallel programming. In E. Cantu-Paz, J. A. Foster, K. Deb, D. Davis, R. Roy, U.-M. O'Reilly, H.-G. Beyer, R. Standish, G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. Dowsland, N. Jonoska, and J. Miller, editors, Genetic and Evolutionary Computation { GECCO-2003, volume 2724 of LNCS, pages 1918{1919, Chicago, 12-16 July 2003. Springer-Verlag.
- D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430{436. MIT Press, 2001.
- I. De Falco, A. Della Cioppa, and E. Tarantino. Discovering interesting classiffication rules with genetic programming. Applied Soft Computing, 1(4F):257{269, May 2001.
- J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11{16):1467{1479, May 1999. Also in: Proceedings of the 8th International World Wide Web Conference.
- J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classiffication: Refining the search space. In T. Heskes, P. Lucas, L. Vuurpijl, and W. Wiegerinck, editors, Proceedings of the Fivteenth Belgium/Netherlands Conference on Artificial Intelligence (BNAIC'03), pages 123{130, Nijmegen, The Netherlands, 23-24 Oct. 2003.
- W. Fan, E. A. Fox, P. Pathak, and H. Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628{636, 2004.
- W. Fan, M. D. Gordon, and P. Pathak. Personalization of search engine services for effective retrieval and knowledge management. In The Proceedings of the International Conference on Information Systems 2000, pages 20{34, 2000.
- W. Fan, M. D. Gordon, and P. Pathak. Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4):523{527, 2004.
- W. Fan, M. D. Gordon, and P. Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 2004. In press.
- W. Fan, M. D. Gordon, P. Pathak, W. Xi, and E. A. Fox. Ranking function optimization for effective web search by genetic programming: An empirical study. In: Proceedings of 37th Hawaii International Conference on System Sciences, Hawaii, 2004. IEEE.
- J. Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487{498, 1999.
- L. Giles. Citeseer: An automatic citation indexing system. Dec. 16 1998.
- E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In: Proceedings of WWW-02, International Conference on the World Wide Web, 2002.
- M. Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208{1218, Oct. 1988.
- M. D. Gordon. User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5):311{322, June 1991.
- N. G?overt, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorizing web documents. In: Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475{482, Kansas City, Missouri, USA, November 1999.
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML- 8, 10th European Conference on Machine Learning, pages 137{142, Chemnitz, Germany, April 1998.
- T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250{257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.
- M. M. Kessler. Bibliographic coupling between scientifc papers. American Documentation, 14(1):10{25, January 1963.
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604{632, 1999.
- J. R. Koza. Genetic programming: On the programming of computers by natural selection. MIT Press, Cambridge, Mass., 1992.
- S. Lawrence, C. L. Giles, and K. Bollacker. \Digital Libraries and Autonomous Citation Indexing". IEEE Computer, 32(6):67{71, 1999.
- S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In O. Etzioni, J. P. M?uller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), pages 392{393, Seattle, WA, USA, 1999. ACM Press.
- M. J. Martin-Bautista, M. Vila, and H. L. Larsen. A fuzzy genetic algorithm approach to an adaptive information retrieval agent. American Society for Information Science, 50:760{771, 1999.
- A. K. McCallum and K. Nigam. Employing EM and pool-based active learning for text classification. In: Proceedings. 15th International Conference on Machine Learning, pages 350{358. Morgan Kaufmann, San Francisco, CA, 1998.
- F. C. Misch, editor. Webster's Ninth New Collegiate Dictionary. Merriam-Webster Inc., Springfield, Massachusetts, 1988.
- H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 264{271. ACM Press, 2000.
- P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching function adaptation. In: Proceedings of the 33rd Hawaii International Conference on System Science (HICSS), Hawaii, USA, 2000.
- V. V. Raghavan and B. Agarwal. Optimal determination of user-oriented clusters: an application for the reproductive plan. In J. J. Grefenstette, editor, Proceedings of the 2nd International Conference on Genetic Algorithms and their Applications, pages 241{246, Cambridge, MA, July 1987. Lawrence Erlbaum Associates.
- S. E. Robertson, S. Walker, and M. M. Beaulieu. Okapi at TREC-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), pages 73{96, 1995.
- M. Saar-Tsechansky and F. Provost. Active learning for class probability estimation and ranking. In B. Nebel, editor, Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI-01), pages 911{920, San Francisco, CA, Aug. 4{10 2001. Morgan Kaufmann Publishers, Inc.
- Gerard M. Salton. Automatic Text Processing. Addison-Wesley, Boston, Massachusetts, USA, 1989. * Gerard M. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513{523, 1988.
- H. G. Small. Co-citation in the scientifc literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265{269, July 1973.
- A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95{123, 1999.
- A. Sun, E.-P. Lim, and W.-K. Ng. Web classiffication using support vector machine. In: Proceedings of the fourth international workshop on Web information and data management, pages 96{99. ACM Press, 2002.
- Y. Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 13{22,Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
- Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219{241,A. K. McCallum and K. Nigam. Employing EM and pool-based active learning for text classiffication. In: Proceedings. 15th International Conference on Machine Learning,pages 350{358. Morgan Kaufmann, San Francisco, CA, 1998.