2003 WordnetImprovesTextDocumentClustering

(Hotho et al., 2003) ⇒ Andreas Hotho, Steffen Staab, Gerd Stumme. (2003). “Wordnet Improves Text Document Clustering.” In: Proceedings of the SIGIR Workshop on Semantic Web Workshop.

Subject Headings: Text Clustering Algorithm, WordNet, Bisecting k-Means Clustering Algorithm.

Notes

It proposes a Text Clustering Algorithm.
It analyzes the benefits of using WordNet Synsets and up to five levels of Hypernyms
It uses the Bisecting k-Means Algorithm.
It analyzes the addition of Part of Speech tags
It analyzes WSD by context which returns the concept which maximizes a function depending on the conceptual vicinity. Given a concept c, its semantic vicinity is defined as the set of all its direct sub and super concepts.
It relates to their prior work: (Hotho et al., 2001).

Cited By

~157 http://scholar.google.com/scholar?cites=8528179124131283614

2007

(Recupero, 2007) ⇒ Diego Reforgiato Recupero. (2007). “A New Unsupervised Method for Document Clustering by using WordNet Lexical and Conceptual Relations.” In: Information Retrieval (2007) 10:563–579.

2004

(Sedding and Kazakov, 2004) ⇒ Julian Sedding and Dimitar Kazakov. (2004). “Wordnet-based Text Document Clustering..” In: COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND).

Quotes

Abstract

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate background knowledge — in our application Wordnet — into the process of clustering text documents. We cluster the documents by a standard partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks.

References

Eneko Agirre and G. Rigau. Word sense disambiguation using conceptual density. In: Proceedings of COLING’96, 1996.
G. Amati, C. Carpineto, and G. Romano. Fub at trec-10 web track: A probabilistic framework for topic relevance term weighting. In The Tenth Text Retrieval Conference (TREC 2001). National Institute of Standards and Technology (NIST), online publication, 2001.
E. Bozsak et al. Kaon - towards a large scale semantic web. In: Proceedings of EC-Web, pages 304–313, Aix-en-Provence, France, (2002). LNCS 2455 Springer.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.
M. de Buenaga Rodrıguez, J. M. G. Hidalgo, and B. D´ıaz-Agudo. Using WordNet to complement training information in text categorization. In Recent Advances in Natural Language Processing II, volume 189. John Benjamins, 2000.
B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer, Berlin – Heidelberg, 1999.
J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarr´an. Indexing with WordNet synsets can improve text retrieval. In: Proceedings ACL/COLING Workshop on Usage of WordNet for Natural Language Processing, 1998.
S. J. Green. Building hypertext links in newspaper articles using semantic similarity. In: Proceedings of third Workshop on Applications of Natural Language to Information Systems (NLDB ’97), 1997.
S. J. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(5):713–730, 1999.
(Hofmann, 1999) ⇒ Thomas Hofmann. (1999). “Probabilistic Latent Semantic Indexing.” In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999) doi:10.1145/312624.312649
Andreas Hotho, Steffen Staab, and G. Stumme. Explaining text clustering results using semantic structures. In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22-26, 2003, LNCS. Springer, 2003.
Andreas Hotho, Steffen Staab, and G. Stumme. Text clustering based on background knowledge. Technical report, University of Karlsruhe, Institute AIFB, (2003). 36 pages.
N. Ide and J. Véronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–40, 1998.
G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of CIKM-00, pages 12–19. ACM Press, 2000.
D. M. P. Kushal Dave, Steve Lawrence. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the Twelfth International World Wide Web Conference, WWW2003. ACM, 2003.
D. Lewis. Reuters-21578 text categorization test collection, 1997.
George A. Miller. WordNet: A lexical database for english. CACM, 38(11):39–41, 1995.
Dan Moldovan and Rada Mihalcea. Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1):34–43, 2000.
Patrick Pantel and Dekang Lin. Document clustering with committees. In: Proceedings of SIGIR’02, Tampere, Finland, 2002.
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
Gerard M. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 WordnetImprovesTextDocumentClustering	Steffen Staab Andreas Hotho Gerd Stumme			Wordnet Improves Text Document Clustering		Proceedings of the SIGIR Workshop on Semantic Web Workshop	http://www.uni-koblenz.de/~staab/Research/Publications/sw sigir2003 submit.pdf			2003