2001 MiningTheWebForSynonyms
- (Turney, 2001) ⇒ Peter D. Turney. (2001). “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL.” In: Proceedings of the 12th European Conference on Machine Learning (ECML 2001). doi:10.1007/3-540-44795-4_42
Subject Headings: Pointwise Mutual Information and Information Retrieval, Synonym Extraction Algorithm, Lexical Semantic Similarity Function.
Notes
Cited By
- ~613 http://scholar.google.com/scholar?q=%22Mining+the+Web+for+Synonyms:+PMI-IR+versus+LSA+on+TOEFL%22+2001
- ~44 http://portal.acm.org/citation.cfm?id=645328.650004&preflayout=flat#citedby
2002
- (Turney, 2002) ⇒ Peter D. Turney. (2002). “Thumbs up or Thumbs Down?: Semantic orientation applied to unsupervised classification of reviews.” In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002). doi:10.3115/1073083.1073153
- QUOTE: The PMI-IR algorithm is employed to estimate the semantic orientation of a phrase (Turney, 2001). PMI-IR uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words or phrases.
Quotes
Abstract
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).
References
- 1. Kenneth W. Church, Hanks, P.: Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, (1989) 76-83.
- 2. Kenneth W. Church, Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum (1991) 115-164.
- 3. AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/.
- 4. Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton, New Jersey, http://www.ets.org/.
- 5. Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998).
- 6. Thomas K. Landauer, Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104 (1997) 211-240.
- 7. Deerwester, S., Dumais, S.T., Furnas, G.W., Thomas K. Landauer, Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (1990) 391-407.
- 8. Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Information Access. Proceedings of Supercomputing ’95, San Diego, California, (1995).
- 9. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press (1999).
- 10. John Rupert Firth: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Papers of John Rupert Firth 1952-1959, London: Longman (1968).
- 11. AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, California, http://doc.altavista.com/adv_search/syntax.html (2001).
- 12. Christiane Fellbaum (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/.
- 13. Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more information: http://www.framerd.org/brico/.
- 14. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/. 15. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (2000) 303-336.
- 16. Gregory Grefenstette: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R. Goldman, P. Norvig, Eugene Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65.
- 17. Hinrich Schütze: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993) 895-902.
- 18. Dekang Lin: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montreal (1998) 768-773.
- 19. Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words. In: Proceedings of AICS Conference. Trinity College, Dublin (1994).
- 20. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in ISA Hierarchies. Journal of Documentation, 49 (1993) 188-207.
- 21. Philip Resnik: Semantic Similarity in a Taxonomy: An Information-based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11 (1998) 95-130.
- 22. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, (1997).
- 23. Brin, S., Rajeev Motwani, Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data (1997) 255-264.
- 24. Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation, Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000).
- 25. (Papadimitriou et al., 1998) ⇒ C. H. Papadimitriou, Prabhakar Raghavan, H. Tamaki, and S. Vempala. (1998) "Latent Semantic Indexing: A Probabilistic Analysis.” In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.
- 26. Karen Spärck Jones: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4.
- 27. Buckley, C., Gerard M. Salton, Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) 69-80.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2001 MiningTheWebForSynonyms | Peter D. Turney | Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL | ECML 2001 | http://arxiv.org/ftp/cs/papers/0212/0212033.pdf | 10.1007/3-540-44795-4_42 | 2001 |