2009 WebScaleDistrSimAndEntitySetExp

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Set Expansion Task

Notes

  • It presents a scalability solution in order to handle WWW.

Cited By

Quotes

Abstract

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.


References

  • Abney, S. Parsing by Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-based Parsing. Kluwer Academic Publishers, Dordrecht. 1991.
  • Agirre, E.; Alfonseca, E.; Hall, K.; Kravalova, J.; Paşca, M.; and Soroa, A.. (2009). A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of NAACL HLT 09.
  • Ando, R. K. (2000). Latent semantic space: Iterative scaling improves precision of interdocument similarity measurement. In: Proceedings of SIGIR-00. pp. 216–223.
  • Atterer, M. and Hinrich Schütze, 2006. The Effect of Corpus Size when Combining Supervised and Unsupervised Training for Disambiguation. In: Proceedings of ACL-06.
  • Banko, M. and Brill, E. (2001). Mitigating the paucity of data problem. In: Proceedings of HLT-2001. San Diego, CA.
  • Banko, M.; Cafarella, M.; Soderland, S.; Broadhead, M.; Oren Etzioni 2007. Open Information Extraction from the Web. In: Proceedings of IJCAI.
  • Bayardo, R. J.; Ma, Y.; Srikant, R. (2007). Scaling Up All-Pairs Similarity Search. In: Proceedings of WWW- 07. pp. 131-140. Banff, Canada.
  • Blei, D. M.; Ng, A. Y.; and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022.
  • Brill, E. 1995. Transformation-based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics.
  • Broder, A. (1997). On the resemblance and containment of documents. In Compression and Complexity of Sequences. pp. 21-29.
  • Bunescu, R. and Mooney, R. 2004 Collective Information Extraction with Relational Markov Networks. In: Proceedings of ACL-04, pp. 438-445.
  • Cao, H.; Jiang, D.; Pei, J.; He, Q.; Liao, Z.; Chen, E.; and Li, H. (2008). Context-aware query suggestion by mining click-through and session data. In: Proceedings of KDD-08. pp. 875–883.
  • Chang, W.; Patrick Pantel; Popescu, A.-M.; and Gabrilovich, E. (2009). Towards intent-driven bidterm suggestion. In: Proceedings of WWW-09 (Short Paper), Madrid, Spain.
  • Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In: Proceedings of ACL89. pp. 76–83.
  • Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113.
  • Deerwester, S. C.; Dumais, S. T.; Thomas K. Landauer; Furnas, G. W.; and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
  • Downey, D.; Broadhead, M; Oren Etzioni 2007. Locating Complex Named Entities in Web Text. In: Proceedings of IJCAI-07.
  • Elsayed, T.; Lin, J.; Oard, D. (2008). Pairwise Document Similarity in Large Collections with MapReduce. In: Proceedings of ACL-08: HLT, Short Papers (Companion Volume). pp. 265–268. Columbus, OH.
  • Erk, K. (2007). A simple, similarity-based model for selectional preferences. In: Proceedings of ACL-07. pp. 216–223. Prague, Czech Republic.
  • Erk, K. and Padó, S. (2008). A structured vector space model for word meaning in context. In: Proceedings of EMNLP-08. Honolulu, HI.
  • Oren Etzioni; Cafarella, M.; Downey. D.; Popescu, A.; Shaked, T; Soderland, S.; Weld, D.; Yates, A. (2005). Unsupervised named-entity extraction from the Web: An Experimental Study. In Artificial Intelligence, 165(1):91-134.
  • Gorman, J. and Curran, J. R. (2006). Scaling distributional similarity to large corpora. In: Proceedings of ACL- 06. pp. 361-368.
  • Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.), The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.
  • Hindle, D. (1990). Noun classification from predicateargument structures. In: Proceedings of ACL-90. pp. 268–275. Pittsburgh, PA.
  • Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. In: Proceedings of SIGIR-99. pp. 50–57, Berkeley, California.
  • Kanerva, P. (1993). Sparse distributed memory and related models. pp. 50-76.
  • Lapata, M. and Keller, F., 2005. Web-based Models for Natural Language Processing, In ACM Transactions on Speech and Language Processing (TSLP), 2(1).
  • Lee, Lillian. (1999). Measures of Distributional Similarity. In: Proceedings of ACL-93. pp. 25-32. College Park, MD.
  • Dekang Lin 1998. Automatic retrieval and clustering of similar words. In: Proceedings of COLING/ACL-98. pp. 768–774. Montreal, Canada.
  • Lund, K., and Burgess, C. (1996). Producing highdimensional semantic spaces from lexical cooccurrence. Behavior Research Methods, Instruments, and Computers, 28(2):203–208.
  • McCallum, A. and Li, W. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Enhanced Lexicons. In: Proceedings of CoNLL-03.
  • McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281–298.
  • Pasca, M. 2007a. Weakly-supervised discovery of named entities using web search queries. In: Proceedings of CIKM-07. pp. 683-690.
  • Pasca, M. 2007b. Organizing and Searching the World Wide Web of Facts – Step Two: Harnessing the Wisdom of the Crowds. In: Proceedings of WWW-07.
  • Pasca, M. and Durme, B.J. (2008). Weakly-supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In: Proceedings of ACL-08.
  • Pasca, M.; Dekang Lin; Bigham, J.; Lifchits, A.; Jain, A. (2006). Names and Similarities on the Web: Fast Extraction in the Fast Lane. In: Proceedings of ACL- 2006. pp. 113-120.
  • Patrick Pantel and Dekang Lin 2002. Discovering Word Senses from Text. In: Proceedings of KDD-02. pp. 613-619. Edmonton, Canada.
  • Patrick Pantel, D. Ravichandran, Eduard Hovy 2004. Towards terascale knowledge acquisition. In: Proceedings of COLING-04. pp 771-777.
  • D. Ravichandran, Patrick Pantel, and Eduard Hovy 2005. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In: Proceedings of ACL-05. pp. 622-629.
  • Riloff, E. and Jones, R. 1999 Learning Dictionaries for Information Extraction by Multi-level Boostrapping. In: Proceedings of AAAI/IAAAI-99.
  • Riloff, E. and Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In: Proceedings of EMNLP-97.
  • Rychlý, P. and Kilgarriff, A. (2007). An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments). In: Proceedings of ACL-07, demo sessions. Prague, Czech Republic.
  • S. Sarawagi, and Kirpal, A. (2004). Efficient set joins on similarity predicates. In: Proceedings of SIGMOD '04. pp. 74 –754. New York, NY.
  • (Sarmento et al., 2007) ⇒ Luis Sarmento, Valentin Jijkoun, Maarten de Rijke\n, and Eugenio Oliveira. (2007). “"More like these": growing entity classes from seeds.” In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM 2007). doi:10.1145/1321440.1321585
  • Peter D. Turney, and Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21(4).
  • Wang, R.C. and Cohen, W.W. (2008). Iterative Set Expansion of Named Entities using the Web. In: Proceedings of ICDM 2008. Pisa, Italy.
  • Wang. R.C. and Cohen, W.W. 2007 Language- Independent Set Expansion of Named Entities Using the Web. In: Proceedings of ICDM-07.
  • Yuret, D., and Yatbaz, M. A. (2009). The noisy channel model for unsupervised word sense disambiguation. Computational Linguistics. Under review.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 WebScaleDistrSimAndEntitySetExpPatrick Pantel
Ana-Maria Popescu
Eric Crestan
Arkady Borkovsky
Vishnu Vyas
Web-Scale Distributional Similarity and Entity Set ExpansionProceedings of EMNLP Conferencehttp://www.aclweb.org/anthology/D/D09/D09-1098.pdf2009