2009 TheUnreasonableEffOfData

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Data-Driven Algorithm, Very Large Database, Semantic Web.

Notes

Cited By

2017

Quotes

Index Terms:

Abstract

At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.

References

  • 1. E. Wigner, "The Unreasonable Effectiveness of Mathematics in the Natural Sciences," Comm. Pure and Applied Mathematics, vol. 13, no. 1, 1960 pp. 1–14.
  • 2. R. Quirk et al., A Comprehensive Grammar of the English Language, Longman, 1985.
  • 3. H. Kucera, W.N. Francis, and J.B. Carroll, Computational Analysis of Present-Day American English, Brown Univ. Press, 1967.
  • 4. T. Brants and A. Franz, Web 1T 5-Gram Version 1, Linguistic Data Consortium, 2006.
  • 5. S. Riezler, Y. Liu, and A. Vasserman, "Translating Queries into Snippets for Improved Query Expansion," Proceedings of 22nd Int'l Conference Computational Linguistics (Coling 08), Assoc. Computational Linguistics, 2008 pp. 737–744.
  • 6. P.P. Talukdar et al., "Learning to Create Data-Integrating Queries," Proceedings of 34th Int'l Conference Very Large Databases (VLDB 08), Very Large Database Endowment, 2008 pp. 785–796.
  • 7. J. Hays and A.A. Efros, "Scene Completion Using Millions of Photographs," Comm. ACM, vol. 51, no. 10, 2008 pp. 87–94.
  • 8. L. Getoor and B. Taskar, Introduction to Statistical Relational Learning, MIT Press, 2007.
  • 9. B. Taskar et al., "Max-Margin Parsing," Proceedings of Conference Empirical Methods in Natural Language Processing (EMNLP 04), Assoc. for Computational Linguistics, 2004 pp. 1–8.
  • 10. S. Schoenmackers, Oren Etzioni, and D.S. Weld, "Scaling Textual Inference to the Web," Proceedings of 2008 Conference Empirical Methods in Natural Language Processing (EMNLP 08), Assoc. for Computational Linguistics, 2008 pp. 79–88.
  • 11. T. Berners-Lee, J. Hendler, and O. Lassila, "The Semantic Web," Scientific Am.,17 May 2001.
  • 12. P. Friedland et al., "Towards a Quantitative, Platform-Independent Analysis of Knowledge Systems," Proceedings of Int'l Conference Principles of Knowledge Representation, AAAI Press, 2004 pp. 507–514.
  • 13. “Interview of Tom Gruber," AIS SIGSEMIS Bull., vol. 1, no. 3, 2004.
  • 14. M.J. Cafarella et al., "WebTables: Exploring the Power of Tables on the Web," Proceedings of Very Large Data Base Endowment (VLDB 08), ACM Press, 2008 pp. 538–549.
  • 15. Marius Paşca, "Organizing and Searching the World Wide Web of Facts. Step Two: Harnessing the Wisdom of the Crowds," Proceedings of 16th Int'l World Wide Web Conf., ACM Press, 2007 pp. 101–110.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 TheUnreasonableEffOfDataPeter Norvig
Fernando Pereira
Alon Y. Halevy
The Unreasonable Effectiveness of DataIEEE Intelligent System10.1109/MIS.2009.362009