There is no Data Like More Data Heuristic
Jump to navigation
Jump to search
The There is no Data Like More Data Heuristic is a Heuristic from the NLP Community.
References
2006
- (Kilgarriff & Grefenstette, 2006) ⇒ Adam Kilgarriff, and Gregory Grefenstette. (2006). “Introduction to the Special Issue on the Web as Corpus.” In: Computational Linguistics, 29(3). doi:10.1162/089120103322711569
- Another argument is made vividly by Banko and Brill (2001). They explore the performance of a number of machine learning algorithms (on a representative disambiguation task) as the size of the training corpus grows from a million to a billion words. All the algorithms steadily improve in performance, though the question “Which is best?” gets different answers for different data sizes. The moral: Performance improves with data size, and getting more data will make more difference than fine-tuning algorithms.
- Google Research Blog. (2006). “All Our N-gram are Belong to You." Thursday, August 03, 2006 at 8/03/2006 11:26:00 AM
- Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
- We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
2002
- (Van den Bosck & Buchold, 2002) ⇒ Antal van den Bosch, and Sabine Buchholz. (2002). “Shallow Parsing on the Basis of Words only: a Case Study."
- They train some models using features of different granularity and informative richness.
- ABSTRACT: memory-based learning algorithm is trained to simultaneously chunk sentences and assign grammatical function tags to these chunks. We compare the algorithm’s performance on this parsing task with varying training set size s (yielding learning curves) and different input representations. In particular we compare input consisting of words only, a variant that includes word form information for low-frequency words, gold-standard POS only, and combinations of these. The word=based shallow parser displays an apparently log-linear increase in performance, and surpasses the flatter POS-based curve at about 50,000 sentences of training data. The low-frequency variant performs even better, and the combinations is best. Comparative experiments with a real POS tagger produce lower results. We argue that we might not need an explicit intermediate POS-tagging step for parsing when a sufficient amount of training material is available and word form information is used for low-frequency words.
2001
- (Banko & Brill, 2001) ⇒ Michele Banko, and Eric D. Brill. (2001). “Scaling to Very Very Large Corpora for Natural Language Disambiguation.” In: Meeting of the Association for Computational Linguistics (ACL 2001).