2016 EmbracingDataAbundanceBookTestD

From GM-RKB
Jump to navigation Jump to search

Subject Headings: BookTest Dataset; Reading Comprehension Dataset.

Notes

Cited By

Quotes

Abstract

There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.

1. Introduction

2. Task Description

3. Current Landscape

  CNN Daily Mail CBT CN CBT NE BookTest
# queries 380,298 879,450 120,769 108,719 14,140,825
Max # options 527 371 10 10 10
Avg # options 26.4 26.5 10 10 10
Avg # tokens 762 813 470 433 522
Vocab. size 118,497 208,045 53,185 53,063 1,860,394
Table 1: Statistics on the 4 standard text comprehension datasets and our new BookTest dataset introduced in this paper. CBT CN stands for CBT Common Nouns and CBT NE stands for CBT Named Entites. Statistics were taken from (Hermann et al., 2015) and the statistics provided with the CBT data set.

4. BookTest

Similarly to the CBT, our BookTest dataset[1] is derived from books available through project Gutenberg. We used 3,555 copyright-free books to extract CN examples and 10,507 books for NE examples, for comparison the CBT dataset was extracted from just 108 books.

(...)

5. Baselines

6. Discussion

7. Human Study

8. Conclusion

Footnotes

References

BibTeX

@article{2016_EmbracingDataAbundanceBookTestD,
  author    = {Ondrej Bajgar and
               Rudolf Kadlec and
               Jan Kleindienst},
  title     = {Embracing data abundance: BookTest Dataset for Reading Comprehension},
  journal   = {CoRR},
  volume    = {abs/1610.00956},
  year      = {2016},
  url       = {http://arxiv.org/abs/1610.00956},
  archivePrefix = {arXiv},
  eprint    = {1610.00956},
}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2016 EmbracingDataAbundanceBookTestDOndrej Bajgar
Rudolf Kadlec
Jan Kleindienst
Embracing Data Abundance: BookTest Dataset for Reading Comprehension2016