2016 EmbracingDataAbundanceBookTestD

(Bajgar et al., 2016) ⇒ Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. (2016). “Embracing Data Abundance: BookTest Dataset for Reading Comprehension.” In: ePrint: abs/1610.00956.

Subject Headings: BookTest Dataset; Reading Comprehension Dataset.

Notes

Cited By

Google Scholar: ~ 36 Citations, Retrieved: 2020-12-13.

Quotes

Abstract

There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.

1. Introduction

2. Task Description

3. Current Landscape

**Table 1:** Statistics on the 4 standard text comprehension datasets and our new BookTest dataset introduced in this paper. CBT CN stands for CBT Common Nouns and CBT NE stands for CBT Named Entites. Statistics were taken from (Hermann et al., 2015) and the statistics provided with the CBT data set.
	CNN	Daily Mail	CBT CN	CBT NE	BookTest
# queries	380,298	879,450	120,769	108,719	14,140,825
Max # options	527	371	10	10	10
Avg # options	26.4	26.5	10	10	10
Avg # tokens	762	813	470	433	522
Vocab. size	118,497	208,045	53,185	53,063	1,860,394

4. BookTest

Similarly to the CBT, our BookTest dataset^[1] is derived from books available through project Gutenberg. We used 3,555 copyright-free books to extract CN examples and 10,507 books for NE examples, for comparison the CBT dataset was extracted from just 108 books.

(...)

5. Baselines

6. Discussion

7. Human Study

8. Conclusion

Footnotes

↑ BookTest dataset can be downloaded from https://ibm.biz/booktest-v1.

References

BibTeX

@article{2016_EmbracingDataAbundanceBookTestD,
  author    = {Ondrej Bajgar and
               Rudolf Kadlec and
               Jan Kleindienst},
  title     = {Embracing data abundance: BookTest Dataset for Reading Comprehension},
  journal   = {CoRR},
  volume    = {abs/1610.00956},
  year      = {2016},
  url       = {http://arxiv.org/abs/1610.00956},
  archivePrefix = {arXiv},
  eprint    = {1610.00956},
}

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2016 EmbracingDataAbundanceBookTestD	Ondrej Bajgar Rudolf Kadlec Jan Kleindienst			Embracing Data Abundance: BookTest Dataset for Reading Comprehension						2016

[ftn-6-1] BookTest dataset can be downloaded from https://ibm.biz/booktest-v1.

[1]