2016 EmbracingDataAbundanceBookTestD
- (Bajgar et al., 2016) ⇒ Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. (2016). “Embracing Data Abundance: BookTest Dataset for Reading Comprehension.” In: ePrint: abs/1610.00956.
Subject Headings: BookTest Dataset; Reading Comprehension Dataset.
Notes
Cited By
- Google Scholar: ~ 36 Citations, Retrieved: 2020-12-13.
Quotes
Abstract
There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.
1. Introduction
2. Task Description
3. Current Landscape
CNN | Daily Mail | CBT CN | CBT NE | BookTest | |
---|---|---|---|---|---|
# queries | 380,298 | 879,450 | 120,769 | 108,719 | 14,140,825 |
Max # options | 527 | 371 | 10 | 10 | 10 |
Avg # options | 26.4 | 26.5 | 10 | 10 | 10 |
Avg # tokens | 762 | 813 | 470 | 433 | 522 |
Vocab. size | 118,497 | 208,045 | 53,185 | 53,063 | 1,860,394 |
4. BookTest
Similarly to the CBT, our BookTest dataset[1] is derived from books available through project Gutenberg. We used 3,555 copyright-free books to extract CN examples and 10,507 books for NE examples, for comparison the CBT dataset was extracted from just 108 books.
(...)
5. Baselines
6. Discussion
7. Human Study
8. Conclusion
Footnotes
- ↑ BookTest dataset can be downloaded from https://ibm.biz/booktest-v1.
References
BibTeX
@article{2016_EmbracingDataAbundanceBookTestD, author = {Ondrej Bajgar and Rudolf Kadlec and Jan Kleindienst}, title = {Embracing data abundance: BookTest Dataset for Reading Comprehension}, journal = {CoRR}, volume = {abs/1610.00956}, year = {2016}, url = {http://arxiv.org/abs/1610.00956}, archivePrefix = {arXiv}, eprint = {1610.00956}, }
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2016 EmbracingDataAbundanceBookTestD | Ondrej Bajgar Rudolf Kadlec Jan Kleindienst | Embracing Data Abundance: BookTest Dataset for Reading Comprehension | 2016 |