BookTest Dataset

From GM-RKB

Jump to navigation Jump to search

A BookTest Dataset is a reading comprehension dataset that is similar to Children's Book Test (CBT) dataset but 60 times larger.

Context:
- Datasets available at: https://ibm.biz/booktest-v1
Example(s):
- …
Counter-Example(s):
- a Children's Book Test (CBT) Dataset,
- a CNN-Daily Mail Dataset,
- an ImageNet Dataset,
- a MS-MARCO Dataset,
- a MS COCO Dataset,
- a MC-Test Dataset,
- a RACE Dataset,
- a Question-Answer Dataset.
See: Reading Comprehension Task, Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.

References

2016

(Bajgar et al., 2016) ⇒ Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. (2016). “Embracing Data Abundance: BookTest Dataset for Reading Comprehension.” In: ePrint: abs/1610.00956.
- QUOTE: Similarly to the CBT, our BookTest dataset^[1] is derived from books available through project Gutenberg. We used 3,555 copyright-free books to extract CN examples and 10,507 books for NE examples, for comparison the CBT dataset was extracted from just 108 books.

**Table 1:** Statistics on the 4 standard text comprehension datasets and our new BookTest dataset introduced in this paper. CBT CN stands for CBT Common Nouns and CBT NE stands for CBT Named Entites. Statistics were taken from (Hermann et al., 2015) and the statistics provided with the CBT data set.
	CNN	Daily Mail	CBT CN	CBT NE	BookTest
# queries	380,298	879,450	120,769	108,719	14,140,825
Max # options	527	371	10	10	10
Avg # options	26.4	26.5	10	10	10
Avg # tokens	762	813	470	433	522
Vocab. size	118,497	208,045	53,185	53,063	1,860,394

↑ BookTest dataset can be downloaded from https://ibm.biz/booktest-v1.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=BookTest_Dataset&oldid=760457"