Children's Book Test (CBT) Dataset

A Children's Book Test (CBT) Dataset is a reading comprehension dataset that contains text data from books freely available through Project Gutenberg.

AKA: CBT Dataset.
Context:
- Datasets available at: https://research.fb.com/downloads/babi/
- Benchmarking Task: CBT Benchmark Task.
- …
Example(s):
- http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz
Counter-Example(s):
- a BookTest Dataset,
- a CNN-Daily Mail Dataset,
- a MS-MARCO Dataset,
- a MS COCO Dataset,
- a MC-Test Dataset,
- a RACE Dataset,
- a Question-Answer Dataset,
- an ImageNet Dataset.
See: Reading Comprehension Task, bAbI Project, Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.

References

2016

(Hill et al., 2016) ⇒ Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. (2016). “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations.” In: Proceedings of the 4th International Conference on Learning Representations (ICLR 2016) Conference Track.
- QUOTE: The experiments in this paper are based on a new resource, the Children's Book Test, designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg ^[1]. Using children's books guarantees a clear narrative structure, which can make the role of context more salient. After allocating books to either training, validation or test sets, we formed example '$questions$' (denoted $x$) from chapters in the book by enumerating 21 consecutive sentences.
  In each question, the first 20 sentences form the context (denoted $S$), and a word (denoted $a$) is removed from the 21st sentence, which becomes the query (denoted $q$). Models must identify the answer word a among a selection of 10 candidate answers (denoted $C$) appearing in the context sentences and the query. Thus, for a question answer pair $(x, a): x = (q, S, C);\; S$ is an ordered list of sentences; $q$ is a sentence (an ordered list $q = q_1,\cdots, q_l$ of words) containing a missing word symbol; $C$ is a bag of unique words such that $a \in C$, its cardinality $\vert C\vert$ is 10 and every candidate word $w \in C$ is such that $w \in q \cup S$. An example question is given in Figure 1.

**Figure 1:** A Named Entity question from the CBT (right), created from a book passage (left, in blue). In this case, the candidate answers $C$ are both entities and common nouns, since fewer than ten named entities are found in the context.

↑ https://www.gutenberg.org

[ftn-1-1] ttps://www.gutenberg.org

[1]