Children's Book Test (CBT) Dataset
A Children's Book Test (CBT) Dataset is a reading comprehension dataset that contains text data from books freely available through Project Gutenberg.
- AKA: CBT Dataset.
- Context:
- Datasets available at: https://research.fb.com/downloads/babi/
- Benchmarking Task: CBT Benchmark Task.
- …
- Example(s):
- Counter-Example(s):
- See: Reading Comprehension Task, bAbI Project, Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2016
- (Hill et al., 2016) ⇒ Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. (2016). “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations.” In: Proceedings of the 4th International Conference on Learning Representations (ICLR 2016) Conference Track.
- QUOTE: The experiments in this paper are based on a new resource, the Children's Book Test, designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg [1]. Using children's books guarantees a clear narrative structure, which can make the role of context more salient. After allocating books to either training, validation or test sets, we formed example '$questions$' (denoted $x$) from chapters in the book by enumerating 21 consecutive sentences.
In each question, the first 20 sentences form the context (denoted $S$), and a word (denoted $a$) is removed from the 21st sentence, which becomes the query (denoted $q$). Models must identify the answer word a among a selection of 10 candidate answers (denoted $C$) appearing in the context sentences and the query. Thus, for a question answer pair $(x, a): x = (q, S, C);\; S$ is an ordered list of sentences; $q$ is a sentence (an ordered list $q = q_1,\cdots, q_l$ of words) containing a missing word symbol; $C$ is a bag of unique words such that $a \in C$, its cardinality $\vert C\vert$ is 10 and every candidate word $w \in C$ is such that $w \in q \cup S$. An example question is given in Figure 1.
- QUOTE: The experiments in this paper are based on a new resource, the Children's Book Test, designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg [1]. Using children's books guarantees a clear narrative structure, which can make the role of context more salient. After allocating books to either training, validation or test sets, we formed example '$questions$' (denoted $x$) from chapters in the book by enumerating 21 consecutive sentences.