LAMBADA Dataset
Jump to navigation
Jump to search
A LAnguage Modeling Broadened to Account for Discourse Aspects (LAMBADA) Dataset is a reading comprehension dataset for evaluationg word prediction tasks and that consists of passages composed of a context and a target sentence.
- AKA: LAMBADA Corpus.
- Context:
- It is available online at: https://zenodo.org/record/2630551
- It consists of "10,022 passages, divided into 4,869 development and 5,153 test passages (extracted from 1,331 and 1,332 disjoint novels, respectively)".
- Example(s):
- Counter-Example(s):
- See: Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2016
- (Paperno et al., 2016) ⇒ Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. (2016). “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) Volume 1: Long Papers.
- QUOTE: LAMBADA consists of passages composed of a context (on average 4.6 sentences) and a target sentence. The context size is the minimum number of complete sentences before the target sentence such that they cumulatively contain at least 50 tokens (this size was chosen in a pilot study). The task is to guess the last word of the target sentence (the target word). The constraint that the target word be the last word of the sentence, while not necessary for our research goal, makes the task more natural for human subjects.