TriviaQA Dataset
Jump to navigation
Jump to search
A TriviaQA Dataset is a QA-reading comprehension dataset that is a large-scale dataset for reading comprehension and question answering tasks.
- Context:
- It contains over 650K question-answer-evidence triples.
- ...
- Example(s):
- …
- Counter-Example(s):
- a CoQA Dataset,
- a CNN-Daily Mail Dataset,
- a FastQA Dataset,
- a MS COCO Dataset,
- a NarrativeQA Dataset,
- a NewsQA Dataset.
- a RACE Dataset,
- a SearchQA Dataset,
- a SQuAD Dataset.
- See: QA from Corpus, MS MARCO.
References
2017
- (Joshi et al., 2017) ⇒ Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. (2017). “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension". In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017) Volume 1: Long Papers.
- QUOTE: TriviaQA contains over 650K question-answer evidence triples, that are derived by combining 95K Trivia enthusiast authored question-answer pairs with on average six supporting evidence documents per question. To our knowledge, TriviaQA is the first dataset where full-sentence questions are authored organically (i.e. independently of an NLP task) and evidence documents are collected retrospectively from Wikipedia and the Web. This decoupling of question generation from evidence collection allows us to control for potential bias in question style or content, while offering organically generated questions from various topics. Designed to engage humans, TriviaQA presents a new challenge for RC models. They should be able to deal with large amount of text from various sources such as news articles, encyclopedic entries and blog articles, and should handle inference over multiple sentences.