NewsQA Dataset
Jump to navigation
Jump to search
A NewsQA Dataset is a QA dataset that is a large-scale dataset for reading comprehension tasks.
- Context:
- It contains 119,633 natural language questions posed by crowdworkers on 12,744 news articles from CNN.
- Online repository: https://github.com/Maluuba/newsqa
- Datasets available at: https://www.microsoft.com/en-us/research/project/newsqa-dataset/
- Benchmark Task: NewsQA Machine Comprehension Challenge.
- Example(s):
- Counter-Example(s):
- a CoQA Dataset,
- a FigureQA Dataset,
- a Frames Dataset,
- a MS COCO Dataset,
- a NarrativeQA Dataset,
- a RACE Dataset,
- a SearchQA Dataset,
- a SQuAD Dataset,
- a TriviaQA Dataset.
- See: Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2020
- (MS Research Montreal, 2020) ⇒ https://www.microsoft.com/en-us/research/project/newsqa-dataset/ Retrieved: 2020-12-27.
- QUOTE: With massive volumes of written text being produced every second, how do we make sure that we have the most recent and relevant information available to us? Microsoft research Montreal is tackling this problem by building AI systems that can read and comprehend large volumes of complex text in real-time.
The purpose of the NewsQA dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills.
Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.
- QUOTE: With massive volumes of written text being produced every second, how do we make sure that we have the most recent and relevant information available to us? Microsoft research Montreal is tackling this problem by building AI systems that can read and comprehend large volumes of complex text in real-time.
2017
- (Trischler et al., 2017) ⇒ Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. (2017). “NewsQA: A Machine Comprehension Dataset.” In: Proceedings of the 2nd Workshop on Representation Learning for NLP (Rep4NLP@ACL 2017).
- QUOTE: In this paper, we present a challenging new large-scale dataset for machine comprehension: NewsQA. It contains 119,633 natural language questions posed by crowdworkers on 12,744 news articles from CNN. In SQuAD, crowdworkers are tasked with both asking and answering questions given a paragraph. In contrast, NewsQA was built using a collection process designed to encourage exploratory, curiosity-based questions that may better reflect realistic information-seeking behaviors. Particularly, a set of crowdworkers were tasked to answer questions given a summary of the article, i.e. the CNN article highlights. A separate set of crowdworkers selects answers given the full article, which consist of word spans in the corresponding articles. This gives rise to interesting patterns such as questions that may not be answerable by the original article.