Microsoft Machine Reading Comprehension (MS MARCO) Dataset
Jump to navigation
Jump to search
A Microsoft Machine Reading Comprehension (MS MARCO) Dataset is a large-scale real-world reading comprehension dataset for reading comprehension and question-answering tasks.
- Context:
- It was developed by Nguyen et al. (2016).
- ...
- Example(s):
- …
- Counter-Example(s):
- See: QA from Corpus, Question-Answer Dataset.
References
2016
- (Nguyen et al., 2016) ⇒ Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. (2016). “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” In: Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016).
- QUOTE: In this paper we introduce Microsoft Machine Reading Comprehension (MS MARCO) - a large scale real-world reading comprehension dataset that addresses the shortcomings of the existing datasets for RC and QA discussed above. The questions in the dataset are real anonymized queries issued through Bing or Cortana and the documents are related web pages which may or may not be enough to answer the question. For every question in the dataset, we have asked a crowdsourced worker to answer it, if they can, and to mark relevant passages which provide supporting information for the answer. If they can’t answer it we consider the question unanswerable and we also include a sample of those in MS MARCO. We believe a characteristic of reading comprehension is to understand when there is not enough information or even conflicting information so a question is unanswerable. The answer is strongly encouraged to be in the form of a complete sentence, so the workers may write a longform passage on their own. MS MARCO includes 100,000 questions, 1 million passages, and links to over 200,000 documents. Compared to previous publicly available datasets, this dataset is unique in the sense that (a) all questions are real user queries, (b) the context passages, which answers are derived from, are extracted from real web documents, (c) all the answers to the queries are human generated, (d) a subset of these queries has multiple answers, (e) all queries are tagged with segment information.