CNN-Daily Mail Dataset
A CNN-Daily Mail Dataset is a reading comprehension dataset that consists of sets of context–query–answer triples retrieved from CNN and Daily Mail websites for reading comprehension tasks.
- AKA: CNN-Daily Mail Corpus.
- Context:
- It was developed by (Hermann et al. ,2015).
- …
- Example(s):
- Counter-Example(s):
- See: Question-Answering System, Question-Answer Dataset, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.
References
2023
- GBard
- The CNN/Daily Mail Corpus is a large text dataset for text summarization. It was created by researchers at the University of Washington and Allen Institute for Artificial Intelligence in 2017. The corpus contains 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs. Each pair consists of a news article and a corresponding abstractive summary of the article.
The CNN/Daily Mail Corpus is one of the most popular datasets for text summarization research. It is used to train and evaluate a wide variety of text summarization models, including both extractive and abstractive models. Extractive models select sentences from the original article to create a summary, while abstractive models generate new sentences to create a summary.
The CNN/Daily Mail Corpus is also used for other natural language processing tasks, such as machine reading comprehension and question answering.
- The CNN/Daily Mail Corpus is a large text dataset for text summarization. It was created by researchers at the University of Washington and Allen Institute for Artificial Intelligence in 2017. The corpus contains 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs. Each pair consists of a news article and a corresponding abstractive summary of the article.
2016
- Nallapati, Ramesh, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. “Abstractive text summarization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023 (2016).
2015
- Hermann et al., 2015)⇒ Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. (2015). “Teaching Machines to Read and Comprehend.” In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS'15). ePrint: arXiv:1506.03340v3
- QUOTE: In this work we seek to directly address the lack of real natural language training data by introducing a novel approach to building a supervised reading comprehension data set. We observe that summary and paraphrase sentences, with their associated documents, can be readily converted to context–query–answer triples using simple entity detection and anonymisation algorithms. Using this approach we have collected two new corpora of roughly a million news stories with associated queries from the CNN and Daily Mail websites.
We demonstrate the efficacy of our new corpora by building novel deep learning models for reading comprehension. These models draw on recent developments for incorporating attention mechanisms into recurrent neural network architectures (...)
- QUOTE: In this work we seek to directly address the lack of real natural language training data by introducing a novel approach to building a supervised reading comprehension data set. We observe that summary and paraphrase sentences, with their associated documents, can be readily converted to context–query–answer triples using simple entity detection and anonymisation algorithms. Using this approach we have collected two new corpora of roughly a million news stories with associated queries from the CNN and Daily Mail websites.