CNN-Daily Mail Dataset

(Redirected from CNN/Daily Mail Corpus)
Jump to navigation Jump to search

A CNN-Daily Mail Dataset is a reading comprehension dataset that consists of sets of context–query–answer triples retrieved from CNN and Daily Mail websites for reading comprehension tasks.



  • GBard
    • The CNN/Daily Mail Corpus is a large text dataset for text summarization. It was created by researchers at the University of Washington and Allen Institute for Artificial Intelligence in 2017. The corpus contains 286,817 training pairs, 13,368 validation pairs, and 11,487 test pairs. Each pair consists of a news article and a corresponding abstractive summary of the article.

      The CNN/Daily Mail Corpus is one of the most popular datasets for text summarization research. It is used to train and evaluate a wide variety of text summarization models, including both extractive and abstractive models. Extractive models select sentences from the original article to create a summary, while abstractive models generate new sentences to create a summary.

      The CNN/Daily Mail Corpus is also used for other natural language processing tasks, such as machine reading comprehension and question answering.


  • Nallapati, Ramesh, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. “Abstractive text summarization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023 (2016).