Automated Summarization Task
An Automated Summarization Task is a summarization task that is an automated writing task.
- Context:
- measure: Automated Summarization Performance Evaluation.
- It can range from being a Heuristic Automated Summarization Task to being a Data-Driven Automated Summarization Task.
- It can be solved by an Automated Summarization System.
- …
- Example(s):
- an Automated Text-Item Summarization Task, such as:
- a Dialog Summarization Task, such as customer support dialogue summarization.
- a Summarization Benchmark Task, such as:
- Counter-Example(s):
- See: Automated IE Task.
References
2024
- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Automatic_summarization Retrieved:2024-1-5.
- Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document. On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection. Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion. Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content.
- Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.
2022
- (Liang, Bommasani et al., 2022) ⇒ Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. (2022). “Holistic Evaluation of Language Models.” doi:10.48550/arXiv.2211.09110
- QUOTE: Summarization. CNN/DailyMail and XSUM have been standard benchmarks for summarization for many years, but the official reference summaries in these datasets are outperformed by generated model summaries in human evaluations, especially for faithfulness (Table 8). Overall, we believe summarization benchmarks (along with metrics) must be improved by incorporating high-quality, human-written summaries (§10.1: missing-scenarios), so that we can draw meaningful conclusions on the effect of in-context learning, instruction tuning, and fine-tuning (see §8.5.1: human-evaluation-summarization).
- …
Text summarization is an established research direction in NLP (Luhn, 1958; Mani, 1999; Spärck Jones, 1999; Nenkova and McKeown, 2012), with growing practical importance given the ever-increasing volume of text that would benefit from summarization. To effectively summarize, systems must identify and yield the core relevant and informative content in the source document while removing less critical information and avoiding redundancy (Peyrard, 2019). The rise of language models in recent years has dramatically improved summarization capabilities: the ability to generate fluent and coherent human-like text serves as a core primitive towards building better summarization systems (Lewis et al., 2020b; Zhang et al., 2019b).
- Scenario: CNN/DaiiyMaii
Input: Two years ago, the storied Boston Marathon ended in terror and altered the lives of runners,. Many bombing survivors. celebrating "One Boston Day," which was created to recognize acts of valor and to encourage kindness among Bostonians.
Reference: Citizens gather to honor victims on One Boston Day, two years after the marathon bombings.
- Fig. 13. Example of summarization. An example instance for summarization from CNN/DailyMai. Different summarization scenarios can have significantly different properties, but this example captures the overall structure of summarization.
- Scenario: CNN/DaiiyMaii
- Problem setting. We formulate text summarization as an unstructured sequence-to-sequence problem, where a document (e.g. a CNN news article) is the input and the LM is tasked with generating a summary that resembles the reference summary (e.g. the bullet point summary provided by CNN with their article). Figure 13 provides an example. This evaluation tests the abstractive summarization capabilities of the model, where the model is directly required to generate the summary rather than being explicitly constrained to copying words or larger extracts from the input document.
To evaluate model performance, the model-generated summary is compared against a human- authored reference summary using automated metrics for overall quality (ROUGE-2; BERTScore; Lin, 2004; Zhang et al., 2020b), faithfulness (Laban et al., 2022; Fabbri et al., 2022), and attractiveness (Grusky et al., 2018). Faithfulness refers to whether all the information in the model summary is supported by the article (Cao et al., 2018; Durmus et al., 2020; Maynez et al., 2020). Extractiveness refers to the extent to which model summaries involving copying from the input document: the distinction between extractive and abstractive approaches has been widely discussed in the summarization literature (see Nenkova and McKeown, 2012). We compute extractiveness since prior work has shown that current summarization systems tend to be less faithful, on average, whenever they extract less (Durmus et al., 2020; Mrini et al., 2021; Ladhak et al., 2022).
We pay special attention to faithfulness as neural models in particular often hallucinate content that diverges from what appears in the document being summarized. Consequently, it is important to measure and improve the faithfulness of these systems since unfaithful systems may be harmful by potentially spreading misinformation, including dangerous, yet hard to detect errors, when deployed in real-world settings. We first evaluate the LMs using recently proposed reference-free evaluation metrics that have been shown to get high correlations with human scores for faithfulness (Laban et al., 2022; Fabbri et al., 2022). Recent work has shown that some reference-free evaluation metrics may be mostly relying on spurious correlations (Durmus et al., 2022). Given this, we further conducted a human user study to validate and supplement the automated evaluation.
Datasets. There is a growing collection of summarization datasets, including datasets that capture finer-grained and more specific summarization functions (e.g. summarizing multiple documents or conditional on a user query). Bommasani and Cardie (2020) show that there is significant diversity in summarization datasets along several axes, which makes selecting a few datasets to represent summarization rather challenging. Since we are especially interested in model faithfulness in this work (as this is a known failure mode of other neural approaches to summarization), we select the CNN/DailyMail (Hermann et al., 2015a) and XSUM (Narayan et al., 2018) datasets, which are the most well-studied datasets in the literature on summarization faithfulness. This also ensures domain coverage of news-type data. Importantly, these datasets differ along a central axis studied in summarization: XSUM is a dataset with largely abstractive reference summaries (meaning the string overlap between the document and its summary in the dataset is relatively small on average), whereas CNN/DailyMail is a dataset with largely extractive reference summaries. However, these datasets do not suffice in representing the full diversity of summarization, and we encourage future work to expand on our benchmark along this axis (e.g. add datasets from domains beyond news), particularly towards domains where there is greater demand for summaries (see Reiter, 2022). And we especially highlight that these two datasets have been the subject of critique, and that broader change is required for dataset and evaluation design in summarization and natural language generation (Gehrmann et al., 2022b; Reiter, 2022).