2007 ASurveyOnAutomaticTextSummarization
- (Das & Martins, 2007) ⇒ Dipanjan Das, and André F. T. Martins. (2007). “A Survey on Automatic Text Summarization.” Literature Survey for the Language and Statistics II course at CMU, November, 2007 (unpublished).
Subject Headings: Automatic Text Summarization
Notes
Cited By
Quotes
Abstract
The increasing availability of online information has necessitated intensive research in the area of automatic text summarization within the Natural Language Processing (NLP) community. Over the past half a century, the problem has been addressed from many different perspectives, in varying domains and using various paradigms. This survey intends to investigate some of the most relevant approaches both in the areas of single-document and multiple-document summarization, giving special emphasis to empirical methods and extractive techniques. Some promising approaches that concentrate on specific details of the summarization problem are also discussed. Special attention is devoted to automatic evaluation of summarization systems, as future research on summarization is strongly dependent on progress in this area.
1 Introduction
The subfield of summarization has been investigated by the NLP community for nearly the last half century. Radev et al. (2002). define a summary as "a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantlythat". This simple definition captures three important aspects that characterize research on automatic summarization:
- Summaries may be produced from a single document or multiple documents,
- Summaries should preserve important information,
- Summaries should be short.
Even if we agree unanimously on these points, it seems from the literature that any attempt to provide a more elaborate definition for the task would result in disagreement within the community. In fact, many approaches differ on the manner of their problem formulations. We start by introducing some common terms in the summarization dialect: extraction is the procedure of identifying important sections of the text and producing them verbatim; abstraction aims to produce important material in a new way; fusion combines extracted parts coherently; and compression aims to throw out unimportant sections of the text (Radev et al., 2002).
Earliest instances of research on summarizing scientific documents proposed paradigms for extracting salient sentences from text using features like word and phrase frequency (Luhn, 1958), position in the text (Baxendale, 1958) and key phrases (Edmundson, 1969). Various work published since then has concentrated on other domains, mostly on newswire data. Many approaches addressed the problem by building systems depending of the type of the required summary. While extractive summarization is mainly concerned with what the summary content should be, usually relying solely on extraction of sentences, abstractive summarization puts strong emphasis on the form, aiming to produce a grammatical summary, which usually requires advanced language generation techniques. In a paradigm more tuned to information retrieval (IR), one can also consider topic-driven summarization, that assumes that the summary content depends on the preference of the user and can be assessed via a query, making the final summary focused on a particular topic.
A crucial issue that will certainly drive future research on summarization is evaluation. During the last fifteen years, many system evaluation competitions like TREC,1 DUC2 and MUC3 have created sets of training material and have established baselines for performance levels. However, a universal strategy to evaluate summarization systems is still absent.
,