Long Document Summarization Algorithm
Jump to navigation
Jump to search
A Long Document Summarization Algorithm is a document summarization algorithm that can be implemented by a long document summarization system (to solve long document summarization).
- Context:
- It can (typically) handle various document types, such as academic papers, legal documents, technical reports, and lengthy articles, characterized by their substantial length and complex structure.
- It may incorporate techniques from both Extractive Summarization and Abstractive Summarization to optimize the balance between accuracy and coherence in the generated summaries.
- These algorithms are essential in areas where quick assimilation of large volumes of information is necessary, providing users with the ability to grasp the gist of a document without reading it in its entirety.
- ...
- Example(s):
- Counter-Example(s):
- Simple Keyword Extraction Algorithms that only highlight prominent terms without generating cohesive summaries.
- Short Document Summarization Algorithms tailored for brief texts like news articles or blog posts, where the challenge of managing extensive content does not apply.
- See: Extractive Summarization, Abstractive Summarization, Natural Language Processing (NLP), Text Summarization.
References
2022a
- (Kohee et al., 2022) ⇒ Huan Y. Kohee, Jiaxin Ju, Ming Liu, and Shirui Pan. (2022). “An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.” ACM computing surveys 55, no. 8
- ABSTRACT: Long documents such as academic articles and business reports have been the standard format to detail out important issues and complicated subjects that require extra attention. An automatic summarization system that can effectively condense long documents into short and concise texts to encapsulate the most important information would thus be significant in aiding the reader’s comprehension. Recently, with the advent of neural architectures, significant research efforts have been made to advance automatic text summarization systems, and numerous studies on the challenges of extending these systems to the long document domain have emerged. In this survey, we provide a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. For each component, we organize the literature within the context of long document summarization and conduct an empirical analysis to broaden the perspective on current research progress. The empirical analysis includes a study on the intrinsic characteristics of benchmark datasets, a multi-dimensional analysis of summarization models, and a review of the summarization evaluation metrics. Based on the overall findings, we conclude by proposing possible directions for future exploration in this rapidly growing field.
2022b
- (Ghadimi & Beigy, 2022) ⇒ Alireza Ghadimi, and Hamid Beigy. (2022). “Hybrid Multi-document Summarization Using Pre-trained Language Models.” Expert Systems with Applications 192
- ABSTRACT: Abstractive multi-document summarization is a type of automatic text summarization. It obtains information from multiple documents and generates a human-like summary from them. In this paper, we propose an abstractive multi-document summarization method called HMSumm. The proposed method is a combination of extractive and abstractive summarization approaches. First, it constructs an extractive summary from multiple input documents, and then uses it to generate the abstractive summary. Redundant information, which is a global problem in multi-document summarization, is managed in the first step. Specifically, the determinantal point process (DPP) is used to deal with redundancy. This step also controls the length of the input sequence for the abstractive summarization process. This step has two effects: The first is to reduce the computational time, and the second is to preserve the important parts of the input documents for an abstractive summarizer. We employ a deep submodular network (DSN) to determine the quality of the sentences in the extractive summary, and use BERT-based similarityies to compute the redundancy. The obtained extractive summary is fed into BART and T5 pre-trained models to generate two abstractive summaries. We use the diversity of sentences in each summary to select one of them as the final abstractive summary. To evaluate the performance of HMSumm, we use both human evaluations and ROUGE-based assessments, and compare it with several state-of-the-art methods. We use DUC 2002, DUC 2004, Multi-News, and CNN/DailyMail datasets to evaluate the algorithms. The experimental results show that HMSumm outperforms the related state-of-the-art algorithms.