2024 BenchmarkingLargeLanguageModels
- (Zhang, Ladhak et al., 2024) ⇒ Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. (2024). “Benchmarking Large Language Models for News Summarization.” In: Transactions of the Association for Computational Linguistics, 12. doi:10.1162/tacl_a_00632
Subject Headings: Automated Summarization, Instruction-Tuning.
Notes
- It systematically evaluates ten LLMs across various dimensions for news summarization, focusing on zero-shot, few-shot, and instruction-tuned settings.
- It identifies instruction tuning as a more significant factor than model scale for enhancing LLMs' zero-shot summarization capabilities.
- It highlights issues with the quality of reference summaries in popular datasets, which have led to the underestimation of human performance and difficulties in model evaluation.
- It employs high-quality summaries generated by freelance writers for a more accurate comparison with LLM-generated summaries, finding comparable performance.
- It challenges the assumption that human-written summaries are superior to those generated by LLMs, showcasing the potential of instruction-tuned models.
- It calls for the development and use of high-quality reference summaries for better benchmarking and evaluation of summarization models.
- It contributes to the computational linguistics field by providing insights into improving LLM training and evaluation, and it makes available high-quality summaries and evaluation data for future research.
Cited By
Quotes
Abstract
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLMâs zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 BenchmarkingLargeLanguageModels | Kathleen R. McKeown Percy Liang Tianyi Zhang Esin Durmus Faisal Ladhak Tatsunori B Hashimoto | Benchmarking Large Language Models for News Summarization | 10.1162/tacl_a_00632 | 2024 |