2024 BenchmarkingLargeLanguageModels

Subject Headings: Automated Summarization, Instruction-Tuning.

Notes

It systematically evaluates ten LLMs across various dimensions for news summarization, focusing on zero-shot, few-shot, and instruction-tuned settings.
It identifies instruction tuning as a more significant factor than model scale for enhancing LLMs' zero-shot summarization capabilities.
It highlights issues with the quality of reference summaries in popular datasets, which have led to the underestimation of human performance and difficulties in model evaluation.
It employs high-quality summaries generated by freelance writers for a more accurate comparison with LLM-generated summaries, finding comparable performance.
It challenges the assumption that human-written summaries are superior to those generated by LLMs, showcasing the potential of instruction-tuned models.
It calls for the development and use of high-quality reference summaries for better benchmarking and evaluation of summarization models.
It contributes to the computational linguistics field by providing insights into improving LLM training and evaluation, and it makes available high-quality summaries and evaluation data for future research.

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLMâs zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 BenchmarkingLargeLanguageModels	Kathleen R. McKeown Percy Liang Tianyi Zhang Esin Durmus Faisal Ladhak Tatsunori B Hashimoto			Benchmarking Large Language Models for News Summarization				10.1162/tacl_a_00632		2024