2024 BenchmarkingLargeLanguageModels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Automated Summarization, Instruction-Tuning.

Notes

  • It systematically evaluates ten LLMs across various dimensions for news summarization, focusing on zero-shot, few-shot, and instruction-tuned settings.
  • It identifies instruction tuning as a more significant factor than model scale for enhancing LLMs' zero-shot summarization capabilities.
  • It highlights issues with the quality of reference summaries in popular datasets, which have led to the underestimation of human performance and difficulties in model evaluation.
  • It employs high-quality summaries generated by freelance writers for a more accurate comparison with LLM-generated summaries, finding comparable performance.
  • It challenges the assumption that human-written summaries are superior to those generated by LLMs, showcasing the potential of instruction-tuned models.
  • It calls for the development and use of high-quality reference summaries for better benchmarking and evaluation of summarization models.
  • It contributes to the computational linguistics field by providing insights into improving LLM training and evaluation, and it makes available high-quality summaries and evaluation data for future research.

Cited By

Quotes

Abstract

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 BenchmarkingLargeLanguageModelsKathleen R. McKeown
Percy Liang
Tianyi Zhang
Esin Durmus
Faisal Ladhak
Tatsunori B Hashimoto
Benchmarking Large Language Models for News Summarization10.1162/tacl_a_006322024