2024 EvaluatingTextSummariesGenerate
- (Shakil et al., 2024) ⇒ Hassan Shakil, Atqiya Munawara Mahi, Phuoc Nguyen, Zeydy Ortiz, and Mamoun T Mardini. (2024). “Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT.” In: arXiv preprint arXiv:2405.04053. doi:10.48550/arXiv.2405.04053
Subject Headings: Text Summary Evaluation, LLM-Supported Evaluation, Content Assessment.
Notes
- The paper examines the effectiveness of using OpenAI's GPT models to evaluate text summaries generated by six transformer-based models from Hugging Face.
- The paper assesses the quality of summarys based on four key attributes: conciseness, relevance, coherence, and readability.
- The paper employs both traditional summary evaluation metrics (ROUGE, Latent Semantic Analysis, Flesch-Kincaid) and GPT-based evaluation to provide a comprehensive assessment of summary quality.
- The paper finds significant correlations between GPT evaluations and traditional metrics, particularly in assessing relevance and coherence.
- The paper demonstrates GPT's potential as a robust tool for evaluating text summaries, offering insights that complement established metrics.
- The paper suggests that integrating AI-driven tools like GPT with traditional metrics could lead to more comprehensive and nuanced evaluation methods in natural language processing.
- The paper proposes future research directions, including expanding the evaluation framework to diverse NLP tasks, exploring other transformer-based models, and refining the integration of AI-driven evaluations with traditional metrics.
Cited By
Quotes
Abstract
This research examines the effectiveness of OpenAI's GPT models as independent evaluators of text summaries generated by six transformer-based models from Hugging Face: DistilBART, BERT, ProphetNet, T5, BART, and PEGASUS. We evaluated these summaries based on essential properties of high-quality summary - conciseness, relevance, coherence, and readability - using traditional metrics such as ROUGE and Latent Semantic Analysis (LSA). Uniquely, we also employed GPT not as a summarizer but as an evaluator, allowing it to independently assess summary quality without predefined metrics. Our analysis revealed significant correlations between GPT evaluations and traditional metrics, particularly in assessing relevance and coherence. The results demonstrate GPT's potential as a robust tool for evaluating text summaries, offering insights that complement established metrics and providing a basis for comparative analysis of transformer-based models in natural language processing tasks.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 EvaluatingTextSummariesGenerate | Hassan Shakil Atqiya Munawara Mahi Phuoc Nguyen Zeydy Ortiz Mamoun T Mardini | Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT | 10.48550/arXiv.2405.04053 | 2024 |