HaluEval Benchmark
Jump to navigation
Jump to search
A HaluEval Benchmark is an NLP benchmark that evaluates hallucinated content recognition and hallucinated content avoidance.
- Context:
- It can (typically) include Question Answering Tasks, Dialogue Generation Tasks,, and Text Summarization Tasks,.
- It can (typically) aim to benchmark Hallucination Detection Algorithms.
- ...
- It can leverage Human-Annotated Data and Automatically-Generated Data.
- It can knowledge-grounded dialogue and QA systems.
- It can use a two-step framework to sample and filter hallucinated content, ensuring that the most challenging examples are included for evaluation.
- It can be used to study the conditions under which large language models are most likely to produce hallucinated content.
- It can assess improvements in hallucination recognition through external knowledge integration or reasoning steps.
- It can range from identifying factual hallucinations to recognizing stylistic inconsistencies in generated text.
- It can be a large dataset (e.g., 35,000 samples)
- It can contain human-labeled and machine-generated hallucinations.
- ...
- Example(s):
- ...
- Counter-Example(s):
- NLP Task Benchmark like SuperGLUE or SQuAD that focus on general NLP tasks without a specific emphasis on hallucination evaluation.
- Evaluation metrics that focus solely on fluency or relevance without assessing factual correctness or hallucinations.
- ...
- See: Hallucination, AI Model Evaluation, Factuality in NLP, ChatGPT, HotpotQA
References
2024
- LLM
- HaluEval provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate LLMs' performance in recognizing hallucinations[1][2]. It contains 5,000 general user queries with ChatGPT responses and 30,000 task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization[2].
- Data Generation: The benchmark data was created through two main approaches:
1. Automatic generation: Using a two-stage framework called "sampling-then-filtering" that leverages ChatGPT to generate hallucinated samples based on existing datasets[4]. 2. Human annotation: Hiring human labelers to annotate hallucinations in ChatGPT responses[4].
- Key Findings: Experiments conducted with HaluEval revealed several important insights:
- ChatGPT tends to generate hallucinated content by fabricating unverifiable information in about 19.5% of its responses[4]. - Existing LLMs face significant challenges in identifying hallucinations, with ChatGPT achieving only 62.59% accuracy in question answering tasks[4]. - Providing external knowledge or adding intermediate reasoning steps can improve LLMs' ability to recognize hallucinations[4].
- Applications: HaluEval serves as a valuable tool for:
1. Evaluating LLMs' propensity to hallucinate 2. Analyzing what types of content and to what extent LLMs tend to generate hallucinations 3. Developing strategies to improve hallucination detection and prevention in LLMs
- Citations:
[1] https://paperswithcode.com/dataset/halueval [2] https://github.com/RUCAIBox/HaluEval [3] https://home.nomic.ai/blog/posts/evaluating-llm-hallucination-benchmarks-with-embeddings [4] https://ar5iv.labs.arxiv.org/html/2305.11747 [5] https://aclanthology.org/2023.emnlp-main.397.pdf [6] https://www.semanticscholar.org/paper/HaluEval:-A-Large-Scale-Hallucination-Evaluation-Li-Cheng/e0384ba36555232c587d4a80d527895a095a9001 [7] https://arxiv.org/abs/2305.11747 [8] https://www.statista.com/statistics/1465328/halueval-hallucination-benchmark/