HaluEval Benchmark

From GM-RKB
(Redirected from HaluEval)
Jump to navigation Jump to search

A HaluEval Benchmark is an NLP benchmark that evaluates hallucinated content recognition and hallucinated content avoidance.

References

2024

  • LLM
    • HaluEval provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate LLMs' performance in recognizing hallucinations[1][2]. It contains 5,000 general user queries with ChatGPT responses and 30,000 task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization[2].
    • Data Generation: The benchmark data was created through two main approaches:
1. Automatic generation: Using a two-stage framework called "sampling-then-filtering" that leverages ChatGPT to generate hallucinated samples based on existing datasets[4].
2. Human annotation: Hiring human labelers to annotate hallucinations in ChatGPT responses[4].
    • Key Findings: Experiments conducted with HaluEval revealed several important insights:
- ChatGPT tends to generate hallucinated content by fabricating unverifiable information in about 19.5% of its responses[4].
- Existing LLMs face significant challenges in identifying hallucinations, with ChatGPT achieving only 62.59% accuracy in question answering tasks[4].
- Providing external knowledge or adding intermediate reasoning steps can improve LLMs' ability to recognize hallucinations[4].
    • Applications: HaluEval serves as a valuable tool for:
1. Evaluating LLMs' propensity to hallucinate
2. Analyzing what types of content and to what extent LLMs tend to generate hallucinations
3. Developing strategies to improve hallucination detection and prevention in LLMs
    • Citations:
[1] https://paperswithcode.com/dataset/halueval
[2] https://github.com/RUCAIBox/HaluEval
[3] https://home.nomic.ai/blog/posts/evaluating-llm-hallucination-benchmarks-with-embeddings
[4] https://ar5iv.labs.arxiv.org/html/2305.11747
[5] https://aclanthology.org/2023.emnlp-main.397.pdf
[6] https://www.semanticscholar.org/paper/HaluEval:-A-Large-Scale-Hallucination-Evaluation-Li-Cheng/e0384ba36555232c587d4a80d527895a095a9001
[7] https://arxiv.org/abs/2305.11747
[8] https://www.statista.com/statistics/1465328/halueval-hallucination-benchmark/