AlpacaEval 2.0 Leaderboard
Jump to navigation
Jump to search
A AlpacaEval 2.0 Leaderboard is a LLM benchmark leaderboard that evaluates instruction-following language models.
- Context:
- It can (typically) measure the performance on Instruction Understanding and Instruction Execution.
- It can (often) be influenced by factors such as output length and model tuning, with a noted bias towards models with longer outputs or those fine-tuned on outputs from models.
- It can focus on Simple Instruction sets.
- ...
- Example(s):
- ...
- Counter-Example(s):
- A Language Model Evaluation based solely on creative writing tasks.
- MMLU.
- See: Instruction-Following, Benchmark Evaluation, Benchmarking Task, AI System.
References
2024
- (GitHub, 2024) ⇒ https://github.com/tatsu-lab/alpaca_eval
- QUOTE: 🎉 AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 turbo as baseline. More details here. For the old version, set your environment variable IS_ALPACA_EVAL_2=False.
2024
- (Yuan et al., 2024) ⇒ Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. (2024). “Self-Rewarding Language Models.” doi:10.48550/arXiv.2401.10020
- QUOTE: ... Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.