AlpacaEval 2.0 Leaderboard

Context:
- It can (typically) measure the performance on Instruction Understanding and Instruction Execution.
- It can (often) be influenced by factors such as output length and model tuning, with a noted bias towards models with longer outputs or those fine-tuned on outputs from models.
- It can focus on Simple Instruction sets.
- ...
Example(s):
- ...
Counter-Example(s):
- A Language Model Evaluation based solely on creative writing tasks.
- MMLU.
See: Instruction-Following, Benchmark Evaluation, Benchmarking Task, AI System.

References

(GitHub, 2024) ⇒ https://github.com/tatsu-lab/alpaca_eval
- QUOTE: 🎉 AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 turbo as baseline. More details here. For the old version, set your environment variable IS_ALPACA_EVAL_2=False.