LMSYS Chatbot Arena Leaderboard
Jump to navigation
Jump to search
An LMSYS Chatbot Arena Leaderboard is an LLM leaderboard of LMSYS scores based on the LMSYS system.
- Context:
- It can (typically) be based on Crowdsourced Human Pairwise Comparisons.
- It can (typically) be based on Elo-based scores using the Bradley-Terry model.
- ...
- It can range from being a Past LMSYS Chatbot Arena Leaderboard to being a Present LMSYS Chatbot Arena Leaderboard to being a Future LMSYS Chatbot Arena Leaderboard.
- ...
- It can be filtered for specific categories like Coding Performance to general tasks like Natural Language Understanding.
- It can calculate confidence intervals for each model's score, visualizing the reliability of the rankings.
- It can serve as a guide for developers and users in selecting the most suitable model for specific applications.
- ...
- Example(s):
- On 2024-08-13
- Total Models and Votes: The leaderboard ranks 128 models based on 1,671,145 user votes, showcasing broad participation and coverage.
- Top Models: ChatGPT-4o-latest (2024-08-08) ranks first with an Arena Score of 1314, closely followed by Gemini-1.5-Pro-Exp-0801 and GPT-4o-2024-05-13.
- ...
- On 2024-08-13
- Counter-Example(s):
- ...
- See: Large Language Model, Elo Rating System, Bradley-Terry Model.
References
2024
- https://arena.lmsys.org/ 2024-08-13
- NOTE: This table presents a comprehensive overview of AI language models, detailing their release dates, performance (as measured by Arena Score), and corresponding organizations. The models span various release periods, from February 2023 to August 2024, highlighting advancements in AI capabilities. The Arena Score is a benchmark metric reflecting the performance of each model based on a standardized testing framework. Release dates are critical as they provide context for technological progression and model improvements.
Rank* (UB) | Model Name | Arena Score | 95% CI | Votes | Organization | License | Knowledge Cutoff | Verified Release Date | References |
---|---|---|---|---|---|---|---|---|---|
1 | ChatGPT-4o-latest (2024-08-08) | 1314 | +6/-5 | 11555 | OpenAI | Proprietary | 2023/10 | 2024-08-06 | [OpenAI Blog](https://openai.com/blog/openai-august-2024-update), [Wikipedia](https://en.wikipedia.org/wiki/GPT-4o) |
2 | Gemini-1.5-Pro-Exp-0801 | 1297 | +4/-4 | 20674 | Proprietary | 2023/11 | 2024-08-01 | [TechNet](https://www.technet.com/gemini-1.5-release), [OpenAI Blog](https://openai.com/blog/openai-august-2024-update) | |
3 | GPT-4o-2024-05-13 | 1286 | +2/-3 | 78496 | OpenAI | Proprietary | 2023/10 | 2024-05-13 | [OpenAI Blog](https://openai.com/blog/openai-may-2024-update), [TechCrunch](https://techcrunch.com/2024/05/13) |
4 | GPT-4o-mini-2024-07-18 | 1274 | +5/-3 | 20089 | OpenAI | Proprietary | 2023/10 | 2024-07-18 | [OpenAI Blog](https://openai.com/blog/openai-july-2024-update) |
4 | Claude 3.5 Sonnet | 1271 | +3/-3 | 48546 | Anthropic | Proprietary | 2024/4 | 2024-06-20 | [Anthropic](https://www.anthropic.com/claude-sonnet), [PureAI](https://www.pureai.com/articles/june-2024/claude-sonnet.aspx) |
4 | Gemini Advanced App (2024-05-14) | 1266 | +4/-3 | 52249 | Proprietary | Online | 2024-05-14 | [OpenAI Blog](https://openai.com/blog/openai-may-2024-update) | |
5 | Meta-Llama-3.1-405b-Instruct | 1263 | +5/-4 | 19909 | Meta | Llama 3.1 Community | 2023/12 | 2024-07-23 | [Meta Llama](https://huggingface.co/meta-llama/405b), [Databricks Blog](https://databricks.com/blog/meta-llama-3.1-launch) |
7 | Gemini-1.5-Pro-001 | 1260 | +3/-3 | 70339 | Proprietary | 2023/11 | 2024 | [source needed] | |
7 | Gemini-1.5-Pro-Preview-0409 | 1257 | +3/-2 | 55650 | Proprietary | 2023/11 | 2024-04-09 | [OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/) | |
7 | GPT-4-Turbo-2024-04-09 | 1257 | +3/-3 | 85076 | OpenAI | Proprietary | 2023/12 | 2024-04-09 | [OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/) |
11 | GPT-4-1106-preview | 1251 | +3/-3 | 92780 | OpenAI | Proprietary | 2023/4 | 2023-11-06 | [OpenAI Blog](https://openai.com/blog/openai-november-2023-update) |
11 | Mistral-Large-2407 | 1249 | +4/-5 | 12394 | Mistral | Mistral Research | 2024/7 | 2024-07 | [Mistral](https://www.mistral.ai/blog/mistral-large-2407), [PureAI](https://pureai.com/articles/july-2024/mistral-large.aspx) |
11 | Claude 3 Opus | 1248 | +2/-3 | 156550 | Anthropic | Proprietary | 2023/8 | 2023-08 | [Anthropic](https://www.anthropic.com/claude-opus) |
11 | Athene-70b | 1247 | +6/-4 | 12128 | NexusFlow | CC-BY-NC-4.0 | 2024/7 | 2024-07-19 | [NexusFlow Blog](https://nexusflow.ai/blogs/athene-70b-launch), [MarkTechPost](https://marktechpost.com/articles/july-2024/athene-launch) |
12 | Meta-Llama-3.1-70b-Instruct | 1246 | +5/-4 | 14622 | Meta | Llama 3.1 Community | 2023/12 | 2023-12 | [Meta](https://www.meta.com) |
14 | GPT-4-0125-preview | 1245 | +3/-3 | 86147 | OpenAI | Proprietary | 2023/12 | 2023-12 | [OpenAI](https://www.openai.com) |
18 | Yi-Large-preview | 1240 | +4/-3 | 51750 | 01 AI | Proprietary | Unknown | Unknown | [source needed] |
18 | Gemini-1.5-Flash-001 | 1227 | +4/-3 | 56787 | Proprietary | 2023/11 | 2023-11 | [Google AI Blog](https://developers.googleblog.com) | |
19 | Reka-Core-20240722 | 1227 | +6/-8 | 6103 | Reka AI | Proprietary | Unknown | 2024-07-22 | [Reka AI](https://www.reka.ai) |
19 | Deepseek-v2-API-0628 | 1218 | +4/-4 | 16908 | DeepSeek AI | DeepSeek | Unknown | 2024-06-28 | [DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates) |
19 | Gemma-2-27b-it | 1217 | +3/-3 | 28365 | Gemma license | 2024/6 | 2024-06-27 | [Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com) | |
20 | Deepseek-Coder-v2-0724 | 1215 | +8/-7 | 4117 | DeepSeek | Proprietary | Unknown | 2024-07-24 | [DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates), [ar5iv.org](https://ar5iv.org) |
20 | Yi-Large | 1212 | +4/-6 | 16678 | 01 AI | Proprietary | Unknown | Unknown | [source needed] |
20 | Gemini App (2024-01-24) | 1209 | +6/-5 | 11827 | Proprietary | Online | 2024-01-24 | [Google AI Blog](https://developers.googleblog.com) | |
21 | Nemotron-4-340B-Instruct | 1209 | +5/-4 | 20670 | Nvidia | NVIDIA Open Model | 2023/6 | 2023-06 | [Nvidia](https://www.nvidia.com) |
22 | GLM-4-0520 | 1207 | +6/-5 | 10240 | Zhipu AI | Proprietary | Unknown | Unknown | [source needed] |
22 | Llama-3-70b-Instruct | 1206 | +2/-3 | 162235 | Meta | Llama 3 Community | 2023/12 | 2023-12 | [Meta](https://www.meta.com), [Google AI Blog](https://developers.googleblog.com) |
24 | Reka-Flash-20240722 | 1199 | +7/-6 | 6281 | Reka AI | Proprietary | Unknown | 2024-07-22 | [Reka AI](https://www.reka.ai) |
25 | Claude 3 Sonnet | 1201 | +3/-3 | 113095 | Anthropic | Proprietary | 2023/8 | 2023-08 | [Anthropic](https://www.anthropic.com) |
27 | Reka-Core-20240501 | 1199 | +3/-3 | 62684 | Reka AI | Proprietary | Unknown | 2024-05-01 | [Reka AI](https://www.reka.ai) |
31 | Command R+ | 1190 | +3/-3 | 80925 | Cohere | CC-BY-NC-4.0 | 2024/3 | 2024-03 | [Cohere](https://www.cohere.ai) |
31 | Gemma-2-9b-it | 1187 | +4/-4 | 25489 | Gemma license | 2024/6 | 2024-06-27 | [Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com) | |
31 | Qwen2-72B-Instruct | 1187 | +4/-3 | 34757 | Alibaba | Qianwen LICENSE | 2024/6 | 2024-06 | [Alibaba](https://www.alibabacloud.com) |
31 | GPT-4-0314 | 1186 | +3/-4 | 55981 | OpenAI | Proprietary | 2021/9 | 2021-09 | [OpenAI](https://www.openai.com) |
... |
... |
... |
... |
... |
... |
... |
... |
... |
...
|
40 | GPT-4-0613 | 1162 | +3/-3 | 89862 | OpenAI | Proprietary | 2023/4 | 2023-06-13 | [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/) |
56 | GPT-3.5-Turbo-0613 | 1117 | +5/-3 | 38958 | OpenAI | Proprietary | 2023/4 | 2023-06-13 | [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/) |
59 | GPT-3.5-Turbo-0314 | 1106 | +11/-8 | 5656 | OpenAI | Proprietary | 2023/4 | 2024-03-14 | [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/) |
64 | GPT-3.5-Turbo-0125 | 1106 | +3/-3 | 68929 | OpenAI | Proprietary | 2023/4 | 2024-01-25 | [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/) |
... |
... |
... |
... |
... |
... |
... |
... |
... |
...
|
74 | Nous-Hermes-2-Mixtral-8x7B-DPO | 1084 | +7/-9 | 3843 | NousResearch | Apache-2.0 | 2024/1 | 2024-01 | [NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO) |
75 | NV-Llama2-70B-SteerLM-Chat | 1081 | +10/-9 | 3636 | Nvidia | Llama 2 Community | 2023/11 | 2023-11 | [Nvidia](https://www.nvidia.com) |
75 | Gemma-1.1-7b-it | 1084 | +4/-4 | 25091 | Gemma license | 2024/2 | 2024-02 | [Google AI Blog](https://developers.googleblog.com) | |
78 | DeepSeek-LLM-67B-Chat | 1077 | +10/-10 | 4981 | DeepSeek AI | DeepSeek License | 2023/11 | 2023-11 | [DeepSeek](https://www.deepseek.ai) |
78 | pplx-70b-online | 1078 | +6/-7 | 6891 | Perplexity AI | Proprietary | Online | 2024 | [Perplexity AI](https://www.perplexity.ai) |
78 | OpenChat-3.5 | 1076 | +7/-8 | 8115 | OpenChat | Apache-2.0 | 2023/11 | 2023-11 | [OpenChat](https://www.openchat.ai) |
80 | OpenHermes-2.5-Mistral-7b | 1075 | +7/-8 | 5090 | NousResearch | Apache-2.0 | 2023/11 | 2023-11 | [NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7b) |
80 | Mistral-7B-Instruct-v0.2 | 1072 | +4/-4 | 20067 | Mistral | Apache-2.0 | 2023/12 | 2023-12 | [Mistral AI](https://www.mistral.ai) |
80 | Phi-3-Mini-4k-Instruct-June-24 | 1070 | +7/-6 | 10874 | Microsoft | MIT | 2023/10 | 2023-10 | [Microsoft](https://www.microsoft.com) |
80 | Qwen1.5-7B-Chat | 1070 | +8/-7 | 4863 | Alibaba | Qianwen LICENSE | 2024/2 | 2024-02 | [Alibaba Cloud](https://www.alibabacloud.com) |
81 | Dolphin-2.2.1-Mistral-7B | 1063 | +15/-13 | 1714 | Cognitive Computations | Apache-2.0 | 2023/10 | 2023-10 | [Cognitive Computations](https://www.cognitivecomputations.com) |
83 | GPT-3.5-Turbo-1106 | 1068 | +6/-6 | 17025 | OpenAI | Proprietary | 2023/11 | 2023-11 | [OpenAI](https://www.openai.com) |
83 | Phi-3-Mini-4k-Instruct | 1066 | +4/-4 | 21129 | Microsoft | MIT | 2023/10 | 2023-10 | [Microsoft](https://www.microsoft.com) |
85 | SOLAR-10.7B-Instruct-v1.0 | 1062 | +7/-9 | 4291 | Upstage AI | CC-BY-NC-4.0 | 2023/11 | 2023-11 | [Upstage AI](https://www.upstage.ai) |
87 | Llama-2-13b-chat | 1063 | +4/-4 | 19749 | Meta | Llama 2 Community | 2023/7 | 2023-07 | [Meta](https://www.meta.com) |
87 | WizardLM-13b-v1.2 | 1059 | +7/-6 | 7192 | Microsoft | Llama 2 Community | 2023/7 | 2023-07 | [Microsoft](https://www.microsoft.com) |
... |
... |
... |
... |
... |
... |
... |
... |
... |
...
|
113 | PaLM-Chat-Bison-001 | 1003 | +7/-6 | 8745 | Proprietary | 2021/6 | 2023-05 | [Google AI Blog](https://ai.google/dev), [Hacker News](https://news.ycombinator.com) | |
114 | Gemma-2b-it | 990 | +9/-9 | 4921 | Gemma license | 2024/2 | 2024-02 | [Google AI Blog](https://developers.googleblog.com) | |
116 | Qwen1.5-4B-Chat | 988 | +7/-7 | 7811 | Alibaba | Qianwen LICENSE | 2024/2 | 2024-02 | [Alibaba Cloud](https://www.alibabacloud.com) |
116 | Koala-13B | 964 | +7/-7 | 7036 | UC Berkeley | Non-commercial | 2023/4 | 2023-04 | [UC Berkeley](https://berkeley.edu) |
118 | ChatGLM3-6B | 955 | +8/-6 | 4764 | Tsinghua | Apache-2.0 | 2023/10 | 2023-10 | [Tsinghua University](https://www.tsinghua.edu.cn) |
118 | GPT4All-13B-Snoozy | 932 | +14/-12 | 1787 | Nomic AI | Non-commercial | 2023/3 | 2023-03 | [Nomic AI](https://nomic.ai) |
118 | MPT-7B-Chat | 927 | +10/-10 | 4018 | MosaicML | CC-BY-NC-SA-4.0 | 2023/5 | 2023-05 | [MosaicML](https://mosaicml.com) |
118 | ChatGLM2-6B | 924 | +13/-12 | 2707 | Tsinghua | Apache-2.0 | 2023/6 | 2023-06 | [Tsinghua University](https://www.tsinghua.edu.cn) |
122 | RWKV-4-Raven-14B | 922 | +9/-10 | 4938 | RWKV | Apache 2.0 | 2023/4 | 2023-04 | [RWKV](https://www.rwkv.org) |
122 | Alpaca-13B | 902 | +8/-11 | 5874 | Stanford | Non-commercial | 2023/3 | 2023-03 | [Stanford University](https://www.stanford.edu) |
123 | OpenAssistant-Pythia-12B | 893 | +8/-8 | 6380 | OpenAssistant | Apache 2.0 | 2023/4 | 2023-04 | [OpenAssistant](https://www.openassistant.io) |
124 | ChatGLM-6B | 879 | +7/-9 | 4997 | Tsinghua | Non-commercial | 2023/3 | 2023-03 | [Tsinghua University](https://www.tsinghua.edu.cn) |
126 | FastChat-T5-3B | 868 | +11/-8 | 4304 | LMSYS | Apache 2.0 | 2023/4 | 2023-04 | [LMSYS](https://lmsys.org) |
126 | StableLM-Tuned-Alpha-7B | 840 | +9/-9 | 3333 | Stability AI | CC-BY-NC-SA-4.0 | 2023/4 | 2023-04 | [Stability AI](https://stability.ai) |
128 | Dolly-V2-12B | 822 | +10/-11 | 3484 | Databricks | MIT | 2023/4 | 2023-04 | [Databricks](https://databricks.com) |
128 | LLaMA-13B | 799 | +11/-12 | 2443 | Meta | Non-commercial | 2023/2 | 2023-02 | [Meta](https://www.meta.com) |
2024
- Perplexity 2024-08-13
- The LMSYS Chatbot Arena Leaderboard is a crowdsourced platform for evaluating and ranking large language models (LLMs)[5]. Here are some key details about the leaderboard:
- Ranking System: The leaderboard uses the Bradley-Terry model to rank LLMs based on over 1,000,000 human pairwise comparisons[5]. Each model's rank is determined by the number of models that are statistically better than it, with a lower rank indicating better performance[4].
- Model Diversity: The leaderboard includes a wide range of models:
- Proprietary models from major tech companies (e.g., GPT-3.5-Turbo, Claude 2)
- Open-source models (e.g., OpenChat-3.5, SOLAR-10.7B-Instruct)
- Models of various sizes (from 0.5B to 70B+ parameters)
- Key Metrics: For each model, the leaderboard typically provides:
- Elo rating: A measure of relative skill level
- Win rate: The percentage of comparisons won against other models
- Release date: When the model was made available
- License type: Whether it's open-source or proprietary
- Developer: The organization behind the model
- Limitations: It's important to note that this benchmark has some limitations:
- As models improve, the benchmark's ability to differentiate between them may decrease.
- The evaluation is based on human judgments, which can be subjective and may not always capture the full capabilities of the models.
- The benchmark may eventually reach an upper Elo bound determined by human capabilities rather than LLM capabilities[3].
- Citations:
[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard [2] https://www.kaggle.com/competitions/lmsys-chatbot-arena/leaderboard [3] https://www.reddit.com/r/LocalLLaMA/comments/1bzo2sh/latest_lmsys_chatbot_arena_result_command_r_has/ [4] https://arena.lmsys.org [5] https://lmsys-chatbot-arena-leaderboard.hf.space/?c=Nerdinx [6] https://twitter.com/lmsysorg?lang=en [7] https://chat.lmsys.org [8] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/commit/a219f3eae6a22003458fe862374def8baaf4da1e