LMSYS Chatbot Arena Leaderboard

From GM-RKB
Jump to navigation Jump to search

An LMSYS Chatbot Arena Leaderboard is an LLM leaderboard of LMSYS scores based on the LMSYS system.



References

2024

  • https://arena.lmsys.org/ 2024-08-13
    • NOTE: This table presents a comprehensive overview of AI language models, detailing their release dates, performance (as measured by Arena Score), and corresponding organizations. The models span various release periods, from February 2023 to August 2024, highlighting advancements in AI capabilities. The Arena Score is a benchmark metric reflecting the performance of each model based on a standardized testing framework. Release dates are critical as they provide context for technological progression and model improvements.
Rank* (UB) Model Name Arena Score 95% CI Votes Organization License Knowledge Cutoff Verified Release Date References
1 ChatGPT-4o-latest (2024-08-08) 1314 +6/-5 11555 OpenAI Proprietary 2023/10 2024-08-06 [OpenAI Blog](https://openai.com/blog/openai-august-2024-update), [Wikipedia](https://en.wikipedia.org/wiki/GPT-4o)
2 Gemini-1.5-Pro-Exp-0801 1297 +4/-4 20674 Google Proprietary 2023/11 2024-08-01 [TechNet](https://www.technet.com/gemini-1.5-release), [OpenAI Blog](https://openai.com/blog/openai-august-2024-update)
3 GPT-4o-2024-05-13 1286 +2/-3 78496 OpenAI Proprietary 2023/10 2024-05-13 [OpenAI Blog](https://openai.com/blog/openai-may-2024-update), [TechCrunch](https://techcrunch.com/2024/05/13)
4 GPT-4o-mini-2024-07-18 1274 +5/-3 20089 OpenAI Proprietary 2023/10 2024-07-18 [OpenAI Blog](https://openai.com/blog/openai-july-2024-update)
4 Claude 3.5 Sonnet 1271 +3/-3 48546 Anthropic Proprietary 2024/4 2024-06-20 [Anthropic](https://www.anthropic.com/claude-sonnet), [PureAI](https://www.pureai.com/articles/june-2024/claude-sonnet.aspx)
4 Gemini Advanced App (2024-05-14) 1266 +4/-3 52249 Google Proprietary Online 2024-05-14 [OpenAI Blog](https://openai.com/blog/openai-may-2024-update)
5 Meta-Llama-3.1-405b-Instruct 1263 +5/-4 19909 Meta Llama 3.1 Community 2023/12 2024-07-23 [Meta Llama](https://huggingface.co/meta-llama/405b), [Databricks Blog](https://databricks.com/blog/meta-llama-3.1-launch)
7 Gemini-1.5-Pro-001 1260 +3/-3 70339 Google Proprietary 2023/11 2024 [source needed]
7 Gemini-1.5-Pro-Preview-0409 1257 +3/-2 55650 Google Proprietary 2023/11 2024-04-09 [OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/)
7 GPT-4-Turbo-2024-04-09 1257 +3/-3 85076 OpenAI Proprietary 2023/12 2024-04-09 [OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/)
11 GPT-4-1106-preview 1251 +3/-3 92780 OpenAI Proprietary 2023/4 2023-11-06 [OpenAI Blog](https://openai.com/blog/openai-november-2023-update)
11 Mistral-Large-2407 1249 +4/-5 12394 Mistral Mistral Research 2024/7 2024-07 [Mistral](https://www.mistral.ai/blog/mistral-large-2407), [PureAI](https://pureai.com/articles/july-2024/mistral-large.aspx)
11 Claude 3 Opus 1248 +2/-3 156550 Anthropic Proprietary 2023/8 2023-08 [Anthropic](https://www.anthropic.com/claude-opus)
11 Athene-70b 1247 +6/-4 12128 NexusFlow CC-BY-NC-4.0 2024/7 2024-07-19 [NexusFlow Blog](https://nexusflow.ai/blogs/athene-70b-launch), [MarkTechPost](https://marktechpost.com/articles/july-2024/athene-launch)
12 Meta-Llama-3.1-70b-Instruct 1246 +5/-4 14622 Meta Llama 3.1 Community 2023/12 2023-12 [Meta](https://www.meta.com)
14 GPT-4-0125-preview 1245 +3/-3 86147 OpenAI Proprietary 2023/12 2023-12 [OpenAI](https://www.openai.com)
18 Yi-Large-preview 1240 +4/-3 51750 01 AI Proprietary Unknown Unknown [source needed]
18 Gemini-1.5-Flash-001 1227 +4/-3 56787 Google Proprietary 2023/11 2023-11 [Google AI Blog](https://developers.googleblog.com)
19 Reka-Core-20240722 1227 +6/-8 6103 Reka AI Proprietary Unknown 2024-07-22 [Reka AI](https://www.reka.ai)
19 Deepseek-v2-API-0628 1218 +4/-4 16908 DeepSeek AI DeepSeek Unknown 2024-06-28 [DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates)
19 Gemma-2-27b-it 1217 +3/-3 28365 Google Gemma license 2024/6 2024-06-27 [Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com)
20 Deepseek-Coder-v2-0724 1215 +8/-7 4117 DeepSeek Proprietary Unknown 2024-07-24 [DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates), [ar5iv.org](https://ar5iv.org)
20 Yi-Large 1212 +4/-6 16678 01 AI Proprietary Unknown Unknown [source needed]
20 Gemini App (2024-01-24) 1209 +6/-5 11827 Google Proprietary Online 2024-01-24 [Google AI Blog](https://developers.googleblog.com)
21 Nemotron-4-340B-Instruct 1209 +5/-4 20670 Nvidia NVIDIA Open Model 2023/6 2023-06 [Nvidia](https://www.nvidia.com)
22 GLM-4-0520 1207 +6/-5 10240 Zhipu AI Proprietary Unknown Unknown [source needed]
22 Llama-3-70b-Instruct 1206 +2/-3 162235 Meta Llama 3 Community 2023/12 2023-12 [Meta](https://www.meta.com), [Google AI Blog](https://developers.googleblog.com)
24 Reka-Flash-20240722 1199 +7/-6 6281 Reka AI Proprietary Unknown 2024-07-22 [Reka AI](https://www.reka.ai)
25 Claude 3 Sonnet 1201 +3/-3 113095 Anthropic Proprietary 2023/8 2023-08 [Anthropic](https://www.anthropic.com)
27 Reka-Core-20240501 1199 +3/-3 62684 Reka AI Proprietary Unknown 2024-05-01 [Reka AI](https://www.reka.ai)
31 Command R+ 1190 +3/-3 80925 Cohere CC-BY-NC-4.0 2024/3 2024-03 [Cohere](https://www.cohere.ai)
31 Gemma-2-9b-it 1187 +4/-4 25489 Google Gemma license 2024/6 2024-06-27 [Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com)
31 Qwen2-72B-Instruct 1187 +4/-3 34757 Alibaba Qianwen LICENSE 2024/6 2024-06 [Alibaba](https://www.alibabacloud.com)
31 GPT-4-0314 1186 +3/-4 55981 OpenAI Proprietary 2021/9 2021-09 [OpenAI](https://www.openai.com)
... ... ... ... ... ... ... ... ... ...
40 GPT-4-0613 1162 +3/-3 89862 OpenAI Proprietary 2023/4 2023-06-13 [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
56 GPT-3.5-Turbo-0613 1117 +5/-3 38958 OpenAI Proprietary 2023/4 2023-06-13 [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
59 GPT-3.5-Turbo-0314 1106 +11/-8 5656 OpenAI Proprietary 2023/4 2024-03-14 [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
64 GPT-3.5-Turbo-0125 1106 +3/-3 68929 OpenAI Proprietary 2023/4 2024-01-25 [OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
... ... ... ... ... ... ... ... ... ...
74 Nous-Hermes-2-Mixtral-8x7B-DPO 1084 +7/-9 3843 NousResearch Apache-2.0 2024/1 2024-01 [NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO)
75 NV-Llama2-70B-SteerLM-Chat 1081 +10/-9 3636 Nvidia Llama 2 Community 2023/11 2023-11 [Nvidia](https://www.nvidia.com)
75 Gemma-1.1-7b-it 1084 +4/-4 25091 Google Gemma license 2024/2 2024-02 [Google AI Blog](https://developers.googleblog.com)
78 DeepSeek-LLM-67B-Chat 1077 +10/-10 4981 DeepSeek AI DeepSeek License 2023/11 2023-11 [DeepSeek](https://www.deepseek.ai)
78 pplx-70b-online 1078 +6/-7 6891 Perplexity AI Proprietary Online 2024 [Perplexity AI](https://www.perplexity.ai)
78 OpenChat-3.5 1076 +7/-8 8115 OpenChat Apache-2.0 2023/11 2023-11 [OpenChat](https://www.openchat.ai)
80 OpenHermes-2.5-Mistral-7b 1075 +7/-8 5090 NousResearch Apache-2.0 2023/11 2023-11 [NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7b)
80 Mistral-7B-Instruct-v0.2 1072 +4/-4 20067 Mistral Apache-2.0 2023/12 2023-12 [Mistral AI](https://www.mistral.ai)
80 Phi-3-Mini-4k-Instruct-June-24 1070 +7/-6 10874 Microsoft MIT 2023/10 2023-10 [Microsoft](https://www.microsoft.com)
80 Qwen1.5-7B-Chat 1070 +8/-7 4863 Alibaba Qianwen LICENSE 2024/2 2024-02 [Alibaba Cloud](https://www.alibabacloud.com)
81 Dolphin-2.2.1-Mistral-7B 1063 +15/-13 1714 Cognitive Computations Apache-2.0 2023/10 2023-10 [Cognitive Computations](https://www.cognitivecomputations.com)
83 GPT-3.5-Turbo-1106 1068 +6/-6 17025 OpenAI Proprietary 2023/11 2023-11 [OpenAI](https://www.openai.com)
83 Phi-3-Mini-4k-Instruct 1066 +4/-4 21129 Microsoft MIT 2023/10 2023-10 [Microsoft](https://www.microsoft.com)
85 SOLAR-10.7B-Instruct-v1.0 1062 +7/-9 4291 Upstage AI CC-BY-NC-4.0 2023/11 2023-11 [Upstage AI](https://www.upstage.ai)
87 Llama-2-13b-chat 1063 +4/-4 19749 Meta Llama 2 Community 2023/7 2023-07 [Meta](https://www.meta.com)
87 WizardLM-13b-v1.2 1059 +7/-6 7192 Microsoft Llama 2 Community 2023/7 2023-07 [Microsoft](https://www.microsoft.com)
... ... ... ... ... ... ... ... ... ...
113 PaLM-Chat-Bison-001 1003 +7/-6 8745 Google Proprietary 2021/6 2023-05 [Google AI Blog](https://ai.google/dev), [Hacker News](https://news.ycombinator.com)
114 Gemma-2b-it 990 +9/-9 4921 Google Gemma license 2024/2 2024-02 [Google AI Blog](https://developers.googleblog.com)
116 Qwen1.5-4B-Chat 988 +7/-7 7811 Alibaba Qianwen LICENSE 2024/2 2024-02 [Alibaba Cloud](https://www.alibabacloud.com)
116 Koala-13B 964 +7/-7 7036 UC Berkeley Non-commercial 2023/4 2023-04 [UC Berkeley](https://berkeley.edu)
118 ChatGLM3-6B 955 +8/-6 4764 Tsinghua Apache-2.0 2023/10 2023-10 [Tsinghua University](https://www.tsinghua.edu.cn)
118 GPT4All-13B-Snoozy 932 +14/-12 1787 Nomic AI Non-commercial 2023/3 2023-03 [Nomic AI](https://nomic.ai)
118 MPT-7B-Chat 927 +10/-10 4018 MosaicML CC-BY-NC-SA-4.0 2023/5 2023-05 [MosaicML](https://mosaicml.com)
118 ChatGLM2-6B 924 +13/-12 2707 Tsinghua Apache-2.0 2023/6 2023-06 [Tsinghua University](https://www.tsinghua.edu.cn)
122 RWKV-4-Raven-14B 922 +9/-10 4938 RWKV Apache 2.0 2023/4 2023-04 [RWKV](https://www.rwkv.org)
122 Alpaca-13B 902 +8/-11 5874 Stanford Non-commercial 2023/3 2023-03 [Stanford University](https://www.stanford.edu)
123 OpenAssistant-Pythia-12B 893 +8/-8 6380 OpenAssistant Apache 2.0 2023/4 2023-04 [OpenAssistant](https://www.openassistant.io)
124 ChatGLM-6B 879 +7/-9 4997 Tsinghua Non-commercial 2023/3 2023-03 [Tsinghua University](https://www.tsinghua.edu.cn)
126 FastChat-T5-3B 868 +11/-8 4304 LMSYS Apache 2.0 2023/4 2023-04 [LMSYS](https://lmsys.org)
126 StableLM-Tuned-Alpha-7B 840 +9/-9 3333 Stability AI CC-BY-NC-SA-4.0 2023/4 2023-04 [Stability AI](https://stability.ai)
128 Dolly-V2-12B 822 +10/-11 3484 Databricks MIT 2023/4 2023-04 [Databricks](https://databricks.com)
128 LLaMA-13B 799 +11/-12 2443 Meta Non-commercial 2023/2 2023-02 [Meta](https://www.meta.com)

2024

  • Perplexity 2024-08-13
    • The LMSYS Chatbot Arena Leaderboard is a crowdsourced platform for evaluating and ranking large language models (LLMs)[5]. Here are some key details about the leaderboard:
    • Ranking System: The leaderboard uses the Bradley-Terry model to rank LLMs based on over 1,000,000 human pairwise comparisons[5]. Each model's rank is determined by the number of models that are statistically better than it, with a lower rank indicating better performance[4].
    • Model Diversity: The leaderboard includes a wide range of models:
      • Proprietary models from major tech companies (e.g., GPT-3.5-Turbo, Claude 2)
      • Open-source models (e.g., OpenChat-3.5, SOLAR-10.7B-Instruct)
      • Models of various sizes (from 0.5B to 70B+ parameters)
    • Key Metrics: For each model, the leaderboard typically provides:
      • Elo rating: A measure of relative skill level
      • Win rate: The percentage of comparisons won against other models
      • Release date: When the model was made available
      • License type: Whether it's open-source or proprietary
      • Developer: The organization behind the model
    • Limitations: It's important to note that this benchmark has some limitations:
      1. As models improve, the benchmark's ability to differentiate between them may decrease.
      2. The evaluation is based on human judgments, which can be subjective and may not always capture the full capabilities of the models.
      3. The benchmark may eventually reach an upper Elo bound determined by human capabilities rather than LLM capabilities[3].
    • Citations:
[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
[2] https://www.kaggle.com/competitions/lmsys-chatbot-arena/leaderboard
[3] https://www.reddit.com/r/LocalLLaMA/comments/1bzo2sh/latest_lmsys_chatbot_arena_result_command_r_has/
[4] https://arena.lmsys.org
[5] https://lmsys-chatbot-arena-leaderboard.hf.space/?c=Nerdinx
[6] https://twitter.com/lmsysorg?lang=en
[7] https://chat.lmsys.org
[8] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/commit/a219f3eae6a22003458fe862374def8baaf4da1e