LMSYS Chatbot Arena Leaderboard

An LMSYS Chatbot Arena Leaderboard is an LLM leaderboard of LMSYS scores based on the LMSYS system.

Context:
- It can (typically) be based on Crowdsourced Human Pairwise Comparisons.
- It can (typically) be based on Elo-based scores using the Bradley-Terry model.
- ...
- It can range from being a Past LMSYS Chatbot Arena Leaderboard to being a Present LMSYS Chatbot Arena Leaderboard to being a Future LMSYS Chatbot Arena Leaderboard.
- ...
- It can be filtered for specific categories like Coding Performance to general tasks like Natural Language Understanding.
- It can calculate confidence intervals for each model's score, visualizing the reliability of the rankings.
- It can serve as a guide for developers and users in selecting the most suitable model for specific applications.
- ...
Example(s):
- On 2024-08-13
  - Total Models and Votes: The leaderboard ranks 128 models based on 1,671,145 user votes, showcasing broad participation and coverage.
  - Top Models: ChatGPT-4o-latest (2024-08-08) ranks first with an Arena Score of 1314, closely followed by Gemini-1.5-Pro-Exp-0801 and GPT-4o-2024-05-13.
- ...
Counter-Example(s):
- ...
See: Large Language Model, Elo Rating System, Bradley-Terry Model.

References

https://arena.lmsys.org/

2024

https://arena.lmsys.org/ 2024-08-13
- NOTE: This table presents a comprehensive overview of AI language models, detailing their release dates, performance (as measured by Arena Score), and corresponding organizations. The models span various release periods, from February 2023 to August 2024, highlighting advancements in AI capabilities. The Arena Score is a benchmark metric reflecting the performance of each model based on a standardized testing framework. Release dates are critical as they provide context for technological progression and model improvements.

Rank* (UB)	Model Name	Arena Score	95% CI	Votes	Organization	License	Knowledge Cutoff	Verified Release Date	References
1	ChatGPT-4o-latest (2024-08-08)	1314	+6/-5	11555	OpenAI	Proprietary	2023/10	2024-08-06	[OpenAI Blog](https://openai.com/blog/openai-august-2024-update), [Wikipedia](https://en.wikipedia.org/wiki/GPT-4o)
2	Gemini-1.5-Pro-Exp-0801	1297	+4/-4	20674	Google	Proprietary	2023/11	2024-08-01	[TechNet](https://www.technet.com/gemini-1.5-release), [OpenAI Blog](https://openai.com/blog/openai-august-2024-update)
3	GPT-4o-2024-05-13	1286	+2/-3	78496	OpenAI	Proprietary	2023/10	2024-05-13	[OpenAI Blog](https://openai.com/blog/openai-may-2024-update), [TechCrunch](https://techcrunch.com/2024/05/13)
4	GPT-4o-mini-2024-07-18	1274	+5/-3	20089	OpenAI	Proprietary	2023/10	2024-07-18	[OpenAI Blog](https://openai.com/blog/openai-july-2024-update)
4	Claude 3.5 Sonnet	1271	+3/-3	48546	Anthropic	Proprietary	2024/4	2024-06-20	[Anthropic](https://www.anthropic.com/claude-sonnet), [PureAI](https://www.pureai.com/articles/june-2024/claude-sonnet.aspx)
4	Gemini Advanced App (2024-05-14)	1266	+4/-3	52249	Google	Proprietary	Online	2024-05-14	[OpenAI Blog](https://openai.com/blog/openai-may-2024-update)
5	Meta-Llama-3.1-405b-Instruct	1263	+5/-4	19909	Meta	Llama 3.1 Community	2023/12	2024-07-23	[Meta Llama](https://huggingface.co/meta-llama/405b), [Databricks Blog](https://databricks.com/blog/meta-llama-3.1-launch)
7	Gemini-1.5-Pro-001	1260	+3/-3	70339	Google	Proprietary	2023/11	2024	[source needed]
7	Gemini-1.5-Pro-Preview-0409	1257	+3/-2	55650	Google	Proprietary	2023/11	2024-04-09	[OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/)
7	GPT-4-Turbo-2024-04-09	1257	+3/-3	85076	OpenAI	Proprietary	2023/12	2024-04-09	[OpenAI Blog](https://openai.com/blog/openai-april-2024-update), [TechCrunch](https://techcrunch.com/2024/04/09/)
11	GPT-4-1106-preview	1251	+3/-3	92780	OpenAI	Proprietary	2023/4	2023-11-06	[OpenAI Blog](https://openai.com/blog/openai-november-2023-update)
11	Mistral-Large-2407	1249	+4/-5	12394	Mistral	Mistral Research	2024/7	2024-07	[Mistral](https://www.mistral.ai/blog/mistral-large-2407), [PureAI](https://pureai.com/articles/july-2024/mistral-large.aspx)
11	Claude 3 Opus	1248	+2/-3	156550	Anthropic	Proprietary	2023/8	2023-08	[Anthropic](https://www.anthropic.com/claude-opus)
11	Athene-70b	1247	+6/-4	12128	NexusFlow	CC-BY-NC-4.0	2024/7	2024-07-19	[NexusFlow Blog](https://nexusflow.ai/blogs/athene-70b-launch), [MarkTechPost](https://marktechpost.com/articles/july-2024/athene-launch)
12	Meta-Llama-3.1-70b-Instruct	1246	+5/-4	14622	Meta	Llama 3.1 Community	2023/12	2023-12	[Meta](https://www.meta.com)
14	GPT-4-0125-preview	1245	+3/-3	86147	OpenAI	Proprietary	2023/12	2023-12	[OpenAI](https://www.openai.com)
18	Yi-Large-preview	1240	+4/-3	51750	01 AI	Proprietary	Unknown	Unknown	[source needed]
18	Gemini-1.5-Flash-001	1227	+4/-3	56787	Google	Proprietary	2023/11	2023-11	[Google AI Blog](https://developers.googleblog.com)
19	Reka-Core-20240722	1227	+6/-8	6103	Reka AI	Proprietary	Unknown	2024-07-22	[Reka AI](https://www.reka.ai)
19	Deepseek-v2-API-0628	1218	+4/-4	16908	DeepSeek AI	DeepSeek	Unknown	2024-06-28	[DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates)
19	Gemma-2-27b-it	1217	+3/-3	28365	Google	Gemma license	2024/6	2024-06-27	[Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com)
20	Deepseek-Coder-v2-0724	1215	+8/-7	4117	DeepSeek	Proprietary	Unknown	2024-07-24	[DeepSeek API Docs](https://platform.deepseek.com/api-docs/updates), [ar5iv.org](https://ar5iv.org)
20	Yi-Large	1212	+4/-6	16678	01 AI	Proprietary	Unknown	Unknown	[source needed]
20	Gemini App (2024-01-24)	1209	+6/-5	11827	Google	Proprietary	Online	2024-01-24	[Google AI Blog](https://developers.googleblog.com)
21	Nemotron-4-340B-Instruct	1209	+5/-4	20670	Nvidia	NVIDIA Open Model	2023/6	2023-06	[Nvidia](https://www.nvidia.com)
22	GLM-4-0520	1207	+6/-5	10240	Zhipu AI	Proprietary	Unknown	Unknown	[source needed]
22	Llama-3-70b-Instruct	1206	+2/-3	162235	Meta	Llama 3 Community	2023/12	2023-12	[Meta](https://www.meta.com), [Google AI Blog](https://developers.googleblog.com)
24	Reka-Flash-20240722	1199	+7/-6	6281	Reka AI	Proprietary	Unknown	2024-07-22	[Reka AI](https://www.reka.ai)
25	Claude 3 Sonnet	1201	+3/-3	113095	Anthropic	Proprietary	2023/8	2023-08	[Anthropic](https://www.anthropic.com)
27	Reka-Core-20240501	1199	+3/-3	62684	Reka AI	Proprietary	Unknown	2024-05-01	[Reka AI](https://www.reka.ai)
31	Command R+	1190	+3/-3	80925	Cohere	CC-BY-NC-4.0	2024/3	2024-03	[Cohere](https://www.cohere.ai)
31	Gemma-2-9b-it	1187	+4/-4	25489	Google	Gemma license	2024/6	2024-06-27	[Google AI Blog](https://developers.googleblog.com), [Maginative](https://www.maginative.com)
31	Qwen2-72B-Instruct	1187	+4/-3	34757	Alibaba	Qianwen LICENSE	2024/6	2024-06	[Alibaba](https://www.alibabacloud.com)
31	GPT-4-0314	1186	+3/-4	55981	OpenAI	Proprietary	2021/9	2021-09	[OpenAI](https://www.openai.com)
`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`
40	GPT-4-0613	1162	+3/-3	89862	OpenAI	Proprietary	2023/4	2023-06-13	[OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
56	GPT-3.5-Turbo-0613	1117	+5/-3	38958	OpenAI	Proprietary	2023/4	2023-06-13	[OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
59	GPT-3.5-Turbo-0314	1106	+11/-8	5656	OpenAI	Proprietary	2023/4	2024-03-14	[OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
64	GPT-3.5-Turbo-0125	1106	+3/-3	68929	OpenAI	Proprietary	2023/4	2024-01-25	[OpenAI Platform](https://platform.openai.com/docs/gpts/release-notes/release-notes), [Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/)
`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`
74	Nous-Hermes-2-Mixtral-8x7B-DPO	1084	+7/-9	3843	NousResearch	Apache-2.0	2024/1	2024-01	[NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO)
75	NV-Llama2-70B-SteerLM-Chat	1081	+10/-9	3636	Nvidia	Llama 2 Community	2023/11	2023-11	[Nvidia](https://www.nvidia.com)
75	Gemma-1.1-7b-it	1084	+4/-4	25091	Google	Gemma license	2024/2	2024-02	[Google AI Blog](https://developers.googleblog.com)
78	DeepSeek-LLM-67B-Chat	1077	+10/-10	4981	DeepSeek AI	DeepSeek License	2023/11	2023-11	[DeepSeek](https://www.deepseek.ai)
78	pplx-70b-online	1078	+6/-7	6891	Perplexity AI	Proprietary	Online	2024	[Perplexity AI](https://www.perplexity.ai)
78	OpenChat-3.5	1076	+7/-8	8115	OpenChat	Apache-2.0	2023/11	2023-11	[OpenChat](https://www.openchat.ai)
80	OpenHermes-2.5-Mistral-7b	1075	+7/-8	5090	NousResearch	Apache-2.0	2023/11	2023-11	[NousResearch](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7b)
80	Mistral-7B-Instruct-v0.2	1072	+4/-4	20067	Mistral	Apache-2.0	2023/12	2023-12	[Mistral AI](https://www.mistral.ai)
80	Phi-3-Mini-4k-Instruct-June-24	1070	+7/-6	10874	Microsoft	MIT	2023/10	2023-10	[Microsoft](https://www.microsoft.com)
80	Qwen1.5-7B-Chat	1070	+8/-7	4863	Alibaba	Qianwen LICENSE	2024/2	2024-02	[Alibaba Cloud](https://www.alibabacloud.com)
81	Dolphin-2.2.1-Mistral-7B	1063	+15/-13	1714	Cognitive Computations	Apache-2.0	2023/10	2023-10	[Cognitive Computations](https://www.cognitivecomputations.com)
83	GPT-3.5-Turbo-1106	1068	+6/-6	17025	OpenAI	Proprietary	2023/11	2023-11	[OpenAI](https://www.openai.com)
83	Phi-3-Mini-4k-Instruct	1066	+4/-4	21129	Microsoft	MIT	2023/10	2023-10	[Microsoft](https://www.microsoft.com)
85	SOLAR-10.7B-Instruct-v1.0	1062	+7/-9	4291	Upstage AI	CC-BY-NC-4.0	2023/11	2023-11	[Upstage AI](https://www.upstage.ai)
87	Llama-2-13b-chat	1063	+4/-4	19749	Meta	Llama 2 Community	2023/7	2023-07	[Meta](https://www.meta.com)
87	WizardLM-13b-v1.2	1059	+7/-6	7192	Microsoft	Llama 2 Community	2023/7	2023-07	[Microsoft](https://www.microsoft.com)
`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`
113	PaLM-Chat-Bison-001	1003	+7/-6	8745	Google	Proprietary	2021/6	2023-05	[Google AI Blog](https://ai.google/dev), [Hacker News](https://news.ycombinator.com)
114	Gemma-2b-it	990	+9/-9	4921	Google	Gemma license	2024/2	2024-02	[Google AI Blog](https://developers.googleblog.com)
116	Qwen1.5-4B-Chat	988	+7/-7	7811	Alibaba	Qianwen LICENSE	2024/2	2024-02	[Alibaba Cloud](https://www.alibabacloud.com)
116	Koala-13B	964	+7/-7	7036	UC Berkeley	Non-commercial	2023/4	2023-04	[UC Berkeley](https://berkeley.edu)
118	ChatGLM3-6B	955	+8/-6	4764	Tsinghua	Apache-2.0	2023/10	2023-10	[Tsinghua University](https://www.tsinghua.edu.cn)
118	GPT4All-13B-Snoozy	932	+14/-12	1787	Nomic AI	Non-commercial	2023/3	2023-03	[Nomic AI](https://nomic.ai)
118	MPT-7B-Chat	927	+10/-10	4018	MosaicML	CC-BY-NC-SA-4.0	2023/5	2023-05	[MosaicML](https://mosaicml.com)
118	ChatGLM2-6B	924	+13/-12	2707	Tsinghua	Apache-2.0	2023/6	2023-06	[Tsinghua University](https://www.tsinghua.edu.cn)
122	RWKV-4-Raven-14B	922	+9/-10	4938	RWKV	Apache 2.0	2023/4	2023-04	[RWKV](https://www.rwkv.org)
122	Alpaca-13B	902	+8/-11	5874	Stanford	Non-commercial	2023/3	2023-03	[Stanford University](https://www.stanford.edu)
123	OpenAssistant-Pythia-12B	893	+8/-8	6380	OpenAssistant	Apache 2.0	2023/4	2023-04	[OpenAssistant](https://www.openassistant.io)
124	ChatGLM-6B	879	+7/-9	4997	Tsinghua	Non-commercial	2023/3	2023-03	[Tsinghua University](https://www.tsinghua.edu.cn)
126	FastChat-T5-3B	868	+11/-8	4304	LMSYS	Apache 2.0	2023/4	2023-04	[LMSYS](https://lmsys.org)
126	StableLM-Tuned-Alpha-7B	840	+9/-9	3333	Stability AI	CC-BY-NC-SA-4.0	2023/4	2023-04	[Stability AI](https://stability.ai)
128	Dolly-V2-12B	822	+10/-11	3484	Databricks	MIT	2023/4	2023-04	[Databricks](https://databricks.com)
128	LLaMA-13B	799	+11/-12	2443	Meta	Non-commercial	2023/2	2023-02	[Meta](https://www.meta.com)

2024

Perplexity 2024-08-13
- The LMSYS Chatbot Arena Leaderboard is a crowdsourced platform for evaluating and ranking large language models (LLMs)[5]. Here are some key details about the leaderboard:
- Ranking System: The leaderboard uses the Bradley-Terry model to rank LLMs based on over 1,000,000 human pairwise comparisons[5]. Each model's rank is determined by the number of models that are statistically better than it, with a lower rank indicating better performance[4].
- Model Diversity: The leaderboard includes a wide range of models:
  - Proprietary models from major tech companies (e.g., GPT-3.5-Turbo, Claude 2)
  - Open-source models (e.g., OpenChat-3.5, SOLAR-10.7B-Instruct)
  - Models of various sizes (from 0.5B to 70B+ parameters)
- Key Metrics: For each model, the leaderboard typically provides:
  - Elo rating: A measure of relative skill level
  - Win rate: The percentage of comparisons won against other models
  - Release date: When the model was made available
  - License type: Whether it's open-source or proprietary
  - Developer: The organization behind the model
- Limitations: It's important to note that this benchmark has some limitations:
  1. As models improve, the benchmark's ability to differentiate between them may decrease.
  2. The evaluation is based on human judgments, which can be subjective and may not always capture the full capabilities of the models.
  3. The benchmark may eventually reach an upper Elo bound determined by human capabilities rather than LLM capabilities[3].
- Citations:

[1] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
[2] https://www.kaggle.com/competitions/lmsys-chatbot-arena/leaderboard
[3] https://www.reddit.com/r/LocalLLaMA/comments/1bzo2sh/latest_lmsys_chatbot_arena_result_command_r_has/
[4] https://arena.lmsys.org
[5] https://lmsys-chatbot-arena-leaderboard.hf.space/?c=Nerdinx
[6] https://twitter.com/lmsysorg?lang=en
[7] https://chat.lmsys.org
[8] https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/commit/a219f3eae6a22003458fe862374def8baaf4da1e

LMSYS Chatbot Arena Leaderboard

References

2024

2024

Navigation menu

Search