Support investigative journalism — donate to IRE →

LMArena leaderboard

noun

LM Arena (originally known as Chatbot Arena) is a popular leaderboard showing the performance rankings of the latest large language models on a number of different benchmarks.

The leaderboard uses a unique crowd-sourced side-by-side "taste test," which lets real human users compare the responses of two anonymous models. Users choose which response they prefer. The leaderboard then aggregates the results of these tests to determine which model performs better on a given task.

Journalists should be aware that the AI industry watches the results of these leaderboards closely, and sometimes "mystery models" show up, disguised by a code name. Last year, Meta faced widespread criticism for using an internal "experimental chat version" of its Llama 4 model on LM Arena, which received unusually high scores, leading to accusations that the company was trying to manipulate the results.

One of the things we've generally tried to do over the last year is anchor more of our models in our Meta AI product north star use cases. The issue with open source benchmarks, and any given thing like the LM Arena stuff, is that they’re often skewed toward a very specific set of uses cases, which are often not actually  what any normal person does in your product. Mark Zuckerberg (via Simon Willison)
Internally, OpenAI paid close attention to LM Arena, people familiar with the matter said. It also closely tracked 4o’s contribution to ChatGPT’s daily active user counts, which were visible internally on dashboards and touted to employees in town-hall meetings and in Slack. The Wall Street Journal
Entry by Jon Keegan · Last updated: March 4, 2026
About this glossary — who's behind this site and how you can contribute.