Benchmarks

noun

Reporting on AI

Benchmarks are tools used to measure the capabilities of AI models. Some test general knowledge, while others are specialized for specific purposes and domains, such as coding, medicine, finance, and law.

Other benchmarks aim to measure performance on specific tasks, such as visual reasoning, language translation, generating images and video from text, and executing computer-based tasks.

There are no official standards for these tools, and AI model builders are quick to tout their models' latest high scores. The industry has embraced a set of benchmarks that appear on nearly every model card, but many become obsolete quickly as the technology advances and new ones emerge.

Benchmarks are made by independent researchers, AI labs, companies and some government agencies. The lack of official industry-wide standards for these tests makes true, fair comparisons between models difficult.

Journalists should read the research papers on how these benchmarks were built, what source material they used, and what a high score really indicates. It is also important to understand who built the benchmarks, and if a company is measuring itself by its own yardsticks.

Some popular benchmarks:

Benchmark	Description
Humanity's Last Exam (HLE)	A multimodal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
ARC-AGI	A reasoning test of abstract visual puzzles designed to be "easy for humans, hard for AI," and structured to avoid reliance on memorized training knowledge.
SWE-bench Pro	A rigorous, realistic evaluation of AI agents for software engineering, measuring model performance on real-world coding tasks.
BrowseComp	A benchmark for web browsing that is challenging for models and easy to verify, measuring an AI model's ability to find information through web search.
MMMU-Pro	A multimodal benchmark testing expert-level knowledge across many topics, requiring interpretation of text alongside images, diagrams, maps, and scientific figures.

While imperfect, the industry has embraced the use of 'benchmarks' — tests designed to measure an AI model's knowledge and reasoning ability. — Sherwood News

The rapid pace of AI product releases — and a lack of governmental oversight — increases the likelihood that tech companies continue to use the same benchmarks, regardless of their shortcomings. — The Markup

Entry by Jon Keegan · Last updated: Feb. 26, 2026

Flag Changelog

About this glossary — who's behind this site and how you can contribute.