All terms · Outputs & Evaluation

Benchmark

A standardized test measuring how well an AI model performs on specific tasks.

An AI benchmark is a curated dataset and scoring rubric for evaluating model capability. Common benchmarks:

- MMLU: 57,000 multiple-choice questions across math, science, history, law—tests general knowledge. - HumanEval: 164 Python coding problems—tests coding ability. - SQuAD: Question-answering on Wikipedia—tests reading comprehension. - GLUE: 9 text classification and reasoning tasks. - Big-Bench: 200+ tasks covering reasoning, knowledge, language understanding.

Benchmarks let researchers compare models objectively ("GPT-4 scores 86% on MMLU, Claude 3.5 scores 88%"). But benchmarks have limitations—they don't capture every real-world use case, and models can overfit to popular benchmarks.

Why benchmarks matter: They serve three distinct audiences—marketing (companies cite high scores), research (measuring progress over time), and safety evaluation (testing for harmful capabilities).

Example

GPT-4 achieves 86.4% on MMLU, ranking it among the top LLMs. A new startup model scores 72%, showing it's less capable.