We Compare AI

Benchmark

Evaluation
Simple Definition

A standardized test used to measure and compare AI model capabilities across specific tasks.

Full Explanation

Common benchmarks include: MMLU (massive multitask language understanding — college-level knowledge), HumanEval (coding ability), MATH-500 (competition math), MT-Bench (multi-turn conversation), HellaSwag (common sense reasoning). Benchmarks enable objective comparisons but can be 'gamed' by training on similar data. Real-world performance often differs from benchmark scores.

Example

GPT-4o scores 86% on MMLU. Claude Opus 4 scores 88%. Gemini 2.5 Pro scores 90%.

Last verified: 2026-03-30← Back to Glossary