Evaluation & Benchmarks¶
How we measure whether AI actually works โ the science, and real difficulty, of knowing if a model is any good.
How do you know if an AI is actually good, or just sounds good? You test it โ the same way a school gives students exams. Evaluation is the science of measuring whether an AI really works. We hand the model thousands of questions with known answers and count how many it gets right. A shared set of these questions is called a benchmark, so rival AIs can be compared fairly, like everyone sitting the identical exam and posting their marks on a leaderboard. But there is a catch. If a student secretly memorised the answer key the night before, a perfect score means nothing about how clever they are. AI hits exactly this problem: models sometimes 'see' the test questions while they are being built, so a sky-high score can be misleading. That is why honest evaluation is surprisingly hard, and why researchers keep inventing tougher, fresher tests.
The main ideas¶
- Metrics โ Accuracy, precision/recall, F1, BLEU/ROUGE, perplexity โ and when each misleads.
- Benchmarks & leaderboards โ MMLU, GSM8K, HumanEval, MMMU and friends โ standardized tests, and how they get gamed.
- LLM-as-judge โ Using strong models to grade outputs, with their biases and calibration issues.
- Human evaluation โ Preference ratings, head-to-head arenas (Elo), and inter-annotator agreement.
- Red-teaming & safety evals โ Probing for harmful, jailbroken, or unsafe behavior before release.
- Contamination & validity โ Test-set leakage, overfitting to benchmarks, and building evals you can trust.
Related areas¶
NLP & Large Language Models ยท AI Safety, Alignment & Ethics ยท Building with AI
Want to make things?
Head to AI School โ AI camps where kids build their own games.