Interpretability & Explainability¶

Opening the black box — understanding why a model made a prediction, and what it has actually learned inside.

Modern AI systems can be astonishingly good at their jobs, yet they usually can't tell you why they answered the way they did. Interpretability is the effort to open up that hidden reasoning and understand what a model has actually learned inside.

Think of a maths teacher who refuses to accept just the final answer on a test — she asks the student to show their working. A right answer reached for the wrong reasons is fragile: change the numbers next time and it falls apart. Interpretability tools try to make an AI show its working — which parts of the input it paid attention to, and which internal 'switches' fired on the way to a decision.

Why bother? Because if we can see how a model thinks, we can catch mistakes, spot hidden bias, and trust it in the places where being wrong really matters — like medicine, money, or the law.

The main ideas¶

Feature attribution — Which inputs mattered? SHAP, LIME, integrated gradients, and saliency maps.
Probing representations — Testing what information is encoded in a model's internal activations.
Mechanistic interpretability — Reverse-engineering circuits and features inside networks — induction heads, superposition, sparse autoencoders.
Concept-based explanations — Explaining models in terms of human-understandable concepts.
Global vs local — Explaining one prediction vs a model's overall behavior.
Faithfulness — The hard question of whether an explanation reflects the true reason for a decision.

Deep Learning · AI Safety, Alignment & Ethics · AI Ethics & Governance

Want to make things?

Head to AI School — AI camps where kids build their own games.

Interpretability & Explainability¶

The main ideas¶

Related areas¶