AI Safety, Alignment & Ethics¶

Making AI systems reliable, fair, and aligned with human values — and governing their use.

Imagine you ask a super-eager new helper to "keep the kitchen clean," and they throw out your half-finished dinner because, technically, that makes it clean. They did exactly what you said, not what you meant. Powerful AI systems have the same problem: they chase whatever goal we hand them, even when our instructions are fuzzy or we forgot to mention something obvious. AI safety and alignment is the work of making sure these systems actually pursue what humans intend, stay reliable when someone tries to trick them, treat people fairly, and protect private information. It also includes writing sensible rules for how AI may be used. As AI grows more capable, tiny gaps between what we say and what we want can cause real harm, so getting this right matters more every year.

The main ideas¶

Alignment — Ensuring systems pursue intended goals, including RLHF and scalable oversight.
Interpretability — Understanding what models learn and why they behave as they do.
Robustness & security — Adversarial examples, jailbreaks, prompt injection, and defending deployed systems.
Fairness, bias & privacy — Detecting and mitigating harm; protecting personal data.
Governance & policy — Regulation, standards, and responsible-AI practice.

Foundations of AI · AI Agents & Autonomy · Knowledge & Reasoning

Want to make things?

Head to AI School — AI camps where kids build their own games.

AI Safety, Alignment & Ethics¶

The main ideas¶

Related areas¶