Alignment

Making an AI's goals and behavior consistent with human values and intentions.

Alignment is the challenge of making powerful AI systems behave the way humans intend. An unaligned model might optimize for the wrong thing—e.g., asked to write a persuasive essay, it might generate misinformation because "persuasion" didn't constrain truthfulness.

Alignment techniques include: - RLHF (human feedback)—models learn to match human preferences. - Constitutional AI—models learn from human values written as rules. - Oversight and monitoring—detecting when models behave unexpectedly. - Red-teaming—testing models for failure modes.

Alignment is an unsolved problem. Even the most refined models occasionally violate their guidelines or surprise users. Alignment becomes harder at scale—the stakes increase as AI systems become more powerful and autonomous.

Alignment is both a technical and philosophical challenge—what does "good behavior" even mean when humans disagree?

Example

A model might correctly complete "Write convincing arguments that..." but alignment training makes it add disclaimers if the arguments are misleading.

Related terms

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human reviewers rate AI outputs, and the model learns to generate outputs that score high.

Fine-Tuning

The process of updating a pre-trained model with task-specific or domain-specific data to improve performance.

Back to glossary