RLHF (Reinforcement Learning from Human Feedback)

A training technique where human reviewers rate AI outputs, and the model learns to generate outputs that score high.

RLHF is the technique that made ChatGPT and Claude so good at conversation. After pre-training on text, the model is fine-tuned using human feedback:

The RLHF process: 1. Collect examples of the model's outputs for various prompts. 2. Have humans rank the outputs (best to worst). 3. Train a reward model that predicts human preferences. 4. Use reinforcement learning to update the language model to maximize the reward.

RLHF is why modern models refuse harmful requests, explain their reasoning, and feel conversational. Without RLHF, models often sound robotic or unhelpful. RLHF is computationally expensive and requires extensive human labeling, which is why it's primarily done by large labs like OpenAI, Anthropic, and Google.

RLHF isn't perfect—it's only as good as human raters and their biases.

Example

A model outputs 5 responses to "Write a poem about rain". Humans rank them: Response 3 is best (creative, grammatical), Response 1 is worst (nonsensical). RLHF learns to generate outputs like Response 3.

Related terms

Fine-Tuning

The process of updating a pre-trained model with task-specific or domain-specific data to improve performance.

Alignment

Making an AI's goals and behavior consistent with human values and intentions.

Pre-Training

The initial large-scale training phase where a model learns language patterns from massive text datasets.

Back to glossary