All terms · Model Architecture

Attention Mechanism

A neural network technique that lets models weigh which input tokens are most relevant for each output.

The attention mechanism is the core innovation that powers Transformers. When the model generates each output token, it computes attention weights—scores indicating how much each input token should influence that output. Tokens with high attention weights matter more; irrelevant tokens get low weights.

For example, when generating "she", the attention mechanism would assign high weights to "Alice" and low weights to "store" because "Alice" is the subject and "she" refers back to her. This lets the model capture long-range dependencies and understand context.

Attention comes in variants: self-attention (tokens attending to other tokens in the same sequence), cross-attention (one sequence attending to another), and multi-head attention (multiple attention patterns computed in parallel). All modern language models rely heavily on attention.

Example

Processing "The cat sat on the mat and licked its paw"—when generating the pronoun "its", attention weights concentrate on "cat" and "mat" to disambiguate.