All terms · Outputs & Evaluation

Throughput / Tokens per Second

The speed at which a model generates output tokens, measured in tokens per second.

Throughput measures output generation speed: how many tokens a model produces per second. If a model generates 50 tokens/second and your response is 200 tokens, that's ~4 seconds of generation time.

Key insight: throughput is generation speed, not total latency. Total time = initial latency + (tokens generated / throughput).

Throughput varies by: - Model size (larger = slower). - Hardware (GPUs are faster than CPUs). - Batching (generating multiple responses in parallel). - Optimization techniques (quantization, distillation).

For streaming responses (where you see the answer appear word-by-word), throughput directly affects perceived speed. High throughput (100+ tokens/sec) feels fast; low throughput (<20 tokens/sec) feels glacial.

Throughput is a key metric for AI service providers to optimize—it directly affects cost and user satisfaction.

Example

If a model generates 80 tokens/second and your response needs 400 tokens, generation takes 5 seconds.

Related terms

Latency

The time it takes for an AI model to generate a response from when you send a prompt.

Inference

The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.

Back to glossary