Throughput / Tokens per Second
The speed at which a model generates output tokens, measured in tokens per second.
Throughput measures output generation speed: how many tokens a model produces per second. If a model generates 50 tokens/second and your response is 200 tokens, that's ~4 seconds of generation time.
Key insight: throughput is generation speed, not total latency. Total time = initial latency + (tokens generated / throughput).
Throughput varies by: - Model size (larger = slower). - Hardware (GPUs are faster than CPUs). - Batching (generating multiple responses in parallel). - Optimization techniques (quantization, distillation).
For streaming responses (where you see the answer appear word-by-word), throughput directly affects perceived speed. High throughput (100+ tokens/sec) feels fast; low throughput (<20 tokens/sec) feels glacial.
Throughput is a key metric for AI service providers to optimize—it directly affects cost and user satisfaction.
Example
If a model generates 80 tokens/second and your response needs 400 tokens, generation takes 5 seconds.