All terms · Outputs & Evaluation

Latency

The time it takes for an AI model to generate a response from when you send a prompt.

Latency is the delay between input and output—how long you wait for the model to respond. In real-world AI: - ChatGPT: ~500ms to 2 seconds for streaming response. - Claude: ~800ms to 3 seconds. - Latency varies based on prompt length, model size, server load, and network.

Latency matters for user experience. A 100ms difference feels snappy; 5 seconds feels slow. For real-time applications (chat, autocomplete, search), latency is critical. For batch jobs (daily reports), latency is less important.

Optimizations for latency: - Smaller models (faster but less capable). - Quantization (compress model weights). - Caching (re-use recent computations). - Distillation (train a tiny model to mimic a large one). - Edge deployment (run locally, not on servers).

Latency is a key SLA (service-level agreement) metric for AI products.

Example

ChatGPT might have 1-second latency for simple prompts, 5+ seconds for complex reasoning tasks.

Related terms

Throughput / Tokens per Second

The speed at which a model generates output tokens, measured in tokens per second.

Inference

The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.

Back to glossary