Inference
The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.
Inference is what happens after a model is trained. Training teaches a model how to respond; inference is the actual moment of use—when the model reads your prompt, processes it token-by-token, and generates output. During inference, the model weights are frozen (no learning happens); the model is just applying what it already learned.
Inference speed matters because it determines how fast you get your answer. Larger models are usually more capable but slower to run inference on. Optimization techniques like quantization and distillation make inference faster without sacrificing too much accuracy. Inference cost is what AI companies charge users for—because serving inference at scale (millions of simultaneous requests) is computationally expensive.
Example
When you type into ChatGPT and see the response stream in, that's inference happening in real time.
Related terms
Token
The smallest unit of text an AI processes—usually a word fragment, character, or subword.
Temperature
A parameter that controls how random or deterministic an AI's responses are (0 = deterministic, 1+ = very random).
Latency
The time it takes for an AI model to generate a response from when you send a prompt.