Prompt Caching
A technique that stores repeated sections of a prompt server-side so they do not need to be reprocessed on every API call, reducing latency and cost.
Prompt caching allows API users to mark static sections of their prompt — system instructions, document context, few-shot examples — as cacheable. The first call processes and caches that section. Subsequent calls reuse the cache rather than reprocessing the tokens, which reduces input token costs (typically 90% discount on cached tokens) and speeds up response time.
When prompt caching is most valuable: - Long system prompts used across many requests (e.g. a 10,000-token instruction set) - Documents or codebases included as context in every call - Repeated few-shot examples used for consistent output formatting - RAG pipelines that inject the same large knowledge base into each request
Anthropic supports prompt caching on Claude models with a 5-minute cache TTL. The cache is invalidated after 5 minutes of inactivity or when the cached content changes.
For production applications making many API calls with a consistent system prompt, prompt caching is one of the highest-leverage cost optimisations available.
Example
A legal AI tool includes a 50,000-token contract template in every prompt. With prompt caching enabled, only the first call pays full input price. Subsequent calls cost 90% less for that section — turning a $5 average call cost into $0.60.
Related terms
Token
The smallest unit of text an AI processes—usually a word fragment, character, or subword.
Inference
The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.
Context Window
The maximum amount of text (in tokens) an AI model can process in a single conversation.