All terms · Core Concepts

Prompt Caching

A technique that stores repeated sections of a prompt server-side so they do not need to be reprocessed on every API call, reducing latency and cost.

Prompt caching allows API users to mark static sections of their prompt — system instructions, document context, few-shot examples — as cacheable. The first call processes and caches that section. Subsequent calls reuse the cache rather than reprocessing the tokens, which reduces input token costs (typically 90% discount on cached tokens) and speeds up response time.

When prompt caching is most valuable: - Long system prompts used across many requests (e.g. a 10,000-token instruction set) - Documents or codebases included as context in every call - Repeated few-shot examples used for consistent output formatting - RAG pipelines that inject the same large knowledge base into each request

Anthropic supports prompt caching on Claude models with a 5-minute cache TTL. The cache is invalidated after 5 minutes of inactivity or when the cached content changes.

For production applications making many API calls with a consistent system prompt, prompt caching is one of the highest-leverage cost optimisations available.

Example

A legal AI tool includes a 50,000-token contract template in every prompt. With prompt caching enabled, only the first call pays full input price. Subsequent calls cost 90% less for that section — turning a $5 average call cost into $0.60.

Related terms

Token

The smallest unit of text an AI processes—usually a word fragment, character, or subword.

Inference

The process of a trained model generating a response to an input. When you chat with ChatGPT, that's inference.

Context Window

The maximum amount of text (in tokens) an AI model can process in a single conversation.

Used in

Claude Ai → Chatgpt →

Back to glossary