gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini
What can be cached
- Messages: The complete messages array, encompassing system, user, and assistant interactions.
- Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
- Tool use: Both the messages array and the list of available
tools
can be cached, contributing to the minimum 1024 token requirement. - Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.
What’s Not Supported
- Completions API (only Chat Completions API is supported)
- Streaming responses (caching works, but streaming itself is not affected)
Monitoring Cache Performance
Prompt caching requests & responses based on OpenAI’s calculations here:
cached_tokens
field of the usage.prompt_tokens_details
chat completions object indicating how many of the prompt tokens were a cache hit.
For requests under 1024 tokens, cached_tokens
will be zero.

- Reduced Latency: Especially significant for longer prompts.
- Lower Costs: Cached portions of prompts are billed at a discounted rate.
- Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
- Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.