Prompt Caching

On this page

What can be cached
What’s Not Supported
Monitoring Cache Performance

OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.

Prompt Caching is enabled for following models

gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini

Portkey supports OpenAI’s prompt caching feature out of the box. Here is an examples on of how to use it:

from portkey_ai import Portkey

portkey = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="@OPENAI_PROVIDER",
)

# Define tools (for function calling example)
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        }
    }
]


# Example: Function calling with caching
response = portkey.chat.completions.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that can check the weather."},
    {"role": "user", "content": "What's the weather like in San Francisco?"}
  ],
  tools=tools,
  tool_choice="auto"
)
print(json.dumps(response.model_dump(), indent=2))

What can be cached

Messages: The complete messages array, encompassing system, user, and assistant interactions.
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
Tool use: Both the messages array and the list of available tools can be cached, contributing to the minimum 1024 token requirement.
Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.

What’s Not Supported

Completions API (only Chat Completions API is supported)
Streaming responses (caching works, but streaming itself is not affected)

Monitoring Cache Performance

Prompt caching requests & responses based on OpenAI’s calculations here:

All requests, including those with fewer than 1024 tokens, will display a cached_tokens field of the usage.prompt_tokens_details chat completions object indicating how many of the prompt tokens were a cache hit.

For requests under 1024 tokens, cached_tokens will be zero.

Key Features:

Reduced Latency: Especially significant for longer prompts.
Lower Costs: Cached portions of prompts are billed at a discounted rate.
Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.

Structured Outputs Files

Ecosystem

LLM Integrations

Cloud Platforms

Guardrails

Plugins

Vector Databases

Agents

AI Apps

Libraries

Tracing Providers

What can be cached

What’s Not Supported

Monitoring Cache Performance

Ecosystem

LLM Integrations

Cloud Platforms

Guardrails

Plugins

Vector Databases

Agents

AI Apps

Libraries

Tracing Providers

​What can be cached

​What’s Not Supported

​Monitoring Cache Performance

What can be cached

What’s Not Supported

Monitoring Cache Performance