OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.
Prompt Caching is enabled for following models
gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini
Portkey supports OpenAI’s prompt caching feature out of the box. Here is an examples on of how to use it:
Copy
Ask AI
from portkey_ai import Portkeyportkey = Portkey( api_key="PORTKEY_API_KEY", provider="@OPENAI_PROVIDER",)# Define tools (for function calling example)tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, } }]# Example: Function calling with cachingresponse = portkey.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant that can check the weather."}, {"role": "user", "content": "What's the weather like in San Francisco?"} ], tools=tools, tool_choice="auto")print(json.dumps(response.model_dump(), indent=2))
Messages: The complete messages array, encompassing system, user, and assistant interactions.
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
Tool use: Both the messages array and the list of available tools can be cached, contributing to the minimum 1024 token requirement.
Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.
Prompt caching requests & responses based on OpenAI’s calculations here:
All requests, including those with fewer than 1024 tokens, will display a cached_tokens field of the usage.prompt_tokens_detailschat completions object indicating how many of the prompt tokens were a cache hit.
For requests under 1024 tokens, cached_tokens will be zero.
Key Features:
Reduced Latency: Especially significant for longer prompts.
Lower Costs: Cached portions of prompts are billed at a discounted rate.
Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.