Prompt Caching Across LLM Providers: What Translates and What Doesn't

2026年5月12日 · prompt-cache · anthropic · openai · llm-gateway

Prompt caching changes the economics of LLM applications with long contexts. Anthropic charges roughly 1.25x to write to cache and 0.1x to read from it. OpenAI auto-caches anything over 1024 tokens at roughly 0.5x of input price. RAG applications with 20KB system prompts can see effective input cost drop by 8x to 10x.

But the rules of who-caches-what differ enough across providers that a gateway can’t just pass the feature through — it has to actively translate, and some of the translation is lossy.

This post catalogs what works when you bridge prompt cache semantics across OpenAI, Anthropic, and Gemini, and the edge cases that get billing wrong if you skip them.

How each provider does caching

Anthropic. Explicit cache_control: { type: "ephemeral" } markers on message blocks (system, tools, individual messages). Two TTL options: 5 minutes (~1.25x write, ~0.1x read) and 1 hour (~2x write, ~0.1x read). Up to 4 cache breakpoints per request. Breakpoint placement is deterministic: a cache hit requires the prefix to match exactly up to the breakpoint.

Usage is reported with separate fields: cache_creation_input_tokens (cache write), cache_read_input_tokens (cache hit), and TTL-split sub-fields (cache_creation_5m_input_tokens, cache_creation_1h_input_tokens) when both TTLs are in play.

OpenAI. Automatic caching for prompts ≥1024 tokens. No manual control. Single pricing tier (cached portion at roughly 0.5x of regular input price, varies slightly by model). TTL is undocumented but consistently observed at 5–10 minutes.

Usage is reported as prompt_tokens_details.cached_tokens — single number, no write/read distinction (OpenAI’s pricing model bundles them).

Gemini / Vertex. Explicit cached_content API: clients pre-create a cache object, get an ID, and reference it in subsequent generate calls. TTL configurable, minimum 60 seconds, max 1 hour (extends on each use). Pricing runs about 0.25x of regular input price for the cached portion, plus a per-hour storage fee proportional to the cached size.

Different API surface entirely — separate management endpoints, separate billing line.

The translation matrix

When a gateway bridges, here’s what actually carries across in each direction:

From → To	What carries	What’s lost
Anthropic → OpenAI Chat	`cache_read` aggregated into `cached_tokens`; `cache_creation` merged into `prompt_tokens`	TTL distinction; multi-block breakpoints
OpenAI → Anthropic	Nothing automatic — clients can opt in via custom header	OpenAI’s automatic cache becomes nothing
Anthropic ↔ Gemini	Nothing direct (lifecycle is too different)	Everything — would require a Gemini cache management call

The biggest leaky abstraction: OpenAI clients hitting Anthropic backends. The OpenAI SDK has no cache_control concept. Two ways to bridge:

Auto-detect. Gateway examines system prompt length and inserts cache_control: ephemeral when above a threshold. Simple but heavy-handed — caches things the client didn’t intend to.
Explicit header. Clients pass x-anthropic-cache-control signaling intent. More work for the client but precise.

Option 2 makes intent explicit and avoids silent over-caching. It also makes the client’s cost model predictable — auto-detect can occasionally cost more than no caching if the prompt changes between requests.

Cache affinity across pooled provider keys

A gateway-specific concern that’s invisible if you’ve only used direct API access. Most gateways maintain a pool of upstream provider keys for rate-limit headroom. Round-robin across the pool is the default.

But each upstream key has its own cache state. If request 1 lands on key A and creates a cache entry, request 2 needs to also land on key A to get a hit. Round-robin breaks this.

The fix: when a request has cache_control set, hash the workspace ID and deterministically pick a key. Subsequent requests from the same workspace land on the same key, preserving cache continuity.

def pick_key(workspace_id, request, key_pool):
    if has_cache_control(request):
        idx = hash(workspace_id) % len(key_pool)
        return key_pool[idx]
    return round_robin(key_pool)

Two failure modes if you skip this:

Cache miss rate stays high even though clients do everything right
Billing shows lots of cache_creation_input_tokens and almost no cache_read_input_tokens

If you ever debug a “why am I always paying cache write cost?” issue on a multi-key gateway, this is the cause in 80% of cases.

Cost accounting edge cases

A few corners where naive cost calculation goes wrong.

TTL pricing forks

Anthropic prices 5-minute and 1-hour caches differently. If your billing logic uses one price for “cache write” regardless of TTL, you’re silently mis-charging.

The fix: store separate columns (cache_write_5m_input_tokens, cache_write_1h_input_tokens), apply different price multipliers, and surface the breakdown in audit logs.

BYOK + cache hit = $0

Covered in BYOK Billing Invariants. Cache hits in BYOK mode never reach upstream, so the gateway charges no surcharge.

Failed attempts with partial cache reads

If a streaming response 5xx’s after some cache_read_input_tokens have already been recorded in upstream usage, those tokens were charged by Anthropic. Your retry logic should:

Track the partial usage from the failed attempt
Merge it into the final settlement (not just the successful attempt’s usage)

Otherwise you accumulate untracked upstream spend. The customer should see this (platform mode) or the BYOK customer is being silently overcharged (BYOK mode).

First request after key rotation

When an upstream key rotates (admin replaces it in the gateway pool), the new key has no cache state. The first request after rotation pays cache-write cost even if the client expected hits. This is unavoidable, but worth surfacing in admin tooling so the cost spike isn’t surprising.

Testing strategy

Six test cases catch most cache-related bugs:

Anthropic + cache_control + repeat request → second request’s usage includes non-zero cache_read_input_tokens
Anthropic 5m cache + wait 6 minutes + repeat → cache_read_input_tokens = 0 (cache expired)
Multi-key pool + cache_control + 10 sequential same-workspace requests → all land on the same upstream key
OpenAI client → Anthropic backend with x-anthropic-cache-control → upstream request body contains the injected cache_control block
BYOK + cache hit → quota_charged = 0
Mid-stream 5xx with cache reads → failed-attempt usage is merged into settlement (not silently lost)

Each is mechanical to write and asserts a one-line invariant.

A note on API design

The honest read: prompt caching APIs across providers are still in flux. OpenAI’s automatic-only design will likely gain manual controls; Anthropic’s manual-only design may add automatic fallbacks; Gemini’s separate-endpoint model is the outlier and may converge toward inline cache_control.

A gateway’s job is to absorb those changes so client code doesn’t need to. If you’re picking a gateway, ask how they handle a new TTL tier appearing — the answer should involve a config change, not a code release.

Closing

Prompt caching is the highest-ROI optimization available to anyone running expensive LLM workloads. The savings only materialize if the gateway correctly translates intent across providers and accurately accounts for the resulting cost split.

Most of that work isn’t visible to clients — that’s exactly what a gateway is for. But it has to actually do the work, not just pass the cache_control flag through and hope for the best.