BYOK Pricing for LLM Gateways: The Billing Invariants That Matter
When you’re shopping for an LLM gateway, the “BYOK supported” checkbox looks self-explanatory. Users plug in their OpenAI key, requests flow through using that key, the gateway takes a cut. Done.
In practice, BYOK billing is where the largest class of correctness bugs in production gateways lives — and where the financial impact of getting it wrong compounds because mistakes are rarely caught until someone audits a customer invoice three months later.
This post catalogs the five billing invariants we enforce by automated test in production, and what breaks when each one is violated.
Recap: the two ends of the BYOK spectrum
Platform-keyed. Gateway holds upstream keys, pays providers, charges users the upstream price plus margin. Standard SaaS resale.
BYOK (Bring Your Own Key). User holds upstream keys, pays providers directly. Gateway charges only for the value it adds — usually one of:
- A percentage surcharge on each request (e.g. 5% of upstream list price)
- A flat monthly subscription independent of usage
- A combination (subscription + per-request)
- Free below some monthly request count, surcharge above
Most gateways support both modes. The five invariants below apply to the BYOK path specifically.
Invariant 1: Cache hits in BYOK mode cost zero
When Anthropic’s prompt cache hits, no upstream API call happened. The gateway never forwarded the request to Anthropic; the cached response was served from the gateway’s own cache.
The natural mistake: settle billing as if it were a normal request, applying the BYOK surcharge to the upstream price.
quota_debit = base_cost × surcharge_rate ← wrong when cache hit
This is a double charge. The customer pays:
- Anthropic for the cached request? No — Anthropic didn’t see it.
- Gateway surcharge on… what? There’s no upstream cost to surcharge.
The correct behavior:
if is_cache_hit:
quota_debit = 0
The gateway provided real value (the cache infrastructure), but the BYOK contract is “you pay upstream cost + surcharge”, and the upstream cost on a cache hit is zero.
Testing this is mechanical: send the same prompt twice with
cache_control: ephemeral, verify the second request’s billing log has
quota_charged = 0.
Invariant 2: Failed upstream attempts are tracked, but billed differently
If a streaming request gets a 5xx from upstream after 200 tokens have been generated, the customer’s app sees an error. From the customer’s perspective they got nothing. But the upstream provider did process and charge for those 200 tokens.
You have three options:
- Charge the customer for the failed tokens. Honest, but feels bad to the customer who saw nothing useful.
- Eat the cost yourself. Generous but unsustainable at scale — bad actors can intentionally cause failures to drain you.
- Track failed-attempt token usage separately from success. Don’t bill the customer for it, but record the underlying upstream cost so you can detect abuse.
Mature gateways go with option 3. The implementation: a separate counter in
the byok_usage table tracks failed_count, parallel to request_count. The
customer dashboard shows request_count only (their billable usage);
operations gets failed_count visibility for abuse detection.
The pathological case to guard against: a customer scripting “always fail immediately after first token” to get effectively free inference. Without tracking failed-attempt tokens, this attack is invisible.
Invariant 3: Surcharge and base cost live in different ledgers
The gateway charges a surcharge. The customer pays upstream directly. These are two separate flows of money that must never be mixed.
The temptation is code like:
total = base_cost + surcharge
ws.used_quota += total # wrong
base_cost is what upstream charges, paid by the customer to OpenAI / Anthropic
directly. It must never appear in the gateway’s quota ledger.
The correct accounting:
ws.used_quota += base_cost × surcharge_rate # only the surcharge
billing_log.is_byok = True # explicit flag, not heuristic
billing_log.list_price_quota = base_cost # for the user's audit display
billing_log.quota = base_cost × surcharge_rate # what the gateway actually charged
Three values, three ledgers. The customer invoice shows the gateway’s surcharge separately from a memo line: “you also paid ~$X to upstream directly (for your records)”.
If your gateway’s billing log doesn’t have an explicit is_byok boolean and
relies on heuristics like quota_charged > 0 ? platform : byok, you have a
latent bug. Free models, free-tier usage, and cache hits all set
quota_charged = 0 in platform mode and will be misclassified.
Invariant 4: Free tier counts successful upstream calls only
A common BYOK monetization model: free for the first N requests per month,
surcharge afterward. The counter that drives this — byok_usage.request_count
— must only increment on successful upstream calls.
What gets this wrong: incrementing the counter at request entry, before knowing whether the upstream succeeded. A customer hits a retry storm when upstream is flaky, their counter blows past the free tier, and they start getting surcharged for failures they never saw the value of.
The correct sequence:
1. Pre-flight: check whether BYOK is eligible for this request
2. Dispatch to upstream
3. ↓
4. Upstream returns 200: increment byok_usage.request_count (consumes free tier)
5. Upstream returns 5xx: increment byok_usage.failed_count (does NOT consume free tier)
Cache hits compose cleanly with this rule: a cache hit means no upstream call,
but the customer perceives a successful BYOK request. We increment
request_count (consumes free tier) and apply Invariant 1 (zero surcharge).
Invariant 5: Workspace quota never moves in BYOK mode
This is the headline invariant: ws.used_quota and ws.quota must never move
for BYOK requests.
If you violate this, the customer’s “remaining balance” in the gateway’s UI silently decreases despite them paying upstream directly. The double-charge manifests not as a wrong invoice, but as a wrong balance display — arguably worse, because customers find out months later when they wonder why their prepaid balance vanished.
Enforcement is a one-line invariant check in the settle function:
// settlement.go
if req.IsBYOK && surchargeRate == 0 {
return // skip quota adjustment entirely
}
Pair with a test that fires a BYOK request and asserts
ws.used_quota_before == ws.used_quota_after. This single test catches more
billing bugs than any other check in the suite.
How to validate these invariants in practice
You can’t trust invariants that aren’t tested. The minimal test matrix for BYOK billing:
| Test | Asserts |
|---|---|
| BYOK + successful upstream | ws.quota unchanged; byok_usage.request_count += 1; surcharge logged |
| BYOK + cache hit | ws.quota unchanged; quota_charged = 0; request_count += 1 |
| BYOK + upstream 5xx mid-stream | ws.quota unchanged; failed_count += 1; request_count unchanged |
| BYOK + free tier (under limit) | ws.quota unchanged; surcharge = 0 |
| BYOK + free tier (over limit) | ws.quota unchanged; surcharge applied |
| Platform + cache hit | quota_charged = 0; BYOK counters untouched |
Six cases. If you can’t enumerate the expected values for every cell, you don’t understand your own billing.
Why this matters commercially
A 1% billing inaccuracy at $100k MRR is $1k/month of mis-attributed revenue. Per year that’s $12k that either belongs to you (and you didn’t collect) or to customers (and they’re being silently overcharged). Either side of the error erodes trust faster than feature gaps.
The gateway space is also one where customers compare prices in spreadsheets across providers. A BYOK customer doing audit math on their OpenAI invoice against the gateway dashboard will find any inconsistency. The gateways that get the most enterprise traction are usually the ones whose math reconciles to the cent.
Closing
BYOK is not a feature checkbox. It’s a different commercial contract with the customer that demands a distinct billing path through the gateway. The five invariants above are the minimum bar for that path to be correct.
If you’re building or evaluating a gateway, ask to see the test that enforces Invariant 5 specifically. If it doesn’t exist, the BYOK path is almost certainly buggy in production.