What is an LLM API Gateway? A Practical Introduction

2026年5月12日 · llm-gateway · api-gateway · architecture · byok

In 2023, adding “an AI feature” meant one API call to OpenAI. By 2026, the median production application talks to three to five different model providers — OpenAI for general chat, Anthropic for long-context reasoning, Vertex for Gemini multimodal, Bedrock for compliance-bound workloads, and one or two open-source models served from a self-hosted endpoint.

That shift surfaced a problem familiar to anyone who’s worked with traditional API integrations: every new provider doubles the surface area of authentication, billing, observability, retry logic, rate limiting, and error mapping. An LLM API gateway is the middleware that absorbs that complexity.

This post explains what an LLM API gateway actually does, when adding one is worth the operational overhead, and what to look for if you’re evaluating options.

The problem an LLM gateway solves

Consider what direct integration with three providers looks like in a moderately serious application:

[your app] ───┬─→ [OpenAI SDK]      Bearer sk-...           $/M tokens
              ├─→ [Anthropic SDK]   x-api-key + version     $/M tokens, different schema
              └─→ [Vertex SDK]      GCP service account     usage in characters

Six concerns get duplicated three times:

Authentication. Bearer tokens, custom headers, SigV4, and GCP service-account JWTs are four different auth schemes. Every key rotation has to touch every service that uses that provider.
Request/response schema. OpenAI’s messages array isn’t quite Anthropic’s messages array. Vertex contents is structured differently again. Streaming chunk formats differ.
Error handling. Rate-limit headers, retry semantics, and error response shapes are all provider-specific.
Token accounting. Each provider reports usage differently — and most of them omit or rename fields under streaming, cache hits, and tool calls.
Cost tracking. Mapping usage.input_tokens to dollars-per-million requires keeping price tables current per model, per provider, per region.
Routing and fallback. If OpenAI is down, can you fail over to Anthropic? Not without a translation layer between their API shapes.

A purpose-built gateway centralizes all six. Your application calls one endpoint with one schema, and the gateway handles the provider-specific machinery.

What “API gateway” means in this context

The term is overloaded. In traditional microservice architecture, an API gateway (Kong, Nginx, AWS API Gateway) handles ingress concerns: routing, auth, rate limiting, request transformation. That’s still part of the picture, but LLM gateways add a layer specific to model inference:

Traditional API gateway	LLM API gateway
Routes by URL path	Routes by model name and capability
Auth pass-through	Multi-tenant key management with per-user quota
Generic rate limiting	Token-based rate limiting (TPM, RPM, $/min)
HTTP retry logic	Streaming-aware retry with partial-response handling
Generic logging	Token usage and cost attribution per request
Plain caching	Prompt-prefix caching with provider-specific TTL
n/a	Content safety scanning of inputs and outputs
n/a	Cross-protocol translation (OpenAI ↔ Anthropic ↔ Gemini)

The bottom seven rows are what makes an LLM gateway non-trivial. They can’t be implemented by gluing Kong plugins together — each one requires understanding the actual content of an LLM request, not just routing bytes.

A reference architecture

A typical LLM gateway handles a request through this pipeline:

client request
    ↓
[Authentication]            verify API key, map to tenant
    ↓
[Protocol parsing]          detect input format (OpenAI / Anthropic / Gemini)
    ↓
[Routing decision]          pick channel based on model + load + cost
    ↓
[Pre-flight billing]        estimate cost, reserve quota
    ↓
[Guardrail (input)]         scan for PII, prompt injection, key leaks
    ↓
[Protocol translation]      convert to upstream provider's format
    ↓
[Upstream dispatch]         forward request, handle streaming
    ↓
[Guardrail (output)]        scan response if non-streaming
    ↓
[Usage accounting]          parse upstream usage, settle billing
    ↓
[Audit logging]             persist request metadata
    ↓
client response

Most of these steps are skippable for simple deployments. A minimal gateway is auth + routing + dispatch + logging. A production gateway tends to have all of them, plus a few you can’t see from outside the request path:

Channel health monitoring — passive failure tracking so a flaky provider key gets temporarily benched
Background quota sync — reconciling Redis quota counters with durable storage
Inflight reconciliation — recovering pre-reserved quota when a request crashes mid-flight
Price table updates — pulling latest cost data from each provider

The complexity ramps up quickly past the minimal version, which is why most teams use a gateway rather than build their own.

When you actually need one

Adding a gateway is overhead. The justifications:

You probably need a gateway if:

You use two or more model providers in production
You have multiple internal teams or users sharing a budget
You need per-feature cost attribution (“how much is feature X costing?”)
Compliance requires audit logs of all LLM input/output
You want to give users bring-your-own-key (BYOK) options
You need graceful fallback when a provider has an outage

You probably don’t need a gateway yet if:

You’re a solo developer prototyping
You use exactly one provider and don’t care about vendor lock-in
Your monthly LLM spend is under a few hundred dollars
You’re an internal tool with a handful of users

The threshold is usually crossed faster than people expect. Once you’re running real production traffic, the absence of central observability becomes painful within weeks.

Bring Your Own Key (BYOK): the commercial wrinkle

A subtle but important architectural question: who pays the upstream provider?

There are two ends of the spectrum.

Platform-keyed. The gateway operator holds all upstream API keys. Users pay the gateway, which pays the providers. The gateway sets prices and captures the margin.

BYOK. Users plug in their own provider keys. The gateway forwards requests using the user’s key — the user is billed directly by OpenAI / Anthropic. The gateway charges only for the value it adds (observability, routing, guardrails), typically as a surcharge percentage or a flat subscription.

Mature gateways support both. The implementation has subtle invariants that are easy to get wrong:

Cache hits in BYOK mode shouldn’t be billed. No upstream call happened, so there’s no upstream cost to surcharge. Charging anyway is a double-charge.
The user’s surcharge quota is separate from their cumulative upstream usage. Two different ledgers, two different audit trails.
Failed attempts may still consume upstream tokens. Mid-stream 5xx errors often leave partial token usage that the upstream still bills for. Those need to be tracked but billed differently from successful attempts.
Key rotation must not break in-flight requests or invalidate prompt caches.

Each of these is a billing bug if implemented naively — the kind that results in either silently charging users for free operations or silently giving away paid features.

Cross-protocol translation: what makes it hard

One advertised feature you’ll see is “use the OpenAI SDK with any model.” This is the cross-protocol translation problem. It sounds simple — convert one JSON shape to another — but it’s actually one of the harder parts of a gateway to get right.

Some of the leaky edges:

Streaming chunk shapes differ. OpenAI emits choices[].delta, Anthropic emits typed events (message_start, content_block_delta, message_stop), Gemini emits candidates[].content.parts. A streaming bridge has to synthesize matching events for the inbound protocol from outbound chunks of a different shape, including correctly handling tool calls split across chunks.
Tool calling schemas are similar but not identical. OpenAI uses tools[].function, Anthropic uses tools[] with input_schema, Gemini uses function_declarations. Tool result blocks are reported differently too.
System prompts are top-level in Anthropic, role-based in OpenAI. Multiple system messages in OpenAI have to be concatenated for Anthropic.
Vision inputs are encoded as image_url (OpenAI), image block with media_type (Anthropic), or inline_data (Gemini). Translation needs to fetch URLs and re-encode appropriately.
Reasoning models (OpenAI o-series, Anthropic extended-thinking) expose hidden tokens via different fields. Cost accounting needs to track these separately from visible output.

These details only matter if you actually care about correctness. Many “translation layers” pass through the obvious 80% and silently lose information in the remaining 20%. That 20% is where SDK errors and cost bugs hide.

What to look for if you’re evaluating a gateway

Whether you’re picking an OSS gateway, a SaaS vendor, or deciding to build your own, the criteria that separate good from bad:

Streaming correctness across protocols. Send a streaming Anthropic request through an OpenAI client and verify the events come through correctly. This is the single most common failure mode.
Cost accuracy under cache hits and failures. Force a cache hit, force a 5xx mid-stream, force a timeout. Verify the billing log matches reality in each case.
Multi-key key rotation. Add a new upstream key, deprecate an old one. Verify in-flight requests aren’t broken and new requests immediately pick up the change.
Observability primitives. Per-request cost in response headers. Per-user usage queryable in <100ms. Per-channel error rates visible without grepping logs.
Graceful degradation. When the primary upstream is down, what happens? Hard failure is acceptable; silent garbage responses are not.
Test coverage of the billing layer. The gateway’s invariants — “BYOK cache hits cost 0”, “failed attempt usage is tracked separately” — should be enforced by automated tests, not by commit-message conventions.

Closing

An LLM API gateway is not exotic infrastructure. It’s the same kind of consolidation pattern that produced the original API gateway pattern fifteen years ago, applied to a domain where the per-request semantics are richer than HTTP.

If you’re operating any non-trivial volume of LLM traffic across two or more providers, you’re already maintaining the equivalent of a gateway in scattered application code. Centralizing it is mostly a question of when, not whether.

In future posts we’ll dig into specific subsystems: how to design the billing layer to be crash-safe under partial failure, how to implement provider-aware prompt caching, and how to make BYOK pricing work without double-charging the customer.