LLM middleware: what it is, why you need it, and how to implement it
A practical guide to the server-side layer between your app and the model — auth, rate limits, routing, logging, and the patterns that keep AI features production-ready.
Most AI feature POCs start the same way: the frontend calls OpenAI or Anthropic directly, streams the response into a chat widget, and ships to a demo audience by Friday.
That works until someone asks about tenant isolation, the monthly token bill, or what happens when Anthropic rate-limits you during peak traffic. The gap is almost never the model. It is the missing layer between your product and the provider.
That layer is LLM middleware — a server-side gateway every AI request passes through before it reaches a model.
What LLM middleware is
LLM middleware is a controlled server-side proxy that sits between your authenticated application and one or more model providers.
Your UI does not hold API keys. It does not assemble privileged context. It sends a request to your API, which validates the session, applies policy, calls the model, logs the result, and returns a response.
Client UI
Copilot, search, actions
Your API
Existing auth session
LLM middleware
Auth, rate limits, logging
Model provider
OpenAI, Anthropic, etc.
Every model call passes through your stack — not around it.
The flow is always:
- Client UI — copilot panel, search box, draft button
- Your existing API — session, JWT, or API key your app already uses
- LLM middleware — the dedicated layer for model-specific concerns
- Model provider — OpenAI, Anthropic, Gemini, Bedrock, or a self-hosted endpoint
Every feature — copilot, RAG, classification, agents — should share this path. Otherwise each team reinvents auth checks, rate limits, and logging in slightly incompatible ways.
What it does
Think of middleware as the operating system for AI traffic in your product. Responsibilities vary by maturity, but production teams consistently need the following.
Identity and context assembly
Middleware runs after your normal auth. It knows who the user is, which tenant they belong to, and what data they can access — then builds the prompt from that scope.
The client sends intent ("summarize this ticket"), not raw privileged data the browser assembled unsupervised.
Rate limiting and abuse control
LLM calls are expensive and slow relative to CRUD APIs. Middleware enforces per-user, per-tenant, or per-IP limits before a token is spent — the same way you would throttle password resets or export endpoints.
Routing, streaming, and failover
Middleware chooses which model and provider to use, handles streaming responses back to the client, and can fail over to a secondary provider or a degraded response when the primary is down or rate-limited.
Token accounting and cost visibility
Every request should record input tokens, output tokens, latency, and model ID — tagged by feature and tenant. Without this layer, finance discovers the copilot cost three months after launch.
Policy and safety
Middleware is where you enforce prompt structure, separate system instructions from untrusted user input, validate tool calls, and require confirmation before destructive actions. Prompt injection is an integration problem, not something you fix only in the system prompt.
Audit logging
When the model invokes a tool — update a record, send an email, fetch billing data — middleware (and your tool handlers) write audit events with the same retention and access controls as the rest of your security logs.
| Concern | Without middleware | With middleware |
|---|---|---|
| API keys | Exposed in browser or duplicated per feature | Single server-side secret |
| Auth | Client trusts itself to scope context | Server validates session before model call |
| Cost | Unknown until the invoice arrives | Per-tenant token metrics from day one |
| Failures | Spinner forever or opaque 500 | Timeouts, fallbacks, user-visible recovery |
| Rollout | All-or-nothing deploy | Feature flags at the middleware boundary |
Why you need it
The direct-to-provider anti-pattern
Browser → model provider APIThis pattern fails production review for predictable reasons:
- Secrets in the client — even "proxy keys" leak; users can bypass UI limits
- No tenant boundary — the browser can request context the user should not see
- No central observability — each feature logs differently, or not at all
- No consistent failure behavior — one copilot handles rate limits; another crashes
- No kill switch — turning off AI means hunting down every direct integration
Middleware converts AI from a scattered set of demos into infrastructure your team can operate.
Middleware before RAG, before agents
Teams often jump to vector databases and agent frameworks before they have a stable request path. That inverts the order.
Middleware first gives every later feature — RAG, tool-calling, batch jobs — the same auth, logging, and cost envelope. See When not to use RAG for why retrieval is not always the next step, and What production-ready LLM integration actually means for the full readiness checklist.
Use this as a gate before calling an AI feature GA — not as a post-launch backlog.
How it is implemented
There is no single vendor product called "LLM middleware." It is a pattern you implement in your stack — usually as a module, service, or set of API routes in code you own.
Layer 1: Minimal (one feature, one route)
Enough for a first production feature behind a feature flag:
- One authenticated API route (e.g.
POST /api/copilot/chat) - Server-side provider call with streaming
- Basic rate limiting (IP or user ID)
- Structured log line per request: latency, tokens, user ID, feature name
This site’s live demo assistant follows this shape: the browser talks to /api/chat, the server holds the model key, applies rate limits, and streams a grounded response. It is not a full multi-tenant proxy — but it is middleware in the architectural sense.
Layer 2: Shared module (multiple features)
Extract a reusable internal module or small service once a second AI feature appears:
// lib/llm/middleware.ts — illustrative, not a product SDK
type LlmRequest = {
feature: "copilot" | "classify" | "search-assist";
tenantId: string;
userId: string;
messages: Array<{ role: "user" | "assistant"; content: string }>;
};
export async function complete(req: LlmRequest) {
await enforceRateLimit(req.tenantId, req.userId);
const model = selectModel(req.feature, req.tenantId);
const startedAt = Date.now();
try {
const result = await provider.complete({
model,
messages: req.messages,
stream: true,
});
logLlmRequest({
...req,
model,
latencyMs: Date.now() - startedAt,
inputTokens: result.usage.inputTokens,
outputTokens: result.usage.outputTokens,
});
return result;
} catch (error) {
logLlmError({ ...req, model, error });
return fallbackResponse(req.feature);
}
}Feature routes stay thin — they validate HTTP input, load tenant-scoped context, and delegate to complete().
Layer 3: Dedicated service (platform scale)
At higher volume or stricter compliance, teams split middleware into its own deployable service:
- Provider adapters — normalize OpenAI, Anthropic, and Bedrock behind one interface
- Queue or async path — for long-running jobs that should not block HTTP workers
- Central config — model routing rules, per-tenant budgets, prompt version registry
- Shared eval hooks — shadow traffic, golden-set comparison before prompt rollout
The implementation changes; the request flow does not. Client → your API → middleware → provider.
Context assembly belongs in middleware (or right behind it)
For a copilot, middleware orchestrates:
- Validate session and tenant
- Fetch allowed context from your databases and APIs (not from the client)
- Build the system prompt and message list
- Call the model
- Post-process (citations, tool calls, schema validation)
- Return to client
For RAG, retrieval runs inside this boundary — after auth, before the model call — so users only embed and retrieve documents they are permitted to see. See RAG without the platform rewrite for that integration shape.
Tool-calling and agents
When the model invokes product APIs, middleware (or a dedicated orchestration layer it calls) must:
- Expose a narrow tool surface — not raw database access
- Re-check permissions on every tool call
- Audit log inputs and outcomes
- Gate destructive actions behind explicit user confirmation
Agent frameworks like LangChain sit above provider adapters and below your auth boundary — they orchestrate steps; middleware enforces policy. See Build an agent with LangChain for how that fits together.
Common mistakes
Calling the model from the client. Even with a "public" key, users can extract it, bypass UI limits, and exfiltrate context.
One middleware per feature. Three copilots means three slightly different rate limiters. Extract shared logic early.
Logging prompts but not decisions. You need tokens, latency, model ID, tenant, feature, and tool-call outcomes — not just "request succeeded."
Skipping middleware because the POC "works". Production readiness is defined by failure behavior, cost, and permissions — not happy-path demo quality.
A sensible rollout order
- Middleware route — auth, rate limit, logging, one model, streaming
- First workflow-bound feature — copilot or classifier tied to a real user action
- Eval baseline — golden inputs, regression gate on prompt changes
- Retrieval or tools — only when metrics show the simpler path is insufficient
- Provider routing and failover — when uptime or cost optimization requires it
Measure quality, cost, and support load at each stage before expanding.
Putting it together
LLM middleware is not optional infrastructure for production AI — it is the integration. The model generates text; middleware makes that text safe, observable, permissioned, and operable inside your product.
If you are planning a copilot, RAG feature, or agent workflow, start by drawing the request path on a whiteboard: where auth runs, where context is assembled, where logs land, and what the user sees when the provider fails. If the arrow goes from browser to OpenAI, you have work to do before you scale traffic.
Scoping middleware for your stack? Describe the feature — we will map the architecture, effort, and rollout plan for your existing APIs and auth model.
Related resources
What production-ready LLM integration actually means
A practical checklist for engineering leaders — beyond the demo and before you call an AI feature shipped.
When not to use RAG
RAG is the default answer for every AI feature — but often the wrong one. A decision guide for engineering leaders scoping retrieval, tools, and middleware.
