Monitoring LLM costs in production: tokens, tenants, and alerts
A practical guide to LLM cost observability: structured logging, Langfuse dashboards, OpenTelemetry metrics, per-tenant budgets, and the unit economics finance actually needs.
The copilot launched. Usage climbed. Three weeks later finance forwards the OpenAI invoice and asks a simple question nobody can answer: which customers, which features, and which workflows drove the spend?
Provider dashboards show account totals. They do not show that tenant ACME's support copilot costs $0.42 per resolved ticket while tenant Beta burns $3.10 per draft because retrieval returns forty chunks nobody trimmed. Invoice-level visibility is too late and too coarse. Cost monitoring belongs in your LLM middleware, tagged like any other multi-tenant metric.
What to measure (and what to ignore)
Total monthly tokens is a lagging indicator. Production teams track unit economics tied to user outcomes:
| Metric | Why it matters |
|---|---|
| Tokens per successful action | Cost per draft accepted, ticket summarized, or search answered |
Spend by tenantId | Multi-tenant fairness, abuse detection, customer profitability |
Spend by feature | Copilot vs classifier vs RAG assist: where to optimize first |
| Input vs output tokens | Bloated context assembly shows up as high input; rambling models as high output |
| Retrieval + embed cost | RAG has a bill before the LLM runs; trace it separately |
| Failed / timeout requests | You still pay for many partial generations; track outcome |
Example: support copilot unit economics
Suppose last week your middleware logged:
- 12,400 copilot requests across 85 tenants
- 48.2M total tokens ($612 estimated at blended rates)
- 9,100 requests where the agent clicked "insert draft" (
outcome: successwith downstream accept event)
Blended cost per request: $612 ÷ 12,400 ≈ $0.049
Cost per accepted draft: $612 ÷ 9,100 ≈ $0.067
That second number is what product and finance can reason about. If ACME alone ran 2,200 requests but only 180 accepted drafts, their effective cost per useful action is $3.40, not the tenant-average $0.07. Without tenant and outcome dimensions, you would never see the gap.
Where to instrument
Every model call should pass through one server-side path (middleware) that already handles auth and rate limits. Cost data is recorded there, on the same boundary as tracing.
Client → your API → LLM middleware → provider
│
log tokens + estimated cost
enforce tenant budget
emit trace (Langfuse) + metrics (OTel)Never rely on the browser to report usage. Never infer spend from unstructured console.log lines without consistent fields.
Client UI
Copilot, search, actions
Your API
Existing auth session
LLM middleware
Auth, rate limits, logging
Model provider
OpenAI, Anthropic, etc.
Every model call passes through your stack — not around it.
Layer 1: Structured logging (minimum viable)
Before dashboards, emit one structured log line per model call with fixed fields. This alone answers most "who spent what" questions if you ship to Datadog, CloudWatch, Grafana Loki, or similar.
Required fields:
feature,tenantId,userId(hashed if policy requires)model,inputTokens,outputTokenslatencyMs,outcome(success,timeout,rate_limited,error)requestIdor trace ID for correlation with support tickets
import logging
from dataclasses import dataclass
logger = logging.getLogger("llm.middleware")
@dataclass(frozen=True)
class LlmUsage:
input_tokens: int
output_tokens: int
model: str
def log_llm_request(
*,
feature: str,
tenant_id: str,
user_id: str,
usage: LlmUsage,
latency_ms: int,
outcome: str,
) -> None:
logger.info(
"llm.request",
extra={
"feature": feature,
"tenant_id": tenant_id,
"user_id": user_id,
"model": usage.model,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"total_tokens": usage.input_tokens + usage.output_tokens,
"latency_ms": latency_ms,
"outcome": outcome, # success | timeout | rate_limited | error
},
)Datadog example query (adapt field names to your schema):
sum:llm.tokens{feature:copilot} by {tenant_id}.as_count()Grafana Loki / LogQL example:
sum by (tenant_id) (
rate({app="api"} |= "llm.request" | json | input_tokens + output_tokens [1h])
)Set an alert when a single tenant_id exceeds 2× its seven-day baseline. That catches runaway agents and abuse before the invoice closes.
Layer 2: Estimated cost per request
Providers bill in tokens; finance thinks in dollars. Middleware should compute an estimated costUsd on every completion using a pricing table you refresh when providers change rates.
# USD per 1M tokens — refresh from provider pricing pages
MODEL_PRICING = {
"gpt-4.1-mini": {"input": 0.40, "output": 1.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
}
def estimate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
rates = MODEL_PRICING[model]
return (
(input_tokens / 1_000_000) * rates["input"]
+ (output_tokens / 1_000_000) * rates["output"]
)Store the estimate on the log line and trace:
{
"feature": "copilot",
"tenantId": "tenant_acme",
"model": "claude-sonnet-4",
"inputTokens": 4200,
"outputTokens": 380,
"costUsd": 0.0183,
"outcome": "success"
}Langfuse can display cost on generations when you pass usage and cost details. That lets you filter traces by expensive tenants and compare prompt versions side by side.
cost_usd = estimate_cost_usd(
model=response.model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
)
with langfuse.start_as_current_observation(
as_type="generation",
name="copilot-completion",
model=response.model,
input=messages,
) as generation:
generation.update(
output=response.text,
usage_details={
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
},
cost_details={
"input": cost_usd * 0.3, # optional split for dashboards
"output": cost_usd * 0.7,
"total": cost_usd,
},
metadata={"feature": "copilot", "tenantId": tenant_id},
)In the Langfuse UI:
- Open Traces → filter
metadata.tenantId = tenant_acmeandtagscontainscopilot - Switch to Analytics → group by
userIdor custom metadatafeature - Compare cost per trace before and after a prompt change tagged
promptVersion: v4
See Langfuse for LLM observability for full tracing setup.
Layer 3: OpenTelemetry metrics (SRE-friendly)
If your platform team already runs OpenTelemetry into Datadog, Honeycomb, Grafana Cloud, or Prometheus, export counters from middleware rather than building LLM-only silos.
from opentelemetry import metrics
meter = metrics.get_meter("llm.middleware")
token_counter = meter.create_counter(
"llm.tokens",
description="LLM tokens by tenant and feature",
)
cost_counter = meter.create_counter(
"llm.cost_usd",
description="Estimated LLM spend in USD",
)
def record_otel_cost(attrs: dict, usage, cost_usd: float) -> None:
labels = {
"feature": attrs["feature"],
"tenant_id": attrs["tenant_id"],
"model": usage.model,
"outcome": attrs["outcome"],
}
token_counter.add(usage.input_tokens + usage.output_tokens, labels)
cost_counter.add(cost_usd, labels)Example Datadog monitor:
- Metric:
sum:llm.cost_usd{*}.rollup(sum, 86400)by{tenant_id} - Alert: daily spend > $50 for any tenant not on the enterprise AI tier
Example Grafana panel:
- Stacked bar:
sum(rate(llm_cost_usd[1h])) by (feature) - Table: top 10
tenant_idbysum(increase(llm_tokens[24h]))
Langfuse's OpenTelemetry exporter can dual-write: LLM-native traces in Langfuse for debugging, aggregated llm.cost_usd in your existing stack for paging and executive dashboards.
Layer 4: Per-tenant budgets and enforcement
Observability without enforcement is a report card nobody acts on. For multi-tenant SaaS, add budget checks in middleware before the provider call.
Pattern:
- Read today's accumulated spend for
tenantIdfrom Redis (or your rate-limit store) - Compare against plan limit (
starter: $5/day,pro: $50/day, etc.) - If over budget: return
429or a degraded response ("AI assist temporarily unavailable") - After a successful call: increment spend by
costUsd
from datetime import date
import redis
redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])
def daily_spend_key(tenant_id: str) -> str:
return f"llm:spend:{tenant_id}:{date.today().isoformat()}"
def check_tenant_budget(tenant_id: str, budget_usd: float) -> None:
spent = float(redis_client.get(daily_spend_key(tenant_id)) or 0)
if spent >= budget_usd:
raise TenantBudgetExceeded(tenant_id, spent, budget_usd)
def record_tenant_spend(tenant_id: str, cost_usd: float) -> float:
key = daily_spend_key(tenant_id)
return float(redis_client.incrbyfloat(key, cost_usd))Pair soft limits with rate limits (requests per minute) to stop runaway loops and agent retries from burning budget in minutes. See Prompt injection and LLM security for SaaS for abuse patterns that inflate token spend.
Provider-side guardrails (belt and suspenders)
Use these in addition to middleware, not instead of it:
| Tool | What it does |
|---|---|
| OpenAI | Project-level budgets, separate API keys per environment, usage limits per key |
| Anthropic | Workspaces with distinct keys; monitor usage in Console |
| Google Vertex / Gemini | Cloud Billing budgets and alerts on the GCP project |
| Langfuse Cloud | Trace volume and cost analytics; self-hosted for data residency |
Provider limits catch catastrophes. Middleware limits catch tenant-level fairness your provider will never enforce for you.
Dashboards worth building on day one
You do not need twenty charts. Ship these four before GA:
1. Spend by feature (stacked daily)
Answers: "Is the copilot or the classifier driving growth?"
2. Top tenants by cost (table, 24h / 7d)
Answers: "Who would we call if we need to throttle or upsell?"
3. Tokens per successful action (trend)
Requires joining middleware logs with a product event (draft_accepted, ticket_resolved). Answers: "Are we getting more efficient or just busier?"
4. Cost by model (pie or bar)
Answers: "Did last week's routing change actually shift traffic to the cheaper model?"
Use this as a gate before calling an AI feature GA — not as a post-launch backlog.
Alerts that prevent surprises
| Alert | Condition | Action |
|---|---|---|
| Tenant spike | Tenant daily cost > 2× 7-day avg | Notify CS + auto-throttle AI features |
| Feature regression | tokens_per_success up 40% WoW after deploy | Roll back prompt version; check Langfuse promptVersion filter |
| Error burn | outcome:error rate > 5% and tokens still rising | Often retry storms; tighten timeouts and max retries |
| Budget threshold | 80% of monthly org budget in week one | Page platform; enable kill switch for non-critical features |
| Embed surge | RAG embed tokens > generation tokens | Chunking or re-index job run amok; check batch pipelines |
Wire critical alerts to the same channel as payment or export failures. LLM spend is a reliability and revenue issue, not only a finance report.
Reducing cost without guessing
Once you can see spend by tenant and feature, optimization is measurable:
Route by task complexity
Use smaller models for classification and extraction; reserve large models for multi-turn copilots. Middleware selectModel(feature, tenantId) centralizes this. See When not to use RAG: many "AI" tasks do not need the biggest model or retrieval at all.
Trim context assembly
High input token counts usually mean the context builder sends too much: full thread history, unstripped HTML, duplicate records. Fix the builder; do not only switch models.
Cache stable work
Cache embeddings for unchanged documents. Cache FAQ or policy answers keyed by (tenantId, questionHash) with short TTL. Log cache_hit: true so dashboards separate fresh spend from cache savings.
Cap agent loops
Agents that retry tools or loop on errors can multiply cost per session. Enforce maxSteps, per-session token budgets, and timeouts. See Build an agent with LangChain.
Eval before expensive architecture
Running evals costs tokens, but cheaper than shipping retrieval or a larger model to every tenant without proof. See Eval pipelines for LLM features.
Common mistakes
Waiting for the invoice. By then the damage is done and you cannot attribute it.
Logging tokens without tenantId or feature. Aggregates are useless in SaaS.
Cost per API call only. A copilot that takes three model calls to produce one draft looks cheap per call and expensive per outcome.
Ignoring embed and re-rank costs. RAG bills start at the vector store and embedding API, not only the final chat completion.
Budgets in config files nobody updates. Tie limits to plan tier in your billing system or tenant settings table.
Alerts with no kill switch. Observability plus a runbook entry "disable feature=copilot via flag" beats a Slack message nobody can act on at 2 a.m.
Rollout order
- Structured log with tokens, model, tenant, feature, outcome
- Cost estimate on every completion; reconcile monthly with provider billing
- Langfuse (or equivalent) for trace-level drill-down and prompt version comparison
- OTel metrics into your existing observability stack for alerts
- Per-tenant budgets enforced in middleware before GA to external tenants
- Unit economics dashboard joined to product success events
Measure quality, cost, and support load at each stage before expanding.
Putting it together
LLM cost monitoring is not a finance spreadsheet exercise. It is middleware instrumentation: the same boundary where you already enforce auth, assemble context, and route models. Log consistently, estimate cost per request, aggregate by tenant and feature, alert on spikes, and enforce budgets before the provider charges you.
If you cannot answer "what did tenant X spend on copilot yesterday" from your own metrics, you are not ready to scale AI traffic, regardless of how good the demo looks.
Scoping AI features for your product? Describe the workflow and we will map middleware, cost tagging, budgets, and dashboards for your stack and tenant model.
Related resources
More on observability- Langfuse for LLM observability — where it fits in your middleware stack
How to trace model calls, debug prompts, and run evals with Langfuse — integrated into server-side LLM middleware, not bolted onto a frontend demo.
- LLM middleware: what it is, why you need it, and how to implement it
A practical guide to the server-side layer between your app and the model — auth, rate limits, routing, logging, and the patterns that keep AI features production-ready.
- What production-ready LLM integration actually means
A practical checklist for engineering leaders — beyond the demo and before you call an AI feature shipped.
