475Cumulus
Guide

Monitoring LLM costs in production: tokens, tenants, and alerts

A practical guide to LLM cost observability: structured logging, Langfuse dashboards, OpenTelemetry metrics, per-tenant budgets, and the unit economics finance actually needs.

The copilot launched. Usage climbed. Three weeks later finance forwards the OpenAI invoice and asks a simple question nobody can answer: which customers, which features, and which workflows drove the spend?

Provider dashboards show account totals. They do not show that tenant ACME's support copilot costs $0.42 per resolved ticket while tenant Beta burns $3.10 per draft because retrieval returns forty chunks nobody trimmed. Invoice-level visibility is too late and too coarse. Cost monitoring belongs in your LLM middleware, tagged like any other multi-tenant metric.

What to measure (and what to ignore)

Total monthly tokens is a lagging indicator. Production teams track unit economics tied to user outcomes:

MetricWhy it matters
Tokens per successful actionCost per draft accepted, ticket summarized, or search answered
Spend by tenantIdMulti-tenant fairness, abuse detection, customer profitability
Spend by featureCopilot vs classifier vs RAG assist: where to optimize first
Input vs output tokensBloated context assembly shows up as high input; rambling models as high output
Retrieval + embed costRAG has a bill before the LLM runs; trace it separately
Failed / timeout requestsYou still pay for many partial generations; track outcome

Example: support copilot unit economics

Suppose last week your middleware logged:

  • 12,400 copilot requests across 85 tenants
  • 48.2M total tokens ($612 estimated at blended rates)
  • 9,100 requests where the agent clicked "insert draft" (outcome: success with downstream accept event)

Blended cost per request: $612 ÷ 12,400 ≈ $0.049

Cost per accepted draft: $612 ÷ 9,100 ≈ $0.067

That second number is what product and finance can reason about. If ACME alone ran 2,200 requests but only 180 accepted drafts, their effective cost per useful action is $3.40, not the tenant-average $0.07. Without tenant and outcome dimensions, you would never see the gap.

Where to instrument

Every model call should pass through one server-side path (middleware) that already handles auth and rate limits. Cost data is recorded there, on the same boundary as tracing.

Client  →  your API  →  LLM middleware  →  provider

                    log tokens + estimated cost
                    enforce tenant budget
                    emit trace (Langfuse) + metrics (OTel)

Never rely on the browser to report usage. Never infer spend from unstructured console.log lines without consistent fields.

Request flow through LLM middleware

Client UI

Copilot, search, actions

Your API

Existing auth session

middleware

LLM middleware

Auth, rate limits, logging

Model provider

OpenAI, Anthropic, etc.

Inject tenant-scoped context
Enforce tool permissions
Record tokens & latency

Every model call passes through your stack — not around it.

Layer 1: Structured logging (minimum viable)

Before dashboards, emit one structured log line per model call with fixed fields. This alone answers most "who spent what" questions if you ship to Datadog, CloudWatch, Grafana Loki, or similar.

Required fields:

  • feature, tenantId, userId (hashed if policy requires)
  • model, inputTokens, outputTokens
  • latencyMs, outcome (success, timeout, rate_limited, error)
  • requestId or trace ID for correlation with support tickets
import logging
from dataclasses import dataclass

logger = logging.getLogger("llm.middleware")

@dataclass(frozen=True)
class LlmUsage:
    input_tokens: int
    output_tokens: int
    model: str

def log_llm_request(
    *,
    feature: str,
    tenant_id: str,
    user_id: str,
    usage: LlmUsage,
    latency_ms: int,
    outcome: str,
) -> None:
    logger.info(
        "llm.request",
        extra={
            "feature": feature,
            "tenant_id": tenant_id,
            "user_id": user_id,
            "model": usage.model,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "total_tokens": usage.input_tokens + usage.output_tokens,
            "latency_ms": latency_ms,
            "outcome": outcome,  # success | timeout | rate_limited | error
        },
    )

Datadog example query (adapt field names to your schema):

sum:llm.tokens{feature:copilot} by {tenant_id}.as_count()

Grafana Loki / LogQL example:

sum by (tenant_id) (
  rate({app="api"} |= "llm.request" | json | input_tokens + output_tokens [1h])
)

Set an alert when a single tenant_id exceeds 2× its seven-day baseline. That catches runaway agents and abuse before the invoice closes.

Layer 2: Estimated cost per request

Providers bill in tokens; finance thinks in dollars. Middleware should compute an estimated costUsd on every completion using a pricing table you refresh when providers change rates.

# USD per 1M tokens — refresh from provider pricing pages
MODEL_PRICING = {
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
    "claude-sonnet-4": {"input": 3.00, "output": 15.00},
}

def estimate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = MODEL_PRICING[model]
    return (
        (input_tokens / 1_000_000) * rates["input"]
        + (output_tokens / 1_000_000) * rates["output"]
    )

Store the estimate on the log line and trace:

{
  "feature": "copilot",
  "tenantId": "tenant_acme",
  "model": "claude-sonnet-4",
  "inputTokens": 4200,
  "outputTokens": 380,
  "costUsd": 0.0183,
  "outcome": "success"
}

Langfuse can display cost on generations when you pass usage and cost details. That lets you filter traces by expensive tenants and compare prompt versions side by side.

cost_usd = estimate_cost_usd(
    model=response.model,
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
)

with langfuse.start_as_current_observation(
    as_type="generation",
    name="copilot-completion",
    model=response.model,
    input=messages,
) as generation:
    generation.update(
        output=response.text,
        usage_details={
            "input": response.usage.input_tokens,
            "output": response.usage.output_tokens,
        },
        cost_details={
            "input": cost_usd * 0.3,   # optional split for dashboards
            "output": cost_usd * 0.7,
            "total": cost_usd,
        },
        metadata={"feature": "copilot", "tenantId": tenant_id},
    )

In the Langfuse UI:

  1. Open Traces → filter metadata.tenantId = tenant_acme and tags contains copilot
  2. Switch to Analytics → group by userId or custom metadata feature
  3. Compare cost per trace before and after a prompt change tagged promptVersion: v4

See Langfuse for LLM observability for full tracing setup.

Layer 3: OpenTelemetry metrics (SRE-friendly)

If your platform team already runs OpenTelemetry into Datadog, Honeycomb, Grafana Cloud, or Prometheus, export counters from middleware rather than building LLM-only silos.

from opentelemetry import metrics

meter = metrics.get_meter("llm.middleware")
token_counter = meter.create_counter(
    "llm.tokens",
    description="LLM tokens by tenant and feature",
)
cost_counter = meter.create_counter(
    "llm.cost_usd",
    description="Estimated LLM spend in USD",
)

def record_otel_cost(attrs: dict, usage, cost_usd: float) -> None:
    labels = {
        "feature": attrs["feature"],
        "tenant_id": attrs["tenant_id"],
        "model": usage.model,
        "outcome": attrs["outcome"],
    }
    token_counter.add(usage.input_tokens + usage.output_tokens, labels)
    cost_counter.add(cost_usd, labels)

Example Datadog monitor:

  • Metric: sum:llm.cost_usd{*}.rollup(sum, 86400) by {tenant_id}
  • Alert: daily spend > $50 for any tenant not on the enterprise AI tier

Example Grafana panel:

  • Stacked bar: sum(rate(llm_cost_usd[1h])) by (feature)
  • Table: top 10 tenant_id by sum(increase(llm_tokens[24h]))

Langfuse's OpenTelemetry exporter can dual-write: LLM-native traces in Langfuse for debugging, aggregated llm.cost_usd in your existing stack for paging and executive dashboards.

Layer 4: Per-tenant budgets and enforcement

Observability without enforcement is a report card nobody acts on. For multi-tenant SaaS, add budget checks in middleware before the provider call.

Pattern:

  1. Read today's accumulated spend for tenantId from Redis (or your rate-limit store)
  2. Compare against plan limit (starter: $5/day, pro: $50/day, etc.)
  3. If over budget: return 429 or a degraded response ("AI assist temporarily unavailable")
  4. After a successful call: increment spend by costUsd
from datetime import date
import redis

redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])

def daily_spend_key(tenant_id: str) -> str:
    return f"llm:spend:{tenant_id}:{date.today().isoformat()}"

def check_tenant_budget(tenant_id: str, budget_usd: float) -> None:
    spent = float(redis_client.get(daily_spend_key(tenant_id)) or 0)
    if spent >= budget_usd:
        raise TenantBudgetExceeded(tenant_id, spent, budget_usd)

def record_tenant_spend(tenant_id: str, cost_usd: float) -> float:
    key = daily_spend_key(tenant_id)
    return float(redis_client.incrbyfloat(key, cost_usd))

Pair soft limits with rate limits (requests per minute) to stop runaway loops and agent retries from burning budget in minutes. See Prompt injection and LLM security for SaaS for abuse patterns that inflate token spend.

Provider-side guardrails (belt and suspenders)

Use these in addition to middleware, not instead of it:

ToolWhat it does
OpenAIProject-level budgets, separate API keys per environment, usage limits per key
AnthropicWorkspaces with distinct keys; monitor usage in Console
Google Vertex / GeminiCloud Billing budgets and alerts on the GCP project
Langfuse CloudTrace volume and cost analytics; self-hosted for data residency

Provider limits catch catastrophes. Middleware limits catch tenant-level fairness your provider will never enforce for you.

Dashboards worth building on day one

You do not need twenty charts. Ship these four before GA:

1. Spend by feature (stacked daily)

Answers: "Is the copilot or the classifier driving growth?"

2. Top tenants by cost (table, 24h / 7d)

Answers: "Who would we call if we need to throttle or upsell?"

3. Tokens per successful action (trend)

Requires joining middleware logs with a product event (draft_accepted, ticket_resolved). Answers: "Are we getting more efficient or just busier?"

4. Cost by model (pie or bar)

Answers: "Did last week's routing change actually shift traffic to the cheaper model?"

Production readiness checklist
Server-side auth
Tenant-scoped context
Structured logging
Cost per action
Eval pipeline
Provider fallback
Feature flags
Audit on tool calls

Use this as a gate before calling an AI feature GA — not as a post-launch backlog.

Alerts that prevent surprises

AlertConditionAction
Tenant spikeTenant daily cost > 2× 7-day avgNotify CS + auto-throttle AI features
Feature regressiontokens_per_success up 40% WoW after deployRoll back prompt version; check Langfuse promptVersion filter
Error burnoutcome:error rate > 5% and tokens still risingOften retry storms; tighten timeouts and max retries
Budget threshold80% of monthly org budget in week onePage platform; enable kill switch for non-critical features
Embed surgeRAG embed tokens > generation tokensChunking or re-index job run amok; check batch pipelines

Wire critical alerts to the same channel as payment or export failures. LLM spend is a reliability and revenue issue, not only a finance report.

Reducing cost without guessing

Once you can see spend by tenant and feature, optimization is measurable:

Route by task complexity

Use smaller models for classification and extraction; reserve large models for multi-turn copilots. Middleware selectModel(feature, tenantId) centralizes this. See When not to use RAG: many "AI" tasks do not need the biggest model or retrieval at all.

Trim context assembly

High input token counts usually mean the context builder sends too much: full thread history, unstripped HTML, duplicate records. Fix the builder; do not only switch models.

Cache stable work

Cache embeddings for unchanged documents. Cache FAQ or policy answers keyed by (tenantId, questionHash) with short TTL. Log cache_hit: true so dashboards separate fresh spend from cache savings.

Cap agent loops

Agents that retry tools or loop on errors can multiply cost per session. Enforce maxSteps, per-session token budgets, and timeouts. See Build an agent with LangChain.

Eval before expensive architecture

Running evals costs tokens, but cheaper than shipping retrieval or a larger model to every tenant without proof. See Eval pipelines for LLM features.

Common mistakes

Waiting for the invoice. By then the damage is done and you cannot attribute it.

Logging tokens without tenantId or feature. Aggregates are useless in SaaS.

Cost per API call only. A copilot that takes three model calls to produce one draft looks cheap per call and expensive per outcome.

Ignoring embed and re-rank costs. RAG bills start at the vector store and embedding API, not only the final chat completion.

Budgets in config files nobody updates. Tie limits to plan tier in your billing system or tenant settings table.

Alerts with no kill switch. Observability plus a runbook entry "disable feature=copilot via flag" beats a Slack message nobody can act on at 2 a.m.

Rollout order

  1. Structured log with tokens, model, tenant, feature, outcome
  2. Cost estimate on every completion; reconcile monthly with provider billing
  3. Langfuse (or equivalent) for trace-level drill-down and prompt version comparison
  4. OTel metrics into your existing observability stack for alerts
  5. Per-tenant budgets enforced in middleware before GA to external tenants
  6. Unit economics dashboard joined to product success events
Incremental rollout phases
Phase 1: InternalEng team + CS
Phase 2: Canary5–10% of tenants
Phase 3: Gradual25% → 50% → 100%
Phase 4: GADefault on

Measure quality, cost, and support load at each stage before expanding.

Putting it together

LLM cost monitoring is not a finance spreadsheet exercise. It is middleware instrumentation: the same boundary where you already enforce auth, assemble context, and route models. Log consistently, estimate cost per request, aggregate by tenant and feature, alert on spikes, and enforce budgets before the provider charges you.

If you cannot answer "what did tenant X spend on copilot yesterday" from your own metrics, you are not ready to scale AI traffic, regardless of how good the demo looks.


Scoping AI features for your product? Describe the workflow and we will map middleware, cost tagging, budgets, and dashboards for your stack and tenant model.

More on observability