LLM observability beyond Langfuse — the full production stack
Langfuse covers traces and evals. Here is what else production teams need: structured logging, OpenTelemetry metrics, quality signals, sampling, canaries, and when to add Braintrust, Phoenix, or your existing APM.
You wired Langfuse into middleware. Traces show prompt versions, tool calls, and token usage. Support can finally answer "what did the model see on this request?"
Then SRE asks why there is no PagerDuty alert when copilot latency doubles. Finance wants a weekly spend dashboard in Grafana. Engineering wants evals to block a bad prompt merge. Langfuse is one layer of LLM observability, not the whole stack.
The four questions production teams ask
Every LLM feature generates four classes of questions. No single product answers all of them well.
| Question | Example | Best layer |
|---|---|---|
| What happened on this request? | Wrong citation on ticket #8842 | Trace (Langfuse or equivalent) |
| Is the system healthy right now? | p95 latency up 40% since deploy | Metrics + alerts (OTel → Datadog, Grafana, Honeycomb) |
| Should we ship this change? | New prompt scores 12 points lower on golden set | Eval pipeline (eval guide) |
| Who spent what, on what outcome? | Tenant ACME costs $3 per accepted draft | Structured logs + cost metrics (cost guide) |
Langfuse is strongest on the first row and supports the third via datasets and scores. Rows two and four usually need your existing platform stack or deliberate middleware instrumentation.
Where everything sits
The integration point is unchanged: server-side LLM middleware, not the browser.
Client → your API → LLM middleware → model provider
│
┌───────────────┼───────────────┐
│ │ │
structured log Langfuse trace OTel metrics
(Loki / DD) (debug + evals) (alerts + dashboards)Client UI
Copilot, search, actions
Your API
Existing auth session
LLM middleware
Auth, rate limits, logging
Model provider
OpenAI, Anthropic, etc.
Every model call passes through your stack — not around it.
Middleware should emit all three from one code path. If each copilot logs differently, you will migrate observability twice.
Layer 1: Structured logging (minimum viable)
Before traces or dashboards, emit one JSON log line per model call with fixed fields. This alone answers most "who spent what" and "which feature is noisy" questions.
Required fields:
feature,tenantId,userId(hashed if policy requires)model,inputTokens,outputTokens,latencyMsoutcome—success,timeout,rate_limited,error,user_rejectedrequestIdor trace ID for correlation with support tickets
import logging
from dataclasses import dataclass
logger = logging.getLogger("llm.middleware")
@dataclass(frozen=True)
class LlmUsage:
input_tokens: int
output_tokens: int
model: str
def log_llm_request(
*,
feature: str,
tenant_id: str,
user_id: str,
usage: LlmUsage,
latency_ms: int,
outcome: str,
) -> None:
logger.info(
"llm.request",
extra={
"feature": feature,
"tenant_id": tenant_id,
"user_id": user_id,
"model": usage.model,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"total_tokens": usage.input_tokens + usage.output_tokens,
"latency_ms": latency_ms,
"outcome": outcome, # success | timeout | rate_limited | error
},
)Ship logs to whatever you already run: Datadog, CloudWatch, Grafana Loki, Elasticsearch. You do not need an LLM-specific vendor for this layer.
Query examples:
# Datadog — tokens by tenant
sum:llm.tokens{feature:copilot} by {tenant_id}.as_count()
# Loki — hourly token rate
sum by (tenant_id) (
rate({app="api"} |= "llm.request" | json [1h])
)Structured logging is the first thing we add on engagements when there is no observability at all. Langfuse comes next, not instead.
Layer 2: OpenTelemetry metrics (SRE-friendly)
Traces debug one bad answer. Metrics tell you whether the feature is degrading for everyone.
Export counters and histograms from middleware:
| Metric | Type | Use |
|---|---|---|
llm.requests | Counter | Volume by feature, tenant, model |
llm.tokens | Counter | Input vs output dimensions |
llm.cost_usd | Counter | Estimated spend (reconcile to invoice monthly) |
llm.latency_ms | Histogram | p50 / p95 / p99 for paging |
llm.errors | Counter | Provider timeouts, schema failures, budget exceeded |
from opentelemetry import metrics
meter = metrics.get_meter("llm.middleware")
token_counter = meter.create_counter(
"llm.tokens",
description="LLM tokens by tenant and feature",
)
cost_counter = meter.create_counter(
"llm.cost_usd",
description="Estimated LLM spend in USD",
)
def record_otel_cost(attrs: dict, usage, cost_usd: float) -> None:
labels = {
"feature": attrs["feature"],
"tenant_id": attrs["tenant_id"],
"model": usage.model,
"outcome": attrs["outcome"],
}
token_counter.add(usage.input_tokens + usage.output_tokens, labels)
cost_counter.add(cost_usd, labels)Langfuse can dual-write via its OpenTelemetry exporter: LLM-native traces in Langfuse for debugging, aggregated metrics in Datadog or Grafana for alerts. Many platform teams already have on-call runbooks there. Do not build a parallel paging system inside Langfuse unless your SRE team lives there.
Alert on symptoms, not vibes:
- p95
llm.latency_ms> 8s for 10 minutes llm.errorsrate > 2% for a single featurellm.cost_usdper tenant > daily budget (see cost monitoring)
Layer 3: Evals and quality signals
Tracing tells you what happened on one request. Evals tell you whether a prompt or retrieval change is safe to ship across dozens of cases.
The loop:
- Baseline production or staging traces (redacted)
- Golden dataset with property-based expectations (refusal, citation, correct tool)
- Run on every meaningful change in CI
- Ship behind a feature flag; compare live metrics by
promptVersion
See Eval pipelines for LLM features for runners, scorers, and gates.
Online quality complements offline evals:
| Signal | Source |
|---|---|
| Thumbs down / "report incorrect" | Product UI → Langfuse score or your DB |
| User edited the draft before sending | Implicit negative |
| Session abandoned mid-flow | Possible quality or latency issue |
| Tool confirmation rejected | Agent overreach |
A support engineer marking ten traces "wrong citation" in Langfuse weekly beats an unused automated metric nobody maintains.
Layer 4: Cost and unit economics
Invoice totals are too coarse. Production teams track cost per successful outcome: accepted draft, resolved ticket, classified label applied.
That requires the same tenantId and feature tags on logs, traces, and metrics, plus an outcome dimension tied to product events. Full walkthrough: Monitoring LLM costs in production.
Langfuse alternatives and complements
Langfuse is our default recommendation for trace + prompt version + eval dataset in one place. Other tools fill adjacent niches. Pick one LLM-native platform in production; do not run three.
| Tool | Strength | When to consider |
|---|---|---|
| Langfuse | Traces, prompts, scores, datasets, self-host | Default for middleware-integrated observability (setup guide) |
| Braintrust | Evals, regression runs, CI integration | Team thinks in test cases first; eval-heavy workflow |
| Arize Phoenix | Open-source tracing, embedding visualization | RAG debugging, retrieval quality, drift exploration |
| LangSmith | LangChain / LangGraph integration | Orchestration is already LangChain-native |
| Helicone / Portkey | Gateway proxy + request logging | Gateway is your middleware boundary |
| Datadog LLM Observability | Managed GenAI tracing | Already standardized on Datadog for everything |
Generic APM (Honeycomb, Sentry, Grafana Tempo) stays valuable for exceptions, HTTP latency, and infrastructure. Use it alongside Langfuse, not as a replacement for prompt-version debugging.
Techniques that matter as much as tools
Consistent metadata schema
Every layer should share dimensions: feature, tenantId, sessionId, promptVersion, model, outcome. Add RAG fields (retrievalChunkCount, topDocIds) and agent fields (toolsInvoked, permissionDenied) as features mature. A trace you cannot filter by customer is useless in multi-tenant SaaS.
Trace the full chain
The bug is rarely the final streamed token. Instrument retrieval, tool selection, permission checks, and prompt assembly as child spans. Logging only the assistant reply hides 80% of copilot failures.
Sampling at scale
| Environment | Guidance |
|---|---|
| Staging | Trace 100% of requests |
| Production — low volume copilot | Trace 100% initially |
| Production — high volume classifier | Sample 1–10%; always trace errors and budget breaches |
Sampling keeps storage cost sane without flying blind.
Synthetic canaries
A cron job runs five golden prompts against production middleware every hour. Alert if latency, token count, or eval score drifts. Catches provider outages and silent prompt regressions before users report them.
Session-level grouping
For multi-turn copilots, tag sessionId on every turn. Metrics like cost per resolved thread or turns until abandonment beat per-message token counts for product decisions.
Redaction and retention
Traces are debugging artifacts, not product data. Mask emails, strip secrets, truncate retrieved chunks, set TTL by data classification. Observability that violates privacy policy gets shut down.
RAG-specific observability
Generic generation traces miss retrieval failures. Add spans and metrics for:
- Query rewrite and filters applied
- Chunk count returned vs injected into prompt
- Embed latency and cost (separate from generation)
- Citation present in output vs retrieved doc IDs
- "Low confidence" fallback path taken
Phoenix or custom dashboards on embedding distributions help when answers drift after a doc corpus update. Pair with property-based evals: "answer mentions retrieved doc title", "refuses when no chunks above threshold".
Agent and tool-calling observability
Agents add steps, not just tokens. Per tool invocation, record:
- Tool name and redacted arguments
- Permission outcome (allowed, denied, needs confirmation)
- Downstream API latency and error
- Human approval granted or rejected
When a customer says "the copilot tried to close the wrong ticket," you need which tool, which ID, which policy gate — not a generic "agent error" in logs.
Connect to prompt injection and tool security: security-relevant denials should appear in traces and metrics, not only in stderr.
Maturity model: what to add when
| Stage | Minimum stack |
|---|---|
| POC / demo | Structured log: feature, model, tokens, latency |
| First production feature | Langfuse (or equivalent) on every middleware call; tenant + feature tags |
| On-call ownership | OTel metrics + alerts in existing APM |
| Prompt iteration | Golden dataset + CI eval gate |
| Multi-tenant scale | Per-tenant budgets, sampling, cost-per-outcome dashboards |
| Second AI feature | Shared tracing module — one schema, all features |
Adding observability after three features each instrument differently means a migration project. Wire the shared module when you extract LLM middleware.
Common mistakes
Tool sprawl. Langfuse + Braintrust + Helicone + a custom spreadsheet. Pick one LLM-native layer and integrate it deeply.
Tracing from the browser. Keys stay server-side. Client traces are incomplete and insecure.
APM-only. Datadog shows /api/copilot is slow. It does not show which promptVersion regressed.
Langfuse-only. No alerts, no budgets, no CI eval gate. You debug well but ship regressions and cost surprises.
No owner. Someone reviews traces and eval failures weekly during rollout. Dashboards nobody opens do not count.
100% trace volume forever. Storage cost explodes on classifiers running on every row.
How the pieces connect at rollout
Prompt / retrieval change
↓
CI eval on golden set ──fail──► block merge
│
pass
↓
deploy behind feature flag (promptVersion tagged)
↓
monitor: OTel alerts + cost per outcome + Langfuse trace sampling
↓
production feedback → new golden cases → repeatUse this as a gate before calling an AI feature GA — not as a post-launch backlog.
How 475 Cumulus approaches the full stack
We do not sell observability licenses. On integration projects we typically:
- Define one metadata schema across logs, traces, and metrics
- Instrument middleware once with Langfuse or OTel-compatible tracing
- Stand up eval datasets from real workflow boundaries, not lorem ipsum
- Connect tracing to rollout — feature flags, canary prompt versions, per-tenant cost alerts
- Dual-write to existing APM when platform teams already run Datadog, Honeycomb, or Grafana
The outcome is LLM features that behave like the rest of your production systems: permissioned, measurable, and improvable without guessing what the model saw.
Langfuse is the right center of gravity for LLM-native debugging. Round it out with structured logs, OTel alerts, eval gates, and unit economics — then you have observability, not just traces. Describe your copilot or agent and we will map the full stack for your middleware and auth model.
Related resources
More on observability- Langfuse for LLM observability — where it fits in your middleware stack
How to trace model calls, debug prompts, and run evals with Langfuse — integrated into server-side LLM middleware, not bolted onto a frontend demo.
- Monitoring LLM costs in production: tokens, tenants, and alerts
A practical guide to LLM cost observability: structured logging, Langfuse dashboards, OpenTelemetry metrics, per-tenant budgets, and the unit economics finance actually needs.
- Eval pipelines for LLM features — what they are and how to build one
A practical guide to golden sets, property-based scoring, and CI gates — so prompt and retrieval changes do not silently break production copilots.
