GuideJune 18, 2026

LLM observability beyond Langfuse — the full production stack

Langfuse covers traces and evals. Here is what else production teams need: structured logging, OpenTelemetry metrics, quality signals, sampling, canaries, and when to add Braintrust, Phoenix, or your existing APM.

Topics:observability middleware integration evals

You wired Langfuse into middleware. Traces show prompt versions, tool calls, and token usage. Support can finally answer "what did the model see on this request?"

Then SRE asks why there is no PagerDuty alert when copilot latency doubles. Finance wants a weekly spend dashboard in Grafana. Engineering wants evals to block a bad prompt merge. Langfuse is one layer of LLM observability, not the whole stack.

The four questions production teams ask

Every LLM feature generates four classes of questions. No single product answers all of them well.

Question	Example	Best layer
What happened on this request?	Wrong citation on ticket #8842	Trace (Langfuse or equivalent)
Is the system healthy right now?	p95 latency up 40% since deploy	Metrics + alerts (OTel → Datadog, Grafana, Honeycomb)
Should we ship this change?	New prompt scores 12 points lower on golden set	Eval pipeline (eval guide)
Who spent what, on what outcome?	Tenant ACME costs $3 per accepted draft	Structured logs + cost metrics (cost guide)

Langfuse is strongest on the first row and supports the third via datasets and scores. Rows two and four usually need your existing platform stack or deliberate middleware instrumentation.

Where everything sits

The integration point is unchanged: server-side LLM middleware, not the browser.

Client  →  your API  →  LLM middleware  →  model provider
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         structured log    Langfuse trace   OTel metrics
         (Loki / DD)      (debug + evals)   (alerts + dashboards)

Request flow through LLM middleware

Client UI

Copilot, search, actions

Your API

Existing auth session

middleware

LLM middleware

Auth, rate limits, logging

Model provider

OpenAI, Anthropic, etc.

Inject tenant-scoped context

Enforce tool permissions

Record tokens & latency

Every model call passes through your stack — not around it.

Middleware should emit all three from one code path. If each copilot logs differently, you will migrate observability twice.

Layer 1: Structured logging (minimum viable)

Before traces or dashboards, emit one JSON log line per model call with fixed fields. This alone answers most "who spent what" and "which feature is noisy" questions.

Required fields:

feature, tenantId, userId (hashed if policy requires)
model, inputTokens, outputTokens, latencyMs
outcome — success, timeout, rate_limited, error, user_rejected
requestId or trace ID for correlation with support tickets

import logging
from dataclasses import dataclass

logger = logging.getLogger("llm.middleware")

@dataclass(frozen=True)
class LlmUsage:
    input_tokens: int
    output_tokens: int
    model: str

def log_llm_request(
    *,
    feature: str,
    tenant_id: str,
    user_id: str,
    usage: LlmUsage,
    latency_ms: int,
    outcome: str,
) -> None:
    logger.info(
        "llm.request",
        extra={
            "feature": feature,
            "tenant_id": tenant_id,
            "user_id": user_id,
            "model": usage.model,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "total_tokens": usage.input_tokens + usage.output_tokens,
            "latency_ms": latency_ms,
            "outcome": outcome,  # success | timeout | rate_limited | error
        },
    )

Ship logs to whatever you already run: Datadog, CloudWatch, Grafana Loki, Elasticsearch. You do not need an LLM-specific vendor for this layer.

Query examples:

# Datadog — tokens by tenant
sum:llm.tokens{feature:copilot} by {tenant_id}.as_count()
 
# Loki — hourly token rate
sum by (tenant_id) (
  rate({app="api"} |= "llm.request" | json [1h])
)

Structured logging is the first thing we add on engagements when there is no observability at all. Langfuse comes next, not instead.

Layer 2: OpenTelemetry metrics (SRE-friendly)

Traces debug one bad answer. Metrics tell you whether the feature is degrading for everyone.

Export counters and histograms from middleware:

Metric	Type	Use
`llm.requests`	Counter	Volume by feature, tenant, model
`llm.tokens`	Counter	Input vs output dimensions
`llm.cost_usd`	Counter	Estimated spend (reconcile to invoice monthly)
`llm.latency_ms`	Histogram	p50 / p95 / p99 for paging
`llm.errors`	Counter	Provider timeouts, schema failures, budget exceeded

from opentelemetry import metrics

meter = metrics.get_meter("llm.middleware")
token_counter = meter.create_counter(
    "llm.tokens",
    description="LLM tokens by tenant and feature",
)
cost_counter = meter.create_counter(
    "llm.cost_usd",
    description="Estimated LLM spend in USD",
)

def record_otel_cost(attrs: dict, usage, cost_usd: float) -> None:
    labels = {
        "feature": attrs["feature"],
        "tenant_id": attrs["tenant_id"],
        "model": usage.model,
        "outcome": attrs["outcome"],
    }
    token_counter.add(usage.input_tokens + usage.output_tokens, labels)
    cost_counter.add(cost_usd, labels)

Langfuse can dual-write via its OpenTelemetry exporter: LLM-native traces in Langfuse for debugging, aggregated metrics in Datadog or Grafana for alerts. Many platform teams already have on-call runbooks there. Do not build a parallel paging system inside Langfuse unless your SRE team lives there.

Alert on symptoms, not vibes:

p95 llm.latency_ms > 8s for 10 minutes
llm.errors rate > 2% for a single feature
llm.cost_usd per tenant > daily budget (see cost monitoring)

Layer 3: Evals and quality signals

Tracing tells you what happened on one request. Evals tell you whether a prompt or retrieval change is safe to ship across dozens of cases.

The loop:

Baseline production or staging traces (redacted)
Golden dataset with property-based expectations (refusal, citation, correct tool)
Run on every meaningful change in CI
Ship behind a feature flag; compare live metrics by promptVersion

See Eval pipelines for LLM features for runners, scorers, and gates.

Online quality complements offline evals:

Signal	Source
Thumbs down / "report incorrect"	Product UI → Langfuse score or your DB
User edited the draft before sending	Implicit negative
Session abandoned mid-flow	Possible quality or latency issue
Tool confirmation rejected	Agent overreach

A support engineer marking ten traces "wrong citation" in Langfuse weekly beats an unused automated metric nobody maintains.

Layer 4: Cost and unit economics

Invoice totals are too coarse. Production teams track cost per successful outcome: accepted draft, resolved ticket, classified label applied.

That requires the same tenantId and feature tags on logs, traces, and metrics, plus an outcome dimension tied to product events. Full walkthrough: Monitoring LLM costs in production.

Langfuse alternatives and complements

Langfuse is our default recommendation for trace + prompt version + eval dataset in one place. Other tools fill adjacent niches. Pick one LLM-native platform in production; do not run three.

Tool	Strength	When to consider
Langfuse	Traces, prompts, scores, datasets, self-host	Default for middleware-integrated observability (setup guide)
Braintrust	Evals, regression runs, CI integration	Team thinks in test cases first; eval-heavy workflow
Arize Phoenix	Open-source tracing, embedding visualization	RAG debugging, retrieval quality, drift exploration
LangSmith	LangChain / LangGraph integration	Orchestration is already LangChain-native
Helicone / Portkey	Gateway proxy + request logging	Gateway is your middleware boundary
Datadog LLM Observability	Managed GenAI tracing	Already standardized on Datadog for everything

Generic APM (Honeycomb, Sentry, Grafana Tempo) stays valuable for exceptions, HTTP latency, and infrastructure. Use it alongside Langfuse, not as a replacement for prompt-version debugging.

Techniques that matter as much as tools

Consistent metadata schema

Every layer should share dimensions: feature, tenantId, sessionId, promptVersion, model, outcome. Add RAG fields (retrievalChunkCount, topDocIds) and agent fields (toolsInvoked, permissionDenied) as features mature. A trace you cannot filter by customer is useless in multi-tenant SaaS.

Trace the full chain

The bug is rarely the final streamed token. Instrument retrieval, tool selection, permission checks, and prompt assembly as child spans. Logging only the assistant reply hides 80% of copilot failures.

Sampling at scale

Environment	Guidance
Staging	Trace 100% of requests
Production — low volume copilot	Trace 100% initially
Production — high volume classifier	Sample 1–10%; always trace errors and budget breaches

Sampling keeps storage cost sane without flying blind.

Synthetic canaries

A cron job runs five golden prompts against production middleware every hour. Alert if latency, token count, or eval score drifts. Catches provider outages and silent prompt regressions before users report them.

Session-level grouping

For multi-turn copilots, tag sessionId on every turn. Metrics like cost per resolved thread or turns until abandonment beat per-message token counts for product decisions.

Redaction and retention

Traces are debugging artifacts, not product data. Mask emails, strip secrets, truncate retrieved chunks, set TTL by data classification. Observability that violates privacy policy gets shut down.

RAG-specific observability

Generic generation traces miss retrieval failures. Add spans and metrics for:

Query rewrite and filters applied
Chunk count returned vs injected into prompt
Embed latency and cost (separate from generation)
Citation present in output vs retrieved doc IDs
"Low confidence" fallback path taken

Phoenix or custom dashboards on embedding distributions help when answers drift after a doc corpus update. Pair with property-based evals: "answer mentions retrieved doc title", "refuses when no chunks above threshold".

Agent and tool-calling observability

Agents add steps, not just tokens. Per tool invocation, record:

Tool name and redacted arguments
Permission outcome (allowed, denied, needs confirmation)
Downstream API latency and error
Human approval granted or rejected

When a customer says "the copilot tried to close the wrong ticket," you need which tool, which ID, which policy gate — not a generic "agent error" in logs.

Connect to prompt injection and tool security: security-relevant denials should appear in traces and metrics, not only in stderr.

Maturity model: what to add when

Stage	Minimum stack
POC / demo	Structured log: feature, model, tokens, latency
First production feature	Langfuse (or equivalent) on every middleware call; tenant + feature tags
On-call ownership	OTel metrics + alerts in existing APM
Prompt iteration	Golden dataset + CI eval gate
Multi-tenant scale	Per-tenant budgets, sampling, cost-per-outcome dashboards
Second AI feature	Shared tracing module — one schema, all features

Adding observability after three features each instrument differently means a migration project. Wire the shared module when you extract LLM middleware.

Common mistakes

Tool sprawl. Langfuse + Braintrust + Helicone + a custom spreadsheet. Pick one LLM-native layer and integrate it deeply.

Tracing from the browser. Keys stay server-side. Client traces are incomplete and insecure.

APM-only. Datadog shows /api/copilot is slow. It does not show which promptVersion regressed.

Langfuse-only. No alerts, no budgets, no CI eval gate. You debug well but ship regressions and cost surprises.

No owner. Someone reviews traces and eval failures weekly during rollout. Dashboards nobody opens do not count.

100% trace volume forever. Storage cost explodes on classifiers running on every row.

How the pieces connect at rollout

Prompt / retrieval change
        ↓
   CI eval on golden set ──fail──► block merge
        │
       pass
        ↓
   deploy behind feature flag (promptVersion tagged)
        ↓
   monitor: OTel alerts + cost per outcome + Langfuse trace sampling
        ↓
   production feedback → new golden cases → repeat

Production readiness checklist

Server-side auth

Tenant-scoped context

Structured logging

Cost per action

Eval pipeline

Provider fallback

Feature flags

Audit on tool calls

Use this as a gate before calling an AI feature GA — not as a post-launch backlog.

How 475 Cumulus approaches the full stack

We do not sell observability licenses. On integration projects we typically:

Define one metadata schema across logs, traces, and metrics
Instrument middleware once with Langfuse or OTel-compatible tracing
Stand up eval datasets from real workflow boundaries, not lorem ipsum
Connect tracing to rollout — feature flags, canary prompt versions, per-tenant cost alerts
Dual-write to existing APM when platform teams already run Datadog, Honeycomb, or Grafana

The outcome is LLM features that behave like the rest of your production systems: permissioned, measurable, and improvable without guessing what the model saw.

Langfuse is the right center of gravity for LLM-native debugging. Round it out with structured logs, OTel alerts, eval gates, and unit economics — then you have observability, not just traces. Describe your copilot or agent and we will map the full stack for your middleware and auth model.

Browse all resourcesMore on observability