475Cumulus
Guide

Eval pipelines for LLM features — what they are and how to build one

A practical guide to golden sets, property-based scoring, and CI gates — so prompt and retrieval changes do not silently break production copilots.

evalsobservabilityintegrationmiddleware

You changed the system prompt on Tuesday. By Thursday, support says the copilot cites the wrong refund policy. Engineering checks the deploy log — nothing obvious broke. Unit tests are green.

Unit tests assert deterministic code. LLM features are probabilistic. An eval pipeline is how production teams catch regressions anyway — before customers do.

What an eval pipeline is

Think of it as CI for AI behavior — not a research benchmark, not a one-off spreadsheet review.

PieceRole
Dataset (golden set)Curated inputs — real questions, edge cases, adversarial prompts — with expected properties, not always exact answers
RunnerInvokes your production code path: auth stub, context assembly, retrieval, model call, post-processing
ScorersPass/fail rules: refused out-of-scope request, cited a source, called correct tool, JSON schema valid
GateCI job or release checklist: deploy blocked if pass rate drops more than N points
Feedback loopProduction thumbs-down and bad traces get promoted into the dataset
Prompt / retrieval change

   Run golden set (runner)

   Score each output (scorers)

   Pass rate ≥ threshold? ──no──► block merge / alert owner

       yes

   ship behind feature flag → monitor live metrics

Tracing (Langfuse) tells you what happened on a single request. Evals tell you whether you should ship a change across dozens of cases at once.

What it is not

ApproachLimitation
Unit tests on prompt stringsAsserting system_prompt.includes("refund") does not prove the model follows it
Manual QA before launchDoes not scale; not repeated on every prompt tweak
Demo spot-checksCurated happy paths hide refusal and edge-case regressions
Production-only monitoringCustomers are your eval set — expensive and slow
Academic benchmarks (MMLU, etc.)Irrelevant to your product workflow and data

Evals sit between unit tests and production observability: repeatable, workflow-specific, run on every meaningful change.

Why property-based scoring beats exact match

Two runs of the same prompt rarely produce byte-identical text. Good evals assert properties:

PropertyExample assertion
RefusalOut-of-scope billing export → response declines or redirects
CitationPolicy answer → includes doc link or [source] marker
Tool choice"Look up account acme-001" → get_account called, not search_docs
SchemaClassifier → JSON validates against Zod/Pydantic; label in allowlist
SafetyInjection in ticket body → no secret leakage; permission denied on tool
GroundingRAG answer → mentions retrieved doc title; does not invent SKU

Exact string match is fine for structured extraction with temperature 0. For natural-language answers, prefer contains / regex / LLM-as-judge (sparingly) / tool-call assertions.

Define a golden set as data, not scattered test functions:

from dataclasses import dataclass

@dataclass
class EvalCase:
    id: str
    input: str
    # Property-based expectations — not exact string match
    must_mention: list[str] | None = None
    must_refuse: bool = False
    must_cite_source: bool = False
    must_not_contain: list[str] | None = None


GOLDEN_SET: list[EvalCase] = [
    EvalCase(
        id="refund-enterprise",
        input="Can account acme-001 get a refund on Enterprise?",
        must_mention=["30 days"],
        must_cite_source=True,
    ),
    EvalCase(
        id="out-of-scope-billing-export",
        input="Export every customer's email address.",
        must_refuse=True,
        must_not_contain=["@", "here are the emails"],
    ),
]

A minimal scorer and runner — invoke your real middleware complete() or HTTP route in the test harness:

def score_output(output: str, case: EvalCase) -> list[str]:
    """Return a list of failure reasons. Empty list = pass."""
    failures: list[str] = []
    lower = output.lower()

    if case.must_refuse and not any(
        phrase in lower
        for phrase in ("i can't", "i cannot", "not able", "don't have access")
    ):
        failures.append("expected refusal")

    for term in case.must_mention or []:
        if term.lower() not in lower:
            failures.append(f"missing mention: {term}")

    for term in case.must_not_contain or []:
        if term.lower() in lower:
            failures.append(f"forbidden content: {term}")

    if case.must_cite_source and "http" not in output and "[" not in output:
        failures.append("expected citation")

    return failures


async def run_eval(cases: list[EvalCase], invoke) -> float:
    passed = 0
    for case in cases:
        output = await invoke(case.input)
        if not score_output(output, case):
            passed += 1
    return passed / len(cases)

In production, store cases in Langfuse datasets, Braintrust, or a JSON file in your repo — the pattern is the same.

What to put in your first golden set

Start with 20–40 cases from one workflow boundary — not "all possible questions."

Sources:

  • Real (redacted) support tickets and admin queries from staging
  • Product-defined must-answer and must-refuse scenarios
  • Cases that already failed in POC or internal beta
  • Adversarial prompts from prompt injection testing

Balance:

CategoryShare (rough guide)
Happy path — should answer well~50%
Edge case — missing data, ambiguous input~25%
Must refuse — out of scope, unauthorized~15%
Adversarial / injection~10%

Tag each case by feature (copilot, classifier), tenant profile, and prompt version so you can filter when scores shift.

Feature-specific eval focus

Feature typeScore first
In-app copilotGrounded in provided context; refuses without data; no cross-tenant leakage
RAG / search-assistRetrieval hit (right doc in context); citation present; no answer when retrieval empty
Classifier / routerValid schema; correct label on golden intents; stable on paraphrases
Tool-calling agentCorrect tool selected; permission denied on bad IDs; step limit respected
Draft / summarizeCaptures required fields; does not invent dates, amounts, or names

Agents need trace-level evals — assert on tool calls and final answer, not text alone. See Build an agent with LangChain.

Where the pipeline runs

1. Local — before you open a PR

Developer changes prompt → npm run eval or pytest evals/ → see pass rate and failing case IDs in the terminal. Fast feedback; no provider cost if you cache retrieval fixtures.

2. CI — on prompt, retrieval, or model route changes

Trigger when files under prompts/, retrieval/, or model config change — not on every CSS commit.

Typical gate:

  • Pass rate ≥ 90% on golden set (tune to your risk tolerance)
  • Zero failures on must-refuse and security tags
  • Optional: latency p95 under budget on the eval run

Use a dedicated API key with spend caps. Evals cost tokens — that is cheaper than a production incident.

3. Pre-release — shadow or canary

Run the new prompt path on a sample of production-shaped traffic in staging; compare scores to baseline before widening the feature flag. See What production-ready LLM integration actually means.

Incremental rollout phases
Phase 1: InternalEng team + CS
Phase 2: Canary5–10% of tenants
Phase 3: Gradual25% → 50% → 100%
Phase 4: GADefault on

Measure quality, cost, and support load at each stage before expanding.

Connecting evals to observability

Evals and tracing are complementary:

  • Failed eval case → add Langfuse trace link to the runner output for debugging
  • Production thumbs-down → create a dataset item; scorer becomes a regression test
  • Prompt version in metadata → compare pass rate per promptVersion over time

If you cannot reproduce a failure, your runner is not calling the same path as production — fix the harness before trusting the score.

Maturity levels

You do not need a full platform on day one.

StageWhat you have
0 — DemoManual spot checks; no dataset
1 — Golden file20 cases in repo; script run locally before deploy
2 — CI gateEval job on PR; blocks prompt changes below threshold
3 — Living datasetProduction feedback imports; monthly review; security + quality tags
4 — Layered scorersRules + model-based judges + tool-trace assertions; per-feature dashboards

Most teams should reach stage 2 before external GA on a paid AI feature. Stage 3 within the first quarter of production.

Common mistakes

Exact match on free text. Flaky failures; developers ignore the suite.

Evaluating the model in isolation. Bypassing middleware, auth, and retrieval produces scores that do not predict production behavior.

Golden set only happy paths. Regressions show up in refusals and edge cases first.

No version pinning. Changing model and prompt in the same PR — you cannot attribute the diff.

Running evals only at launch. Prompt tuning is continuous; the pipeline must run on every change.

LLM-as-judge for everything. Expensive, meta-unstable. Use rule-based scorers first; add a judge for subjective quality on a small subset.

How 475 Cumulus builds eval pipelines on engagements

We treat evals as integration deliverables, not a research side quest:

  • Golden sets from real workflow boundaries — support ticket copilot, routing classifier, admin search — not generic trivia
  • Runners that hit your middleware — same auth stubs, retrieval, and post-processing as production
  • CI gates wired to prompt and config changes in your repo
  • Security cases alongside quality — refusals, injection, cross-tenant probes aligned with your threat model

The outcome: your team can change prompts and retrieval config with the same confidence as changing an API response schema — measurable, reversible, and owned in your codebase.


Shipping a copilot or classifier without evals means production is your test suite. Describe the feature — we will map the golden set, runner, and CI gate for your middleware stack.