Technical · 2026-06-26 · Last verified 2026-06-26

AI Agent Evals: How to Test AI Agents Before and After Production

A practical guide to AI agent evals: build golden datasets from real traces, score trajectories and outcomes, implement LLM-as-judge without its biases, and wire everything into CI so regressions never ship twice.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

Agents fail silently: the API returns 200, the response reads fluently, and the answer is wrong. Without evals you find out from customers, not from your pipeline.
Use the eval pyramid: cheap deterministic assertions on every run, LLM-as-judge on a sampled or offline set, and human review only for the ambiguous slice. Most teams invert this and burn budget on judges for checks a regex could do.
Score trajectories, not just outcomes. An agent that reaches the right answer through five redundant tool calls or an unsafe path will fail on the next input. Outcome-only metrics miss this entirely.
Your first eval set needs 10-50 real cases pulled from production traces or realistic pilots, not 500 synthetic ones. Every production failure should become a permanent regression case.
LLM-as-judge works, but only with a rubric-based prompt, randomized ordering, and periodic human calibration. Unmitigated judges show position bias, verbosity bias, and self-preference bias.
Wire evals into CI as a merge-blocking gate that runs on every prompt, model, or tool change. A prompt edit without an eval run is an untested deploy.

Why Agents Fail Silently Without Evals

Traditional software fails loudly. A null pointer throws, a bad deploy 500s, an alert fires. AI agents fail quietly. The LLM always returns something, the HTTP status is 200, the response is grammatically perfect, and it is wrong. No exception, no stack trace, no alert. Your monitoring dashboard is green while your agent confidently tells a customer the wrong refund policy.

Here are three failure patterns I have seen in real deployments, all invisible to standard monitoring:

The prompt-edit regression. A team tweaks their support agent's system prompt to make it more concise. It works great on the two examples they test by hand. What they do not notice: the shorter prompt dropped an instruction about checking order status before promising delivery dates. For three weeks the agent promises dates it never verified. The team finds out when a customer escalates.

The silent tool drift. An internal API the agent calls changes its response schema. The tool call still succeeds, but the agent now parses an empty field and answers "I could not find that information" for a category of queries it used to handle perfectly. Volume on that intent quietly shifts to human support. Nobody connects the dots for a month.

The model-swap surprise. A provider deprecates a model version, the team swaps in the successor, and tool-calling behavior shifts subtly: the new model calls a destructive tool without asking for confirmation first. This is the class of failure that turns into a real incident, the kind our agent risk scorer is designed to flag before launch.

The common thread: every one of these would have been caught by a modest eval suite running on every change. Evals are to agents what unit tests are to code, except the system under test is probabilistic, so you need a layered strategy instead of simple assertions. This is also one of the top items on our list of AI agent mistakes businesses make: shipping an agent with zero automated quality checks and treating "we tried a few prompts and it looked fine" as testing.

This guide covers the full lifecycle: building your first eval set, scoring trajectories versus outcomes, implementing LLM-as-judge correctly, deterministic tool-call assertions, picking a framework, wiring evals into CI, and running online evals after launch.

The Eval Pyramid: Deterministic Checks, LLM Judges, Human Review

Just like the classic testing pyramid, agent evals have layers with very different costs and coverage. The mistake most teams make is starting at the expensive top (humans reading transcripts) or over-relying on the middle (LLM judges for everything). Build bottom-up.

Layer	What it checks	Cost per case	When to run	Examples
Deterministic assertions	Structure and hard constraints	~$0, milliseconds	Every CI run, every production trace	Correct tool called, valid JSON output, required fields present, no PII in response, latency and token budgets, forbidden phrases absent
LLM-as-judge	Semantic quality	$0.001-0.05, seconds	Full eval set in CI, sampled traces in production	Factual correctness vs reference, helpfulness, adherence to policy, trajectory efficiency
Human review	Ground truth and judge calibration	$1-10, minutes	Weekly sample, all judge disagreements, all escalations	Labeling new golden cases, auditing judge scores, reviewing edge cases the judge flagged as uncertain

The layers feed each other. Deterministic checks filter out the obviously broken runs so judges only score plausible outputs. Judge scores flag the ambiguous cases so humans only review where their time matters. Human labels then calibrate the judge and grow the golden dataset. A single engineer with this pipeline can evaluate volume that used to require a review team.

A useful rule of thumb for coverage: roughly 60-70% of your eval signal should come from deterministic checks, 25-35% from LLM judges, and under 5% from humans. If your ratio is inverted, you either have not extracted the deterministic checks hiding inside your judge prompts (many "quality" criteria are actually structural), or you are paying humans to do what a regex can.

Human review is also where human-in-the-loop patterns and evals converge: every case a human approves or rejects at runtime is a free labeled example. Pipe those decisions into your dataset instead of letting them evaporate.

Building Your First Eval Set: 10-50 Real Cases

You do not need 1,000 test cases to start. You need 10-50 cases that represent what your agent actually faces, with expected outcomes a domain expert has signed off on. A small, real dataset beats a large synthetic one every time, because synthetic cases inherit the blind spots of whoever generated them.

Where to source cases, in order of value:

1. Production traces. If you are already live, pull real conversations: the ones that went well (positive cases), the ones users abandoned or thumbed down (failure cases), and the ones support escalated (hard cases). Every production failure becomes a permanent regression case: the trace becomes a test, the test joins the golden dataset, and that failure can never silently ship again.

2. Pilot and dogfood sessions. Pre-launch, run structured sessions where real users (or you, wearing the user hat honestly) work through actual tasks. Record everything.

3. Expert-written edge cases. Ask the domain expert: "What question would a new hire get wrong?" Those go straight in.

4. Synthetic variations, last. Once you have 20 real cases, use an LLM to generate paraphrases and adversarial variants of them. Synthetic-first datasets are how teams end up with 95% eval pass rates and unhappy users.

Store the dataset as versioned data in your repo. Here is a minimal, practical schema:

[
  {
    "id": "refund-policy-basic-001",
    "input": "I bought the annual plan 3 days ago, can I get my money back?",
    "context": {"user_plan": "annual", "purchase_days_ago": 3},
    "expected_output": "Confirms eligibility for full refund under the 14-day policy and offers to initiate it.",
    "expected_tools": ["lookup_subscription", "check_refund_policy"],
    "must_not": ["promises a refund without checking eligibility"],
    "source": "production_trace_8842",
    "tags": ["refunds", "happy_path"]
  },
  {
    "id": "refund-policy-edge-017",
    "input": "Refund me now or I dispute the charge with my bank.",
    "context": {"user_plan": "annual", "purchase_days_ago": 45},
    "expected_output": "Stays calm, explains the 14-day window has passed, escalates to a human agent per policy.",
    "expected_tools": ["lookup_subscription", "check_refund_policy", "escalate_to_human"],
    "must_not": ["issues refund outside policy", "threatens the user"],
    "source": "support_escalation_2026-05-12",
    "tags": ["refunds", "adversarial", "escalation"]
  }
]

Notice the fields. expected_output is a description of correct behavior, not an exact string, because agents phrase things differently run to run. expected_tools enables deterministic trajectory checks. must_not captures the failures that actually hurt. source keeps the provenance so you can revisit the original trace. Both LangSmith and Langfuse can import datasets in roughly this shape, and if you built your agent following our LangGraph tutorial or OpenAI Agents SDK tutorial, the trace exports map onto it directly.

Review cadence matters more than size: add every new production failure within a week, prune cases that no longer reflect the product, and have a human re-verify expected outputs quarterly. A stale golden dataset is quietly worse than none, because it manufactures false confidence.

Trajectory vs Outcome Scoring

An agent is not a function that maps input to output. It is a process: a chain of reasoning steps, tool calls, and observations that unfolds before any final answer appears. That means you have two distinct things to score, and they fail independently.

Outcome scoring asks: was the final answer correct, complete, and safe? This is what most teams measure, and it is necessary but not sufficient.

Trajectory scoring asks: was the path sound? Did the agent call the right tools, in a sensible order, without loops, retries, or detours? Did it check before it acted?

Why you need both: outcome-only metrics miss what researchers call corrupt success, where the agent lands on the right answer through an unsafe or illogical path. Concrete example from a data agent I reviewed: asked for last month's revenue, it ran a query against the wrong table, got an error, retried against three more tables, and eventually found a number that happened to be right. Outcome eval: pass. Trajectory eval: four wasted tool calls, one near-miss on a table it had no business reading, and a path that fails the moment the lucky table gets renamed. That agent was one schema change away from confidently reporting garbage.

The reverse also happens. A RAG agent retrieves the right documents, reasons correctly, and then fumbles the final synthesis. Trajectory: fine. Outcome: fail. Knowing which half broke tells you whether to fix retrieval or the answer prompt, which is exactly the debugging loop we walk through in our guide to building a RAG agent.

In practice, three trajectory metrics cover most needs:

# Trajectory metrics worth computing on every eval run

# 1. Tool precision: did it avoid tools it should not have called?
tool_precision = correct_tool_calls / total_tool_calls

# 2. Tool recall: did it call everything it needed?
tool_recall = expected_tools_called / expected_tools_total

# 3. Efficiency: how much longer than the reference path?
step_ratio = actual_steps / reference_steps
# step_ratio > 1.5 usually means looping or flailing

For LangGraph and OpenAI-format agents, the open source agentevals package (from the LangChain team) ships ready-made trajectory-match evaluators that compare a run against a reference trajectory in strict, unordered, or superset mode, so you can allow extra harmless steps while still failing on missing required ones. Trajectory data also feeds capacity planning once you deploy to production: an agent averaging 9 LLM calls per task when the reference path takes 4 is burning more than twice your GPU or API budget.

LLM-as-Judge: Implementation, Biases, and Mitigations

LLM-as-judge means using a strong model to score your agent's outputs against a rubric. It is the only scalable way to evaluate semantic quality (correctness, tone, policy adherence), and it is also the most misused technique in agent evals. Let's do it properly.

The core rule: judges grade against a rubric and a reference, never against vibes. "Rate this response 1-10 for quality" produces noise. A rubric with binary criteria produces signal. Here is a judge prompt structure that holds up in practice:

JUDGE_PROMPT = """You are evaluating an AI support agent's response.

## User request
{input}

## Reference behavior (written by a domain expert)
{expected_output}

## Hard constraints (any violation = automatic fail)
{must_not}

## Agent's actual response
{actual_output}

## Instructions
Evaluate ONLY against the criteria below. Do not reward length,
politeness, or confident tone. A short correct answer beats a
long partially-correct one.

For each criterion answer true or false with a one-line reason:
1. factually_consistent: Every claim matches the reference
   behavior or the provided context. No invented details.
2. policy_compliant: No hard constraint is violated.
3. task_complete: The user's actual question is resolved,
   not deflected.
4. appropriately_scoped: The response does not promise actions
   the agent did not verify or perform.

Return JSON only:
{{"factually_consistent": bool, "policy_compliant": bool,
  "task_complete": bool, "appropriately_scoped": bool,
  "reasons": {{...}}, "overall_pass": bool}}
"""

Binary criteria plus required reasons plus JSON output: each of these choices fights a specific failure mode. And there are real failure modes. The research on judge reliability is consistent and sobering:

Position bias. In pairwise comparisons, the answer shown first wins 10-15 points more often, independent of quality. Mitigation: evaluate both orderings and average, or avoid pairwise judging entirely and score each output against a rubric independently (as above).

Verbosity bias. Judges score longer answers higher even when the extra length adds nothing. This one is baked into the model's training, so prompt instructions alone ("do not reward length") reduce but do not eliminate it. Mitigation: explicit anti-length instruction plus periodic human audits comparing judge scores on short-vs-long pairs.

Self-preference bias. Models rate their own outputs (and same-provider outputs) higher. Mitigation: use a different model family for judging than the one powering your agent whenever practical.

Calibration drift. A judge that agreed with humans 92% of the time in March may not in September, because your data distribution shifted or the judge model was updated. Mitigation: keep a small human-labeled calibration set (30-50 cases) and re-measure judge-human agreement monthly. If agreement drops below roughly 85%, fix the rubric before trusting new scores.

Two more operational rules. First, pin the judge model version; an unpinned judge is an eval suite that rewrites itself. Second, for high-stakes criteria, run three judge calls and take the majority vote; it costs 3x per case on a dataset of 50, which is pennies, and it noticeably cuts variance.

Deterministic Tool-Call Assertions in Pytest

Before any LLM judge runs, deterministic checks should have already verified everything that has a right answer. Tool calls are the best example: either the agent called lookup_subscription with the right customer ID or it did not. No judge needed, no tokens spent, no flakiness.

Here is a self-contained pytest example asserting on an agent's tool calls. It works with any agent that exposes its message history in OpenAI format, which covers LangGraph, the OpenAI Agents SDK, and most MCP-based tool setups:

import json
import pytest
from my_agent import run_agent  # returns final state incl. messages

def extract_tool_calls(messages: list[dict]) -> list[dict]:
    """Flatten all tool calls from an agent run into
    [{"name": ..., "args": {...}}, ...] in call order."""
    calls = []
    for msg in messages:
        for tc in (msg.get("tool_calls") or []):
            calls.append({
                "name": tc["function"]["name"],
                "args": json.loads(tc["function"]["arguments"]),
            })
    return calls

def test_refund_checks_policy_before_promising():
    result = run_agent(
        "I bought the annual plan 3 days ago, can I get my money back?",
        context={"customer_id": "cus_123"},
    )
    calls = extract_tool_calls(result["messages"])
    names = [c["name"] for c in calls]

    # Required tools were called
    assert "lookup_subscription" in names
    assert "check_refund_policy" in names

    # Order matters: never promise before checking
    assert names.index("check_refund_policy") < len(names)
    assert names.index("lookup_subscription") < names.index(
        "check_refund_policy"
    )

    # Exact-match on arguments for the critical call
    lookup = next(c for c in calls if c["name"] == "lookup_subscription")
    assert lookup["args"] == {"customer_id": "cus_123"}

    # Forbidden tools were NOT called
    assert "issue_refund" not in names, (
        "Agent must never issue refunds without human approval"
    )

def test_agent_stays_within_step_budget():
    result = run_agent("What is my current plan?")
    calls = extract_tool_calls(result["messages"])
    assert len(calls) <= 2, f"Expected 1-2 tool calls, got {len(calls)}"

Three patterns in that snippet are worth stealing. Ordering assertions encode process safety: check before you promise, verify before you write. Negative assertions (the issue_refund check) encode your blast-radius rules; they are the automated version of the controls in our agent risk checklist. Step budgets catch looping the moment it appears, long before it shows up as a latency or cost graph.

If you use LangSmith, decorate these same tests with @pytest.mark.langsmith (SDK 0.3.4+) and every run gets logged as a tracked experiment with pass rates over time, while still running as plain pytest locally and in CI. Because LLM outputs vary, expect some legitimate nondeterminism: for assertions on tool choice, run each test 3-5 times and require all passes for safety-critical checks, or 4 of 5 for softer ones. Flaky agent tests are data, not noise; a test that passes 60% of the time is telling you the agent's behavior is unreliable on that input.

Framework Comparison: LangSmith vs Langfuse vs DeepEval vs OpenAI Evals

You can build everything above with pytest and a JSON file, and honestly, that is a fine start. Frameworks earn their place when you want experiment tracking over time, dataset management in a UI, and production trace scoring. Here is how the four main options compare as of mid-2026:

	LangSmith	Langfuse	DeepEval	OpenAI Evals
Type	Managed platform (observability + evals)	Open source platform (MIT), cloud or self-host	Open source Python eval library (+ Confident AI cloud)	Eval API + dashboard inside OpenAI platform
Pricing	Free dev tier, then per-trace; heavy tracing gets expensive (500K traces/mo lands around $1,400/mo on Plus)	Self-host free; cloud from free tier to ~$199/mo Pro flat	Library is free forever; cloud platform optional, usage-based	Free tooling; you pay normal token costs for runs and judges
Self-host	Enterprise plan only	First-class, Docker Compose to Kubernetes	Fully local by design	No, tied to OpenAI's platform
CI fit	Excellent: pytest integration, experiments, regression comparison views	Good: SDK-driven, scores via API, more DIY assembly	Excellent: literally pytest-style, built for CI from day one	Moderate: API-triggerable but built around OpenAI models
Trajectory evals	Strong (plus open source agentevals/openevals)	Good via trace scoring on any step	Good: tool-correctness and task-completion metrics	Basic, response-focused
Best for	LangChain/LangGraph teams who want one integrated platform	Teams that need self-hosting or framework neutrality	Python teams who want evals as code, no platform required	Teams all-in on OpenAI models and SDK

My honest defaults: if you are on LangGraph and fine with a managed service, LangSmith gives you the shortest path from trace to dataset to CI gate. If data residency matters or you run a self-hosted LLM agent stack, Langfuse is the obvious pick; it is MIT-licensed and self-hosting is genuinely first-class, not an enterprise upsell. If you want zero platform dependency and evals that live entirely in your repo, DeepEval (or the similar Promptfoo, worth a look for its CLI and red-teaming features) is the eval-as-code option. Also worth knowing: RAGAS remains the standard add-on for RAG-specific metrics like faithfulness and context precision, and it plugs into all of the above rather than competing with them.

Whatever you pick, the dataset format from earlier in this guide imports into any of them. The framework is the easy, swappable part; the golden dataset and the rubrics are the assets you actually accumulate.

Wiring Evals into CI: Run on Every Prompt Change

Here is the discipline that separates teams with working agents from teams with agent incidents: every change to a prompt, model, tool schema, or retrieval config triggers an eval run, and a score drop blocks the merge. Prompts are code. A prompt edit without an eval run is an untested deploy, no matter how small it looks; the horror stories in the first section were all "small" edits.

A GitHub Actions workflow for this is short. The trick is in the trigger paths and the gate:

name: agent-evals
on:
  pull_request:
    paths:
      - "agent/prompts/**"
      - "agent/tools/**"
      - "agent/graph.py"
      - "evals/**"

jobs:
  evals:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r evals/requirements.txt

      # Layer 1: deterministic assertions (fast, free, must pass)
      - name: Tool-call and structure assertions
        run: pytest evals/test_deterministic.py -x -q

      # Layer 2: LLM-as-judge on the golden dataset
      - name: Judge evals on golden dataset
        env:
          JUDGE_API_KEY: ${{ secrets.JUDGE_API_KEY }}
        run: python evals/run_judge.py --dataset evals/golden.json                --min-pass-rate 0.90 --output eval-report.json

      # Gate: fail the PR if pass rate dropped vs main
      - name: Compare against baseline
        run: python evals/compare_baseline.py                --report eval-report.json                --baseline .eval-baseline.json                --max-regression 0.02

      - uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval-report.json

Design decisions worth copying. Path filters keep eval runs off unrelated PRs, so nobody gets trained to ignore them. Deterministic tests run first and fail fast, so you never pay judge tokens to score an agent that is structurally broken. The regression gate compares against main's baseline, not an absolute threshold, so a PR that drops pass rate from 94% to 89% fails even though 89% might sound acceptable in isolation. And the report is uploaded as an artifact so reviewers see which cases regressed, not just a red X.

Two practical notes. Judge runs on a 50-case dataset with a mid-tier judge model cost well under a dollar per PR; this is the cheapest insurance in your entire stack. And cache aggressively: if the diff only touches one tool, you can skip judge evals on cases that never exercise that tool, cutting run time from minutes to seconds. LangSmith's pytest integration and DeepEval both support exactly this style of pipeline out of the box if you would rather not maintain the compare-baseline script yourself.

After Launch: Online Evals and Production Monitoring

Offline evals tell you the agent works on the cases you thought of. Online evals tell you whether it works on the cases users actually bring, which is a different and larger set. Once you are live, your monitoring stack should be running evals continuously, not just counting errors and latency (though you need those too; see our production deployment guide for the infrastructure side).

A production online-eval setup has three tiers:

Tier 1: deterministic checks on every trace. These are cheap enough to run on 100% of traffic: output parsed correctly, no forbidden tool called, no PII leaked, step count within budget, no policy phrase violations. Emit them as scores on each trace (Langfuse scores or LangSmith feedback both work) and alert on rate changes, exactly like error-rate alerting.

Tier 2: sampled LLM-as-judge. Judging every production turn is too expensive and unnecessary. Sample 1-5% of traces (plus 100% of traces that failed a Tier 1 check or got negative user feedback) and run your offline judge rubric on them asynchronously. Now you have a continuous quality score with a trend line. A three-point drop in judged correctness over a week is your earliest warning that something upstream changed: a model update, a data drift, a tool API behaving differently.

Tier 3: implicit and explicit user signals. Thumbs up/down, "did this resolve your issue," task abandonment mid-conversation, escalation-to-human rate, and retry phrasing ("no, I meant...") are all eval signals users generate for free. They are noisy individually and reliable in aggregate.

For riskier changes, add A/B evals: route 5-10% of traffic to the new prompt or model, run identical Tier 1 and Tier 2 scoring on both arms, and compare before rolling out. This catches the failures that only appear under real input diversity, the ones no golden dataset covers.

The loop that makes all of this compound: every online failure becomes an offline case. A trace that fails a judge check or gets a thumbs-down goes into a review queue; a human confirms whether it is a real failure; confirmed failures get added to the golden dataset with the corrected expected behavior. Six months in, your eval suite is a curated museum of every way your agent has ever failed, and none of those failures can ship twice. Teams that run this loop are the ones whose agents get measurably better each month, which is the operating rhythm we teach in the AI Agents for Operators course.

A Realistic Eval Budget: How Much Eval per Feature

"How much should we invest in evals?" is really a risk question. Here is the budget heuristic I give consulting clients, scaled by blast radius:

Agent risk level	Example	Golden cases	Eval effort	CI gate
Low (read-only, internal)	Internal docs Q&A bot	10-20	~10% of build time	Deterministic checks only
Medium (customer-facing, no actions)	Support answer agent	30-60	~20% of build time	Deterministic + judge, 90% pass gate
High (takes actions)	Agent that issues refunds, sends emails, writes to DBs	60-150	~30-40% of build time	Full pyramid, regression gate, human sign-off on eval changes

Per feature, the rhythm looks like this: when you add a new tool or capability, budget 5-15 new golden cases covering its happy path, its failure modes, and at least two adversarial inputs. Write the deterministic assertions for it before you write the tool integration itself; agent TDD is unreasonably effective, because defining "correct tool usage" up front forces you to design a tool schema the model can actually use.

In dollar terms, evals are almost embarrassingly cheap relative to what they protect. A 100-case golden dataset judged on every PR, at 30 PRs a month, costs a few dollars in judge tokens. Ongoing online sampling at 2% of, say, 100K monthly conversations adds maybe $50-150 depending on judge model. Compare that to one production incident where an action-taking agent misfires. If you are unsure which risk tier your agent sits in, run it through our free AI agent risk scorer; the score maps directly onto the table above.

The one place not to economize: human calibration of your judge. Two hours a month of a domain expert reviewing 30-50 judge decisions is what keeps every automated score above it trustworthy. Skip it, and within a quarter you are steering by an instrument nobody has checked.

Common Mistakes in Agent Evals

Patterns I see repeatedly in eval setups that are not earning their keep:

1. Testing the demo, not the distribution. The eval set contains the ten inputs from the sales demo, all phrased politely and completely. Real users write fragments, typos, and threats. Source cases from real traffic or honest pilots, always.

2. Vibes-based judging. "Score 1-10 for helpfulness" with no rubric and no reference. The scores are noise, the trend lines are noise, and the team learns to ignore them. Binary rubric criteria with required reasons, every time.

3. Grading only outcomes. Covered above, but it bears repeating: agents that succeed via broken trajectories are pre-failures. Score the path.

4. An eval set that never grows. The dataset was built in week one and never touched again. Six months later it tests an agent that no longer exists against users who never existed. Add every production failure; prune quarterly.

5. Unpinned judge models. Your judge silently updates, scores shift 4 points, and you spend two days debugging an agent regression that is actually a judge regression. Pin versions; re-baseline deliberately when you upgrade.

6. 100% pass rate as the target. If every case passes, your eval set is too easy, and it has stopped generating information. A healthy golden dataset sits around 85-95% with the failures pointing at known hard cases you are actively working on.

7. Evals as a launch artifact instead of a living gate. The team runs a big eval push before launch, ships, and never runs them again. All the value of evals is in the loop: every change gated, every failure captured. A one-time eval is a screenshot; a CI-wired eval suite is a security camera.

8. No human anywhere in the loop. Fully automated eval pipelines drift. Judges need calibration, datasets need curation, and ambiguous cases need adjudication. The goal is to make human review surgical, not to eliminate it; the human-in-the-loop patterns that protect your runtime also protect your evals.

Most of these mistakes share a root cause: treating evals as a QA checkbox rather than as the feedback system that makes agent iteration possible at all. Without evals you cannot tell whether a change helped, so you either stop changing things (stagnation) or change them blind (roulette).

Next Steps: From Eval Basics to Production Discipline

Here is the whole guide compressed into a starting checklist you can execute this week:

Week 1: Foundations
[ ] Pull 20-30 real cases from traces, pilots, or expert interviews
[ ] Write the dataset JSON with expected_output, expected_tools, must_not
[ ] Write deterministic pytest assertions for tool calls and structure

Week 2: Judging and CI
[ ] Write a rubric-based judge prompt with binary criteria
[ ] Human-label 30 cases and measure judge agreement (target 85%+)
[ ] Add the CI workflow: assertions first, judge second, regression gate

After launch
[ ] Tier 1 deterministic checks on 100% of traces
[ ] Judge sampling on 1-5% of traffic plus all flagged traces
[ ] Weekly: convert confirmed production failures into golden cases
[ ] Monthly: re-calibrate the judge against fresh human labels

None of this requires a big platform commitment on day one. A JSON file, pytest, one judge prompt, and a GitHub Action gets you 80% of the value; graduate to LangSmith, Langfuse, or DeepEval when you want experiment history and trace-linked datasets.

If you want the full production discipline, evals are one pillar of a larger system that includes deployment, observability, guardrails, and human oversight. Our Production-Grade Agent Engineering course covers this entire lifecycle hands-on: you build an agent, build its eval suite, wire the CI gates, deploy it (using the same stack as our LangGraph + vLLM deployment guide), and run the online eval loop against real traffic. It is the course we wish existed when we shipped our first agents and learned this material via incidents instead.

And if you are staring at an agent that is already in production without any of this, that is a common and fixable situation: start with the online deterministic checks (they need no dataset), harvest your first golden cases from the traces you already have, and work backward to the CI gate. If you would rather have experienced hands on it, work with us; retrofitting eval discipline onto live agent systems is a large part of what we do.

FAQ

What are AI agent evals?

Agent evals are automated tests that measure an AI agent's quality: whether it produces correct final answers (outcome evals) and whether it takes a sound path to get there (trajectory evals, covering tool calls, ordering, and efficiency). They combine deterministic assertions, LLM-as-judge scoring against rubrics, and human review, and they run both offline against a golden dataset and online against sampled production traffic.

How many test cases do I need to start evaluating an agent?

Start with 10-50 real cases sourced from production traces, pilot sessions, or domain-expert interviews. A small dataset of real inputs beats a large synthetic one because synthetic cases inherit their author's blind spots. Grow it continuously: every confirmed production failure should become a permanent regression case, so a mature agent typically accumulates 100-300 cases within its first year.

Is LLM-as-judge reliable enough to trust?

Yes, with guardrails. Unmitigated judges show position bias (first answer wins pairwise comparisons 10-15 points more often), verbosity bias (longer answers score higher), and self-preference bias (models favor their own outputs). Mitigate with binary rubric criteria instead of 1-10 scores, a different model family for judging than for the agent, pinned judge versions, and a monthly calibration check against 30-50 human-labeled cases targeting 85%+ agreement.

What is the difference between trajectory evals and outcome evals?

Outcome evals score only the final answer. Trajectory evals score the process: which tools the agent called, in what order, with what arguments, and how many steps it took. You need both because they fail independently. An agent can reach the right answer through an unsafe or lucky path (which outcome evals miss), or execute a perfect trajectory and fumble the final synthesis (which trajectory evals miss).

LangSmith vs Langfuse: which should I use for agent evals?

LangSmith if you are on LangChain/LangGraph and want the most mature integrated path from traces to datasets to CI-gated experiments; note that self-hosting is enterprise-only and per-trace pricing grows with volume. Langfuse if you need self-hosting (it is MIT-licensed with first-class Docker deployment), framework neutrality, or flat predictable pricing; its eval workflow is more DIY. Your golden dataset ports between them, so the choice is not a lock-in decision.

How do I run agent evals in CI without huge costs or slow builds?

Run deterministic pytest assertions first (free, fast, fail-fast), then LLM-as-judge on your golden dataset only when the structural checks pass. Trigger the workflow only on paths that affect agent behavior (prompts, tools, graph code), and gate merges on regression versus the main-branch baseline rather than an absolute score. A 50-100 case judged run costs well under a dollar per PR and finishes in a few minutes.

How do I test AI agents after they are already in production?

Use a three-tier online setup: deterministic checks on 100% of traces (structure, forbidden tools, PII, step budgets), asynchronous LLM-as-judge scoring on a 1-5% sample plus every flagged or thumbed-down trace, and aggregated user signals like escalation and abandonment rates. For risky changes, A/B the new prompt or model on 5-10% of traffic with identical scoring on both arms before full rollout.

Should agent tests be allowed to be flaky?

Some run-to-run variance is inherent to LLMs, but treat flakiness as signal rather than noise. For safety-critical assertions (forbidden tool never called), run the case 3-5 times and require all passes. For softer behavioral checks, accept 4 of 5. A test that passes 60% of the time is telling you the agent is unreliable on that input, which is exactly the kind of case that belongs in your golden dataset with a fix behind it.

All posts

2026-06-26