AI Agent Observability: Tracing, Metrics, and Alerting for Production Agents
How to monitor AI agents in production: what a good trace contains, LangSmith vs Langfuse vs Helicone vs Phoenix, a hands-on self-hosted Langfuse setup for LangGraph, the six dashboard charts that matter, and alert rules with real thresholds.
10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub
- AI agents are unobservable by default. A failed conversation with no trace cannot be debugged, reproduced, or fixed - instrument before you ship, not after the first incident.
- Think in three layers: traces answer 'what happened in this one conversation', metrics answer 'how is the system trending', and production evals answer 'is quality degrading'. You need all three.
- LangSmith is the fastest path for LangGraph teams that accept a hosted platform; Langfuse is the strongest open source option and self-hosts on your own infrastructure with no per-trace fees; Helicone fits multi-provider cost tracking; Phoenix fits eval-heavy ML teams.
- Six charts cover 90% of agent monitoring needs: latency p95, cost per conversation, tool error rate, loop count distribution, escalation rate, and token usage by model.
- Alert on symptoms users feel, with concrete thresholds: p95 latency above 30s, tool error rate above 5%, cost per conversation above 3x baseline, and any conversation exceeding 15 agent loops.
- Your traces are your best eval dataset. Curate failed and interesting production traces into a golden dataset, then run regression evals against it before every prompt or model change.
Why AI Agents Are Unobservable by Default
Here is a support ticket you will eventually receive: "Your AI assistant told a customer we offer refunds on custom orders. We don't. The customer has a screenshot." Your first question is obvious - what exactly did the agent see, think, and do in that conversation? If you did not set up tracing, the honest answer is: you have no idea, and you never will.
Traditional software fails loudly and reproducibly. A null pointer throws a stack trace, the same input produces the same crash, and your APM tool points at the line number. Agents fail differently. The process returns HTTP 200, the logs show "request completed", and yet the output was wrong, expensive, or harmful. The failure lives inside a nondeterministic sequence of LLM calls, tool invocations, and retrieved context that vanished the moment the response was sent.
This is what makes agents unobservable by default. An agent conversation is not one request - it is a trajectory. A single user message might trigger 4 LLM calls, 3 tool executions, 2 retrieval queries, and a conditional loop, and any step in that chain can be the root cause. Standard logging captures none of this structure. You see "POST /chat 200 8.4s" and nothing else.
The stakes compound in production. Without observability you cannot answer basic operational questions: Which conversations cost the most? Which tool fails most often? Did yesterday's prompt change make the agent slower? Is the agent stuck in retry loops for a subset of users? Every one of these questions has burned a real team we have worked with, usually discovered weeks late through a billing surprise or an angry customer rather than an alert.
The good news is that agent observability in 2026 is a solved problem at the tooling level. OpenTelemetry has GenAI semantic conventions, mature platforms exist on both the hosted and open source side, and instrumenting a LangGraph agent takes about ten lines of code. What is not solved is knowing what to capture, what to chart, and what to alert on. That is what this guide covers, end to end, with a working self-hosted setup you can deploy today.
The Three Layers: Traces, Metrics, and Production Evals
Teams new to agent monitoring usually conflate three different jobs into one vague goal of "observability". Separating them clarifies what tooling you need and what questions each layer answers.
| Layer | Question it answers | Granularity | Consumed by | Typical tooling |
|---|---|---|---|---|
| Traces | What exactly happened in this one conversation? | Per request, per span | Engineers debugging incidents | Langfuse, LangSmith, Phoenix |
| Metrics | How is the system trending over hours and days? | Aggregated time series | On-call, dashboards, alerts | Platform dashboards, Prometheus, Grafana |
| Production evals | Is answer quality degrading, independent of errors? | Sampled and scored outputs | Product and ML owners | LLM-as-judge scoring, golden datasets |
Traces are the foundation. Every conversation gets a tree of spans: one span per LLM call, one per tool execution, one per retrieval step, nested under a root span for the whole interaction. Traces are what you open when something went wrong for a specific user. Without them, the other two layers have nothing to aggregate.
Metrics are aggregations computed over traces: p95 latency, cost per conversation, tool error rate, tokens per day. Metrics power dashboards and alerts. Crucially, agent metrics are different from infrastructure metrics - your GPU utilization can be perfectly healthy while your agent burns $400/day looping on a broken tool. If you deployed the stack from our LangGraph + vLLM production guide, the Grafana setup there covers infrastructure; this guide covers the agent layer above it.
Production evals catch the failures that never throw errors. An agent that confidently gives outdated policy answers produces clean traces and healthy metrics. The only way to catch it is to score a sample of production outputs continuously - LLM-as-judge on helpfulness and groundedness, plus regression runs against a golden dataset. We cover how to bootstrap this from your traces in a later section.
Most teams should build in this order: traces first (day one, before launch), metrics and alerts second (first week), production evals third (once you have real traffic to sample). Skipping straight to evals without traces is a common mistake - you end up knowing quality dropped without being able to see why.
What a Good Agent Trace Contains
A useful agent trace is more than a log of LLM inputs and outputs. It has to reconstruct the agent's full trajectory: the decision at each step, the evidence available for that decision, and the cost of every hop. Here is what a well-instrumented trace of a single support conversation turn looks like conceptually:
At the root sits the conversation turn span carrying user id, session id, agent version, and total duration. Nested under it, in order: an LLM span for the initial reasoning call (model, prompt, completion, token counts, finish reason), a tool span for the CRM lookup it decided to make (tool name, arguments, result, latency, error status), a retrieval span for the policy documents fetched (query, returned chunks, similarity scores), a second LLM span where the agent synthesizes the answer, and finally the output attached to the root. The trajectory view - the ordered sequence of decisions - is what separates agent tracing from plain LLM logging.
In JSON, a simplified trace following the OpenTelemetry GenAI semantic conventions looks like this:
{
"trace_id": "a3f8c2e1",
"name": "support-agent-turn",
"user_id": "usr_4821",
"session_id": "sess_190bd2",
"metadata": { "agent_version": "v2.3.1", "env": "prod" },
"spans": [
{
"name": "chat meta-llama/Llama-3.1-70B-Instruct",
"type": "generation",
"attributes": {
"gen_ai.operation.name": "chat",
"gen_ai.request.model": "meta-llama/Llama-3.1-70B-Instruct",
"gen_ai.usage.input_tokens": 1842,
"gen_ai.usage.output_tokens": 96,
"gen_ai.response.finish_reasons": ["tool_calls"]
},
"latency_ms": 2140,
"cost_usd": 0.0031
},
{
"name": "tool:crm_lookup_order",
"type": "tool",
"input": { "order_id": "ORD-99231" },
"output": { "status": "shipped", "custom": true },
"latency_ms": 412,
"error": null
},
{
"name": "chat meta-llama/Llama-3.1-70B-Instruct",
"type": "generation",
"attributes": {
"gen_ai.usage.input_tokens": 2510,
"gen_ai.usage.output_tokens": 187,
"gen_ai.response.finish_reasons": ["stop"]
},
"latency_ms": 3890,
"cost_usd": 0.0044
}
],
"total_cost_usd": 0.0075,
"total_latency_ms": 6442
}
Note the gen_ai.* attribute names. The OpenTelemetry GenAI semantic conventions standardize these fields (model, token usage, finish reasons, operation name) so that traces are portable across backends. The conventions are still marked experimental as of 2026, but Langfuse, LangSmith, Phoenix, and the major cloud vendors all consume them. Emitting OTel-compatible spans means you are never locked into one vendor's SDK.
Three fields deserve special attention because teams routinely omit them and regret it: user_id and session_id (without them you cannot answer "show me everything that happened to this customer"), cost per span (computed from token counts and your price sheet - the only way to find expensive conversation patterns), and agent_version (without it you cannot attribute a regression to a specific prompt or graph change). If your agent runs on the OpenAI Agents SDK instead of LangGraph, the same fields apply - see our OpenAI Agents SDK tutorial for that framework's built-in tracing hooks.
LangSmith vs Langfuse vs Helicone vs Phoenix
Four platforms cover the realistic shortlist for most teams in 2026. Here is the honest comparison:
| LangSmith | Langfuse | Helicone | Phoenix (Arize) | |
|---|---|---|---|---|
| Open source | No | Yes (MIT core) | Yes (core) | Yes |
| Self-host | Enterprise plan only (Kubernetes) | Yes, free, Docker Compose | Yes (core) | Yes, free |
| Pricing (hosted) | Free 5K traces/mo; $39/seat with 10K traces, then ~$2.50 per 1K traces | Free tier; paid cloud from ~$29/mo; self-host unlimited at $0 | Free 10K requests/mo; paid from ~$79/mo | Free tier ~25K spans; managed cloud from ~$50/mo |
| LangGraph integration | Native, two env vars, zero code | First-class via LangChain callback handler | Proxy-based, one base URL change | OpenInference auto-instrumentation |
| OpenAI SDK integration | SDK wrapper | Drop-in client wrapper + OTel | Proxy, no SDK change | Auto-instrumentor |
| Eval support | Strong: datasets, LLM-as-judge, annotations | Strong: datasets, experiments, LLM-as-judge | Basic scoring | Strongest eval primitives, drift analysis |
| Standout trait | Deepest LangGraph trajectory views | Full platform, your infrastructure, no per-trace fees | Sub-millisecond proxy, cost tracking across providers | ML-grade evals and embedding analysis |
Recommendations by team type, based on deployments we have run:
Small team on LangGraph, no data constraints: LangSmith. Two environment variables and you have full trajectory tracing. The free tier covers prototyping, and $39/seat is cheap relative to engineering time. The catch: self-hosting is gated behind Enterprise pricing, and per-trace overage costs grow with traffic, so know your volume before committing.
Any team with data residency, compliance, or cost-at-scale concerns: Langfuse self-hosted. It is the only option that combines a full feature set (tracing, prompt management, datasets, LLM-as-judge evals, dashboards) with genuinely free self-hosting. Traces never leave your network, which matters enormously if your agent handles customer data - see our agent security and privacy guide for why trace data is often the most sensitive data store in your whole stack.
Multi-provider gateway users who mainly want cost visibility: Helicone. Change one base URL and every request across OpenAI, Anthropic, and Google is logged with cost attribution. Weakest on agent trajectory views and evals, so pair it with something else if you need deep debugging.
Eval-heavy ML teams: Phoenix. If your bottleneck is measuring quality rather than debugging trajectories, Phoenix's evaluation primitives are the most rigorous of the four, and it self-hosts for free.
For the rest of this guide we build on Langfuse self-hosted. The reasoning: it is the choice that works for the most teams (no vendor bill that scales with traffic, no data leaving your infrastructure, full eval support), and it fits naturally alongside the self-hosted agent stack we recommend elsewhere. If you choose LangSmith instead, every concept below maps one-to-one - only the setup section differs.
Hands-On: Self-Hosting Langfuse
Langfuse v3 is a six-service stack: the web UI/API, a background worker, ClickHouse for analytics, Postgres for relational data, Redis for queueing, and MinIO for S3-compatible blob storage. That sounds heavy compared to the v2 days of a single Postgres container, but the official Docker Compose file handles all of it, and the payoff is that ClickHouse keeps trace queries fast even at tens of millions of spans.
# Clone and start Langfuse v3
git clone https://github.com/langfuse/langfuse.git
cd langfuse
# Edit secrets before first start (do not run defaults in prod)
# In docker-compose.yml or a .env file, set at minimum:
# NEXTAUTH_SECRET, SALT, ENCRYPTION_KEY (openssl rand -hex 32)
# POSTGRES_PASSWORD, CLICKHOUSE_PASSWORD, REDIS_AUTH
# MINIO_ROOT_PASSWORD
docker compose up -d
# Web UI comes up on port 3000
curl http://localhost:3000/api/public/health
Sizing guidance from real deployments: a 4 vCPU / 16GB RAM VM comfortably handles a few hundred thousand traces per month. ClickHouse is the memory-hungry component; give it headroom. Put nginx with SSL in front of port 3000 exactly as described in our production deployment guide, and never expose ClickHouse, Redis, or MinIO ports publicly.
After first start, open the UI, create an organization and a project, and generate API keys. You get a pk-lf-... public key and sk-lf-... secret key - these go into your agent's environment:
# Agent service environment
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxx
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxx
LANGFUSE_HOST=https://langfuse.yourdomain.com
Two operational notes before we instrument anything. First, trace ingestion is asynchronous by design: the web container writes incoming batches to blob storage and the worker ingests them into ClickHouse, so a worker outage delays trace visibility but does not lose data or slow your agent. Second, plan retention early - add a scheduled job or use Langfuse's built-in retention settings to expire raw traces after 30-90 days, keeping only aggregates and curated dataset items. Trace stores grow faster than teams expect, especially with full prompts attached.
Instrumenting a LangGraph Agent
Langfuse integrates with LangGraph through the LangChain callback system, which means instrumentation is a config change, not a rewrite. Starting from an agent like the one in our LangGraph tutorial:
pip install langfuse langchain-core
import os
from langfuse import Langfuse, get_client
from langfuse.langchain import CallbackHandler
from langchain_core.messages import HumanMessage
# Initialize once at startup (reads LANGFUSE_* env vars)
langfuse = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host=os.environ["LANGFUSE_HOST"],
)
# One handler per process is fine; it is thread-safe
langfuse_handler = CallbackHandler()
async def run_agent_turn(graph, message: str, user_id: str,
session_id: str) -> str:
result = await graph.ainvoke(
{"messages": [HumanMessage(content=message)]},
config={
"callbacks": [langfuse_handler],
"configurable": {"thread_id": session_id},
"metadata": {
# Langfuse picks these up as first-class fields
"langfuse_user_id": user_id,
"langfuse_session_id": session_id,
"langfuse_tags": ["support-agent", "prod"],
"agent_version": os.getenv("AGENT_VERSION", "dev"),
},
},
)
return result["messages"][-1].content
That is the entire integration. Every graph invocation now produces a trace with one span per node execution, one generation span per LLM call (with token counts and model name captured automatically), and one span per tool call with arguments and results. Because your agent's turns share a session_id, Langfuse groups them into a session view - the full multi-turn conversation, which is exactly what you open when a user reports a bad interaction.
Two details matter for correctness. First, spans are flushed asynchronously in batches; in short-lived processes (scripts, serverless) call get_client().flush() before exit or you will silently lose the last traces. Long-running FastAPI services flush automatically. Second, if you self-host your LLM behind vLLM, token counts come from the OpenAI-compatible response's usage block, but cost is not inferred for custom models - define your model and its per-token price in Langfuse's model settings so cost columns populate. For hosted models (OpenAI, Anthropic), pricing is built in.
If parts of your agent live outside LangGraph - a custom retrieval function, a post-processing step, an MCP tool server call - wrap them with the @observe decorator so they appear as spans in the same trace:
from langfuse import observe
@observe(name="rerank-results")
def rerank(query: str, chunks: list[str]) -> list[str]:
# custom logic outside the graph, still traced
...
Deploy this, send a few test conversations, and open the Langfuse UI. You should see the full trajectory: agent node, tool calls, second agent pass, final output, with latency and token counts on every hop. If tool spans are missing, the usual cause is tools invoked outside the graph's ToolNode without callbacks propagated - pass the config through.
Metadata That Pays Rent: Users, Sessions, Cost, Versions
Raw traces answer "what happened". Metadata answers "to whom, under which configuration, and at what cost" - and it is what turns a trace store into an operational tool. Four fields are non-negotiable:
user_id. Pseudonymous, not an email address (more on that in the PII section). This enables the single most common production query: "show me every conversation for the customer who just complained." It also unlocks per-user cost analysis, which is how you discover that 2% of users generate 40% of your LLM spend.
session_id. Set it to your LangGraph thread_id so trace sessions align with checkpointed conversations. Multi-turn failures (the agent forgot context, contradicted itself, looped) are only visible at the session level, never in a single trace.
agent_version. Stamp every trace with a version that changes whenever the prompt, graph structure, model, or tool set changes. When p95 latency jumps on Tuesday, filtering traces by version tells you in seconds whether Tuesday's deploy is the culprit. Use a git SHA or a semantic version, but make it automatic - manual version strings drift.
Cost. Token counts are captured automatically; cost requires model pricing configuration as noted above. Once populated, cost per trace and cost per session become filterable columns. Sort your traces by cost descending once a week - the top 20 is a reliable catalog of your worst agent behaviors: runaway loops, bloated system prompts, tools returning 50KB of JSON that gets stuffed into context.
Beyond these four, add whatever dimensions your product slices by: tenant id for B2B, conversation channel (web, Slack, email), experiment arm for A/B tests, and a boolean for whether a human was involved. That last one matters if you run human-in-the-loop approval flows - tagging traces that hit an approval gate lets you measure escalation rate, which is one of the six core charts in the next section.
One anti-pattern to avoid: do not dump entire application state objects into metadata. Metadata should be low-cardinality, filterable dimensions. Large payloads belong in span inputs/outputs where they are stored efficiently and subject to retention policies.
The Agent Monitoring Dashboard: Six Charts Worth Having
Every observability platform lets you build unlimited charts, and most teams respond by building dashboards nobody reads. In practice, six charts cover 90% of what you need to see daily. Build these in Langfuse's custom dashboards (or Grafana against the Langfuse API if you want everything in one pane):
1. End-to-end latency p95, by agent version. Not the mean - the mean hides the slow tail your users actually complain about. Split by version so regressions are attributable. Watch for step changes after deploys and slow drift as context windows grow with feature creep.
2. Cost per conversation (session), daily. Plot the median and p95 together. Median tells you the baseline economics; p95 catches the pathological sessions. A rising p95 with a flat median almost always means a loop or tool-retry problem affecting a subset of conversations.
3. Tool error rate, per tool. The fraction of tool spans ending in an error, one line per tool. Tools are where agents touch the messy real world - APIs time out, schemas drift, auth tokens expire. A tool going from 1% to 20% errors is your most common production incident, and the agent often masks it by retrying or hallucinating around the failure.
4. Loop count distribution. Histogram of LLM calls per conversation turn. A healthy ReAct-style agent settles into a tight distribution (typically 2-5 calls per turn). A fattening right tail means the agent is spinning: a tool returning unusable output, a prompt change that broke stop conditions, or a model swap that follows instructions differently.
5. Escalation rate. The fraction of conversations handed to a human, hitting an approval gate, or ending with the agent admitting it cannot help. Rising escalation is an early quality signal that precedes user complaints. Falling escalation is not automatically good - verify the agent is not just answering things it should be escalating.
6. Token usage by model, daily. Input and output tokens per model. This is your capacity planning and budget chart. It also catches silent misconfiguration - a fallback to an expensive model, or a prompt cache that stopped working, shows up here days before it shows up on the invoice.
Notice what is not on this list: GPU utilization, request throughput, memory. Those are infrastructure metrics and belong in the Grafana stack from the deployment guide. Keeping agent behavior charts and infrastructure charts separate keeps both dashboards legible. During an incident you glance at both: infra healthy plus agent metrics bad means the problem is in prompts, tools, or the model itself.
Alert Rules with Thresholds That Actually Work
Alerts should fire on symptoms users feel, with thresholds loose enough to avoid pager fatigue. These are the rules we deploy by default, tuned over multiple production agents. Adjust the numbers to your baseline, but start here rather than from zero:
| Alert | Condition | Window | Severity | Usual root cause |
|---|---|---|---|---|
| Latency spike | p95 end-to-end latency > 30s | 10 min | Page | Model overload, slow tool, context bloat |
| Tool failure | Any single tool error rate > 5% | 15 min | Page | Downstream API broken, expired credentials |
| Cost anomaly | Mean cost per conversation > 3x trailing 7-day baseline | 1 hour | Ticket | Loops, prompt growth, wrong model routed |
| Runaway loop | Any conversation > 15 LLM calls in one turn | Immediate | Ticket + auto-kill | Broken stop condition, adversarial input |
| Escalation surge | Escalation rate > 2x 7-day baseline | 1 hour | Ticket | Quality regression, new query pattern |
| Trace ingestion silence | Zero traces received despite traffic | 10 min | Page | Instrumentation broke - you are flying blind |
| Eval score drop | Mean LLM-as-judge score < 0.7 on sampled prod traffic | Daily | Ticket | Model drift, prompt regression, stale knowledge |
Three of these deserve commentary. The runaway loop alert should be paired with a hard limit in the agent itself - LangGraph's recursion_limit config - so the alert is a signal that the safety net caught something, not the only line of defense. Never rely on monitoring alone to stop a loop that costs money per iteration.
The trace ingestion silence alert is the one everybody forgets. Your observability pipeline is itself a production system; when the callback handler breaks after a dependency upgrade, you lose visibility exactly when you need it. Alert on the absence of data, not just bad data.
The cost anomaly alert uses a relative threshold (3x baseline) rather than an absolute dollar figure because traffic varies. But add one absolute backstop too: a daily spend ceiling that pages someone regardless of ratios. Teams have been saved by a blunt "we spent $500 today and it is 11am" alert more often than by any sophisticated anomaly detection.
Wire delivery through whatever your team already watches - Slack webhook, PagerDuty, email. Langfuse supports webhook-based alerting on metric thresholds in recent versions; alternatively, a 50-line cron script hitting the Langfuse metrics API and posting to Slack is unglamorous and completely sufficient.
PII in Traces: Redaction Without Losing Debuggability
Here is the uncomfortable truth about agent tracing: you have just built a database containing every message your users ever sent, every document your retrieval layer fetched, and every record your tools returned. Your trace store is frequently the most sensitive data system in your company, and it was created by a monitoring decision, not a data governance one.
Self-hosting Langfuse solves the "third party holds my data" problem but not the internal one - support engineers debugging traces should not see customer emails, health details, or payment fragments. The practical approach has three tiers:
Tier 1: Pseudonymize identifiers at the source. Never put emails or names in user_id. Use your internal opaque id and keep the mapping in your application database, where existing access controls and deletion workflows already apply. This also makes GDPR deletion tractable: Langfuse supports deleting all traces for a given user id via API, which only works if the id is consistent.
Tier 2: Mask span inputs and outputs before ingestion. The Langfuse SDK accepts a masking function applied to every input and output before data leaves your process:
import re
from langfuse import Langfuse
EMAIL = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")
PHONE = re.compile(r"\+?\d[\d\s().-]{7,}\d")
CARD = re.compile(r"\b(?:\d[ -]*?){13,16}\b")
def mask_pii(data, **kwargs):
if isinstance(data, str):
data = EMAIL.sub("[EMAIL]", data)
data = PHONE.sub("[PHONE]", data)
data = CARD.sub("[CARD]", data)
return data
if isinstance(data, dict):
return {k: mask_pii(v) for k, v in data.items()}
if isinstance(data, list):
return [mask_pii(v) for v in data]
return data
langfuse = Langfuse(mask=mask_pii)
Regex catches the structured 80%. For unstructured PII (names, addresses in free text), run a lightweight NER pass (Presidio is the standard choice) if your domain demands it - but be honest about the latency and false-positive tradeoff. Many teams settle on regex masking plus strict access control rather than aggressive NER that mangles traces.
Tier 3: Retention and access. Expire raw trace content on a schedule (30-90 days is typical), keep aggregates indefinitely, and gate trace UI access by role. Debuggability and privacy trade off directly: over-redact and your traces become useless for the exact incident investigations you built them for. The workable compromise is masked-by-default with a documented, audited break-glass path for active incidents. For a fuller treatment of data handling in agent systems, including what regulators expect, see our AI agent security and privacy guide, and run your setup through the agent risk scorer to find the gaps you have not thought about.
Closing the Loop: From Traces to Production Evals
Traces and metrics tell you when things break loudly. Evals tell you when quality erodes quietly. The best part of having a trace store is that your eval dataset builds itself - you just have to curate it.
Step 1: Build a golden dataset from real traces. Every week, review a slice of production traces: the highest-cost sessions, everything flagged by a user thumbs-down, everything that escalated, and a random sample of normal traffic. In Langfuse, add the interesting ones to a dataset with one click - the trace input becomes the test input, and you write (or correct) the expected output. Within a month you have 100-300 real cases that reflect what your users actually ask, which beats any synthetic benchmark. If your agent is retrieval-heavy, capture the retrieved context too, following the dataset patterns from our RAG agent guide.
Step 2: Run regression evals on every change. Before any prompt edit, model swap, or graph restructure ships, run the candidate against the golden dataset and score outputs - exact-match or rubric checks where possible, LLM-as-judge for open-ended quality. Langfuse's experiments feature stores each run against the dataset so you can diff versions side by side. The rule we enforce with clients: no agent change merges without an eval run attached, the same way no code change merges without tests.
Step 3: Score live traffic continuously. Sample 5-10% of production traces and run LLM-as-judge scoring on dimensions that matter for your product: groundedness (did the answer come from the retrieved context?), resolution (did the user's problem actually get solved?), and tone compliance. Langfuse runs these as managed LLM-as-judge evaluators on incoming traces, writing scores back onto the trace. Those scores then feed the eval-drop alert from the alerting section and give you a quality time series next to your latency and cost charts.
This loop - traces feed datasets, datasets gate changes, judges score live traffic, scores feed alerts - is the difference between monitoring an agent and actually controlling one. It is also the single biggest maturity gap we see in teams that come to us: nearly everyone has some tracing by 2026, almost nobody has closed the loop to evals. The full workflow, including judge prompt design and calibrating judges against human labels, is covered in depth in our Production-Grade Agent Engineering course.
Common Mistakes (and What to Do Instead)
Instrumenting after the incident. The most common mistake by far. Teams launch, something goes wrong in week two, and only then does tracing get added - meaning the incident that motivated it can never be diagnosed. Tracing is ten lines of code; add it before the first real user.
Logging LLM calls but not tool calls. Tools are where most production failures originate, yet proxy-only setups capture just the LLM traffic. If your trace shows the model receiving garbage but not which tool produced it, you have half a trace. Instrument at the framework level (callbacks/OTel), not just the API gateway level.
No session grouping. Single-trace views cannot show multi-turn failures - lost context, contradictions, users rephrasing the same question four times. If you take one thing from the metadata section, take session_id.
Alerting on averages. Mean latency and mean cost are smoothed into uselessness by volume. Alert on p95s, per-tool rates, and distribution tails, where the actual user pain lives.
Treating the trace store as exempt from data governance. Full prompts and tool outputs are production data. Apply masking, retention, and access control from day one - retrofitting redaction onto six months of stored traces is miserable.
Building dashboards instead of habits. A dashboard nobody opens is decoration. The minimum viable ritual: one person opens the six charts every morning (two minutes), and the team reviews the week's worst traces every Friday (thirty minutes). The Friday review is where golden dataset items, prompt fixes, and new alert rules actually come from.
Skipping the observability stack because "we are still small". Small traffic is precisely when tracing is cheapest and reviews are most tractable. You can read every conversation at 50/day; you cannot at 5,000/day - and by then your habits are set.
If you are standing up agent observability for a real production system and want it reviewed - or want the whole loop from tracing through evals built with you - work with us. We have set this stack up enough times to know where each team's version will break. And before you ship, run through the agent risk checklist: observability gaps are the first section for a reason.
FAQ
What is AI agent observability?
AI agent observability is the practice of capturing what an agent actually did in production: a trace of every LLM call, tool execution, and retrieval step per conversation, aggregated metrics like latency p95 and cost per conversation, and continuous quality scoring of live outputs. It differs from standard APM because agent failures are usually semantic (wrong or costly behavior with HTTP 200 responses) rather than crashes, so you need trajectory-level visibility, not just request logs.
LangSmith vs Langfuse: which should I choose?
Choose LangSmith if you use LangGraph, want zero-setup tracing (two environment variables), and are fine with a hosted platform - the free tier covers 5K traces/month and paid starts at $39/seat. Choose Langfuse if you want open source, free self-hosting on your own infrastructure, no per-trace fees at scale, or you have data residency requirements. Feature-wise they are close: both offer trajectory tracing, sessions, datasets, and LLM-as-judge evals. The real decision is hosted convenience versus ownership and cost control.
Can I self-host LangSmith?
Only on the Enterprise plan, which is custom-priced and deploys to your own Kubernetes cluster on AWS, GCP, or Azure. Realistic total cost for a small self-hosted LangSmith deployment lands around $1,000+/month including licensing and infrastructure. If self-hosting on a budget is the requirement, Langfuse (free, Docker Compose) or Phoenix (free) are the practical choices.
What metrics should I monitor for an AI agent in production?
Six cover most needs: end-to-end latency p95 (not the mean), cost per conversation (median and p95), per-tool error rate, loop count distribution (LLM calls per turn), escalation-to-human rate, and daily token usage by model. Keep these separate from infrastructure metrics like GPU utilization - during an incident, healthy infrastructure plus degraded agent metrics tells you the problem is in prompts, tools, or the model.
How do I trace a LangGraph agent?
With Langfuse, install the SDK, create a CallbackHandler, and pass it in the callbacks list of your graph invocation config along with langfuse_user_id and langfuse_session_id in metadata - about ten lines of code, no graph changes. With LangSmith, set LANGSMITH_TRACING=true and LANGSMITH_API_KEY environment variables and tracing is automatic. Either way you get one span per graph node, per LLM call (with token counts), and per tool execution.
How do I keep PII out of LLM traces?
Three tiers: pseudonymize user identifiers at the source (opaque ids, never emails), apply a masking function in the tracing SDK that redacts emails, phone numbers, and card numbers from every span input and output before ingestion, and enforce retention (expire raw traces after 30-90 days) plus role-based access to the trace UI. Regex masking catches the structured majority; add NER-based redaction like Presidio only if your domain requires it, since it trades debuggability for coverage.
How much does agent observability cost at scale?
Self-hosted Langfuse costs only infrastructure: roughly $30-80/month for a VM that handles hundreds of thousands of traces monthly, flat regardless of volume. Hosted platforms scale with traffic: LangSmith runs about $2.50 per 1,000 traces beyond plan quotas, so an agent doing 500K traces/month costs over $1,200/month in overages alone. The crossover point where self-hosting wins is typically 50K-100K traces/month, assuming you can spare a few hours a month for maintenance.
What is the difference between LLM tracing and LLM evals?
Tracing records what happened: the inputs, outputs, latency, cost, and errors of every step in a conversation. Evals judge whether what happened was good: scoring outputs for correctness, groundedness, and helpfulness, either against a golden dataset before deploys or on sampled live traffic via LLM-as-judge. They are complementary and connected - your best eval dataset comes from curating real production traces, and eval scores written back onto traces give you a quality time series to alert on.
