AI Agent Learning Path
A staged roadmap for going from beginner concepts to useful business agents and client-ready automations.
What is an eval set
An eval set is a list of cases your model is supposed to handle, with a definition of what "correct" looks like for each one. Think of it as the unit-test suite for an LLM-shaped system, except the assertions are messier and the failures are more interesting.
The simplest version is a CSV with two columns: input and expected behavior. In practice you'll grow it to four or five columns and a couple of hundred rows. The real value is that you can run it on every change and get a number that tells you whether the change is an improvement or a regression.
Why they matter
Three reasons that turn into one:
- Models change. Either you change them, or the provider does. Without an eval, you have no idea whether yesterday's prompt still works on today's model.
- Prompts drift. A "small tweak" that helps one case often hurts five you forgot about. You'll find out at 2am from a customer.
- Vibes don't scale. Reading 20 outputs and going "yeah that's better" stops working when you have 200,000 outputs.
An eval set turns subjective judgement into a number. The number can be wrong; it can be a low-quality proxy for what you actually care about. But "I have a number" is a strictly better state than "I have a vibe".
How to build one
Start with 20 cases. Not 200. Twenty. Pick them from real traffic if you have any; if not, pick them from the most common questions you expect to get. Write the expected behavior in plain English for each.
Then run it. The number will be bad. Improve the agent until the number is good. Then add 20 more cases that the agent now fails on. Repeat until your eval set is between 100 and 500 cases.
If you're tempted to make it bigger than 500, you're modeling something you should be measuring in production instead.
The four columns
Once you graduate past the toy version, the schema we use looks like this:
- query · the input.
- must_cite · if the agent should ground its answer in a specific doc, name it.
- must_not_say · the thing the agent must not invent. This column is the unlock.
- refusal_ok · can the agent legitimately refuse this query? Sometimes yes; sometimes the refusal is the failure.
The third column, must_not_say, is what catches the hallucinations you actually fear. If you can't write down what the agent shouldn't say, you don't yet understand the failure mode.
Confidence intervals
An eval score is a sample. A 2-point improvement that fits inside the confidence interval is noise. Bootstrap the score; report the mean and the 95% CI. Refuse to merge anything where the new lower bound overlaps the previous mean.
The math is in module 08. The summary: run the eval 200 times with replacement, take the 2.5th and 97.5th percentiles, that's your CI. Add it to the PR comment template and stop arguing about whether things are better.
What to read next
The eval set is a prerequisite for everything else, so pick what's painful for you: