A8gent
Workflow Architecture · Lesson 2 of 16

State And Handoffs

How agents remember work across steps, resume after a failure, and hand off to a person without losing context.

State is what makes an agent resumable

A single model call is stateless. A production agent almost never is. It runs multiple steps, may pause for approval, and must survive a crash without redoing irreversible actions. State is the record of where a run is and what has already happened, and getting it right is what separates a robust workflow from one that corrupts data on the first restart.

What to persist

  • Run inputs and identifiers: the original trigger payload and a stable run ID so you can find and replay any execution.
  • Step results: the output of each completed step, so a resumed run does not repeat a step that already ran.
  • Pending actions: anything waiting on approval, with enough context that a human can decide days later.
  • Status: running, waiting, failed, or done, so the system knows whether to resume, escalate, or leave it alone.

Idempotency is not optional

If a run crashes after sending an email but before recording that it sent, a naive resume sends the email twice. Give every external action an idempotency key derived from the run and step, so a retried action is recognized and skipped rather than duplicated. Assume any step can be attempted more than once and design so that a second attempt is safe.

Handoffs to people

A handoff is a controlled transfer of a run from the agent to a human, usually for approval or because the agent hit an escalation trigger. A good handoff carries three things: the original input, what the agent did and why it stopped, and the specific decision needed. The human should be able to act in seconds without opening five tabs. When they respond, the run resumes from its saved state rather than starting over.

Handoffs between agents

Larger systems split work across specialized agents: one classifies, another drafts, another checks policy. Each handoff is a boundary where you should validate the payload and log the transfer, because a malformed object passed silently between agents is a common source of confusing failures. Keep the contract between agents explicit, like an API.

What good looks like

You can kill the process mid-run, restart it, and the workflow resumes correctly without duplicating any action. A run waiting on approval can sit for a day and still resume with full context. Every handoff, to a person or another agent, is logged and validated.

Common mistakes

  • Keeping state only in memory, so a crash loses the run or replays actions.
  • Handing off with no context, forcing the human to reconstruct what happened.
  • Skipping idempotency keys, then duplicating emails, charges, or records on retry.