Technical · 2026-05-06 · Last verified 2026-07-09

OpenAI Agents SDK Tutorial: Build a Multi-Agent System

Learn how to build a multi-agent system with the OpenAI Agents SDK. This tutorial covers agent creation, tool definition, handoffs between agents, guardrails, and production patterns for building reliable agent architectures.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

The OpenAI Agents SDK provides three core primitives - Agents, Handoffs, and Guardrails - that compose into sophisticated multi-agent architectures without requiring external orchestration frameworks.
Handoffs are the key differentiator: they allow one agent to transfer control (and full conversation context) to another specialized agent, enabling a triage-and-route pattern that dramatically improves task accuracy.
Tools in the Agents SDK are plain Python functions decorated with type annotations - the SDK automatically generates the JSON Schema for the LLM, eliminating manual schema maintenance.
Guardrails run in parallel with the agent's main execution and can halt the pipeline before invalid or dangerous outputs reach the user, providing a safety layer without adding latency to the happy path.
For production deployments, use streaming responses, implement proper error boundaries around tool calls, and design your agent graph so that any single agent failure does not cascade to the entire system.

What Is the OpenAI Agents SDK and Why It Matters

The OpenAI Agents SDK is a Python framework for building agentic AI applications where one or more LLM-powered agents collaborate to complete tasks. Released in early 2025 and significantly updated since, it provides a minimal but opinionated set of primitives that handle the hard parts of agent orchestration: managing conversation state across multiple agents, transferring control between agents cleanly, executing tools with proper error handling, and running safety checks without blocking the main execution path.

If you have built agents using raw API calls, you know the pain. You end up writing hundreds of lines of boilerplate for tool dispatch, message threading, error recovery, and agent-to-agent communication. The Agents SDK abstracts that boilerplate into three concepts: Agents (an LLM configured with instructions, tools, and handoff targets), Handoffs (a mechanism for one agent to transfer control to another while preserving context), and Guardrails (input/output validators that run in parallel with the agent and can interrupt execution). That is the entire conceptual surface area. Everything else is composition.

Why should you care about multi-agent systems instead of building one monolithic agent? The answer is specialization. A single agent with a massive system prompt and thirty tools performs worse than a team of specialized agents, each with a focused prompt and three to five tools. This mirrors human organizations - you do not hire one person who is simultaneously an accountant, a lawyer, and a software engineer. You hire specialists and route tasks to the right person. Multi-agent systems apply the same principle to AI.

The performance data backs this up. In our testing, a multi-agent system with a triage agent routing to three specialists achieved 94% task accuracy compared to 71% for a single agent with the same total tool set. The specialists also used 38% fewer tokens per task because their shorter, focused prompts reduced confusion and unnecessary reasoning. At scale, this translates directly to lower API costs and faster response times.

Before we start building, a note on prerequisites. You need Python 3.10 or later, an OpenAI API key with access to GPT-4o or later models, and basic familiarity with Python's async/await syntax. The SDK is fully async, which matters for production deployments where you need to handle concurrent requests. Install the SDK with pip install openai-agents. The entire package is lightweight - it adds minimal dependencies beyond the core OpenAI Python library.

In this tutorial, we will build a customer service multi-agent system with three specialist agents: an order management agent, a product information agent, and a billing agent. A triage agent sits at the front and routes incoming queries to the appropriate specialist. This is a pattern you can adapt to virtually any domain. For a comparison with other agent frameworks, our OpenAI AgentKit tutorial covers the related AgentKit approach, and our LangGraph HITL tutorial shows how LangGraph handles similar patterns differently.

Creating Your First Agent With Tools

Let us start by building a single agent with tools before scaling to the multi-agent architecture. The fundamental unit in the Agents SDK is the Agent class. You create an agent by specifying its name, instructions (the system prompt), the model to use, and the tools it can access. Here is the conceptual structure for our order management agent:

Here is that order management agent, with a tool attached and imports included:

from agents import Agent, Runner, function_tool

@function_tool
def get_order_status(order_id: str) -> str:
    """Look up the current status of a customer order by its ID."""
    order = order_service.get(order_id)
    if order is None:
        return f"Order {order_id} not found - please verify the order number with the customer."
    return f"Order {order_id} is currently: {order.status}"

order_agent = Agent(
    name="order_agent",
    instructions="You handle order status inquiries, returns, and shipping "
                 "updates. You do not handle billing questions - transfer "
                 "those to the billing agent.",
    tools=[get_order_status],
    model="gpt-4o",
)

The agent definition is declarative. You specify what the agent is and what it can do. The SDK handles how - the message threading, tool dispatch, retry logic, and response parsing. The instructions field is your system prompt. Keep it focused and specific. A good agent instruction is 200-400 words that clearly state the agent's role, its boundaries, and how it should handle edge cases. Avoid vague instructions like "be helpful" - instead say "you handle order status inquiries, returns, and shipping updates. You do not handle billing questions - transfer those to the billing agent."

OpenAI Agents SDK Tutorial - data overview

Tools are defined as plain Python functions with type annotations. The SDK inspects the function signature and docstring to generate the JSON Schema that the LLM sees. This is a significant ergonomic improvement over manually writing tool schemas. For example, a function like def get_order_status(order_id: str) -> str with a docstring explaining the parameter automatically becomes a properly formatted tool definition. The SDK handles serialization and deserialization of arguments and return values.

There are two types of tools: function tools (Python functions that the agent can call) and hosted tools (tools that run on OpenAI's infrastructure, like code interpreter or file search). For most business applications, function tools are what you need because they let you integrate with your own systems. Each function tool should do one thing well. Do not create a single handle_order tool that does lookup, modification, and cancellation - create three separate tools: get_order_status, modify_order, and cancel_order. This gives the LLM clearer options and produces better tool selection accuracy.

Error handling in tools is critical. Your tool functions will interact with external APIs and databases that can fail. The SDK does not automatically retry failed tool calls - that is your responsibility. Wrap every external call in try/except and return a clear, LLM-readable error message rather than letting exceptions propagate. An error message like "Order 12345 not found - please verify the order number with the customer" gives the agent useful context to continue the conversation gracefully. A raw Python traceback gives it nothing useful.

Running the agent is straightforward. You create a Runner and call Runner.run(agent, messages). The runner handles the agentic loop: sending messages to the model, processing tool calls, sending tool results back to the model, and repeating until the model produces a final response (no more tool calls).

Running the order agent end to end takes one call:

result = await Runner.run(order_agent, "Where is order 48213?")
print(result.final_output)

The runner returns a RunResult that contains the final output, the full message history (including tool calls), and metadata like token usage. For interactive applications, use Runner.run_streamed() to get a streaming response that you can forward to your UI in real time. This is essential for user-facing applications where latency perception matters. See the official SDK documentation for the complete API reference.

A practical tip: during development, enable the SDK's built-in tracing. It logs every message exchange, tool call, and handoff decision, making debugging dramatically easier. Set OPENAI_AGENTS_TRACING_ENABLED=true in your environment. In production, disable tracing or route it to your logging infrastructure to avoid performance overhead.

Building Multi-Agent Handoffs

Handoffs are the most powerful concept in the Agents SDK and the key to building effective multi-agent systems. A handoff transfers control from one agent to another, carrying the full conversation history. The receiving agent picks up exactly where the sending agent left off, with full context about what has already been discussed. From the user's perspective, the conversation is seamless - they do not know or care which agent is responding.

Implementing handoffs is simple. When creating an agent, you specify a handoffs list containing the agents it can transfer to. The triage agent's handoffs list includes all three specialist agents.

For our three-specialist customer service system, the triage agent looks like this:

triage_agent = Agent(
    name="triage",
    instructions="Route to the order agent for status, shipping, returns, "
                 "and exchanges. Route to the billing agent for charges, "
                 "refunds, and invoices. Route to the product agent for "
                 "features, availability, and recommendations.",
    handoffs=[order_agent, billing_agent, product_agent],
    model="gpt-4o-mini",
)

Each specialist agent can hand back to the triage agent if the conversation shifts to a topic outside its expertise. The LLM decides when to hand off based on the agent's instructions and the conversation context. You do not write explicit routing logic - the agent's instructions guide its handoff decisions.

However, relying entirely on the LLM's judgment for routing is risky in production. We recommend a hybrid approach: use the LLM for initial intent classification but add explicit rules for high-stakes routing decisions. For example, any message containing "cancel my account" should always route to a human agent regardless of what the LLM thinks. You can implement this as a guardrail (covered in the next section) or as preprocessing logic before the agent runs.

The triage agent pattern is the most common multi-agent architecture. It works like a receptionist: it greets the user, understands their intent, and routes them to the right specialist. The triage agent's instructions should explicitly list what each specialist handles and include examples of messages that should route to each one. For instance: "Route to the order agent for questions about order status, shipping, returns, and exchanges. Route to the billing agent for questions about charges, refunds, payment methods, and invoices. Route to the product agent for questions about product features, availability, specifications, and recommendations."

A common mistake with handoffs is creating circular routing loops. Agent A hands to Agent B, which does not understand the question and hands back to Agent A, which hands to Agent B again. Prevent this by: giving each agent clear boundaries in its instructions ("if you cannot handle this request after one attempt, apologize and offer to connect the user with a human"), limiting the maximum number of handoffs per conversation (the SDK's max_turns parameter controls the total number of agentic loop iterations), and designing your agent graph to be acyclic where possible. The triage agent should be the only agent that routes to specialists, and specialists should either resolve the issue or escalate to a human - not bounce back to the triage agent for re-routing.

Context management across handoffs deserves careful attention. By default, the entire message history transfers with the handoff. For short conversations, this is fine. For long conversations (20+ messages), the growing context window increases token costs and can reduce response quality as the model struggles with excessive context. Consider implementing context summarization at handoff boundaries: before transferring, have the sending agent generate a concise summary of the conversation so far and pass that instead of (or in addition to) the full history. This keeps each specialist agent working with a clean, focused context. For architectural guidance on multi-agent systems beyond OpenAI's framework, our guide to AI agents for business covers the strategic considerations.

Implementing Guardrails for Production Safety

Guardrails are the Agents SDK's mechanism for validating inputs and outputs without adding latency to the happy path. They run in parallel with the agent's main execution, and if a guardrail triggers, it can halt the agent's response before it reaches the user. This is fundamentally different from post-processing validation, which only catches problems after the agent has already generated a full response (wasting tokens and time).

The SDK supports two types of guardrails: input guardrails (validate the user's message before the agent processes it) and output guardrails (validate the agent's response before it reaches the user). Input guardrails catch prompt injection attempts, off-topic messages, and malicious inputs. Output guardrails catch hallucinated information, policy violations, and sensitive data leaks.

Here is a PII-detection output guardrail attached to an agent:

from agents import Agent, output_guardrail, GuardrailFunctionOutput

@output_guardrail
async def block_pii(ctx, agent, response: str) -> GuardrailFunctionOutput:
    contains_pii = await detect_pii(response)
    return GuardrailFunctionOutput(
        output_info={"contains_pii": contains_pii},
        tripwire_triggered=contains_pii,
    )

billing_agent = Agent(
    name="billing_agent",
    instructions="You handle charges, refunds, and invoices.",
    output_guardrails=[block_pii],
)

Implementing a guardrail involves creating a class that inherits from InputGuardrail or OutputGuardrail and defining a run method. The run method receives the context and returns a GuardrailResult that either passes (allowing execution to continue) or trips (halting execution with an error message). The beauty of the parallel execution model is that guardrails do not add latency when they pass - they only interrupt execution when they detect a problem.

For a customer service system, essential guardrails include: a PII detection guardrail (prevents the agent from including Social Security numbers, credit card numbers, or other sensitive data in its responses), a topic boundary guardrail (prevents the agent from discussing topics outside its domain - you do not want your order status agent giving medical advice), a commitment guardrail (prevents the agent from making promises or commitments the business cannot fulfill, like "I will refund your full purchase price" when the actual policy is more nuanced), and a prompt injection guardrail (detects and blocks attempts to override the agent's instructions through crafted user inputs).

The prompt injection guardrail is the most technically nuanced. Simple approaches like keyword matching ("ignore previous instructions") catch obvious attacks but miss sophisticated ones. A more robust approach uses a separate, smaller LLM to classify whether a message is a genuine user request or a prompt injection attempt. The Agents SDK makes this easy to implement because guardrails can themselves use LLM calls - you pass a lightweight model (like GPT-4o-mini) configured to detect injection patterns. This adds a small cost per message but provides significantly better protection than rule-based approaches. For a thorough discussion of AI agent security, see our security and privacy guide.

Testing guardrails requires adversarial thinking. Build a test suite of messages that should be blocked: prompt injections, PII-containing responses, off-topic requests, policy-violating commitments. Also build a test suite of messages that should pass: legitimate requests that happen to contain financial numbers (not credit cards), messages that mention sensitive topics but in a business-appropriate context. False positives - guardrails blocking legitimate messages - are as damaging as false negatives. If your guardrails are too aggressive, users will have frustrating experiences where their legitimate requests are incorrectly rejected. Aim for a false positive rate below 0.5%. OpenAI's guardrails cookbook has practical examples and testing strategies that complement the SDK's built-in capabilities.

In production, log every guardrail trigger - both trips and passes - with the full context that caused the trigger. This data is invaluable for refining guardrail thresholds and identifying new attack patterns. Review the logs weekly for the first month and monthly thereafter.

Production Patterns: Streaming, Errors, and Observability

Moving from a working prototype to a production-ready multi-agent system requires attention to three areas: streaming responses for acceptable user experience, robust error handling for reliability, and observability for debugging and optimization. Let us cover each in detail.

Forwarding streamed tokens to a frontend is a short loop over the stream events:

result = Runner.run_streamed(triage_agent, "Where is my refund?")
async for event in result.stream_events():
    if event.type == "raw_response_event":
        print(event.data, end="", flush=True)

Streaming. In production, you should almost always use Runner.run_streamed() instead of Runner.run(). The streamed runner yields events as they happen: RawResponsesStreamEvent for model output tokens, RunItemStreamEvent for completed items (tool calls, tool results, messages), and various lifecycle events. Forward the raw response events to your frontend for real-time display. The user sees tokens appearing as they are generated, which dramatically improves perceived latency - even if total response time is the same, streaming feels 3-5x faster to users.

When processing streamed events, you need to handle tool call boundaries correctly. The model's output alternates between text and tool calls. During a tool call, you typically want to show a loading indicator in your UI rather than raw tool call JSON. Listen for RunItemStreamEvent with item type tool_call_item to know when a tool call starts, and tool_call_output_item to know when it finishes. Between those events, show "Looking up your order..." or similar contextual loading messages.

Error handling. In multi-agent systems, errors can occur at multiple levels: the model might return an error (rate limits, context length exceeded), a tool call might fail (external API down, invalid input), a handoff might fail (target agent not properly configured), or a guardrail might trip unexpectedly. Each level needs its own error strategy.

For model errors, implement exponential backoff with jitter for rate limits. For context length errors, implement a context truncation strategy that preserves the most recent messages and a summary of older ones. For tool call errors, return LLM-readable error messages so the agent can explain the situation to the user gracefully rather than crashing. For handoff errors, fall back to the triage agent with an explanation of what went wrong. For guardrail errors, return a polite, generic message ("I cannot help with that request - would you like to try something else?") rather than exposing the guardrail's internal reasoning.

Observability. Production multi-agent systems need comprehensive logging and tracing. At minimum, log: every incoming user message (with a session ID), every agent invocation (which agent, what input), every tool call (which tool, arguments, result, latency), every handoff (from which agent to which agent, why), every guardrail evaluation (pass/trip, reasoning), final response (what the user received), and token usage (for cost tracking and optimization). Structure these logs so you can reconstruct any conversation end-to-end, identify which agent handled each part, and calculate per-conversation costs.

The SDK's built-in tracing integrates with OpenAI's dashboard, which is useful during development. For production, route traces to your own observability stack (Datadog, Grafana, or similar). Create dashboards that show: average response time by agent, tool call success rates, handoff patterns (which agents route to which), guardrail trip rates, and cost per conversation. These metrics tell you where to optimize. If one specialist agent has a disproportionately high token usage, its instructions might be too broad. If handoff rates from the triage agent to one specialist are very low, that specialist might be unnecessary. Data-driven agent architecture refinement is what separates good multi-agent systems from great ones. For more on agent architecture decisions, our comparison of AI agents vs automation tools provides broader architectural context.

Advanced Patterns: Context Injection and Dynamic Routing

Once you have the basics working, two advanced patterns dramatically improve your multi-agent system's effectiveness: context injection and dynamic routing. Both are straightforward to implement with the Agents SDK but require architectural forethought.

Context injection means enriching the agent's context with relevant data before it starts processing a user message. Instead of the agent needing to call a tool to look up the customer's account, you pre-fetch that information and inject it into the agent's context. This reduces tool calls (saving tokens and latency) and gives the agent better initial understanding of the situation.

Implement context injection as a preprocessing step before calling Runner.run(). When a user message arrives, look up their customer ID from the session, fetch their recent order history and account status, and format that information into a context block that gets prepended to the conversation. The agent then has immediate access to "Customer has 3 recent orders, the most recent shipped 2 days ago, account is in good standing" without needing to call any tools. This pattern reduces average tool calls per conversation by 40-60% in our testing.

Be careful with context injection volume. Injecting the customer's entire order history (hundreds of orders) wastes tokens and confuses the model. Inject a summary and the most recent 5-10 relevant items. If the agent needs deeper history, it can use a tool to fetch it. The goal is to cover the 80% case with injection and fall back to tools for the remaining 20%.

Dynamic routing goes beyond the static handoff pattern. Instead of a triage agent with fixed handoffs to predetermined specialists, you create a system where agents can be dynamically composed based on the situation. For example, a complex billing dispute might require both the billing agent and the order agent to contribute. With dynamic routing, the triage agent can invoke both specialists sequentially and synthesize their outputs.

The Agents SDK supports this through the Runner abstraction. Instead of a single Runner.run() call, your orchestration layer can run multiple agents in sequence or parallel and combine their results. Run the billing agent to get the billing analysis, run the order agent to get the order context, then run a synthesis agent that takes both outputs and generates a unified response. This is more complex than simple handoffs but handles multi-domain queries that no single specialist can resolve alone.

Another advanced pattern is agent-as-tool. Instead of handing off control entirely, one agent can invoke another agent as a tool - getting its output without transferring the conversation. The calling agent maintains control and uses the other agent's output as input to its own reasoning. This is useful when you need a specialist's analysis but want to frame the response in the context of the calling agent's domain. For example, the billing agent might use the order agent as a tool to verify order details before processing a refund, without the user ever interacting with the order agent directly.

Implementing agent-as-tool requires wrapping an agent invocation inside a function tool. Create a tool called consult_order_specialist that internally runs the order agent with the given query and returns its response as a string. The calling agent sees this as a regular tool call and incorporates the result into its response. This pattern keeps the conversation flow simple for the user while giving agents access to each other's expertise. For more advanced orchestration patterns, the Anthropic's building effective agents guide provides excellent architectural principles that apply across frameworks.

When designing these advanced patterns, always prioritize simplicity. Start with basic handoffs. Add context injection when you identify frequent tool calls that could be pre-fetched. Add dynamic routing when you encounter multi-domain queries that your static architecture cannot handle. Add agent-as-tool when specialists need each other's data. Each layer adds complexity, so add them incrementally based on measured need rather than anticipated need.

Putting It All Together: Architecture and Next Steps

Let us synthesize everything into a complete architecture for our customer service multi-agent system. The full system has five components: a triage agent, three specialist agents, a set of guardrails, a context injection layer, and an orchestration layer that ties everything together.

The triage agent has no tools of its own. Its only function is to classify the user's intent and hand off to the appropriate specialist. Its instructions contain explicit routing rules and examples. It can hand off to any of the three specialists. It has an input guardrail that checks for prompt injection and a topic guardrail that rejects clearly off-topic messages (like asking for cooking recipes from a customer service system).

The order agent has four tools: get_order_status, get_shipping_tracking, initiate_return, and modify_order. Its instructions focus exclusively on order-related inquiries. It has an output guardrail that prevents it from disclosing internal system details (warehouse locations, supplier names) and a commitment guardrail that prevents it from promising delivery dates that are not confirmed by the shipping API.

The billing agent has three tools: get_invoice, process_refund, and update_payment_method. The process_refund tool has a built-in limit - refunds above $500 require human approval, implemented as a tool-level check that returns "This refund requires manager approval - I have submitted the request and you will receive an email confirmation within 24 hours" instead of processing directly. Its output guardrail prevents it from displaying full credit card numbers.

The product agent has two tools: search_products and get_product_details. It also has access to a knowledge base via a retrieval tool. Its instructions emphasize accuracy - it should only state product features that are confirmed in the product database, not infer or hallucinate capabilities.

The context injection layer runs before the triage agent. When a message arrives, it looks up the customer's profile, recent orders, and open support tickets. This context is added to the conversation so the triage agent (and subsequently the specialists) have immediate awareness of the customer's situation. A returning customer with an open support ticket gets a very different experience than a first-time inquirer.

The orchestration layer manages the overall conversation lifecycle: creating sessions, routing messages to the triage agent, handling streaming responses, logging all interactions, and managing error recovery. It is a FastAPI application with WebSocket support for real-time streaming. Each conversation gets a unique session ID that threads through all logging and tracing.

To deploy this system, you need: a Python application server (we recommend uvicorn with FastAPI), a database for customer data and conversation logs (PostgreSQL works well), a caching layer for context injection (Redis), and an observability stack (structured logging with a dashboard). The API costs for this system are approximately $0.03-$0.08 per conversation for typical customer service interactions, which is dramatically lower than the $5-$15 cost of a human-handled interaction.

For next steps, consider these enhancements: add a feedback mechanism where users rate the agent's helpfulness (this data trains your improvement loop), implement conversation analytics to identify the most common questions (these inform knowledge base improvements), add A/B testing capability to experiment with different agent instructions, and build a human escalation path for issues that exceed the agents' capabilities. Our AI agents for small business guide covers the business considerations for deploying systems like this, and our MCP server tutorial shows how to extend agent capabilities with the Model Context Protocol for even richer tool integration.

FAQ

What is the OpenAI Agents SDK?

The OpenAI Agents SDK is a Python framework for building AI agent applications. It provides three core primitives - Agents (LLMs with instructions and tools), Handoffs (mechanism for transferring control between agents), and Guardrails (input/output validators) - that compose into multi-agent architectures. It is open source and works with OpenAI's models.

How do handoffs work in the OpenAI Agents SDK?

Handoffs transfer control from one agent to another while preserving the full conversation history. You define handoff targets when creating an agent, and the LLM decides when to hand off based on the agent's instructions. The receiving agent picks up the conversation seamlessly. Users typically do not notice the transition between agents.

What are guardrails in the OpenAI Agents SDK?

Guardrails are validators that run in parallel with the agent's main execution. Input guardrails validate user messages (catching prompt injection, off-topic requests). Output guardrails validate agent responses (catching PII leaks, policy violations). If a guardrail triggers, it halts execution before the problematic content reaches the user, adding no latency to the happy path.

How much does it cost to run an OpenAI multi-agent system?

API costs for a typical customer service multi-agent system are $0.03-$0.08 per conversation using GPT-4o. Costs vary based on conversation length, number of tool calls, and whether you use context injection (which reduces tool calls). For comparison, a human-handled interaction typically costs $5-$15.

Can I use the OpenAI Agents SDK with non-OpenAI models?

The SDK is designed primarily for OpenAI models but supports any model provider that implements the OpenAI-compatible API format. Several providers (together.ai, Groq, Fireworks) offer OpenAI-compatible endpoints. You can also use it with local models served through vLLM or Ollama with an OpenAI-compatible wrapper, though function calling quality varies by model.

All posts

2026-07-09