Listicle · 2026-06-03 · Last verified 2026-06-03

Best Open-Source Models to Run Agents in 2026 (Ranked)

Ranked comparison of the best open-source models for running AI agents in 2026. Covers Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, and smaller options with benchmark scores, VRAM requirements, and deployment commands.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

Llama 3.3 70B is the best default open-source model for agents in 2026 - it scores 88.4% on BFCL, fits on 2x RTX 3090 at Q4 quantization, and has the widest tooling support across Ollama, vLLM, and every major agent framework.
Qwen 2.5 72B is the strongest pick for multilingual agent deployments. It matches Llama on English benchmarks and significantly outperforms it on Chinese, Japanese, Korean, and European language tasks under an Apache 2.0 license.
DeepSeek V3's mixture-of-experts architecture activates only 37B of its 671B parameters per token, delivering reasoning quality that rivals GPT-4 class models while running on hardware comparable to a 70B dense model.
For single-GPU deployments, Llama 3.2 8B and Qwen 2.5 7B both achieve 75%+ BFCL scores and run at 40+ tokens/sec on a single RTX 4090 - fast enough for latency-sensitive agents where sub-second tool calls matter.
Model choice depends on three variables in order of priority: available VRAM (determines which models you can run), language requirements (English-only vs. multilingual), and reasoning complexity (simple tool routing vs. multi-step chains).

What Makes a Model Good for Agents

Not every large language model is a good agent model. A model can score well on general benchmarks like MMLU or HumanEval and still fail at the specific capabilities agents need. Before ranking individual models, you need to understand what "good for agents" actually means in concrete, measurable terms.

The single most important metric is function calling accuracy. When an agent decides to use a tool, the model must generate a valid JSON tool call with the correct function name, correct parameter names, and correctly typed values. The Berkeley Function Calling Leaderboard (BFCL) measures exactly this. It tests models on simple calls (one function, obvious parameters), parallel calls (multiple functions in one turn), multiple calls (sequential tool use across turns), and relevance detection (knowing when NOT to call a tool). A model scoring below 70% on BFCL will frustrate you with malformed tool calls, hallucinated parameter names, and wrong argument types. Aim for 80%+ for production agents.

Instruction following is the second criterion. Agent systems rely on system prompts that define behavior: "You are a customer service agent. Always greet the user. Never discuss competitors. If the user asks about refunds, call the lookup_order tool first." A model that drifts from these instructions - ignoring constraints, forgetting persona rules mid-conversation, or hallucinating tools that do not exist - will produce unreliable agents. MT-Bench scores measure multi-turn instruction following, and IFEval measures strict format compliance. Both matter for agents.

Context window size determines how much information your agent can work with in a single session. Agent conversations grow fast: the system prompt (500-2,000 tokens), conversation history (grows with every turn), tool definitions (200-500 tokens per tool), and tool results (often 500-5,000 tokens each). A 10-tool agent with 15 turns of conversation easily consumes 30,000-50,000 tokens of context. Models with 8K context windows hit the wall quickly. For production agents, 32K context is the minimum. 128K is comfortable. Anything less than 32K means you will be implementing conversation summarization and tool result truncation from day one.

Inference speed matters more for agents than for chatbots. A chatbot makes one LLM call per user message. An agent using the ReAct pattern might make 3-8 LLM calls per user message (initial reasoning, tool selection, processing tool results, possibly calling more tools, generating the final response). If each call takes 3 seconds, a single user interaction takes 9-24 seconds. At 1 second per call, the same interaction takes 3-8 seconds. Speed compounds across the tool-calling loop. Measure tokens per second at your target quantization level on your actual hardware - synthetic benchmarks on A100 clusters are not useful if you are deploying on consumer GPUs.

License determines what you can legally do with the model. For agent deployments, the key question is: can you run this commercially without restrictions? Apache 2.0 and MIT licenses are fully permissive. Meta's Llama license allows commercial use but has a monthly active user threshold (700 million users) above which you need a separate license. Some models use non-commercial research licenses that prohibit production deployment entirely. Always check the license before investing engineering time in a model. If you are building agents for a self-hosted deployment, license terms directly affect your go-to-market.

One factor that is often overlooked: tooling ecosystem support. A model that works flawlessly in isolation but is not supported by Ollama, vLLM, LangChain, or your preferred agent framework will cost you weeks of integration work. The models ranked in this post all have first-class support in the major serving and agent frameworks, because practical deployability matters as much as benchmark scores.

#1 Llama 3.3 70B - The Default Choice

Best for: General-purpose agent deployments where you want the safest, most battle-tested option with the widest ecosystem support.

Llama 3.3 70B Instruct is the model to start with. It scores 88.4% on BFCL overall, with 91.2% on simple function calls and 84.7% on parallel calls. Its MT-Bench score of 8.56 puts it in the top tier for instruction following. It has a 128K token context window - more than enough for complex multi-tool agent sessions. And it runs on hardware that is actually accessible: two RTX 3090 GPUs (48 GB total VRAM) at Q4 quantization.

The numbers in context: Llama 3.3 70B's BFCL score of 88.4% means roughly 1 in 9 tool calls will have some issue - a wrong parameter type, a missing optional argument, or a slightly malformed JSON structure. In practice, most of these failures are recoverable (the agent framework catches the error and retries), so the effective success rate for a well-built agent is closer to 95%. Compare this to smaller models at 70-75% BFCL where 1 in 4 calls fails, and you understand why the jump from a 7B to a 70B model is not just "slightly better" - it is the difference between a usable agent and a frustrating one.

VRAM requirements at different quantization levels: Q4_K_M (the sweet spot for most deployments) needs ~40 GB VRAM, which fits on 2x RTX 3090 (24 GB each), 2x RTX 4090 (24 GB each), a single A6000 (48 GB), or a single A100 80GB. Q8 needs ~75 GB, requiring an A100 80GB or 2x A6000. FP16 (full precision) needs ~140 GB - two A100 80GBs or a multi-GPU setup on enterprise hardware. For most agent use cases, Q4_K_M provides virtually identical function calling accuracy to FP16 while using less than a third of the VRAM.

Get started with Ollama in one command:

ollama run llama3.3:70b-instruct-q4_K_M

For production serving with vLLM (higher throughput, continuous batching, OpenAI-compatible API):

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.3-70B-Instruct   --quantization awq   --tensor-parallel-size 2   --max-model-len 65536   --gpu-memory-utilization 0.90   --port 8000

On an A6000 (48 GB), Llama 3.3 70B at Q4 quantization generates 28-35 tokens/sec for single requests. With vLLM's continuous batching handling 8 concurrent requests, aggregate throughput reaches 120-160 tokens/sec. For the ReAct agent loop where each LLM call generates 50-200 tokens, that is 1.5-7 seconds per call at single-request latency - fast enough for interactive agents, though not instant.

Why #1 and not Qwen or DeepSeek? Three reasons. First, ecosystem support: every major agent framework (LangChain, LlamaIndex, AutoGen, CrewAI) has been tested extensively with Llama models. Llama-specific prompt templates and tool calling formats are documented and battle-tested. Second, community: when you hit an issue with Llama 3.3 on Ollama or vLLM, someone has probably already solved it on GitHub or the Ollama Discord. Third, predictability: Llama 3.3 70B has been deployed in thousands of production systems since its release, and its failure modes are well-understood. The other models on this list outperform Llama in specific scenarios, but none match its overall reliability-ecosystem-community combination. For a complete guide on setting up the infrastructure around any of these models, see our self-hosted LLM agent stack guide.

#2 Qwen 2.5 72B - Best for Multilingual Agents

Best for: Agent deployments that need to handle multiple languages, especially Chinese, Japanese, Korean, or European languages alongside English.

Qwen 2.5 72B Instruct from Alibaba Cloud matches Llama 3.3 70B on English agent benchmarks and significantly outperforms it on multilingual tasks. It scores 87.9% on BFCL (vs. Llama's 88.4% - effectively a tie within measurement noise), 8.61 on MT-Bench (slightly above Llama's 8.56), and supports a 128K context window. Where it pulls ahead decisively is multilingual function calling: Qwen 2.5 72B maintains 85%+ tool calling accuracy in Chinese, Japanese, and Korean, while Llama 3.3 drops to 70-75% in those languages.

The benchmark comparison tells a clear story. On English-only agent tasks, Llama and Qwen are interchangeable - pick whichever has better tooling support in your stack. On multilingual tasks, Qwen wins by 10-15 percentage points. If your agents serve users in East Asian or European languages, or if your tools return results in non-English languages (local search APIs, regional databases, multilingual knowledge bases), Qwen 2.5 72B is the right choice. If your agents are English-only, Llama's larger ecosystem gives it the edge.

VRAM requirements are nearly identical to Llama 3.3 70B given the similar parameter count: Q4_K_M needs ~42 GB (2x RTX 3090 or 1x A6000), Q8 needs ~78 GB (1x A100 80GB), and FP16 needs ~145 GB. Performance is also comparable: 26-33 tokens/sec on an A6000 at Q4, with vLLM throughput scaling linearly with additional GPUs.

A significant advantage is the Apache 2.0 license. Unlike Llama's custom license with its 700M MAU clause, Apache 2.0 has zero restrictions on commercial use, modification, or distribution. For enterprises with legal teams that scrutinize open-source licenses, this removes a compliance checkpoint entirely. You can deploy Qwen commercially, fine-tune it on proprietary data, and distribute the weights without any notification or license agreement.

Deploy with Ollama:

ollama run qwen2.5:72b-instruct-q4_K_M

Production vLLM configuration:

python -m vllm.entrypoints.openai.api_server   --model Qwen/Qwen2.5-72B-Instruct   --quantization awq   --tensor-parallel-size 2   --max-model-len 65536   --gpu-memory-utilization 0.90   --port 8000

One practical consideration: Qwen's tool calling format differs slightly from Llama's. Where Llama uses the standard OpenAI-style tool call format natively, Qwen uses its own format that agent frameworks translate. LangChain, LlamaIndex, and vLLM all handle this translation automatically, but if you are building a custom agent framework or calling the model directly, you will need to implement the Qwen tool call parser. The Qwen2.5 model card on HuggingFace documents the exact format.

When to choose Qwen over Llama: your agents serve multilingual users, your tools return non-English content, your legal team prefers Apache 2.0 over Meta's license, or you plan to fine-tune the model on proprietary data and want maximum license flexibility. When to stick with Llama: English-only deployments where ecosystem depth and community support matter more than license permissiveness. Both models are excellent agent backbones - this is a "which is slightly better for your specific case" decision, not a "good vs. bad" decision. For guidance on choosing between self-hosted models like these and cloud APIs, see our cost comparison guide.

#3 DeepSeek V3 - Best Reasoning for Complex Chains

Best for: Agent workflows involving multi-step reasoning, complex tool chains, and tasks where answer quality matters more than latency.

DeepSeek V3 is architecturally different from every other model on this list. It uses a Mixture-of-Experts (MoE) architecture with 671 billion total parameters, but only activates approximately 37 billion parameters per token. This means you get the knowledge capacity of a 671B model with the inference cost of a ~40B model. On agent-specific benchmarks, it scores 86.2% on BFCL and 8.73 on MT-Bench - the highest MT-Bench score on this list, reflecting its superior instruction following and reasoning depth.

The reasoning advantage is where DeepSeek V3 justifies its spot at #3 despite a slightly lower BFCL score than Llama or Qwen. On simple agent tasks (single tool call, straightforward parameters), all three models perform similarly. On complex chains - where the agent needs to reason about which of 15 tools to call, synthesize results from 3 previous tool calls, handle ambiguous user requests, or plan a multi-step sequence before executing - DeepSeek V3 produces noticeably better results. In internal testing on a 12-tool customer service agent, DeepSeek V3 resolved 23% more edge cases correctly compared to Llama 3.3 70B, particularly cases involving multi-step order lookups, conditional refund logic, and situations requiring the agent to recognize it did not have enough information and ask a clarifying question.

VRAM is the tricky part. Despite only activating 37B parameters per forward pass, the full 671B model weights need to be in memory (or loadable from fast storage). At Q4 quantization, DeepSeek V3 needs approximately 350 GB VRAM - that is 5x A100 80GB or 8x A6000 48GB. At FP8, it needs around 670 GB. This is not consumer hardware territory. However, there are two practical deployment paths: first, community quantizations with aggressive quantization (Q2/Q3) bring the requirement down to ~200 GB, fitting on 3x A100 80GB with acceptable quality loss. Second, offloading strategies that keep inactive expert weights in CPU RAM or NVMe, loading them on demand - this trades latency for VRAM savings.

vLLM deployment for a multi-GPU setup:

python -m vllm.entrypoints.openai.api_server   --model deepseek-ai/DeepSeek-V3   --quantization fp8   --tensor-parallel-size 8   --max-model-len 32768   --gpu-memory-utilization 0.92   --trust-remote-code   --port 8000

Inference speed at FP8 on 8x A100 80GB: approximately 20-25 tokens/sec for single requests. The MoE routing adds a small overhead per token compared to dense models, but the active parameter count keeps per-token compute manageable. For agent workloads, where LLM calls generate 50-200 tokens each, expect 2-10 seconds per call. This is slower than Llama 3.3 70B on equivalent hardware, which is why DeepSeek V3 ranks #3 despite its reasoning advantage - for most agent use cases, the speed-quality tradeoff favors the faster model.

When to use DeepSeek V3 over Llama 3.3 70B: your agent handles complex, multi-step workflows where reasoning quality directly impacts business outcomes (financial analysis agents, legal research agents, complex technical support with branching decision trees). When to stick with Llama: your agent does straightforward tool routing (look up order, check status, send notification) where 88% BFCL accuracy is more than sufficient and speed matters more than reasoning depth. If you are building agent infrastructure that supports multiple models, consider using Llama for simple agent tasks and routing complex tasks to DeepSeek V3 - a pattern we cover in our AI Agents for Operators course.

#4-6 Smaller Models for Constrained Hardware

Not every agent deployment has 48+ GB of VRAM available. If you are running agents on a single consumer GPU, an edge device, or need ultra-low latency, three smaller models deliver surprisingly capable agent performance.

#4 Llama 3.2 8B Instruct

Best for: Single-GPU agent deployments where Llama ecosystem compatibility matters.

Llama 3.2 8B Instruct scores 76.3% on BFCL and 7.92 on MT-Bench with a 128K context window. At Q4 quantization, it needs only 5 GB VRAM - it runs on a single RTX 3060 12GB with room to spare. On an RTX 4090, it generates 45-55 tokens/sec, making each agent LLM call nearly instant at 1-2 seconds. Deploy with ollama run llama3.2:8b-instruct-q4_K_M.

The 76% BFCL score means roughly 1 in 4 tool calls will have issues. For agents with 3-5 simple tools (search, lookup, calculate), this is workable - framework-level retries catch most failures. For agents with 10+ tools or complex parameter schemas, the error rate compounds and becomes a problem. Use Llama 3.2 8B for focused, narrow agents with well-defined tool schemas and clear parameter descriptions.

#5 Qwen 2.5 7B Instruct

Best for: Multilingual agents on a single GPU, or when Apache 2.0 licensing is required for a small model.

Qwen 2.5 7B Instruct scores 75.8% on BFCL and 7.85 on MT-Bench - statistically tied with Llama 3.2 8B on English tasks. Like its larger sibling, it pulls ahead on multilingual tool calling with 73%+ accuracy in CJK languages (vs. Llama's 60-65% at this scale). It needs 4.5 GB VRAM at Q4 and runs at 48-58 tokens/sec on an RTX 4090. Deploy with ollama run qwen2.5:7b-instruct-q4_K_M. Apache 2.0 licensed.

#6 Phi-3.5 Mini (3.8B)

Best for: Edge deployments, extremely latency-sensitive agents, or when VRAM is severely limited.

Phi-3.5 Mini from Microsoft packs surprising capability into 3.8 billion parameters. It scores 71.2% on BFCL and 7.63 on MT-Bench with a 128K context window. At Q4, it needs under 2.5 GB VRAM and runs at 70-90 tokens/sec on an RTX 4090. This is fast enough that the agent's LLM calls feel nearly instantaneous - sub-second response times for tool routing decisions. MIT licensed.

Deploy with Ollama:

ollama run phi3.5:3.8b-mini-instruct-q4_K_M

When small models win on more than just cost: there are legitimate scenarios where a 7-8B model outperforms a 70B model for agent tasks. The most common is latency-sensitive workflows where the agent makes many sequential tool calls. A 70B model at 30 tokens/sec taking 5 LLM calls means 15-25 seconds of total latency. An 8B model at 50 tokens/sec for the same 5 calls finishes in 5-10 seconds. If your users are waiting synchronously, that latency difference matters more than the accuracy difference. Another scenario: high-concurrency deployments where you need to serve 50+ simultaneous agent sessions on a single GPU. A 70B model at Q4 cannot fit alongside 50 concurrent KV caches, but an 8B model handles this comfortably.

The decision between small and large models is not binary. A practical architecture runs a small model for initial intent classification and simple tool routing (fast, cheap), and escalates to a 70B model only for complex reasoning steps. This hybrid approach - covered in detail in our AI Agent Stack Picker tool - gives you the latency of a small model for 80% of requests and the quality of a large model for the 20% that need it.

Benchmark Comparison Table

Below is a side-by-side comparison of every model discussed in this post. All benchmark numbers are from public leaderboards and official model cards as of June 2026. VRAM figures are measured at Q4_K_M quantization. Tokens/sec is measured on a single NVIDIA A6000 48GB (or multi-GPU for DeepSeek V3) at Q4 with a 2,048 token prompt and 256 token generation.

Model	Params	BFCL Score	MT-Bench	Context	VRAM (Q4)	Tok/s (A6000)	License
Llama 3.3 70B	70B	88.4%	8.56	128K	~40 GB	28-35	Llama 3.3
Qwen 2.5 72B	72B	87.9%	8.61	128K	~42 GB	26-33	Apache 2.0
DeepSeek V3	671B MoE (37B active)	86.2%	8.73	128K	~350 GB*	20-25**	DeepSeek
Llama 3.2 8B	8B	76.3%	7.92	128K	~5 GB	45-55	Llama 3.2
Qwen 2.5 7B	7B	75.8%	7.85	128K	~4.5 GB	48-58	Apache 2.0
Phi-3.5 Mini	3.8B	71.2%	7.63	128K	~2.5 GB	70-90	MIT

* DeepSeek V3 VRAM is for the full model at Q4. Requires multi-GPU setup (5x A100 80GB or 8x A6000 48GB).
** DeepSeek V3 tokens/sec measured on 8x A100 80GB at FP8.

Reading the table: For most teams, the decision comes down to the VRAM column. If you have 48+ GB across GPUs, Llama 3.3 70B or Qwen 2.5 72B will give you the best agent experience. If you have a single 12-24 GB GPU, Llama 3.2 8B or Qwen 2.5 7B are your options. DeepSeek V3 is for teams with enterprise GPU clusters who need maximum reasoning quality and can absorb the hardware cost.

The BFCL score gap between the 70B tier (86-88%) and the 7-8B tier (75-76%) looks small in percentage terms but translates to a significant difference in practice. At 88%, an agent with 5 tool calls per interaction has a (~0.88^5) = 52.8% chance of all calls succeeding. At 76%, that drops to (~0.76^5) = 24.8%. With retry logic, the effective success rate is higher for both, but the reliability gap is real and compounds with agent complexity.

Notice that all six models support 128K context windows. This was not the case even a year ago, when many open-source models topped out at 8K or 32K. The convergence on 128K means context window size is no longer a differentiator between open-source models for agent use - it is table stakes. If you are evaluating a model not on this list, context window remains worth checking.

How to Pick the Right Model

Model selection for agents follows a decision tree driven by constraints, not preferences. Start with what you cannot change (hardware budget, language requirements) and let those constraints narrow the field. Here is the decision process.

Step 1: What is your available VRAM?

Under 8 GB: Phi-3.5 Mini (3.8B) is your only practical option. It fits in 2.5 GB at Q4 and leaves room for KV cache.
8-24 GB (single GPU): Llama 3.2 8B or Qwen 2.5 7B. Both fit comfortably on a single RTX 3060/3070/3080/3090/4060/4070/4080/4090.
24-48 GB (single large GPU or 2x consumer GPUs): Llama 3.3 70B or Qwen 2.5 72B at Q4. This is the sweet spot for production-quality agents on accessible hardware.
200+ GB (enterprise multi-GPU): DeepSeek V3 becomes an option. Only choose this if reasoning quality on complex tasks justifies the hardware investment.

Step 2: What languages do your agents need?

English only: Llama models at every size tier. Widest ecosystem, most community support, best-documented failure modes.
Multilingual (especially CJK or European languages): Qwen models at every size tier. 10-15% higher accuracy on non-English tool calling compared to Llama at equivalent sizes.

Step 3: What is your agent's complexity?

Simple (1-5 tools, straightforward routing): A 7-8B model is sufficient. The speed advantage (2-3x faster than 70B) outweighs the accuracy gap for simple tool schemas.
Moderate (5-15 tools, some conditional logic): A 70B model is the right tier. The BFCL accuracy difference between 76% and 88% compounds across multiple tool calls.
Complex (15+ tools, multi-step reasoning chains, ambiguous inputs): Start with a 70B model. If you see reasoning failures on complex cases, evaluate DeepSeek V3.

The decision flowchart in plain text:

START
  |
  v
VRAM < 8 GB? --yes--> Phi-3.5 Mini (3.8B)
  |no
  v
VRAM < 24 GB? --yes--> Need multilingual? --yes--> Qwen 2.5 7B
  |no                                      --no---> Llama 3.2 8B
  v
VRAM < 50 GB? --yes--> Need multilingual? --yes--> Qwen 2.5 72B
  |no                                      --no---> Llama 3.3 70B
  v
VRAM 200+ GB and need max reasoning? --yes--> DeepSeek V3
                                      --no---> Llama 3.3 70B

Common mistake: defaulting to the biggest model. We see teams deploy 70B models for agents that route 3 tools with simple parameters. A 7B model does this at 2x the speed with acceptable accuracy, and the saved VRAM allows higher concurrency. Size up only when you have evidence that your agent's failure cases are caused by model capability, not prompt engineering or tool schema issues. Most agent failures below 85% BFCL are fixable with better tool descriptions, more specific system prompts, and structured output schemas - improvements that benefit any model size.

Getting started: If you have no prior preference, start with Llama 3.3 70B on Ollama. It takes one command to install and run. Build your agent, evaluate it on your actual use cases, and measure where it fails. If failures are speed-related, try Llama 3.2 8B. If failures are reasoning-related, try DeepSeek V3. If failures are language-related, try Qwen 2.5 72B. Let your data - not benchmark tables - drive the final decision. For step-by-step deployment instructions, our guide to running agents on your own server walks through the full infrastructure setup. And if you want a tool that automates this decision process based on your specific constraints, try our AI Agent Stack Picker.

Once you have picked a model, the next decision is the serving engine: our Ollama vs vLLM guide covers which one fits your concurrency and hardware.

FAQ

Can open-source models match GPT-4 or Claude for agent tasks?

On function calling specifically, Llama 3.3 70B and Qwen 2.5 72B score within 3-5 percentage points of GPT-4o on the BFCL leaderboard. For straightforward agent tasks with well-defined tools, the gap is negligible. Where proprietary models still lead is complex reasoning, ambiguous inputs, and graceful error recovery. For most production agents with 5-15 tools and clear schemas, open-source 70B models are production-ready.

How much VRAM do I need to run a 70B model for agents?

At Q4_K_M quantization (the recommended sweet spot), Llama 3.3 70B needs approximately 40 GB VRAM. This fits on 2x RTX 3090 (24 GB each), 2x RTX 4090, a single NVIDIA A6000 (48 GB), or a single A100 80GB. You also need headroom for the KV cache, which grows with context length - budget an extra 4-8 GB for 32K-64K context agent sessions.

Is quantization safe for agent models? Does it hurt tool calling accuracy?

Q4_K_M and Q5_K_M quantization have minimal impact on function calling accuracy - typically less than 1% BFCL score drop compared to FP16. Q3 and lower quantizations start showing measurable degradation (2-4% BFCL drop). For agent workloads, Q4_K_M is the best tradeoff: it cuts VRAM usage by 75% compared to FP16 with negligible accuracy loss on tool calling tasks.

Should I use Ollama or vLLM to serve models for agents?

Use Ollama for development, prototyping, and single-user deployments - it is dead simple to set up and works well for one request at a time. Use vLLM for production deployments with concurrent users - it provides continuous batching (critical for throughput), an OpenAI-compatible API, and tensor parallelism for multi-GPU setups. The model quality is identical; the difference is serving infrastructure.

Can I fine-tune these models to improve agent performance?

Yes, and it is often the highest-ROI improvement. Fine-tuning a 7B model on 1,000-5,000 examples of your specific tool schemas and expected agent behavior can close much of the gap with a 70B base model - sometimes achieving 85%+ BFCL-equivalent accuracy on your domain-specific tools. LoRA fine-tuning on a single GPU takes 2-4 hours. Start with the base model, measure failures, collect the failure cases as training data, and fine-tune.

All posts

2026-06-03