Technical · 2026-06-03 · Last verified 2026-06-03

How to Run AI Agents on Your Own Server (Production Setup, 2026)

The definitive guide to self-hosting AI agents in production. Covers hardware selection, vLLM inference, LangGraph orchestration, Docker deployment, nginx hardening, and real cost comparisons against cloud APIs at every scale.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

Self-hosting AI agents becomes cost-effective at roughly 50,000+ requests per month - below that threshold, cloud APIs are cheaper when you factor in hardware depreciation, electricity, and maintenance time.
A single RTX 4090 (24 GB VRAM) can serve a quantized Llama 3.1 70B model at 15-20 tokens/second for one concurrent user, which is sufficient for internal tools but not for customer-facing products at scale.
The production stack that works is vLLM for inference, LangGraph for agent orchestration, PostgreSQL for state persistence, and Redis for caching - all running in Docker Compose with a single docker compose up command.
Production hardening requires nginx as a reverse proxy with TLS, API key authentication, rate limiting, and health checks - without these, your server will be exploited within hours of being exposed to the internet.
The biggest hidden cost of self-hosting is not hardware - it is operational burden. Someone on your team needs to handle model updates, driver patches, OOM crashes at 3 AM, and CUDA version conflicts after every Ubuntu upgrade.

Why Self-Host AI Agents (And Why Most Teams Should Not Start Here)

There are four legitimate reasons to run AI agents on your own server: data sovereignty, cost at scale, latency requirements, and regulatory compliance. Every other reason - "I want to learn," "I don't trust OpenAI," "I like building infrastructure" - is valid for side projects but not for production decisions. Let's be precise about when self-hosting actually makes sense.

Data sovereignty

If your agents process medical records, financial data, legal documents, or classified information, sending that data to a third-party API may violate your compliance requirements. HIPAA, SOC 2 Type II, GDPR Article 28, and FedRAMP all have provisions about data processing locations and third-party access. When your data cannot leave your network boundary, self-hosting is not optional - it is the only option. Note that OpenAI and Anthropic both offer enterprise agreements with data processing addendums, so "compliance" alone does not always mandate self-hosting. Read the actual requirements before making infrastructure decisions based on vibes.

Cost at scale

Cloud API pricing is per-token. At low volumes (under 10,000 requests per month), the cost is trivial - maybe $50-200/month depending on your prompt sizes. At 100,000+ requests per month, you might be spending $2,000-10,000/month on API calls alone. A self-hosted GPU server costs $1,500-5,000/month all-in (hardware amortization, electricity, bandwidth, maintenance time). The crossover point depends on your specific workload, but for most teams it lands somewhere between 50,000 and 100,000 requests per month. We will break down exact numbers in the cost comparison section.

Latency

Cloud API calls add 100-500ms of network latency before the first token arrives. For internal tools where users tolerate a loading spinner, this is fine. For real-time applications - voice agents, trading systems, interactive coding assistants - that latency is unacceptable. A local vLLM instance on the same network returns first tokens in 20-50ms. If your agent makes 3-5 LLM calls per request (common in ReAct loops), the latency savings compound: 300-2,500ms saved per request.

Compliance and audit requirements

Some industries require full audit trails of every inference request and response, with guaranteed data retention and deletion capabilities. Self-hosting gives you complete control over logging, storage, and data lifecycle. You can prove exactly which model version processed which data, when it was deleted, and that no third party had access. Try getting that level of audit granularity from an API provider.

Now the honest part: self-hosting is operationally expensive. You are signing up to maintain GPU drivers, handle CUDA version conflicts, manage model updates, monitor VRAM usage, restart crashed inference servers at 3 AM, and debug cryptic NCCL errors. If you do not have someone on your team who is comfortable with Linux system administration and GPU infrastructure, you will spend more time fighting your stack than building your product. For teams that are early stage or have low request volumes, start with cloud APIs and use our ROI calculator to model when self-hosting becomes cost-effective for your specific workload.

With the caveats out of the way, let's build the infrastructure. The rest of this guide assumes you have decided that self-hosting is right for your situation and walks you through the exact setup we use in production.

Hardware Requirements: GPUs, VRAM, and the Math That Matters

The single most important number in self-hosted inference is VRAM - the amount of memory on your GPU. The model must fit entirely in VRAM (or be split across multiple GPUs), and you need headroom for the KV cache that stores attention states during inference. Get the VRAM math wrong and your server either refuses to load the model or crashes mid-inference when a long context exhausts the cache.

Run AI Agents on Your Own Server - data overview

VRAM math for common models

A model's VRAM requirement depends on its parameter count and the precision (quantization level) you use:

Model	Parameters	FP16 (no quant)	INT8 (8-bit)	INT4 (4-bit GPTQ/AWQ)
Llama 3.1 8B	8B	16 GB	8 GB	5 GB
Llama 3.1 70B	70B	140 GB	70 GB	35 GB
Mistral 7B	7B	14 GB	7 GB	4 GB
Mixtral 8x7B	46.7B (active 12.9B)	93 GB	47 GB	24 GB
Qwen 2.5 72B	72B	144 GB	72 GB	36 GB

The formula: VRAM (GB) = Parameters (billions) x Bytes per parameter. FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes. Then add 2-6 GB for the KV cache depending on your max context length and batch size. For a 70B model at INT4 quantization with 4K context, you need roughly 35 + 4 = 39 GB of VRAM.

GPU options and what they actually cost

Here are the GPUs that make sense for self-hosted inference in 2026, ordered by value:

NVIDIA RTX 4090 (24 GB VRAM, ~$1,600 used) - The workhorse for small deployments. Runs Llama 3.1 8B at FP16 or Llama 3.1 70B at INT4 with AWQ quantization (tight fit, limited KV cache). Throughput: 70-90 tokens/second for 8B models, 15-20 tokens/second for quantized 70B. Best for internal tools with 1-5 concurrent users.
NVIDIA RTX A6000 (48 GB VRAM, ~$3,500 used) - The sweet spot for 70B models. Runs Llama 3.1 70B at INT8 with comfortable KV cache headroom. Throughput: 20-30 tokens/second for 70B INT8. Professional-grade reliability with ECC memory. Best for production workloads up to 10-15 concurrent users.
NVIDIA A100 80GB (80 GB VRAM, ~$8,000 used) - The gold standard. Runs 70B models at FP16 with large KV caches, or serves multiple smaller models simultaneously. Throughput: 40-60 tokens/second for 70B FP16. Best for high-throughput production or when you need multiple models.
2x RTX 4090 with tensor parallelism (48 GB total, ~$3,200) - Budget alternative to the A6000. vLLM supports tensor parallelism across multiple GPUs, so two 4090s give you 48 GB total VRAM. Slightly lower throughput than a single A6000 due to inter-GPU communication overhead, but significantly cheaper.

The rest of the hardware

CPU: Not the bottleneck for inference, but you need enough cores for data preprocessing and running your application server alongside vLLM. A modern 8-core CPU (AMD Ryzen 7 or Intel i7) is sufficient. For multi-GPU setups, ensure enough PCIe lanes (at least 16 per GPU).

RAM: You need enough system RAM to load the model weights from disk before transferring them to GPU VRAM. Rule of thumb: system RAM should be at least 1.5x the model size on disk. For a 70B model at INT4 (~35 GB on disk), you want at least 64 GB of system RAM. Running PostgreSQL and Redis alongside needs another 8-16 GB. Recommendation: 64 GB minimum, 128 GB preferred.

Storage: Model weights are large (35-140 GB per model). Use NVMe SSDs for fast model loading - loading a 70B model from a SATA SSD takes 3-4 minutes, from NVMe it takes 45-60 seconds. Plan for 1 TB NVMe minimum to store multiple model versions and quantizations. Postgres data and logs go on separate storage to avoid I/O contention during inference.

Network: For a single-server setup, standard 1 Gbps ethernet is fine. If you are serving external traffic, ensure your ISP provides a static IP and sufficient upload bandwidth. For cloud-hosted GPU servers (Hetzner, OVH, Lambda Labs), you get 1-10 Gbps connectivity included. Factor in $50-200/month for a dedicated server with GPU at these providers - often cheaper than owning hardware when you include electricity and cooling costs.

The Production Stack: vLLM, LangGraph, Postgres, Redis

After evaluating dozens of inference servers and orchestration frameworks, the stack we have settled on for production self-hosted agents is: vLLM for model inference, LangGraph for agent orchestration, PostgreSQL for state persistence, and Redis for caching and rate limiting. Each component is battle-tested, open source, and replaceable if something better comes along. Here is why each piece exists and what it does.

vLLM: the inference engine

vLLM is the fastest open-source LLM inference server, and it is not close. Its key innovation is PagedAttention, which manages the KV cache like an operating system manages virtual memory - dynamically allocating and freeing cache blocks instead of reserving a fixed block per request. This means vLLM can serve 2-4x more concurrent requests than naive implementations with the same GPU memory. In practice, a 70B model on an A100 serves 20+ concurrent requests with vLLM versus 5-8 with a basic HuggingFace setup.

vLLM exposes an OpenAI-compatible API, which means any code written for the OpenAI API works with vLLM by changing one line - the base URL. Your LangGraph agents, your existing scripts, your evaluation harnesses - they all work unchanged. This compatibility is crucial because it means you can develop against OpenAI's API (fast iteration, no GPU needed) and deploy against vLLM (cost savings, data sovereignty) with zero code changes.

LangGraph: the orchestration layer

LangGraph handles everything above the inference layer: agent state management, tool execution, conditional routing, multi-step reasoning loops, and workflow orchestration. If vLLM is the engine, LangGraph is the transmission. It decides when to call the LLM, what tools to invoke, when to loop versus terminate, and how to persist state across requests. For a deep dive into LangGraph itself, see our comprehensive LangGraph tutorial.

Why LangGraph and not just raw Python? Because production agents need state persistence (resume conversations across server restarts), streaming (send tokens to the frontend as they generate), checkpointing (recover from crashes mid-execution), and conditional routing (branch agent behavior based on runtime decisions). You can build all of this yourself, but you will end up reinventing LangGraph poorly. The framework handles the plumbing so you can focus on agent logic.

PostgreSQL: durable state

LangGraph's PostgresSaver checkpointer stores agent state - conversation history, tool results, intermediate reasoning - in PostgreSQL. Every time the agent completes a step, the state is checkpointed to Postgres. If vLLM crashes, if the app server restarts, if the power goes out - the agent resumes from the last checkpoint. For production agents that handle customer data, this durability is non-negotiable.

Postgres also stores your application data: user accounts, API keys, agent configurations, audit logs, and analytics. Running a separate database for each concern is overengineering for most self-hosted setups. One Postgres instance with separate schemas handles it all until you reach significant scale (thousands of concurrent agents).

Redis: caching and rate limiting

Redis serves three purposes in the stack. First, response caching: if two users ask the same question, there is no reason to run inference again. A semantic cache (hash the prompt, store the response) can reduce your inference load by 10-30% depending on your workload. Second, rate limiting: track request counts per API key per time window to prevent abuse. Third, queue management: when more requests arrive than vLLM can handle concurrently, Redis-backed queues buffer them with fair ordering.

How the components connect

The request flow: a client sends a request to your application server (FastAPI). The server authenticates the request, checks Redis for a cached response, and if no cache hit, invokes the LangGraph agent. LangGraph runs the agent's reasoning loop, calling vLLM for inference and executing tools as needed. After each step, LangGraph checkpoints state to Postgres. The final response streams back through FastAPI to the client. All four components run on the same server in Docker containers, communicating over Docker's internal network with sub-millisecond latency.

This stack runs comfortably on a single machine with one GPU. For scaling beyond one machine, you add more app server containers behind a load balancer (all sharing the same Postgres and Redis), and optionally add more vLLM instances on additional GPU machines. But start with one machine - most teams overestimate their concurrency needs early on.

Run AI Agents on Your Own Server - analysis

Step-by-Step Setup: Docker Compose, vLLM, and Model Download

This section gives you a working self-hosted agent stack from zero. By the end, you will have vLLM serving a model, a FastAPI application server, Postgres, Redis, and nginx all running in Docker Compose. Every command is copy-paste ready. We are using Llama 3.1 8B as the example model because it runs on a single RTX 4090 at FP16 - swap the model name for larger models if your hardware supports it.

Prerequisites

Ubuntu 22.04 or 24.04 (other Linux distributions work but NVIDIA driver setup differs)
NVIDIA GPU with at least 16 GB VRAM
NVIDIA driver 535+ installed (nvidia-smi should show your GPU)
Docker Engine 24+ with the NVIDIA Container Toolkit installed
At least 100 GB free disk space for model weights
A Hugging Face account with access to Llama 3.1 (request access at meta-llama on HF)

Project structure

agent-server/
  docker-compose.yml
  app/
    main.py
    agent.py
    requirements.txt
  nginx/
    nginx.conf
  .env

Docker Compose file

This is the heart of the deployment. Save this as docker-compose.yml:

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:v0.7.3
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
      --dtype auto
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: agent_db
      POSTGRES_USER: agent
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  app:
    build: ./app
    environment:
      - VLLM_BASE_URL=http://vllm:8000/v1
      - POSTGRES_URL=postgresql://agent:${POSTGRES_PASSWORD}@postgres:5432/agent_db
      - REDIS_URL=redis://redis:6379
      - API_KEY=${API_KEY}
    ports:
      - "8080:8080"
    depends_on:
      vllm:
        condition: service_healthy
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - app

volumes:
  model-cache:
  pgdata:
  redisdata:

Environment variables

Create a .env file in the project root:

HF_TOKEN=hf_your_huggingface_token_here
POSTGRES_PASSWORD=your_secure_password_here
API_KEY=your_api_key_for_clients

Starting the stack

The first startup takes 10-30 minutes because vLLM downloads the model weights (roughly 16 GB for Llama 3.1 8B). Subsequent startups take 60-90 seconds for model loading:

# Start everything
docker compose up -d

# Watch vLLM model loading progress
docker compose logs -f vllm

# You will see output like:
# INFO: Loading model weights...
# INFO: Model loaded in 47.3 seconds
# INFO: Using PagedAttention with 8192 max seq len
# INFO: Uvicorn running on http://0.0.0.0:8000

# Verify vLLM is serving
curl http://localhost:8000/v1/models
# Returns: {"data": [{"id": "meta-llama/Llama-3.1-8B-Instruct", ...}]}

# Test inference
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello, are you running?"}],
    "max_tokens": 50
  }'

Common startup errors

"CUDA out of memory" - Your model is too large for your GPU. Reduce --max-model-len to 4096, lower --gpu-memory-utilization to 0.85, or use a quantized model (append -AWQ or -GPTQ to the model name).
"Could not find model" - Check your HF_TOKEN is valid and you have accepted the Llama license on Hugging Face.
"nvidia-container-cli: initialization error" - The NVIDIA Container Toolkit is not installed. Run sudo apt install nvidia-container-toolkit and restart Docker.
Health check keeps failing - vLLM needs 60-120 seconds to load the model. The start_period: 120s in the healthcheck accounts for this, but larger models may need more time. Increase it to 300s for 70B models.

Building Your First Agent on the Self-Hosted Stack

With vLLM running and serving inference, you now build the agent layer on top. This agent uses LangGraph for orchestration and calls your local vLLM instance instead of a cloud API. The code is identical to what you would write against OpenAI's API - the only difference is the base URL pointing to your local vLLM server.

Application requirements

Save this as app/requirements.txt:

langgraph==0.3.25
langchain-openai==0.3.12
langchain-core==0.3.40
fastapi==0.115.6
uvicorn==0.34.0
psycopg2-binary==2.9.10
redis==5.2.1
sse-starlette==2.2.1

The agent code

Save this as app/agent.py. This implements a ReAct agent with tool calling that runs entirely on your local infrastructure:

from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.postgres import PostgresSaver
from langchain_core.tools import tool
import os

# Point to your local vLLM instance - this is the only
# line that differs from an OpenAI-backed agent
llm = ChatOpenAI(
    base_url=os.environ["VLLM_BASE_URL"],
    api_key="not-needed",  # vLLM does not require an API key
    model="meta-llama/Llama-3.1-8B-Instruct",
    temperature=0.1,
    max_tokens=2048,
)

# Define tools - these run locally on your server
@tool
def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant information.
    Use this when the user asks a question that requires specific
    domain knowledge or documentation lookup."""
    # Replace with your actual search implementation
    # (vector DB, Elasticsearch, SQL query, etc.)
    return f"Knowledge base results for: {query}"

@tool
def execute_sql_query(query: str) -> str:
    """Execute a read-only SQL query against the analytics database.
    Use this when the user needs data, metrics, or reports.
    Only SELECT queries are allowed."""
    # Replace with your actual database query logic
    # IMPORTANT: validate and sanitize the query first
    return f"Query results for: {query}"

@tool
def create_ticket(title: str, description: str, priority: str) -> str:
    """Create a support ticket in the ticketing system.
    Use this when the user reports an issue that needs follow-up.
    Priority must be 'low', 'medium', or 'high'."""
    # Replace with your actual ticketing API call
    return f"Ticket created: {title} (priority: {priority})"

tools = [search_knowledge_base, execute_sql_query, create_ticket]
llm_with_tools = llm.bind_tools(tools)

# Build the LangGraph agent
def agent_node(state: MessagesState):
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: MessagesState):
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return END

# Assemble the graph
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")

# Compile with Postgres checkpointing for durable state
def create_agent(postgres_url: str):
    checkpointer = PostgresSaver.from_conn_string(postgres_url)
    checkpointer.setup()  # Creates tables if they don't exist
    return graph.compile(checkpointer=checkpointer)

The FastAPI server

Save this as app/main.py. This exposes your agent as an HTTP API with streaming support:

from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
from sse_starlette.sse import EventSourceResponse
from agent import create_agent
from langchain_core.messages import HumanMessage
import json
import os
import redis

app = FastAPI()
agent = None
redis_client = None

@app.on_event("startup")
async def startup():
    global agent, redis_client
    agent = create_agent(os.environ["POSTGRES_URL"])
    redis_client = redis.from_url(os.environ["REDIS_URL"])

async def verify_api_key(x_api_key: str = Header(...)):
    if x_api_key != os.environ["API_KEY"]:
        raise HTTPException(status_code=401, detail="Invalid API key")

class ChatRequest(BaseModel):
    message: str
    thread_id: str

@app.post("/chat")
async def chat(req: ChatRequest, _=Depends(verify_api_key)):
    config = {"configurable": {"thread_id": req.thread_id}}
    input_msg = {"messages": [HumanMessage(content=req.message)]}

    async def event_generator():
        async for event in agent.astream_events(
            input_msg, config, version="v2"
        ):
            if event["event"] == "on_chat_model_stream":
                token = event["data"]["chunk"].content
                if token:
                    yield {"data": json.dumps({"token": token})}
            elif event["event"] == "on_tool_start":
                yield {"data": json.dumps({
                    "tool": event["name"],
                    "status": "started"
                })}

    return EventSourceResponse(event_generator())

@app.get("/health")
async def health():
    return {"status": "ok"}

Testing the agent

Once the stack is running, test the full loop:

# Send a chat message
curl -N http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: your_api_key_for_clients" \
  -d '{
    "message": "What were our top 5 customers by revenue last quarter?",
    "thread_id": "test-session-001"
  }'

# You will see streaming SSE events:
# data: {"tool": "execute_sql_query", "status": "started"}
# data: {"token": "Based"}
# data: {"token": " on"}
# data: {"token": " the"}
# ...

The key insight: this code is nearly identical to what you would write against OpenAI's API. The ChatOpenAI class accepts a base_url parameter, and everything downstream - tools, LangGraph, streaming - works the same way. This means you can develop locally against OpenAI (no GPU needed), run tests against OpenAI (fast iteration), and deploy to your self-hosted vLLM in production. The switchover is a single environment variable change. For a more detailed walkthrough of LangGraph agent patterns, see our LangGraph tutorial.

Production Hardening: Nginx, Auth, Rate Limiting, Monitoring

Running the stack from the previous section gives you a working agent server. Exposing it to the internet without hardening gives you a compromised server within hours. This section covers the minimum production hardening required before your agent server handles real traffic: TLS termination, authentication, rate limiting, and monitoring. Skip none of these.

Nginx reverse proxy with TLS

Save this as nginx/nginx.conf. It handles TLS termination, rate limiting at the connection level, and proxying to your app server:

worker_processes auto;

events {
    worker_connections 1024;
}

http {
    # Rate limiting zone: 10 requests per second per IP
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    # Connection limiting: max 20 concurrent connections per IP
    limit_conn_zone $binary_remote_addr zone=conn:10m;

    upstream app {
        server app:8080;
    }

    # Redirect HTTP to HTTPS
    server {
        listen 80;
        server_name your-domain.com;
        return 301 https://$host$request_uri;
    }

    server {
        listen 443 ssl http2;
        server_name your-domain.com;

        ssl_certificate /etc/nginx/certs/fullchain.pem;
        ssl_certificate_key /etc/nginx/certs/privkey.pem;
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_ciphers HIGH:!aNULL:!MD5;

        # Security headers
        add_header X-Content-Type-Options nosniff;
        add_header X-Frame-Options DENY;
        add_header X-XSS-Protection "1; mode=block";

        # Request size limit (prevents abuse with huge prompts)
        client_max_body_size 1m;

        # Timeouts for long-running agent requests
        proxy_read_timeout 120s;
        proxy_send_timeout 120s;

        location /chat {
            limit_req zone=api burst=20 nodelay;
            limit_conn conn 20;

            proxy_pass http://app;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # SSE support for streaming
            proxy_buffering off;
            proxy_cache off;
            chunked_transfer_encoding on;
        }

        location /health {
            proxy_pass http://app;
        }

        # Block access to everything else
        location / {
            return 404;
        }
    }
}

For TLS certificates, use Let's Encrypt with certbot. Run certbot certonly --standalone -d your-domain.com before starting nginx, and mount the certificate directory as a volume. Set up a cron job for automatic renewal: 0 3 * * * certbot renew --quiet && docker compose restart nginx.

Application-level rate limiting with Redis

Nginx rate limiting is per-IP, which is a good first layer but insufficient for API key-based access. Add per-key rate limiting in your FastAPI middleware using Redis. This lets you set different limits for different clients:

# Add to app/main.py
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware

class RateLimitMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        api_key = request.headers.get("x-api-key", "anonymous")
        key = f"rate_limit:{api_key}"
        current = redis_client.incr(key)
        if current == 1:
            redis_client.expire(key, 60)  # 60-second window
        if current > 100:  # 100 requests per minute per key
            raise HTTPException(
                status_code=429,
                detail="Rate limit exceeded. Max 100 requests per minute."
            )
        return await call_next(request)

app.add_middleware(RateLimitMiddleware)

Monitoring with Prometheus and Grafana

Add Prometheus and Grafana to your Docker Compose for visibility into GPU utilization, inference latency, request throughput, and error rates. vLLM exposes Prometheus metrics natively at /metrics. The key metrics to monitor:

vllm:num_requests_running - Current concurrent requests. If this consistently equals your max batch size, you need more GPU capacity.
vllm:avg_generation_throughput_toks_per_s - Tokens per second. A sudden drop indicates GPU thermal throttling or VRAM pressure.
vllm:gpu_cache_usage_perc - KV cache utilization. Above 95% means requests are getting queued waiting for cache space.
Request latency (P50, P95, P99) - Track at the nginx and application layers. P99 above 30 seconds usually means your batch size or context length is too aggressive.
Error rate by type - Separate CUDA OOM errors (need less memory pressure), timeout errors (need faster inference or longer timeouts), and application errors (bugs in your agent code).

Set up alerts for: GPU temperature above 85C (thermal throttling imminent), VRAM usage above 95% (OOM crash imminent), error rate above 5% (something is broken), and disk usage above 85% (logs or checkpoints are filling the disk). Use Grafana's alerting to send notifications to Slack, PagerDuty, or email. A GPU that crashes at 3 AM and stays down until someone notices at 9 AM is 6 hours of downtime that proper monitoring prevents.

Log aggregation

Structure your logs as JSON so they are parseable. Include a correlation ID in every request that propagates through all components - when debugging a bad agent response, you want to trace from the nginx access log through the FastAPI handler through LangGraph execution through the vLLM inference call. Without correlation IDs, debugging production agent issues is like reading four different books and trying to find the connecting plot thread.

Cost Comparison: Self-Hosted vs. Cloud APIs at Every Scale

This is the section that actually matters for most teams. The technical setup is straightforward - what makes or breaks the self-hosting decision is economics. We calculated costs for four scenarios: 1,000 requests/month (hobby project), 10,000 (internal tool), 100,000 (production application), and 1,000,000 (high-traffic product). All prices are as of June 2026 and assume an average request of 500 input tokens and 300 output tokens.

Cloud API costs

Provider / Model	1K req/mo	10K req/mo	100K req/mo	1M req/mo
OpenAI GPT-4o	$7	$70	$700	$7,000
OpenAI GPT-4o-mini	$0.30	$3	$30	$300
Anthropic Claude Sonnet 4	$5	$50	$500	$5,000
Anthropic Claude Haiku	$0.50	$5	$50	$500
Google Gemini 2.5 Flash	$0.20	$2	$20	$200

Self-hosted costs (amortized monthly)

Setup	Hardware (amortized 3yr)	Electricity	Maintenance (eng hours)	Total/month
RTX 4090, Llama 8B	$110	$40	$200	$350
A6000, Llama 70B INT8	$200	$60	$200	$460
A100, Llama 70B FP16	$350	$80	$200	$630
Cloud GPU (Lambda Labs)	-	-	$200	$500-1,200

The maintenance column deserves explanation. We estimate 2-4 hours per month of engineering time for a well-running self-hosted setup: applying security patches, updating model versions, investigating occasional OOM crashes, checking monitoring dashboards, rotating certificates. At $50-100/hour for an ML engineer, that is $100-400/month. We use $200 as a conservative middle estimate. Some months it is zero; other months a CUDA driver update breaks everything and you burn 10 hours fixing it.

The crossover analysis

Comparing self-hosted (RTX 4090 with Llama 8B at $350/month fixed cost) against cloud APIs:

Requests/month	Self-hosted	GPT-4o	Claude Sonnet 4	GPT-4o-mini	Winner
1,000	$350	$7	$5	$0.30	Cloud API (50x cheaper)
10,000	$350	$70	$50	$3	Cloud API (5-100x cheaper)
100,000	$350	$700	$500	$30	Mixed - beats GPT-4o, loses to mini
1,000,000	$350	$7,000	$5,000	$300	Self-hosted beats premium models

The honest takeaway

If you are comparing self-hosted open models against premium cloud models (GPT-4o, Claude Sonnet), self-hosting wins at roughly 50,000+ requests per month. But here is what the comparison misses: model quality. Llama 3.1 8B is not GPT-4o. It is closer to GPT-4o-mini in capability. When you compare self-hosted Llama 8B against GPT-4o-mini ($0.30 per 1K requests), the cloud API is cheaper at every volume below 1 million requests per month.

The fair comparison is self-hosted Llama 70B against GPT-4o (similar capability tier). At 70B, you need at least an A6000 ($460/month), and the crossover point is around 65,000-70,000 requests per month. Below that, pay for the API. Above that, run your own. Use our ROI calculator to plug in your exact request volume, average token counts, and preferred model to get a personalized break-even analysis.

One factor often overlooked: throughput limits. A single RTX 4090 running Llama 8B can handle roughly 3,000-5,000 requests per day (assuming 30 tokens/second output speed and 300 output tokens per request). That is about 100,000-150,000 requests per month at full utilization. If you need more, you add another GPU - and the costs double. Cloud APIs scale automatically with no hardware decisions.

For teams still evaluating whether self-hosting makes sense for their specific use case, our AI Agents for Operators course covers the full decision framework including factors beyond pure cost: team capability, compliance requirements, latency needs, and vendor risk tolerance.

When NOT to Self-Host (And What to Do Instead)

After 5,000 words about how to self-host, here is the honest conclusion: most teams should not self-host AI agents. Not yet, and maybe not ever. Self-hosting is the right choice for a specific set of conditions, and if those conditions do not apply to you, you are trading engineering velocity for infrastructure headaches. Here are the cases where cloud APIs are the better answer.

Low request volume (under 50,000 requests/month)

The math does not work. Even with the cheapest GPU setup, you are paying $350+/month in fixed costs. At 10,000 requests per month with GPT-4o, you are paying $70. That is 5x cheaper with zero operational burden. You do not need to think about CUDA drivers, model updates, GPU monitoring, or 3 AM crash recovery. You make an API call and it works. The only exception: if your data absolutely cannot leave your network (compliance requirement, not preference), the premium is worth it.

Rapid prototyping and experimentation

When you are building a new agent and iterating on prompts, tools, and architecture, you need fast feedback loops. Self-hosted models have slower token throughput than cloud APIs (15-20 tok/s on a 4090 vs. 80-100 tok/s from OpenAI), and every model swap requires downloading 10-100 GB of weights and restarting vLLM. With cloud APIs, switching from GPT-4o to Claude Sonnet to Gemini is a one-line change. During prototyping, optimize for iteration speed, not infrastructure cost.

Multi-model requirements

Production agents often need multiple models: a large model for complex reasoning, a small model for classification and routing, an embedding model for RAG retrieval, and possibly a vision model for image understanding. Running all of these on your own hardware requires multiple GPUs or constant model swapping (loading a new model takes 60-120 seconds). Cloud APIs let you call any model on any request with no resource management. If your agent architecture requires 3+ different models, cloud APIs are simpler until your scale justifies multi-GPU infrastructure.

Teams without GPU/Linux expertise

Self-hosting requires someone who can debug NVIDIA driver failures, interpret CUDA error messages, tune vLLM parameters for optimal throughput, and handle the operational incidents that GPU infrastructure produces. If nobody on your team has this skillset, you will spend more hours fighting infrastructure than building product. Hiring or training for this expertise is a legitimate investment - but it needs to be a conscious decision, not an accidental time sink.

When you need frontier model capabilities

The best open models (Llama 3.1 70B, Qwen 2.5 72B) are excellent but still lag behind the latest frontier models (GPT-4o, Claude Opus, Gemini Ultra) on complex reasoning, nuanced instruction following, and multi-step planning. If your agent requires frontier-level capabilities - complex analysis, creative writing, subtle judgment calls - self-hosted open models may produce noticeably worse results. Test your specific use case with open models before committing to self-hosting. The capability gap is narrowing rapidly, but it still exists for demanding tasks.

The hybrid approach

The best architecture for many teams is hybrid: self-host a small model for high-volume, latency-sensitive tasks (classification, routing, simple Q&A) and use cloud APIs for complex, low-volume tasks (multi-step reasoning, content generation, analysis). This gives you cost savings where volume is high, quality where it matters, and data sovereignty for the sensitive operations. Your LangGraph agent can call different models per node - the routing node calls local Llama 8B (fast, cheap), and the reasoning node calls Claude Sonnet via API (smart, reliable).

If you decide to start with cloud APIs and migrate to self-hosting later, design your agent code for portability from day one. Use the OpenAI-compatible API format everywhere (which vLLM supports natively), parameterize your model names and base URLs, and keep your business logic decoupled from your inference provider. When the migration day comes, it should be a configuration change, not a rewrite.

The infrastructure should serve the product, not the other way around. Start with the simplest thing that works, measure actual costs and latency, and add complexity only when the data justifies it. That is not a platitude - it is the lesson from every team we have seen that self-hosted too early and spent six months debugging GPU driver issues instead of building features.

FAQ

What is the minimum GPU I need to run AI agents on my own server?

For small models (7-8B parameters) like Llama 3.1 8B or Mistral 7B, you need at least 16 GB of VRAM - an RTX 4090 (24 GB) is the practical minimum. For larger models like Llama 3.1 70B with INT4 quantization, you need 40+ GB of VRAM, which means an RTX A6000 (48 GB) or two RTX 4090s with tensor parallelism. An A100 (80 GB) is recommended for running 70B models at higher precision without quantization compromises.

How many concurrent users can a self-hosted AI agent server handle?

It depends on the model size and GPU. A single RTX 4090 running Llama 3.1 8B with vLLM can handle 5-10 concurrent requests thanks to PagedAttention's efficient memory management. An A6000 running Llama 70B INT8 handles 3-5 concurrent requests. An A100 running the same model at FP16 handles 8-15 concurrent requests. Beyond these limits, requests queue and latency increases linearly. Add more GPU instances behind a load balancer to scale horizontally.

Is the quality of self-hosted open models comparable to GPT-4o or Claude?

For many tasks, yes. Llama 3.1 70B and Qwen 2.5 72B perform comparably to GPT-4o on structured tasks like data extraction, classification, SQL generation, and tool calling. They lag behind on complex multi-step reasoning, nuanced creative writing, and ambiguous instructions. The 8B models are closer to GPT-4o-mini in capability. Always benchmark your specific use case with open models before committing to self-hosting - quality varies significantly by task type.

How do I keep my self-hosted models up to date?

Model updates are manual: you download the new weights from Hugging Face, update the model name in your Docker Compose configuration, and restart vLLM. The download can happen while the current model is still serving traffic. Use a blue-green deployment pattern: start a second vLLM instance with the new model, run your evaluation suite against it, and switch traffic over once validated. Budget 1-2 hours per model update including testing.

Can I run multiple different models on the same server?

Yes, but with constraints. vLLM supports loading one model per GPU. If you have multiple GPUs, you can run a different model on each (one vLLM instance per GPU). Alternatively, some frameworks support model swapping, but swap time is 60-120 seconds, making it impractical for real-time serving. The practical approach for multi-model setups is either multiple GPUs or a hybrid architecture where you self-host your primary high-volume model and use cloud APIs for secondary models.

What happens if my self-hosted server goes down during an agent task?

If you use LangGraph with PostgreSQL checkpointing (as described in this guide), the agent state is persisted after every step. When the server comes back online, the agent resumes from the last checkpoint automatically. The client needs to reconnect and re-send the last request with the same thread ID. Without checkpointing, the entire conversation state is lost and the user must start over. This is why Postgres checkpointing is non-negotiable for production deployments.

All posts

2026-06-03