Technical · 2026-06-03 · Last verified 2026-06-03

Self-Hosted LLM Agent Stack: vLLM + LangGraph + Postgres in Production

Complete guide to building a self-hosted LLM agent stack with vLLM for inference, LangGraph for orchestration, and Postgres for state, vectors, and history. Includes docker-compose, benchmarks, and production monitoring.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

A self-hosted LLM agent stack built on vLLM, LangGraph, and Postgres gives you full control over inference costs, data privacy, and latency - with no per-token API fees after the initial GPU investment.
vLLM with AWQ Q4 quantization runs Llama 3.1 70B on a single A100 80GB at 35-45 tokens/second, which is sufficient for most production agent workloads without requiring multi-GPU setups.
Postgres serves triple duty as vector store (pgvector), checkpoint backend (LangGraph state persistence), and conversation history database - eliminating the need for separate Redis, Pinecone, or dedicated vector DB services.
LangGraph connects to vLLM through the OpenAI-compatible API endpoint, meaning you can swap between local vLLM and cloud OpenAI/Anthropic by changing a single environment variable - no code changes required.
Horizontal scaling works by running multiple vLLM replicas behind an Nginx load balancer, with Prometheus and Grafana providing the visibility needed to right-size GPU allocation based on actual request patterns.

The Reference Architecture

The self-hosted LLM agent stack we are building has five components, each handling a specific responsibility. Understanding how they connect before writing any code prevents the kind of integration problems that burn days in production.

vLLM is the inference server. It loads the model weights into GPU memory, accepts prompt requests via an OpenAI-compatible HTTP API, and returns completions. vLLM handles batching, KV-cache management, and PagedAttention internally - you send it prompts, it returns tokens. It exposes a /v1/chat/completions endpoint that any OpenAI SDK client can hit directly. This is the most resource-intensive component and the one you will spend the most time tuning.

LangGraph is the orchestrator. It runs as a Python FastAPI service that receives user requests, constructs the agent graph (nodes, edges, conditional routing), calls vLLM for inference, executes tools, and manages state transitions. LangGraph connects to vLLM using the standard OpenAI Python client pointed at your local vLLM URL instead of api.openai.com. This is where your agent logic lives - the reasoning loops, tool calls, and decision branching.

Postgres is the unified data backend. A single Postgres instance (with the pgvector extension) handles three jobs: storing LangGraph checkpoints for state persistence and crash recovery, storing conversation history for multi-turn interactions, and serving as a vector database for RAG retrieval. Using one database for all three eliminates operational complexity - one backup strategy, one connection pool, one set of credentials. For most agent workloads under 10M vectors, Postgres with pgvector performs within 10-15% of dedicated vector databases like Pinecone or Weaviate.

Redis handles caching and rate limiting. Frequently requested embeddings, tool call results, and session metadata live here. Redis is optional for small deployments but becomes important at scale - caching repeated RAG lookups alone can cut vLLM load by 20-30% in production agents that handle similar queries.

Nginx sits in front as the reverse proxy and load balancer. It routes /api/agent requests to the LangGraph service and /v1 requests to vLLM (useful for direct model access during development). When you scale to multiple vLLM replicas, Nginx distributes inference requests across them. It also handles TLS termination, request buffering, and basic rate limiting at the network layer.

The request flow: a client sends a message to Nginx, which routes it to the LangGraph service. LangGraph loads the conversation state from Postgres, constructs the agent graph, and calls vLLM for the first inference. If the model decides to use a tool, LangGraph executes the tool (possibly querying Postgres for RAG vectors or cached data from Redis), appends the result to the state, and calls vLLM again. This loop continues until the model produces a final response. LangGraph checkpoints the state to Postgres and returns the response through Nginx to the client.

Everything runs in Docker containers managed by docker-compose. The entire stack starts with a single docker-compose up command. For teams evaluating whether self-hosting makes financial sense for their specific workload, our ROI calculator can help model the GPU cost vs. API cost tradeoff.

Why vLLM + LangGraph + Postgres

Choosing infrastructure components is about understanding the tradeoffs. Here is why each piece of this stack earns its place, with specific numbers rather than marketing claims.

Self-Hosted LLM Agent Stack - data overview

vLLM vs. alternatives. vLLM consistently benchmarks at the top for throughput on production workloads. On an A100 80GB running Llama 3.1 70B with AWQ Q4 quantization, vLLM delivers 35-45 tokens/second for single requests and 800-1200 tokens/second aggregate throughput under concurrent load (8-16 simultaneous requests). Compare this to: llama.cpp at 15-25 tok/s single-request (optimized for CPU and low-memory scenarios, not throughput), TGI (Text Generation Inference from HuggingFace) at 30-40 tok/s (close to vLLM but with less flexible batching), and Ollama at 20-30 tok/s (excellent developer experience but not designed for production concurrency). vLLM's PagedAttention algorithm is the key differentiator - it manages GPU memory like an operating system manages RAM, enabling much higher batch sizes without OOM errors.

vLLM also provides an OpenAI-compatible API out of the box. This is not a nice-to-have, it is architecturally critical. Every LLM framework (LangChain, LangGraph, LlamaIndex, CrewAI) has first-class support for the OpenAI API format. By running vLLM with its OpenAI-compatible server, your orchestration code works identically whether hitting your local vLLM instance or the actual OpenAI API. Switching between self-hosted and cloud inference is a single environment variable change.

LangGraph vs. alternatives. LangGraph models agent execution as a directed graph with explicit state management. The advantage over linear chain frameworks (vanilla LangChain, LlamaIndex pipelines) is that agent behavior is inherently non-linear - an agent reasons, acts, observes, and loops. Representing this as a graph with conditional edges makes the control flow explicit, testable, and debuggable. You can visualize the graph, test individual nodes in isolation, and reason about every possible execution path. For a detailed walkthrough of LangGraph fundamentals, see our LangGraph tutorial.

The state management model is what separates LangGraph from simpler agent loops. State is a typed dictionary with reducer annotations that control how updates merge. Combined with PostgresSaver checkpointing, you get durable state persistence that survives process restarts, enables human-in-the-loop workflows, and supports crash recovery. An agent that crashes mid-execution resumes from the last checkpoint rather than starting over. For production agents handling real user conversations, this is non-negotiable.

Postgres vs. separate databases. The conventional approach uses a dedicated vector database (Pinecone, Weaviate, Qdrant) for RAG, a key-value store for state, and a relational database for history. That is three databases to provision, monitor, back up, and secure. Postgres with pgvector consolidates all three into one system. The pgvector extension supports HNSW and IVFFlat indexes on vector columns, handles cosine similarity and L2 distance searches, and integrates with standard SQL - meaning you can join vector search results with relational data in a single query.

The performance tradeoff is real but manageable. Dedicated vector databases are 2-5x faster for pure vector search at scale (100M+ vectors). But most agent workloads deal with 100K-10M vectors, where pgvector with HNSW indexes delivers sub-50ms search latency. At that scale, the operational simplicity of one database far outweighs the raw search speed advantage of a dedicated solution. If your workload grows beyond 10M vectors, you can add a dedicated vector DB later without changing the rest of the stack.

The cost argument is straightforward. Running Llama 3.1 70B on a single A100 80GB instance costs roughly $1.50-2.00/hour on major cloud providers. At 1000 tokens/second aggregate throughput, that works out to about $0.005 per 1000 tokens - roughly 10x cheaper than GPT-4o API pricing. The breakeven point depends on your volume, but most teams processing more than 5M tokens per day save money self-hosting. For teams that want the benefits of self-hosting with managed infrastructure, our guide on running AI agents on your own server covers the full operational picture.

vLLM Setup: Docker, Models, and GPU Memory Planning

Getting vLLM running correctly is the foundation of the entire stack. This section covers the practical decisions: which model, which quantization, how much GPU memory, and the exact Docker commands to get inference serving.

Model selection. For a self-hosted agent stack, you want a model that handles tool calling reliably, follows system prompts consistently, and fits in your GPU memory budget. The practical choices in mid-2026 are: Llama 3.1 70B (strong general performance, excellent tool calling), Llama 3.1 8B (fits on consumer GPUs, good for development and lower-complexity tasks), Mistral Large (competitive with 70B Llama at similar resource requirements), and Qwen2.5 72B (strong multilingual support). For production agent workloads, the 70B class models are the sweet spot - 8B models work but produce noticeably more tool-calling errors and require more careful prompt engineering.

Quantization choices. Quantization reduces model size and memory requirements at the cost of some quality degradation. The options that matter:

FP16 (no quantization): Llama 3.1 70B requires ~140GB VRAM. You need 2x A100 80GB GPUs with tensor parallelism. Best quality, highest cost. Use this only if you have the hardware and need maximum quality.

AWQ Q8 (8-bit quantization): ~70GB VRAM. Fits on a single A100 80GB with room for KV cache. Quality loss is minimal - less than 1% degradation on standard benchmarks. This is the recommended choice if you have A100 80GB hardware.

AWQ Q4 (4-bit quantization): ~35GB VRAM. Fits on a single A100 40GB or even an A6000 48GB. Quality loss is measurable but acceptable for most agent tasks - about 2-4% degradation on reasoning benchmarks. Tool calling accuracy remains high. This is the recommended choice for cost-sensitive deployments.

GPTQ Q4: Similar memory to AWQ Q4 but with slightly slower inference on vLLM. AWQ is generally preferred for vLLM deployments.

GPU memory planning. The model weights are only part of the memory equation. vLLM also needs memory for the KV cache (which grows with context length and batch size). A rough formula: total VRAM = model weights + (max_batch_size * max_context_length * kv_cache_per_token). For Llama 3.1 70B Q4 on an A100 80GB: 35GB for weights + up to 40GB for KV cache = 75GB, leaving 5GB headroom. With the default gpu_memory_utilization=0.90, vLLM manages this automatically using PagedAttention, but you should understand the tradeoff: more KV cache memory means higher batch sizes and throughput, less headroom means risk of OOM under spike loads.

The vLLM Docker startup command with all the important flags:

docker run -d \
  --name vllm-server \
  --gpus all \
  --shm-size 16g \
  -p 8000:8000 \
  -v /data/models:/models \
  -e HUGGING_FACE_HUB_TOKEN=hf_your_token \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --dtype half \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16 \
  --enable-prefix-caching \
  --port 8000

Key flags explained: --max-model-len 8192 caps the context window at 8K tokens, which is sufficient for most agent interactions and saves significant KV cache memory versus the full 128K context. Increase this if your RAG pipeline injects large documents. --max-num-seqs 16 limits concurrent requests to 16, preventing OOM under burst traffic. --enable-prefix-caching caches common prompt prefixes (like your system prompt) across requests, improving throughput by 10-20% for agents that use the same system prompt repeatedly. --shm-size 16g is required for PyTorch's shared memory in Docker.

Verify the server is running and responding:

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheBloke/Llama-3.1-70B-AWQ",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

If the model takes more than 5 minutes to load, check docker logs vllm-server for progress. First load downloads the model weights from HuggingFace (roughly 35GB for Q4), which can take 10-30 minutes depending on bandwidth. Subsequent starts load from the local cache in /data/models and take 2-4 minutes. For development, start with the 8B model (--model meta-llama/Llama-3.1-8B-Instruct) which loads in under 30 seconds and runs on a single RTX 4090.

LangGraph Orchestrator: Connecting to Local vLLM

The LangGraph service is a Python FastAPI application that implements your agent logic and connects to vLLM as if it were the OpenAI API. The key architectural insight: LangGraph does not care whether inference comes from OpenAI, Anthropic, or your local vLLM server. You point the OpenAI client at http://vllm:8000/v1 instead of https://api.openai.com/v1, and everything works.

The agent code with vLLM integration and Postgres checkpointing:

import os
from typing import Annotated, TypedDict

from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

# Connect to vLLM using the OpenAI-compatible client
llm = ChatOpenAI(
    base_url=os.getenv("VLLM_BASE_URL", "http://vllm:8000/v1"),
    api_key="not-needed",  # vLLM does not require an API key by default
    model=os.getenv("MODEL_NAME", "TheBloke/Llama-3.1-70B-AWQ"),
    temperature=0.1,
    max_tokens=2048,
)


@tool
def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant documents.
    Use this when the user asks a question that requires specific
    domain knowledge or factual information from our documentation."""
    import asyncpg
    import json

    # In production, this runs async - simplified here for clarity
    # Uses pgvector for similarity search
    results = []  # replaced by actual pgvector query below
    return json.dumps(results)


@tool
def execute_sql_query(query: str) -> str:
    """Execute a read-only SQL query against the analytics database.
    Use this when the user asks about metrics, counts, or data analysis.
    Only SELECT queries are allowed."""
    if not query.strip().upper().startswith("SELECT"):
        return "Error: only SELECT queries are permitted."
    # Execute against read replica
    return "Query results would appear here"


# Bind tools to the LLM
tools = [search_knowledge_base, execute_sql_query]
llm_with_tools = llm.bind_tools(tools)


# Define the agent node
async def agent_node(state: MessagesState):
    response = await llm_with_tools.ainvoke(state["messages"])
    return {"messages": [response]}


# Define routing logic
def should_continue(state: MessagesState):
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "__end__"


# Build the graph
workflow = StateGraph(MessagesState)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", ToolNode(tools))

workflow.add_edge(START, "agent")
workflow.add_conditional_edges("agent", should_continue, {
    "tools": "tools",
    "__end__": END,
})
workflow.add_edge("tools", "agent")

# Compile with Postgres checkpointing
DB_URI = os.getenv(
    "DATABASE_URL",
    "postgresql://agent:agent@postgres:5432/agent_db"
)


async def get_app():
    async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer:
        await checkpointer.setup()
        app = workflow.compile(checkpointer=checkpointer)
        return app, checkpointer

The FastAPI wrapper that exposes the agent as an HTTP endpoint:

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_core.messages import HumanMessage
import json

api = FastAPI(title="Agent API")


class ChatRequest(BaseModel):
    message: str
    thread_id: str


@api.post("/api/agent/chat")
async def chat(request: ChatRequest):
    app, _ = await get_app()
    config = {"configurable": {"thread_id": request.thread_id}}

    result = await app.ainvoke(
        {"messages": [HumanMessage(content=request.message)]},
        config=config,
    )

    last_msg = result["messages"][-1]
    return {"response": last_msg.content, "thread_id": request.thread_id}


@api.get("/health")
async def health():
    return {"status": "ok"}

State management details. Each conversation is identified by a thread_id. When a request comes in with an existing thread_id, LangGraph loads the full state from the Postgres checkpoint - including all previous messages, tool results, and custom state fields. The agent sees the complete conversation history and responds in context. New thread_ids start fresh. This is the mechanism that gives your agent memory across interactions without any additional code.

Connecting to vLLM specifics. The ChatOpenAI class from langchain-openai works directly with vLLM's OpenAI-compatible API. Set base_url to your vLLM service URL and api_key to any non-empty string (vLLM does not validate keys by default). The model name must match the model you loaded in vLLM. Tool calling works because vLLM supports the OpenAI function calling format for models that were trained with tool-calling capabilities (Llama 3.1, Mistral, Qwen2.5).

One important caveat: tool calling quality varies by model. Llama 3.1 70B handles tool calling reliably for 2-3 tools. With 5+ tools, accuracy drops and the model occasionally generates malformed tool call arguments. If you need many tools, consider the tool grouping pattern described in our LangGraph tutorial - use a routing step to select relevant tools before binding them to the LLM.

Error handling. The agent should handle vLLM timeouts gracefully. vLLM can be slow to respond under heavy load (especially when the KV cache is full and requests are queued). Set request_timeout=60 on the ChatOpenAI client and catch timeout exceptions in your agent node. Return a user-friendly message rather than crashing the graph. With checkpointing enabled, the user can retry and the conversation state is preserved.

Postgres as Unified Backend: Vectors, State, and History

Running Postgres as the single backend for your agent stack is one of the highest-leverage decisions in this architecture. You get vector search, state persistence, and relational data in one system with one operational footprint. Here is the schema and configuration that makes it work.

First, enable the pgvector extension and create the schema:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- RAG document embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding vector(1536),  -- dimension matches your embedding model
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for fast similarity search
CREATE INDEX idx_documents_embedding ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Conversation history (separate from LangGraph checkpoints)
CREATE TABLE conversations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    thread_id TEXT NOT NULL,
    role TEXT NOT NULL,  -- 'user', 'assistant', 'tool'
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_conversations_thread ON conversations(thread_id, created_at);

-- LangGraph checkpoint tables are created automatically
-- by AsyncPostgresSaver.setup(), but here is what they look like:
--
-- checkpoints: stores serialized graph state per thread
-- checkpoint_writes: stores individual node outputs
-- checkpoint_migrations: tracks schema versioning

-- Application-specific tables
CREATE TABLE tool_call_log (
    id SERIAL PRIMARY KEY,
    thread_id TEXT NOT NULL,
    tool_name TEXT NOT NULL,
    input_args JSONB,
    output TEXT,
    duration_ms INTEGER,
    success BOOLEAN DEFAULT true,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_tool_calls_thread ON tool_call_log(thread_id, created_at);
CREATE INDEX idx_tool_calls_name ON tool_call_log(tool_name, created_at);

pgvector for RAG. The vector search query for your agent's knowledge base tool looks like this:

SELECT content, metadata, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;

The <=> operator computes cosine distance. The WHERE clause filters out low-similarity results (below 0.7 threshold). The HNSW index makes this query return in under 50ms for tables with up to 5M rows. For tables larger than 5M rows, consider partitioning by document category or date, and running the similarity search against the relevant partition.

Embedding generation. You have two options for generating embeddings: use vLLM (if it supports your embedding model) or run a separate lightweight embedding service. For most setups, a separate embedding model is simpler. The sentence-transformers/all-MiniLM-L6-v2 model runs on CPU, generates 384-dimension embeddings, and processes 1000 documents per second. For higher quality, use BAAI/bge-large-en-v1.5 (1024 dimensions) or OpenAI's text-embedding-3-small (1536 dimensions). If you use a non-1536-dimension model, update the vector column dimension in the schema above.

Connection pooling. Both the LangGraph checkpointer and your application code open Postgres connections. Without pooling, each concurrent agent execution holds a database connection for the duration of the request (potentially 10-30 seconds for multi-step agents). With 16 concurrent agent executions, that is 16 connections just for checkpointing, plus connections for RAG queries and tool calls. Postgres defaults to max 100 connections. Use PgBouncer in transaction pooling mode to multiplex many application connections over a smaller number of Postgres connections:

# pgbouncer.ini
[databases]
agent_db = host=postgres port=5432 dbname=agent_db

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
default_pool_size = 20
max_client_conn = 200
server_idle_timeout = 300

Point your application at PgBouncer (port 6432) instead of Postgres directly (port 5432). PgBouncer maintains 20 actual Postgres connections and multiplexes up to 200 client connections across them. For the LangGraph checkpointer, set the connection string to use the PgBouncer port.

Checkpoint cleanup. LangGraph stores a checkpoint after every node execution. For an agent that averages 5 nodes per invocation handling 1000 conversations per day, that is 5000 checkpoints daily. Each checkpoint is a few KB of serialized JSON. Over months, this accumulates. Run a daily cleanup job that removes checkpoints older than your retention period:

DELETE FROM checkpoints
WHERE created_at < NOW() - INTERVAL '30 days'
AND thread_id NOT IN (
    SELECT DISTINCT thread_id FROM conversations
    WHERE created_at > NOW() - INTERVAL '30 days'
);

This preserves checkpoints for threads that have had recent activity while cleaning up abandoned conversations. For deeper patterns on using Postgres for AI agent backends, our RAG agent guide covers the retrieval pipeline in more detail.

Putting It Together: docker-compose and Startup Sequence

Here is the complete docker-compose.yml that defines the full stack. Every service, volume, network, and health check is included. Copy this, set your environment variables, and run docker-compose up -d.

version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "16gb"
    volumes:
      - model-cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model ${MODEL_NAME:-TheBloke/Llama-3.1-70B-AWQ}
      --quantization awq
      --dtype half
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --max-num-batched-tokens 16384
      --max-num-seqs 16
      --enable-prefix-caching
      --port 8000
    ports:
      - "8000:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 300s  # model loading takes time
    networks:
      - agent-net

  postgres:
    image: pgvector/pgvector:pg16
    container_name: agent-postgres
    environment:
      POSTGRES_USER: agent
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-agent_secret}
      POSTGRES_DB: agent_db
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent -d agent_db"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - agent-net

  redis:
    image: redis:7-alpine
    container_name: agent-redis
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - agent-net

  agent:
    build: ./agent
    container_name: agent-service
    environment:
      - VLLM_BASE_URL=http://vllm:8000/v1
      - MODEL_NAME=${MODEL_NAME:-TheBloke/Llama-3.1-70B-AWQ}
      - DATABASE_URL=postgresql://agent:${POSTGRES_PASSWORD:-agent_secret}@postgres:5432/agent_db
      - REDIS_URL=redis://redis:6379/0
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      vllm:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - agent-net

  nginx:
    image: nginx:alpine
    container_name: agent-nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      agent:
        condition: service_healthy
    networks:
      - agent-net

volumes:
  model-cache:
  pgdata:
  redis-data:

networks:
  agent-net:
    driver: bridge

The Nginx configuration for routing and load balancing:

events {
    worker_connections 1024;
}

http {
    upstream agent_backend {
        server agent:8080;
    }

    upstream vllm_backend {
        server vllm:8000;
    }

    server {
        listen 80;
        server_name _;

        # Agent API
        location /api/ {
            proxy_pass http://agent_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 120s;  # agents can take time
            proxy_buffering off;      # for streaming responses
        }

        # Direct vLLM access (dev/testing only)
        location /v1/ {
            proxy_pass http://vllm_backend;
            proxy_set_header Host $host;
            proxy_read_timeout 120s;
        }

        # Health checks
        location /health {
            proxy_pass http://agent_backend/health;
        }
    }
}

Startup sequence. The depends_on conditions with health checks enforce the correct startup order: Postgres and Redis start first and must pass their health checks before the agent service starts. The agent service must be healthy before Nginx starts accepting traffic. vLLM starts in parallel but the agent service waits for its health check too - this is the start_period: 300s on the vLLM health check, giving the model up to 5 minutes to load before the health check starts failing.

Environment variables. Create a .env file in the same directory as docker-compose.yml:

HF_TOKEN=hf_your_huggingface_token
MODEL_NAME=TheBloke/Llama-3.1-70B-AWQ
POSTGRES_PASSWORD=your_secure_password_here

First run checklist. Before running docker-compose up -d: verify NVIDIA Docker runtime is installed (docker run --gpus all nvidia/cuda:12.1-base nvidia-smi should show your GPU), ensure the init.sql file with the Postgres schema from the previous section exists in the project root, and confirm your HuggingFace token has access to the model you are downloading. The first startup takes 15-30 minutes as it downloads model weights. Subsequent starts take 3-5 minutes.

Verify everything is running:

# Check all services are up
docker-compose ps

# Test vLLM directly
curl http://localhost:8000/v1/models

# Test agent endpoint
curl -X POST http://localhost:80/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello, what can you help me with?", "thread_id": "test-1"}'

# Check Postgres
docker exec agent-postgres psql -U agent -d agent_db -c "SELECT count(*) FROM documents;"

Monitoring and Scaling: Prometheus, Grafana, and Multi-Replica vLLM

Running a self-hosted LLM agent stack without monitoring is flying blind. vLLM exposes Prometheus metrics natively, and wiring those into Grafana gives you the visibility needed to identify bottlenecks, plan capacity, and catch problems before users notice.

vLLM Prometheus metrics. vLLM exposes metrics on /metrics by default. The key metrics to track:

vllm:num_requests_running - currently processing requests. If this consistently equals max_num_seqs, you are at capacity and requests are queueing. vllm:num_requests_waiting - requests in the queue waiting for GPU resources. Anything above 0 for sustained periods means you need more capacity. vllm:avg_generation_throughput_toks_per_s - aggregate tokens per second. Baseline this number and alert if it drops more than 20%, which indicates GPU thermal throttling or memory pressure. vllm:gpu_cache_usage_perc - KV cache utilization. Above 95% means the cache is full and new requests will be delayed until cache slots free up. vllm:avg_prompt_throughput_toks_per_s - prompt processing speed, important for RAG agents that inject large context windows.

Add Prometheus and Grafana to your docker-compose:

  prometheus:
    image: prom/prometheus:latest
    container_name: agent-prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prom-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - agent-net

  grafana:
    image: grafana/grafana:latest
    container_name: agent-grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    networks:
      - agent-net

The Prometheus configuration to scrape vLLM and the agent service:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "vllm"
    static_configs:
      - targets: ["vllm:8000"]
    metrics_path: /metrics

  - job_name: "agent"
    static_configs:
      - targets: ["agent:8080"]
    metrics_path: /metrics

Grafana dashboard essentials. Create a dashboard with these panels: (1) vLLM throughput - tokens/second over time, split by prompt vs. generation. (2) Request queue depth - running + waiting requests. (3) GPU cache utilization - percentage over time with an alert threshold at 90%. (4) Request latency - P50, P95, P99 from your agent service. (5) Tool call success rate - percentage of tool calls that succeed vs. fail, broken down by tool name. (6) Active conversations - count of unique thread_ids with activity in the last hour.

Horizontal scaling with multiple vLLM replicas. When a single vLLM instance cannot handle your traffic, add replicas. The approach: run multiple vLLM containers (each on its own GPU) behind Nginx load balancing. Update the Nginx config:

upstream vllm_backend {
    least_conn;  # route to the replica with fewest active connections
    server vllm-1:8000;
    server vllm-2:8000;
    server vllm-3:8000;
}

Each vLLM replica is an independent inference server. They do not share state - this is fine because inference is stateless (the state lives in LangGraph and Postgres). The LangGraph agent service does not need to change at all. It sends requests to the Nginx upstream, which distributes them across replicas. With least_conn balancing, requests go to the replica with the fewest in-flight connections, which naturally balances load based on actual processing time rather than round-robin.

Scaling decision framework. Monitor these signals to decide when to scale: if vllm:num_requests_waiting is consistently above 0, add a vLLM replica. If agent response latency P95 exceeds your SLA (typically 10-15 seconds for agent interactions), check whether the bottleneck is vLLM inference time or tool execution time. If vLLM inference dominates, add replicas. If tool execution dominates, optimize your tools (add caching in Redis, optimize Postgres queries, add connection pooling). If Postgres query latency increases, check for missing indexes, stale HNSW index statistics (ANALYZE documents), or connection exhaustion.

For the agent service itself, scale by running multiple container replicas behind the same Nginx upstream. Since all state is in Postgres, agent service replicas are stateless and can scale horizontally without coordination. A typical production setup: 2-4 agent service replicas (each running 4 uvicorn workers) handling 50-100 concurrent conversations, backed by 1-3 vLLM replicas depending on throughput requirements.

Cost monitoring. Track GPU utilization hours and correlate with request volume to calculate your effective per-token cost. If your A100 instance runs at $2/hour and processes 800 tokens/second on average, your cost is $0.0025 per 1000 tokens. Compare this monthly against what equivalent API calls would cost. Most teams find that self-hosting breaks even at around 3-5M tokens per day and saves 60-80% at 20M+ tokens per day. For teams evaluating the financial case, our ROI calculator models this tradeoff for your specific workload profile. For teams that want to start with this stack and expand to more sophisticated agent workflows, our AI Agents for Operators course covers production patterns beyond the basics covered here.

Two follow-ups worth reading before you buy hardware: Ollama vs vLLM for production agents and how to cut agent LLM costs even if you stay on managed APIs.

FAQ

How much GPU memory do I need to run a self-hosted LLM agent stack?

It depends on the model and quantization. Llama 3.1 70B with AWQ Q4 quantization requires about 35GB VRAM for weights plus 30-40GB for KV cache, fitting on a single A100 80GB. The 8B variant runs on a 24GB RTX 4090 with Q4 quantization. For development, start with the 8B model on consumer hardware. For production, an A100 80GB with Q4 or Q8 quantization handles most workloads.

Can I use this stack with models other than Llama, like Mistral or Qwen?

Yes. vLLM supports most popular open-weight models including Mistral, Qwen2.5, Gemma 2, and many others. Change the --model flag in the vLLM startup command and update the MODEL_NAME environment variable. The LangGraph orchestrator code does not change because it connects through the OpenAI-compatible API, which is model-agnostic.

Is pgvector good enough for production RAG, or do I need a dedicated vector database?

For most agent workloads with under 10M vectors, pgvector with HNSW indexes delivers sub-50ms search latency, which is well within acceptable range. Dedicated vector databases like Pinecone or Weaviate are 2-5x faster for pure vector search at scale (100M+ vectors), but the operational simplicity of using Postgres for everything outweighs the performance gap at smaller scales. Start with pgvector and migrate to a dedicated solution only if you hit measurable performance limits.

How do I switch between self-hosted vLLM and cloud APIs like OpenAI?

Change the VLLM_BASE_URL environment variable. For local vLLM, set it to http://vllm:8000/v1. For OpenAI, set it to https://api.openai.com/v1 and provide a valid OPENAI_API_KEY. The LangGraph code uses ChatOpenAI with base_url, so no code changes are needed. This makes it easy to use cloud APIs during development and self-hosted inference in production.

What happens if vLLM crashes or runs out of GPU memory during an agent execution?

LangGraph checkpoints state to Postgres after every node execution. If vLLM crashes mid-inference, the agent node raises a timeout or connection error. With proper error handling, LangGraph saves the state at the last successful checkpoint. When the user retries (or vLLM restarts and the request is replayed), the agent resumes from the last checkpoint rather than starting the conversation over. This is one of the key advantages of the LangGraph + Postgres checkpointing architecture.

All posts

2026-06-03