How-To · 2026-06-03 · Last verified 2026-06-03

Deploy LangGraph to Production with vLLM in 30 Minutes

Step-by-step guide to deploying a production LangGraph agent backed by a local vLLM inference server on a single GPU server with Docker Compose, nginx, and systemd.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

A single GPU server with 48GB VRAM can run both a vLLM inference server and a LangGraph agent service, serving production traffic for Llama 70B Q4 at reasonable latency.
vLLM exposes an OpenAI-compatible API, so your LangGraph agent connects to it using ChatOpenAI with a custom base_url - no code changes needed if you later switch to a cloud API.
Docker Compose orchestrates vLLM, your FastAPI agent, Postgres for checkpointing, and Redis for rate limiting in a single reproducible stack with GPU passthrough.
Production hardening requires nginx with SSL termination, systemd for process management, health check endpoints, and log rotation - all achievable in under 30 minutes.
Monitor GPU memory utilization, inference latency p95, and request error rates from day one. A simple Grafana dashboard with alerts on these three metrics catches most production issues before users notice.

What We're Building

By the end of this guide, you will have a production-ready AI agent running on a single GPU server. The stack has four components: vLLM serves a Llama 70B model with an OpenAI-compatible API, LangGraph runs your agent logic with state persistence, Docker Compose orchestrates everything, and nginx handles SSL termination and reverse proxying.

The architecture is straightforward. Your LangGraph agent runs as a FastAPI service. When it needs to call the LLM, it sends requests to the local vLLM server instead of OpenAI's API. vLLM handles batching, KV cache management, and GPU scheduling. Postgres stores LangGraph checkpoints for conversation persistence. Redis handles rate limiting. nginx sits in front with SSL and basic auth.

This setup costs roughly $1-2/hour on a Hetzner or RunPod GPU server, compared to $50-200/day in API costs for equivalent throughput on OpenAI. You own the entire stack, your data never leaves your server, and you can swap models by changing one Docker environment variable. For background on why self-hosting your LLM matters, see our guide on running AI agents on your own server.

Here is the final directory structure you will end up with:

langgraph-vllm-prod/
├── compose.yml
├── agent/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── main.py
│   └── graph.py
├── nginx/
│   ├── nginx.conf
│   └── ssl/
├── systemd/
│   └── langgraph-agent.service
└── monitoring/
    └── grafana-dashboard.json

Total deployment time: about 30 minutes if your server is already provisioned, or 45 minutes if you are starting from scratch. Every command in this guide is copy-pasteable. Let's start.

Prerequisites

You need a GPU server with enough VRAM to run your target model. For Llama 3.1 70B at 4-bit quantization (Q4), you need approximately 40-48GB of VRAM. The two best options for on-demand GPU servers are Hetzner (GEX44 with A6000 48GB, around EUR 1.50/hr) and RunPod (RTX 4090 24GB for smaller models, A6000 48GB for 70B). If you already have a server with an NVIDIA GPU, that works too.

Deploy LangGraph to Production with vLLM in 30 Minutes - data overview

Verify your GPU is detected and has sufficient VRAM:

nvidia-smi
# Should show your GPU with available memory
# For 70B Q4: need ~40GB free VRAM
# For 8B models: need ~8GB free VRAM

Install Docker with NVIDIA GPU support. This is the single most error-prone step - get it right before proceeding:

# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |   sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU is accessible from Docker
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If that last command shows your GPU, you are ready. If it fails, the issue is almost always the NVIDIA Container Toolkit installation. Check that nvidia-ctk --version returns a version number.

You also need Python 3.11+ on the host for local testing (optional - everything runs in Docker), a domain name pointed at your server's IP (for SSL), and ports 80 and 443 open in your firewall. If you are following along on RunPod, ports are open by default. On Hetzner, configure the firewall in the Cloud Console.

Software versions this guide is tested with: Docker 27.x, NVIDIA Driver 550+, CUDA 12.4, Python 3.11, vLLM 0.8.x, LangGraph 0.4.x. For a broader overview of the self-hosted LLM agent stack, see our complete stack guide.

Step 1: vLLM Inference Server

vLLM is an inference engine that serves LLMs with high throughput using PagedAttention and continuous batching. The key feature for our use case: it exposes an OpenAI-compatible API out of the box, so your LangGraph agent can connect to it using the standard ChatOpenAI client with zero code changes.

Pull and run the vLLM Docker image with your model. This command downloads the model from Hugging Face and starts serving:

# Create a directory for model cache (persistent across restarts)
mkdir -p ~/vllm-models

# Run vLLM with Llama 3.1 70B Instruct (4-bit AWQ quantization)
docker run -d   --name vllm-server   --gpus all   --shm-size=16g   -p 8000:8000   -v ~/vllm-models:/root/.cache/huggingface   -e HUGGING_FACE_HUB_TOKEN=your_hf_token_here   vllm/vllm-openai:latest   --model TheBloke/Llama-3.1-70B-Instruct-AWQ   --quantization awq   --max-model-len 8192   --gpu-memory-utilization 0.90   --enforce-eager   --dtype half

The first run takes 5-10 minutes to download the model (about 35GB for 70B AWQ). Subsequent starts take 30-60 seconds. Key flags explained: --shm-size=16g prevents shared memory errors during tensor parallelism. --gpu-memory-utilization 0.90 reserves 90% of VRAM for the model, leaving headroom for KV cache growth. --max-model-len 8192 caps context length to manage memory. --enforce-eager disables CUDA graph capture which can cause issues on some GPU configurations.

If you are using a smaller GPU (24GB RTX 4090), use an 8B model instead:

# For 24GB GPUs - Llama 3.1 8B Instruct
docker run -d   --name vllm-server   --gpus all   --shm-size=8g   -p 8000:8000   -v ~/vllm-models:/root/.cache/huggingface   vllm/vllm-openai:latest   --model meta-llama/Llama-3.1-8B-Instruct   --max-model-len 16384   --gpu-memory-utilization 0.90   --dtype half

Wait for the server to become ready, then verify it works:

# Check logs - wait for "Uvicorn running on http://0.0.0.0:8000"
docker logs -f vllm-server

# Test the OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheBloke/Llama-3.1-70B-Instruct-AWQ",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Check available models
curl http://localhost:8000/v1/models

# Check server health
curl http://localhost:8000/health

If the chat completion returns a response, your vLLM server is ready. Stop this standalone container - we will run it through Docker Compose in Step 3: docker stop vllm-server && docker rm vllm-server.

Note the model name in the response from /v1/models - you will need it in the next step when configuring the LangGraph agent. The vLLM server handles request queuing, batching, and KV cache management automatically. For production throughput, expect 20-40 tokens/second for 70B Q4 on a single A6000, and 80-120 tokens/second for 8B on an RTX 4090.

Deploy LangGraph to Production with vLLM in 30 Minutes - analysis

Step 2: LangGraph Agent Service

Now build the LangGraph agent as a FastAPI service that connects to your local vLLM server. The agent uses ChatOpenAI pointed at http://vllm:8000/v1 (the Docker network address) instead of OpenAI's API. This means you can develop against OpenAI during development and switch to vLLM in production by changing one environment variable. For a deeper dive into LangGraph fundamentals, see our LangGraph tutorial.

Create the project structure:

mkdir -p langgraph-vllm-prod/agent
cd langgraph-vllm-prod

First, the requirements file at agent/requirements.txt:

langgraph==0.4.1
langchain-openai==0.3.12
langchain-core==0.3.45
fastapi==0.115.0
uvicorn[standard]==0.32.0
psycopg[binary]==3.2.0
langgraph-checkpoint-postgres==2.0.0
redis==5.2.0
pydantic==2.10.0

Now the graph definition at agent/graph.py. This defines a ReAct agent with two tools - a web search stub and a calculator:

import os
from typing import Annotated, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

# Connect to vLLM via OpenAI-compatible API
llm = ChatOpenAI(
    base_url=os.getenv("VLLM_BASE_URL", "http://vllm:8000/v1"),
    api_key="not-needed",  # vLLM does not require an API key by default
    model=os.getenv("VLLM_MODEL", "TheBloke/Llama-3.1-70B-Instruct-AWQ"),
    temperature=0.7,
    max_tokens=2048,
)

@tool
def search(query: str) -> str:
    """Search the web for current information. Use this when you need
    up-to-date facts, news, or data not in your training set."""
    # Replace with your actual search API (SearXNG, Tavily, etc.)
    return f"Search results for: {query} - [implement your search here]"

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression. Input should be a valid Python
    math expression like '2 + 2' or 'math.sqrt(144)'."""
    import math
    try:
        result = eval(expression, {"__builtins__": {}}, {"math": math})
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"

tools = [search, calculate]
llm_with_tools = llm.bind_tools(tools)

async def agent_node(state: MessagesState):
    response = await llm_with_tools.ainvoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: MessagesState):
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return END

async def build_graph(db_uri: str):
    checkpointer = AsyncPostgresSaver.from_conn_string(db_uri)
    await checkpointer.setup()

    graph = StateGraph(MessagesState)
    graph.add_node("agent", agent_node)
    graph.add_node("tools", ToolNode(tools))
    graph.add_edge(START, "agent")
    graph.add_conditional_edges("agent", should_continue, ["tools", END])
    graph.add_edge("tools", "agent")

    return graph.compile(checkpointer=checkpointer)

Now the FastAPI application at agent/main.py:

import os
import uuid
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_core.messages import HumanMessage

from graph import build_graph

app_graph = None

DB_URI = os.getenv(
    "DATABASE_URL",
    "postgresql://agent:agent@postgres:5432/langgraph"
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    global app_graph
    app_graph = await build_graph(DB_URI)
    yield

app = FastAPI(title="LangGraph Agent", lifespan=lifespan)

class ChatRequest(BaseModel):
    message: str
    thread_id: str | None = None

class ChatResponse(BaseModel):
    response: str
    thread_id: str

@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    thread_id = req.thread_id or str(uuid.uuid4())
    config = {"configurable": {"thread_id": thread_id}}
    try:
        result = await app_graph.ainvoke(
            {"messages": [HumanMessage(content=req.message)]},
            config=config,
        )
        ai_message = result["messages"][-1]
        return ChatResponse(response=ai_message.content, thread_id=thread_id)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "ok", "graph_loaded": app_graph is not None}

And the Dockerfile at agent/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]

The critical line is base_url=os.getenv("VLLM_BASE_URL", "http://vllm:8000/v1"). This points to the vLLM container on the Docker network. The api_key is set to a dummy value because vLLM does not require authentication by default. To develop locally against OpenAI, set VLLM_BASE_URL=https://api.openai.com/v1 and OPENAI_API_KEY accordingly.

Step 3: Docker Compose for the Full Stack

Docker Compose ties all four services together: vLLM for inference, the FastAPI agent, Postgres for checkpointing, and Redis for rate limiting. Create compose.yml in your project root:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    command:
      - --model
      - TheBloke/Llama-3.1-70B-Instruct-AWQ
      - --quantization
      - awq
      - --max-model-len
      - "8192"
      - --gpu-memory-utilization
      - "0.90"
      - --enforce-eager
      - --dtype
      - half
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - vllm-models:/root/.cache/huggingface
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16g"
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 300s

  postgres:
    image: postgres:16-alpine
    container_name: postgres
    environment:
      POSTGRES_USER: agent
      POSTGRES_PASSWORD: agent
      POSTGRES_DB: langgraph
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent -d langgraph"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: redis
    ports:
      - "6379:6379"
    volumes:
      - redisdata:/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  agent:
    build:
      context: ./agent
      dockerfile: Dockerfile
    container_name: agent
    environment:
      - VLLM_BASE_URL=http://vllm:8000/v1
      - VLLM_MODEL=TheBloke/Llama-3.1-70B-Instruct-AWQ
      - DATABASE_URL=postgresql://agent:agent@postgres:5432/langgraph
      - REDIS_URL=redis://redis:6379/0
    ports:
      - "8080:8080"
    depends_on:
      vllm:
        condition: service_healthy
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3

volumes:
  vllm-models:
  pgdata:
  redisdata:

Key details in this Compose file. The deploy.resources.reservations.devices block passes all GPUs to the vLLM container. The start_period: 300s on the vLLM health check gives the model 5 minutes to download and load before Docker considers it unhealthy. The agent service uses depends_on with health check conditions, so it only starts after vLLM, Postgres, and Redis are confirmed healthy.

Create a .env file for secrets:

# .env
HF_TOKEN=hf_your_huggingface_token_here

Start the full stack:

# Build and start all services
docker compose up -d --build

# Watch logs (vLLM takes a few minutes on first start)
docker compose logs -f vllm

# Once vLLM shows "Uvicorn running", check all services
docker compose ps

# Test the agent endpoint
curl -X POST http://localhost:8080/chat   -H "Content-Type: application/json"   -d '{"message": "What is 42 * 17?"}'

If the curl returns a JSON response with the answer and a thread_id, your entire stack is working. The agent received your message, sent it to vLLM for inference, the LLM decided to use the calculator tool, the tool executed, the LLM generated a final response, and the conversation state was persisted to Postgres.

Test conversation persistence by sending a follow-up with the same thread_id:

# Use the thread_id from the previous response
curl -X POST http://localhost:8080/chat   -H "Content-Type: application/json"   -d '{"message": "Now divide that result by 3", "thread_id": "YOUR_THREAD_ID"}'

The agent should remember the previous calculation and divide 714 by 3. If it does, checkpointing is working correctly.

Step 4: Production Hardening

The stack is functional but not production-ready. You need SSL termination, process management, authentication, and log rotation. This step covers all four.

First, set up nginx as a reverse proxy with SSL. Install nginx and Certbot on the host (not in Docker - it is simpler for SSL cert management):

# Install nginx and certbot
sudo apt-get install -y nginx certbot python3-certbot-nginx

# Get SSL certificate (replace with your domain)
sudo certbot --nginx -d agent.yourdomain.com --non-interactive --agree-tos -m [email protected]

Create the nginx configuration at /etc/nginx/sites-available/langgraph-agent:

upstream agent_backend {
    server 127.0.0.1:8080;
}

server {
    listen 80;
    server_name agent.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name agent.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/agent.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/agent.yourdomain.com/privkey.pem;

    # Basic auth - remove if using API key auth in the app
    auth_basic "Agent API";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://agent_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 120s;  # Agent responses can take time
        proxy_send_timeout 120s;
    }

    # Block direct access to vLLM
    location /v1/ {
        deny all;
    }

    # Health check without auth
    location /health {
        auth_basic off;
        proxy_pass http://agent_backend/health;
    }
}

Enable the site and set up basic auth:

# Create basic auth credentials
sudo apt-get install -y apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd apiuser

# Enable the site
sudo ln -sf /etc/nginx/sites-available/langgraph-agent /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Next, create a systemd service so the stack starts on boot and restarts on failure. Create /etc/systemd/system/langgraph-agent.service:

[Unit]
Description=LangGraph Agent Stack
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/home/deploy/langgraph-vllm-prod
ExecStart=/usr/bin/docker compose up -d --build
ExecStop=/usr/bin/docker compose down
ExecReload=/usr/bin/docker compose restart agent
TimeoutStartSec=600

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable langgraph-agent
sudo systemctl start langgraph-agent
sudo systemctl status langgraph-agent

Set up log rotation for Docker logs. Create /etc/logrotate.d/docker-containers:

/var/lib/docker/containers/*/*.log {
    daily
    rotate 7
    compress
    missingok
    delaycompress
    copytruncate
    maxsize 100M
}

Finally, add Docker daemon log limits in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3"
  },
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Restart Docker to apply: sudo systemctl restart docker, then bring the stack back up: sudo systemctl start langgraph-agent. Your agent is now accessible via HTTPS with basic auth, managed by systemd, and logs are rotated automatically.

Step 5: Verify and Monitor

With everything deployed, verify the full request path and set up monitoring. Start with end-to-end verification through the public endpoint:

# Test through nginx with SSL and basic auth
curl -u apiuser:yourpassword https://agent.yourdomain.com/chat   -H "Content-Type: application/json"   -d '{"message": "Explain what Docker Compose does in two sentences."}'

# Verify health endpoint (no auth required)
curl https://agent.yourdomain.com/health

# Check vLLM metrics directly (only accessible from the server)
curl http://localhost:8000/metrics | head -20

# Verify Postgres has checkpoint data
docker exec postgres psql -U agent -d langgraph -c "SELECT count(*) FROM checkpoints;"

The vLLM /metrics endpoint exposes Prometheus-format metrics including GPU utilization, request latency, batch size, and cache hit rates. These are the foundation of your monitoring setup.

Set up a basic Prometheus + Grafana monitoring stack. Add these services to your compose.yml:

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - promdata:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    volumes:
      - grafanadata:/var/lib/grafana
    ports:
      - "3000:3000"
    restart: unless-stopped

Create monitoring/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: vllm
    static_configs:
      - targets: ["vllm:8000"]
  - job_name: agent
    static_configs:
      - targets: ["agent:8080"]

After running docker compose up -d prometheus grafana, open Grafana at http://your-server:3000, add Prometheus as a data source (http://prometheus:9090), and create a dashboard with these panels:

# Key metrics to track in Grafana

# 1. Inference latency (p95) - alert if > 5s
vllm:request_latency_seconds{quantile="0.95"}

# 2. GPU memory utilization - alert if > 90%
vllm:gpu_cache_usage_perc

# 3. Active requests (concurrent load)
vllm:num_requests_running

# 4. Request throughput (requests/min)
rate(vllm:request_success_total[5m]) * 60

# 5. Token generation speed (tokens/sec)
rate(vllm:generation_tokens_total[5m])

Set up alerts for three critical conditions:

# Alert rules (add to Grafana or Prometheus alerting)
# 1. Latency spike: p95 inference latency > 5 seconds for 2 minutes
# 2. GPU memory critical: cache usage > 90% for 5 minutes
# 3. Error rate: agent /chat returning 5xx > 5% of requests over 5 minutes

These three alerts catch the most common production issues. High latency usually means the model is overloaded - either too many concurrent requests or the context length is too high. GPU memory above 90% means the KV cache is nearly full and requests will start queuing or failing. A spike in 5xx errors from the agent service usually means vLLM is down or Postgres is unreachable.

For ongoing operations: check docker compose logs agent daily for the first week, review Grafana dashboards for latency trends, and monitor disk usage on the Postgres volume. Checkpoint data grows with usage - implement a cleanup job that deletes threads inactive for 30+ days. For teams building on this foundation, our AI Agents for Operators course covers the operational patterns for running agent systems at scale.

You now have a fully self-hosted, production-grade LangGraph agent backed by vLLM. The entire stack runs on a single server, costs a fraction of API-based deployments, keeps your data private, and can be reproduced on any GPU server by cloning the repo and running docker compose up -d. For the official LangGraph deployment docs, check the LangGraph Platform self-hosted guide for additional deployment options including Kubernetes and cloud-managed setups.

Shipping is only half the job. Round out your production setup with our guides on testing agents with evals, observability and monitoring, cutting LLM costs, and choosing between Ollama and vLLM for serving.

FAQ

How much VRAM do I need to run Llama 70B with vLLM?

Llama 3.1 70B at 4-bit quantization (AWQ or GPTQ) requires approximately 35-40GB of VRAM for the model weights, plus additional VRAM for the KV cache. A single NVIDIA A6000 (48GB) handles this comfortably. For 8-bit quantization, you need around 70GB - either two A6000s with tensor parallelism or a single A100 80GB. For 8B models, a 24GB RTX 4090 is sufficient.

Can I use a different model besides Llama with this setup?

Yes. vLLM supports most popular open models including Mistral, Mixtral, Qwen, Yi, DeepSeek, and Falcon. Change the --model flag in your compose.yml to the Hugging Face model ID. The LangGraph agent code does not need any changes because it communicates with vLLM through the OpenAI-compatible API.

How do I scale this beyond a single GPU server?

For higher throughput, run multiple vLLM instances across servers and load-balance between them using nginx upstream blocks. For tensor parallelism across GPUs on the same server, add --tensor-parallel-size 2 to the vLLM command. For the agent service, increase the uvicorn --workers count and ensure all workers share the same Postgres checkpointer.

What is the latency difference between vLLM and the OpenAI API?

On a single A6000 with Llama 70B Q4, expect time-to-first-token of 0.5-2 seconds and generation speed of 20-40 tokens/second. OpenAI GPT-4o typically delivers 50-80 tokens/second with 0.3-1 second TTFT. vLLM is slower per-request but eliminates per-token costs and network latency. For batch workloads, vLLM's continuous batching can match or exceed API throughput.

How do I update the model without downtime?

Use a blue-green approach. Start a second vLLM container on a different port with the new model. Once it passes health checks, update the nginx upstream to point at the new container. After confirming traffic is flowing correctly, stop the old container. The LangGraph agent does not need a restart because it connects to vLLM through the Docker network name, which you can remap.

All posts

2026-06-03