Technical · 2026-06-20 · Last verified 2026-06-20

AI Agent Memory: The Complete Guide to Short-Term and Long-Term Memory

How AI agent memory actually works: short-term buffers, trimming vs summarization, LangGraph checkpointers and Store, vector-store long-term memory, and when to reach for Mem0, Zep, or Letta instead of building your own.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

LLMs are stateless - every 'memory' your agent has is context you assembled and paid for on that request. Memory engineering is deciding what earns a place in the prompt.
Short-term memory is a thread-scoped message history. Trim it for cheap recency, summarize it for long conversations, and let LangGraph checkpointers handle persistence so a server restart does not wipe a conversation.
Long-term memory is cross-session and needs explicit writes: a LangGraph Store or a vector database keyed by user, searched at the start of each turn, and updated when the agent learns something durable.
The semantic / episodic / procedural framing tells you what to build: facts about the user (semantic), past interactions (episodic), and learned instructions (procedural) each need different storage and retrieval.
Mem0 is the fastest path to cross-session personalization, Zep's Graphiti wins when facts change over time, Letta fits self-managing autonomous agents, and DIY on pgvector is fine when your memory needs are simple and you want zero new vendors.
Memory is a liability as well as an asset: keep PII out of long-term stores, set retention policies, and evaluate recall quality with real test cases, not vibes.

Why AI Agents Forget Everything

Every LLM call is stateless. The model does not remember your last request, your user's name, or the fact that it already answered this exact question twenty minutes ago. When ChatGPT appears to remember you, that is not the model remembering - it is an application layer retrieving stored text and stuffing it back into the prompt before inference.

This is the single most important mental model for agent memory: memory is just context you chose to include. Your agent has exactly the memory you assemble for it on each request, and you pay input-token prices for every byte of it.

The context window makes this a hard engineering constraint rather than a philosophical one. Even with 200K or 1M-token windows, three things break long before you hit the limit:

Cost. A 150K-token prompt on a frontier model costs real money on every single turn. A support agent handling 10,000 conversations a day with bloated context can burn thousands of dollars a month on tokens the model never needed.

Latency. Prefill time scales with input length. Users notice when turn twelve of a conversation takes four seconds longer than turn two.

Quality. Models demonstrably lose the middle of very long contexts. Stuffing 80 turns of raw history into the prompt does not make the agent smarter - it buries the two facts that matter under 78 turns of noise. Retrieval quality degrades exactly when you need it most.

So memory engineering splits into two problems with very different solutions. Short-term memory: what does the agent need to remember within this conversation? Long-term memory: what should survive across sessions - next week, next month, on a different device? The rest of this guide covers both, with working code, and finishes with an honest comparison of the managed memory platforms (Mem0, Zep, Letta) versus building it yourself. If you are new to agent architecture in general, start with our LangGraph tutorial and come back.

The Three Types of Agent Memory

The agent community borrowed a taxonomy from cognitive science that turns out to be genuinely useful for system design: semantic, episodic, and procedural memory. Each answers a different question and each maps to a different storage pattern.

Type	What it stores	Business example	Typical implementation
Semantic	Facts about the world and the user	"Customer is on the Enterprise plan, prefers email over phone, renewal date is March 15"	Key-value store or user profile document, updated when facts change
Episodic	Specific past events and interactions	"On June 3 this customer reported a billing bug, we issued a $40 credit, they were satisfied"	Vector store of summarized interactions, retrieved by similarity
Procedural	How to do things - learned rules and instructions	"This team wants ticket summaries in bullet points, never paragraphs, and always CC the account manager"	Editable system prompt sections or instruction store the agent can update

Why does the taxonomy matter in practice? Because teams that skip it usually build one giant vector store, dump everything into it, and then wonder why retrieval is mediocre. The failure is architectural: semantic facts want overwrite semantics (the customer's plan changed, the old fact is now wrong), episodic memories want append semantics (past events never change, you just accumulate them), and procedural memories want versioned edit semantics (instructions evolve and you want to know when and why).

A vector store handles episodic memory well because "find past interactions similar to the current situation" is exactly what embedding search does. It handles semantic facts badly because similarity search can happily return the stale "customer is on the Starter plan" chunk alongside the current one, and the model has no way to know which is true today. This is precisely the gap that temporal knowledge graph systems like Zep's Graphiti exist to fill, which we cover below.

Keep this table in your head as we go through implementations. Short-term memory (the current conversation) sits alongside these three as working memory - it is a fourth thing, scoped to a single thread, and it is where we start.

Short-Term Memory Done Right: Trim vs Summarize

Short-term memory is the message list for the current conversation. The naive version - append every message forever and send the whole thing each turn - works fine for ten turns and falls over at a hundred. You have two real strategies, and most production agents use both.

Strategy 1: Trimming. Keep the system prompt plus the most recent N tokens of conversation, drop the rest. Cheap, deterministic, zero extra LLM calls. The trade-off is a hard amnesia cliff: anything older than the window is gone completely.

from langchain_core.messages.utils import trim_messages, count_tokens_approximately

def trim_history(messages):
    return trim_messages(
        messages,
        strategy="last",              # keep the most recent messages
        token_counter=count_tokens_approximately,
        max_tokens=4000,              # budget for history
        start_on="human",             # never start mid-exchange
        include_system=True,          # always keep the system prompt
        allow_partial=False,
    )

Strategy 2: Summarization. When the history exceeds a threshold, use an LLM call to compress the older messages into a running summary, then keep the summary plus recent turns. You retain the gist of turn 5 at turn 90, at the cost of one extra (cheap, small-model) call per compression and some lossy detail.

from langchain_core.messages import SystemMessage, RemoveMessage

async def summarize_if_needed(state):
    messages = state["messages"]
    if count_tokens_approximately(messages) < 6000:
        return {}

    # Compress everything except the last 4 messages
    to_summarize, recent = messages[:-4], messages[-4:]
    prior = state.get("summary", "")
    prompt = (
        f"Current summary:\n{prior}\n\n"
        "Extend the summary with the new messages. Preserve names, "
        "numbers, decisions, and unresolved questions. Max 300 words."
    )
    summary = await small_llm.ainvoke(
        to_summarize + [SystemMessage(content=prompt)]
    )
    return {
        "summary": summary.content,
        "messages": [RemoveMessage(id=m.id) for m in to_summarize],
    }

Which one? A pragmatic rule: trim by default, summarize when conversations are long and stateful. A code-review bot that answers one question per thread only needs trimming. A sales agent running 60-turn negotiations needs summarization, because "the customer said budget is $50K in turn 3" must survive to turn 55. Instruct the summarizer explicitly to preserve entities, numbers, and open decisions - a generic "summarize this" prompt will smooth away exactly the details that matter.

One more production note: summarization changes what the model sees, which changes behavior. Any time you alter your compression strategy, run your eval suite before shipping - our agent evals guide covers how to build regression tests for exactly this.

Persisting Short-Term Memory: LangGraph Checkpointers

Trimming and summarization manage the size of short-term memory. Checkpointers manage its durability. Without persistence, your agent's memory lives in a Python process, and every deploy, crash, or scale-down event wipes active conversations.

LangGraph's answer is the checkpointer: after every node execution, the entire graph state (including the message list) is snapshotted to a database, keyed by a thread_id. Invoke the graph again with the same thread ID and it resumes exactly where it left off - on any server, after any restart.

from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

DB_URI = "postgresql://agent:agent@localhost:5432/langgraph"

async def agent_node(state: MessagesState):
    response = await llm.ainvoke(trim_history(state["messages"]))
    return {"messages": [response]}

async def build_graph():
    checkpointer = AsyncPostgresSaver.from_conn_string(DB_URI)
    await checkpointer.setup()  # creates checkpoint tables

    graph = StateGraph(MessagesState)
    graph.add_node("agent", agent_node)
    graph.add_edge(START, "agent")
    graph.add_edge("agent", END)
    return graph.compile(checkpointer=checkpointer)

# Every conversation is a thread. Same thread_id = same memory.
config = {"configurable": {"thread_id": "user-42-session-7"}}
result = await app.ainvoke(
    {"messages": [("user", "My name is Priya, I run ops at Kestrel.")]},
    config=config,
)
# Later, even after a server restart:
result = await app.ainvoke(
    {"messages": [("user", "What's my name?")]},  # -> "Priya"
    config=config,
)

Three practical rules from running this in production:

Use Postgres, not MemorySaver, for anything real. MemorySaver (in-memory) is for notebooks and tests. AsyncPostgresSaver gives you durability plus the ability to run multiple agent workers against shared state. Redis and SQLite savers exist for other trade-offs.

Design your thread ID scheme deliberately. thread_id = user_id gives one endless conversation per user (memory grows forever). thread_id = session_id gives clean per-session memory but no continuity. Most products want session-scoped threads plus long-term memory for continuity - the next section.

Clean up old checkpoints. Checkpoint tables grow with every turn of every conversation. Add a job that deletes threads inactive for 30+ days, or your Postgres volume becomes the surprise line item. We saw exactly this pattern when covering Postgres checkpointing in our LangGraph + vLLM production deployment guide.

Checkpointers also unlock more than memory: because state is snapshotted at every step, you get time travel (replay from any checkpoint) and interrupts for free, which is the foundation of human-in-the-loop agent workflows.

Cross-Session Memory: The LangGraph Store

Checkpointers are thread-scoped by design. When Priya starts a new session tomorrow with a new thread ID, everything she told the agent yesterday is invisible. Cross-session memory needs a second mechanism, and in LangGraph that is the Store.

The asymmetry between the two is deliberate. Checkpointing is automatic because conversation history is structural - every agent needs it. The Store requires explicit reads and writes in your nodes because what to remember long-term is a product decision. Nobody but you can decide that "prefers invoices in EUR" is worth persisting and "asked what time it is" is not.

from langgraph.store.postgres.aio import AsyncPostgresStore

# The store is namespaced - typically by user
namespace = ("memories", user_id)

async def agent_node(state: MessagesState, config, *, store):
    # 1. RECALL: search long-term memory for relevant facts
    query = state["messages"][-1].content
    memories = await store.asearch(namespace, query=query, limit=5)
    memory_block = "\n".join(f"- {m.value['fact']}" for m in memories)

    system = SystemMessage(content=(
        "You are a helpful assistant.\n"
        f"Known facts about this user:\n{memory_block or 'None yet.'}"
    ))
    response = await llm.ainvoke([system] + state["messages"])
    return {"messages": [response]}

async def write_memory_node(state: MessagesState, config, *, store):
    # 2. EXTRACT: ask a small model what is worth remembering
    extraction = await small_llm.ainvoke(
        state["messages"][-4:] + [SystemMessage(content=(
            "Extract durable facts about the user worth remembering "
            "across sessions (preferences, role, constraints). "
            "Return one fact per line, or NONE."
        ))]
    )
    for fact in extraction.content.splitlines():
        if fact.strip() and fact.strip() != "NONE":
            await store.aput(
                namespace, str(uuid.uuid4()), {"fact": fact.strip()}
            )
    return {}

store = AsyncPostgresStore.from_conn_string(DB_URI)
await store.setup()
app = graph.compile(checkpointer=checkpointer, store=store)

Configure the store with an embedding model and store.asearch() becomes semantic search rather than exact matching, which is what you want once a user has more than a handful of memories.

The pattern above - recall at the start of a turn, extract and write after the turn - is the core loop of every long-term memory system, whether you build it on the LangGraph Store, a raw vector database, or a managed platform. The write side is where the engineering lives: deduplication (do not store "likes EUR invoices" fifteen times), contradiction handling (new fact supersedes old), and importance filtering (not everything a user says is a fact about them). Doing extraction in a background job rather than inline also keeps it off your response latency path.

DIY Long-Term Memory with a Vector Store

If you are not on LangGraph, or you want full control, the same recall/extract loop works directly against a vector database. This is essentially RAG where the corpus is "things this user's agent has learned" instead of documents - if you have built a RAG agent, you already know 80% of this.

import chromadb
from openai import OpenAI

client = OpenAI()
chroma = chromadb.PersistentClient(path="./memory_db")
memories = chroma.get_or_create_collection("user_memories")

def remember(user_id: str, fact: str, kind: str = "semantic"):
    emb = client.embeddings.create(
        model="text-embedding-3-small", input=fact
    ).data[0].embedding
    memories.add(
        ids=[str(uuid.uuid4())],
        embeddings=[emb],
        documents=[fact],
        metadatas=[{
            "user_id": user_id,
            "kind": kind,                    # semantic | episodic
            "created_at": time.time(),
        }],
    )

def recall(user_id: str, query: str, k: int = 5) -> list[str]:
    emb = client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding
    hits = memories.query(
        query_embeddings=[emb],
        n_results=k,
        where={"user_id": user_id},          # hard tenant isolation
    )
    return hits["documents"][0]

Swap Chroma for pgvector if you already run Postgres (one less service, and you can join memories against your application tables), or Qdrant/Weaviate at larger scale. The database choice matters far less than three design decisions:

Always filter by user before searching. The where={"user_id": ...} clause is not an optimization, it is a security boundary. A memory system that can leak one user's facts into another user's context is a data breach with extra steps - see our agent security and privacy guide for the full threat model.

Store timestamps and use them. When two memories conflict, recency is your cheapest arbitration signal. Include created_at in what you show the model ("[2026-05-12] Customer upgraded to Enterprise") so it can reason about staleness.

Summarize episodes before storing them. Do not embed raw transcripts. Embed a 3-5 sentence summary of each session ("User debugged a webhook timeout, root cause was a 5s limit, resolved by moving to async processing"). Summaries retrieve better and cost less to re-inject.

If your stack is n8n rather than Python, the same architecture maps onto n8n's vector store nodes and memory sub-nodes - our n8n RAG agent guide walks through the no-code equivalent.

Mem0 vs Zep vs Letta vs DIY: The Managed Options

At some point every team asks: should we keep maintaining our extraction prompts, dedup logic, and contradiction handling, or pay someone whose entire product is memory? Here is the honest 2026 landscape.

	Mem0	Zep / Graphiti	Letta (MemGPT)	DIY (pgvector)
Core model	Vector-first memory layer with LLM extraction and optional graph	Temporal knowledge graph - facts carry validity intervals	Agent runtime that self-manages memory like an OS pages RAM	Whatever you build
Best for	Cross-session personalization, fastest integration	Facts that change over time ("what was true in Q1?"), enterprise scale	Long-running autonomous agents that edit their own memory	Simple needs, zero new vendors, full control
Pricing model	Free tier (10K memories), paid platform tiers, Pro graph features around $249/mo	Free tier, Flex from ~$25-125/mo usage-based, enterprise plans	Open source core, Letta Cloud usage-based	Your infra + your engineering time
Self-host	Yes - open source (Apache 2.0), Docker stack with pgvector/Qdrant + optional Neo4j	Graphiti engine is open source (needs Neo4j or FalkorDB); Zep platform is managed/enterprise	Yes - fully open source server	By definition
Integrations	Python/JS SDKs, LangGraph, CrewAI, AutoGen, REST API, MCP	Python/JS/Go SDKs, LangGraph, REST API, MCP	Own runtime + REST API, Agent File format, MCP	Anything, you write it
Watch out for	Extraction adds an LLM call per write; vector-first recall can miss temporal reasoning	Graph ingestion has latency and cost; heavier ops if self-hosting Graphiti	Opinionated - you adopt Letta's agent model, not just its memory	You own dedup, contradictions, retention, and evals forever

On benchmarks: on LongMemEval, the temporal-retrieval benchmark that has become the de facto stress test for this category, Zep's Graphiti approach has posted meaningfully higher scores than vector-first systems on questions where facts change over time, while Mem0's own published results (ECAI 2025 paper) show strong accuracy-per-dollar on general conversational recall. Read vendor benchmarks skeptically in both directions - then test on your own data.

A minimal Mem0 integration, to show how thin the API is:

from mem0 import Memory

m = Memory()  # or MemoryClient(api_key=...) for the hosted platform

# After each exchange, hand Mem0 the messages - it extracts,
# deduplicates, and resolves contradictions itself
m.add(
    [{"role": "user", "content": "I'm vegetarian and allergic to nuts."}],
    user_id="priya",
)

# Before each turn, recall
hits = m.search("what should I cook for dinner?", user_id="priya")
context = "\n".join(h["memory"] for h in hits["results"])
# -> "Is vegetarian", "Allergic to nuts"

Our recommendation logic: start DIY if your memory needs fit in one sentence ("remember user preferences and past ticket summaries") - the LangGraph Store or 200 lines on pgvector will carry you a long way. Reach for Mem0 when extraction quality and dedup start eating engineering weeks. Reach for Zep when your domain has facts with lifespans (CRM state, subscriptions, org charts) and "when was this true" matters. Reach for Letta when you are building persistent autonomous agents rather than adding memory to a chat product. And if you want a second pair of eyes on that call for your stack, we do this for client teams regularly.

What to Store, and What NOT To

Every memory system eventually stores something it should not have. Design the policy before the incident, not after.

Store: stable preferences (language, format, channel), role and context (job title, team, plan tier), durable decisions ("agreed to migrate in Q3"), interaction summaries with outcomes, and explicit corrections the user gave the agent - those are gold, because a correction remembered is a mistake never repeated.

Do not store:

PII you do not need. Names and emails are usually fine (and already in your user table). Health details, financial account numbers, government IDs, and anything about third parties the user mentioned should never land in a memory store. The nasty failure mode is the extraction LLM being too good: a user vents "my daughter's diagnosis has me distracted this week" and a naive extractor faithfully writes down a child's health information. Add explicit exclusion rules to your extraction prompt and a PII scrubbing pass on writes.

Secrets and credentials. Users paste API keys and passwords into chats constantly. Pattern-match and redact before anything reaches persistent storage.

Raw transcripts as memory. Beyond retrieval quality problems, full transcripts maximize your data liability surface. Summaries are both better memories and smaller breach blast radius.

Then set a retention policy and enforce it:

# Retention policy, enforced by a scheduled job
POLICY = {
    "semantic":   None,          # keep until superseded or user deletes
    "episodic":   90 * 86400,    # interaction summaries: 90 days
    "checkpoint": 30 * 86400,    # raw thread state: 30 days
}

Under GDPR and similar regimes, memories about a user are personal data: you need deletion on request (namespacing every memory by user ID makes "delete user 42's memories" a one-liner - this is another reason the namespace discipline matters), a defensible answer to "why are you holding this," and honest disclosure that the assistant remembers across sessions. Users also simply deserve to know. A visible "what I remember about you" view with delete buttons is both a trust feature and your GDPR access-request implementation. The broader compliance picture is in our AI agent security and privacy guide.

Memory and Cost: The Token Economics

Memory decisions are cost decisions wearing a different hat. Work the numbers on a concrete example: a support agent, 40-turn average conversation, ~200 tokens per message, 1,000 conversations per day.

No management (full history every turn): by turn 40 you are sending ~8K tokens of history per call. Summed across a conversation, history alone costs roughly 160K input tokens, and at $3 per million input tokens that is about $0.50 per conversation, ~$500/day, just for memory the model mostly ignores.

Summarization (300-token summary + last 6 turns): steady-state history is ~1.5K tokens per call, or roughly $0.18/conversation including the summarizer calls on a small model - roughly a 60-65% cut with better answer quality on long threads, because the signal is no longer buried.

Two more levers compound with this:

Prompt caching rewards stable prefixes. Providers charge cached input tokens at 10-25% of the normal rate, but only for prompt prefixes that repeat exactly. Structure your prompt as [static system prompt] then [long-term memory block] then [summary] then [recent messages], and never interleave volatile content (timestamps, request IDs) into the stable sections. A memory block that changes only when memories change stays cache-warm across turns; one that is re-rendered with a fresh timestamp each call never caches at all.

Retrieval replaces context. Injecting the 5 most relevant memories (say 400 tokens) instead of "everything we know about the user" (4,000 tokens) is a 10x reduction on that block with zero quality loss when retrieval is good. This is the same argument as RAG versus stuffing whole documents into context.

The one cost that surprises teams: memory writes have their own LLM bill. Extraction, summarization, and dedup are all inference. Run them on a small cheap model (extraction is not a frontier-model task), batch them, and do them asynchronously. For the full toolkit - model routing, caching strategy, batch APIs - see our guide on reducing AI agent LLM costs.

Evaluating Memory Quality

"It seems to remember stuff" is not a metric. Memory systems fail quietly - wrong fact retrieved, right fact missed, stale fact trusted - and users experience it as "the agent is dumb," not as a memory bug. You need eval coverage on four axes:

1. Recall. Given a conversation where a fact was stated, does the agent use it N sessions later? Build test cases as (setup conversation, later question, expected answer): tell the agent "our deploy freeze is Fridays" in session 1, ask "can we ship this Friday?" in session 3, assert the answer reflects the freeze.

2. Precision. Does retrieval inject irrelevant memories? Pollution is worse than omission - an irrelevant memory in context actively misleads the model. Measure the fraction of injected memories that were actually pertinent to the turn (an LLM judge scores this well).

3. Freshness. After a fact changes ("we moved from Slack to Teams"), does the agent use the new value? This is where vector-only systems get caught: both facts are in the store, both are similar to the query, and only timestamps or supersession logic saves you. Write explicit contradiction test cases.

4. Boundary safety. Adversarial cases: does user A's memory ever surface for user B? Does the agent refuse to "remember" injected instructions from tool outputs? These belong in CI, not in a quarterly audit.

# Sketch: a memory eval case
{
  "setup": [
    {"session": 1, "user": "We switched our deploy freeze to Fridays."},
    {"session": 2, "user": "Actually, freeze moved to Thursdays now."}
  ],
  "probe": {"session": 3, "user": "Can we deploy Thursday afternoon?"},
  "assert": "answer reflects Thursday freeze, not Friday"
}

Run these like any other agent eval: a fixed suite in CI, plus scored sampling of production traces. The mechanics - LLM judges, trajectory checks, regression gating - are covered in our agent evals and testing guide, and wiring memory-hit-rate metrics into your traces is covered in the observability and monitoring guide. If you instrument one thing, make it this: log which memories were retrieved and injected on every turn. When an answer is wrong, the first debugging question is always "what did the agent think it knew?"

Common Mistakes (We Have Made Most of These)

Storing everything. The most common failure. Teams treat memory as a log and persist every exchange "in case it's useful." Result: retrieval precision collapses, costs climb, and the privacy surface balloons. Memory is curation. If your write path has no filter that rejects most candidate memories, you do not have a memory system, you have a landfill with an embedding index.

Stale memories with no supersession. "Customer is evaluating us" retrieved eight months after they became a paying customer makes the agent look worse than no memory at all. Every semantic fact needs a path to being updated or invalidated: overwrite by key, supersession links, or a temporal graph. Append-only is only correct for episodic memory.

Unbounded growth. Checkpoint tables, memory stores, and vector collections all grow monotonically unless you decide otherwise. We have seen a checkpointer database quietly reach tens of gigabytes because nobody owned deletion. TTLs, archival jobs, and per-user memory caps (keep the top N by recency and usage) are day-one features, not future work.

Extraction on the hot path. Running memory extraction synchronously before returning the response adds 500-2,000ms to every turn for work the user never sees. Extract in the background after responding.

Trusting memory as ground truth. Memories are model-extracted paraphrases, and extraction has an error rate. Frame injected memories as context ("Previously noted about this user: ..."), not as verified fact, and let the agent confirm consequential details ("You mentioned you're on the Enterprise plan - is that still right?") before acting on them irreversibly.

Skipping the memory UX. If users cannot see, correct, or delete what the agent remembers, your first wrong memory becomes a trust incident. "Forget that" should work as a command.

Building the fancy version first. Teams reach for a temporal knowledge graph before they have shipped "remember the user's name across sessions." Sequence it: trimming, then checkpointer persistence, then a simple cross-session store, then evaluate whether your failures actually need graphs. Most do not.

Putting It Together: A Reference Architecture

Here is the memory stack we deploy for most production agents, in the order you should build it:

┌──────────────────────────────────────────────────────┐
│ Per-turn prompt assembly                             │
│  1. System prompt (static, cache-friendly)           │
│  2. Long-term memories (top-5 from Store / Mem0)     │
│  3. Running summary (if conversation > threshold)     │
│  4. Recent messages (trimmed to token budget)         │
├──────────────────────────────────────────────────────┤
│ During turn:  checkpointer persists thread state      │
├──────────────────────────────────────────────────────┤
│ After turn (async): extract -> dedup -> PII scrub     │
│                     -> write to long-term store       │
├──────────────────────────────────────────────────────┤
│ Scheduled: retention jobs, checkpoint cleanup,        │
│            memory eval suite in CI                    │
└──────────────────────────────────────────────────────┘

Week 1: Postgres checkpointer + message trimming. Conversations survive restarts, costs are bounded. Week 2: summarization for long threads, LangGraph Store (or pgvector) for cross-session facts, recall/extract loop with a small model. Week 3: retention policies, PII scrubbing on writes, memory eval cases in CI, retrieval logging in your traces. Then, and only if your evals show temporal or relational failures: evaluate Mem0 or Zep against your own test set.

That sequencing matters because each layer exposes whether you actually need the next one. Plenty of successful production agents stop at week 2.

If you want to build this end to end with guidance - checkpointers, stores, memory evals, cost controls, and deployment - our Production-Grade Agent Engineering course walks through the full stack with the exact patterns from this guide, taken from systems we run for real clients. If your team works in n8n instead of Python, the n8n AI Agents course covers the equivalent memory patterns with n8n's native memory and vector nodes. And if you would rather have it built for you, work with us directly.

Memory is the difference between an agent that performs a task and an agent that gets better at working with you. Build it deliberately, keep it small, and measure it like everything else.

FAQ

What is the difference between short-term and long-term memory in AI agents?

Short-term memory is the message history of the current conversation, scoped to a single thread and assembled into the prompt on every turn. Long-term memory is information that survives across sessions - user preferences, past interaction summaries, learned instructions - stored externally (vector store, key-value store, or knowledge graph) and retrieved selectively when relevant. Short-term memory is managed with trimming and summarization; long-term memory needs an explicit write path with extraction, deduplication, and retention policies.

Do LLMs have built-in memory?

No. Every LLM API call is stateless - the model retains nothing between requests. All apparent memory (including ChatGPT's memory feature) is an application layer that stores text externally and injects it back into the prompt. This means memory is fully your design responsibility, and also that you pay input-token prices for every piece of memory you include on each call.

Should I use trimming or summarization for conversation memory?

Trim by default: keep the system prompt plus the most recent messages within a token budget. It is free, fast, and deterministic. Add summarization when conversations are long and stateful - when a fact from turn 3 must influence turn 50. Summarize older messages into a running summary with a small, cheap model, and explicitly instruct it to preserve names, numbers, and decisions. Most production agents combine both: summary plus a trimmed window of recent turns.

What is a LangGraph checkpointer and how is it different from the Store?

A checkpointer persists the graph state of a single thread - it makes one conversation durable across restarts and enables resuming, time travel, and human-in-the-loop interrupts. It is automatic once you compile with checkpointer=. The Store is for cross-thread, long-term memory: you explicitly put() facts and search() them in future sessions, typically namespaced by user ID. Production apps generally use both: Postgres checkpointer for threads, Store (or an external memory service) for durable facts.

Mem0 vs Zep vs Letta - which should I choose?

Mem0 is a vector-first memory layer with LLM-powered extraction and dedup - the fastest path to cross-session personalization, open source with a hosted platform. Zep is built on Graphiti, a temporal knowledge graph that timestamps facts, so it wins when facts change over time and 'when was this true' matters. Letta (formerly MemGPT) is a full agent runtime where the agent manages its own memory tiers - best for long-running autonomous agents. If your needs are simple, DIY on the LangGraph Store or pgvector is a legitimate fourth option. Benchmark on your own data before committing.

How does agent memory affect LLM costs?

Directly and heavily. Sending full conversation history on every turn can make memory the largest line item in a long conversation - summarization typically cuts that 60% or more. Retrieval-based long-term memory (injecting the top 5 relevant facts instead of everything known) reduces the memory block by 10x. Structuring prompts so memory blocks are stable also unlocks prompt caching discounts of 75-90% on those tokens. Remember that memory writes (extraction, summarization) are LLM calls too - run them on small models, asynchronously.

Is it safe to store user information in agent memory under GDPR?

It can be, but memories about users are personal data and must be treated as such. Namespace every memory by user ID so deletion requests are trivial, exclude sensitive categories (health, financial, third-party information) at extraction time, scrub credentials and PII patterns before writes, set retention periods per memory type, and disclose to users that the assistant remembers across sessions. A user-facing view of stored memories with delete controls covers both trust and access-request obligations.

How do I test whether my agent's memory actually works?

Build eval cases with a setup phase and a probe phase: state a fact in session 1, ask a question that depends on it in session 3, and assert the answer uses it. Cover four axes - recall (facts are found), precision (irrelevant memories are not injected), freshness (updated facts supersede stale ones), and boundary safety (no cross-user leakage). Run the suite in CI on every change to extraction prompts or retrieval logic, and log retrieved memories on every production turn so you can debug wrong answers by inspecting what the agent thought it knew.

All posts

2026-06-20