
"AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026"
AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026#
The biggest limitation of LLMs isn't intelligence—it's amnesia. Every API call starts fresh. Your model doesn't remember the conversation from five minutes ago, let alone last week.
AI agent memory patterns solve this. They give your AI applications the ability to maintain context across conversations, remember user preferences, learn from interactions, and behave like an assistant that actually knows who it's talking to.
This guide covers the five core memory patterns every AI developer should know, with production-ready code examples.
Why AI Agents Need Memory#
Without memory, every interaction is a first meeting:
User: "I prefer dark mode and Python code examples."
AI: "Got it! I'll use dark mode and Python."
[New session]
User: "Show me how to make an API call."
AI: "Sure! Here's an example in JavaScript..." // Forgot everything
Memory patterns fix this by persisting context between API calls.
The Five Memory Patterns#
| Pattern | Context Window Usage | Best For | Complexity |
|---|---|---|---|
| Buffer Memory | Full history | Short conversations | Low |
| Sliding Window | Last N messages | Medium conversations | Low |
| Summary Memory | Compressed summary | Long conversations | Medium |
| Vector Memory | Semantic retrieval | Knowledge-heavy agents | High |
| Hybrid Memory | Combined approach | Production agents | High |
Pattern 1: Buffer Memory (Full History)#
The simplest approach: store every message and send the full history with each request.
from openai import OpenAI
client = OpenAI(
base_url="https://api.crazyrouter.com/v1",
api_key="your-crazyrouter-key"
)
class BufferMemory:
def __init__(self, system_prompt: str):
self.messages = [{"role": "system", "content": system_prompt}]
def add_user_message(self, content: str):
self.messages.append({"role": "user", "content": content})
def add_assistant_message(self, content: str):
self.messages.append({"role": "assistant", "content": content})
def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
self.add_user_message(user_input)
response = client.chat.completions.create(
model=model,
messages=self.messages,
temperature=0.7
)
reply = response.choices[0].message.content
self.add_assistant_message(reply)
return reply
# Usage
memory = BufferMemory("You are a helpful coding assistant.")
print(memory.chat("I'm working on a FastAPI project."))
print(memory.chat("How do I add authentication?")) # Remembers FastAPI context
Pros: Full context, simple implementation
Cons: Token costs grow linearly, hits context window limits
Best for: Chatbots with <20 message conversations
Pattern 2: Sliding Window Memory#
Keep only the last N messages, discarding older ones.
from collections import deque
class SlidingWindowMemory:
def __init__(self, system_prompt: str, window_size: int = 20):
self.system_prompt = {"role": "system", "content": system_prompt}
self.window = deque(maxlen=window_size)
def get_messages(self) -> list:
return [self.system_prompt] + list(self.window)
def chat(self, user_input: str, model: str = "claude-sonnet-4-5") -> str:
self.window.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model=model,
messages=self.get_messages(),
temperature=0.7
)
reply = response.choices[0].message.content
self.window.append({"role": "assistant", "content": reply})
return reply
# Usage - keeps last 20 messages
memory = SlidingWindowMemory(
"You are a data analysis assistant.",
window_size=20
)
Pros: Bounded token costs, prevents context overflow
Cons: Loses early conversation context
Best for: Support bots, general-purpose assistants
Pattern 3: Summary Memory#
Compress old conversations into summaries, keeping recent messages in full.
class SummaryMemory:
def __init__(self, system_prompt: str, summary_threshold: int = 10):
self.system_prompt = system_prompt
self.summary = ""
self.recent_messages = []
self.summary_threshold = summary_threshold
def _summarize(self) -> str:
"""Compress old messages into a summary."""
messages_to_summarize = self.recent_messages[:-4] # Keep last 4
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in messages_to_summarize
)
response = client.chat.completions.create(
model="gemini-2.5-flash", # Cheap model for summarization
messages=[
{
"role": "system",
"content": "Summarize this conversation concisely. Preserve key facts, decisions, and user preferences."
},
{"role": "user", "content": conversation_text}
],
max_tokens=300,
temperature=0
)
return response.choices[0].message.content
def get_messages(self) -> list:
messages = [{"role": "system", "content": self.system_prompt}]
if self.summary:
messages.append({
"role": "system",
"content": f"Previous conversation summary:\n{self.summary}"
})
messages.extend(self.recent_messages)
return messages
def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
self.recent_messages.append({"role": "user", "content": user_input})
# Summarize when threshold reached
if len(self.recent_messages) > self.summary_threshold:
new_summary = self._summarize()
self.summary = (
f"{self.summary}\n{new_summary}" if self.summary else new_summary
)
self.recent_messages = self.recent_messages[-4:] # Keep recent
response = client.chat.completions.create(
model=model,
messages=self.get_messages(),
temperature=0.7
)
reply = response.choices[0].message.content
self.recent_messages.append({"role": "assistant", "content": reply})
return reply
Pros: Preserves key context indefinitely, bounded costs
Cons: Summary may lose nuance, extra API call for summarization
Best for: Long-running assistants, multi-session agents
Pattern 4: Vector Memory (Semantic Retrieval)#
Store all memories as embeddings, retrieve only relevant ones per query.
import numpy as np
from datetime import datetime
class VectorMemory:
def __init__(self, system_prompt: str, top_k: int = 5):
self.system_prompt = system_prompt
self.memories = [] # List of {text, embedding, timestamp, metadata}
self.top_k = top_k
def _get_embedding(self, text: str) -> list:
"""Get embedding using a text embedding model."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def _cosine_similarity(self, a: list, b: list) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def store(self, text: str, metadata: dict = None):
"""Store a memory with its embedding."""
embedding = self._get_embedding(text)
self.memories.append({
"text": text,
"embedding": embedding,
"timestamp": datetime.now().isoformat(),
"metadata": metadata or {}
})
def retrieve(self, query: str) -> list[str]:
"""Retrieve top-k most relevant memories."""
if not self.memories:
return []
query_embedding = self._get_embedding(query)
scored = [
(m["text"], self._cosine_similarity(query_embedding, m["embedding"]))
for m in self.memories
]
scored.sort(key=lambda x: x[1], reverse=True)
return [text for text, score in scored[:self.top_k] if score > 0.3]
def chat(self, user_input: str, model: str = "claude-opus-4-6") -> str:
# Retrieve relevant memories
relevant = self.retrieve(user_input)
# Build context
messages = [{"role": "system", "content": self.system_prompt}]
if relevant:
memory_context = "\n".join(f"- {m}" for m in relevant)
messages.append({
"role": "system",
"content": f"Relevant memories from past interactions:\n{memory_context}"
})
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7
)
reply = response.choices[0].message.content
# Store both the user query and the exchange
self.store(f"User asked: {user_input}")
self.store(f"Assistant replied about: {user_input[:100]}")
return reply
Pros: Infinite memory capacity, retrieves only relevant context
Cons: Requires embedding model, more complex, retrieval may miss context
Best for: Knowledge-intensive agents, personal assistants, RAG systems
Pattern 5: Hybrid Memory (Production Pattern)#
The gold standard: combine short-term buffer, long-term summaries, and semantic retrieval.
class HybridMemory:
"""Production-grade memory combining all patterns."""
def __init__(self, system_prompt: str):
self.system_prompt = system_prompt
# Short-term: recent messages (buffer)
self.short_term = []
self.short_term_limit = 10
# Medium-term: conversation summary
self.summary = ""
# Long-term: vector store for facts and preferences
self.vector_store = VectorMemory(system_prompt)
# Entity memory: key facts about the user
self.entities = {}
def extract_entities(self, text: str):
"""Extract and store key entities from conversation."""
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[
{
"role": "system",
"content": (
"Extract key facts as JSON. Categories: "
"name, preferences, projects, skills, goals. "
"Return {} if none found."
)
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
max_tokens=200,
temperature=0
)
import json
entities = json.loads(response.choices[0].message.content)
self.entities.update(entities)
def get_context(self, query: str) -> list:
"""Build optimized context from all memory layers."""
messages = [{"role": "system", "content": self.system_prompt}]
# Add entity memory (user facts)
if self.entities:
entity_str = "\n".join(f"- {k}: {v}" for k, v in self.entities.items())
messages.append({
"role": "system",
"content": f"Known facts about the user:\n{entity_str}"
})
# Add conversation summary
if self.summary:
messages.append({
"role": "system",
"content": f"Conversation summary:\n{self.summary}"
})
# Add relevant long-term memories
relevant = self.vector_store.retrieve(query)
if relevant:
memory_str = "\n".join(f"- {m}" for m in relevant)
messages.append({
"role": "system",
"content": f"Relevant past context:\n{memory_str}"
})
# Add recent messages
messages.extend(self.short_term)
return messages
def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
# Build context
messages = self.get_context(user_input)
messages.append({"role": "user", "content": user_input})
# Generate response
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7
)
reply = response.choices[0].message.content
# Update short-term memory
self.short_term.append({"role": "user", "content": user_input})
self.short_term.append({"role": "assistant", "content": reply})
# Compress if needed
if len(self.short_term) > self.short_term_limit:
self._compress_short_term()
# Extract entities asynchronously
self.extract_entities(user_input)
# Store in vector memory
self.vector_store.store(f"Q: {user_input}\nA: {reply[:200]}")
return reply
Memory Pattern Comparison#
| Feature | Buffer | Sliding | Summary | Vector | Hybrid |
|---|---|---|---|---|---|
| Token efficiency | ❌ | ✅ | ✅ | ✅ | ✅ |
| Context preservation | ✅ | ❌ | ⚠️ | ✅ | ✅ |
| Implementation effort | Low | Low | Medium | High | High |
| Cost per request | High | Medium | Medium | Medium | Medium |
| Infinite conversations | ❌ | ⚠️ | ✅ | ✅ | ✅ |
| Semantic recall | ❌ | ❌ | ❌ | ✅ | ✅ |
Cost Optimization: Memory with Crazyrouter#
Memory-heavy agents make multiple API calls per interaction. Cost matters.
| Operation | Model | Official Price | Crazyrouter Price |
|---|---|---|---|
| Main response | GPT-5.2 | $2.50/1M input | $1.25/1M |
| Summarization | Gemini 2.5 Flash | $0.15/1M input | $0.075/1M |
| Entity extraction | Gemini 2.5 Flash | $0.15/1M input | $0.075/1M |
| Embeddings | text-embedding-3-small | $0.02/1M tokens | $0.01/1M |
Total per 1K hybrid-memory interactions: ~8.40 at official prices.
Production Tips#
- Persist memory to disk/database: Use Redis for short-term, PostgreSQL + pgvector for long-term
- Separate user memories: Never mix memory between users
- Set memory expiry: Old memories lose relevance—add decay scoring
- Use cheap models for memory operations: Flash/Haiku for summarization and extraction
- Batch embedding calls: Embed multiple texts in one API call
- Test memory retrieval quality: Wrong memories are worse than no memories
FAQ#
What is the best memory pattern for AI agents?#
The hybrid pattern is best for production applications, combining short-term buffer, summarized medium-term, and vector-based long-term memory. For simpler use cases, sliding window memory with periodic summarization works well.
How much does AI agent memory cost?#
Memory operations add 20-40% to base API costs due to summarization, entity extraction, and embedding calls. Using affordable models through Crazyrouter (Gemini Flash for summarization, text-embedding-3-small for vectors) keeps overhead under $0.005 per interaction.
Can I use memory patterns with any LLM?#
Yes. Memory patterns are model-agnostic—they work with GPT-5, Claude, Gemini, DeepSeek, and any model accessible via the OpenAI-compatible API format. Crazyrouter provides access to 300+ models through a single endpoint.
How do I persist AI agent memory across server restarts?#
Use a database. Redis is ideal for short-term conversation state, while PostgreSQL with the pgvector extension handles long-term vector memory. For simpler setups, JSON files or SQLite work for prototyping.
What's the difference between RAG and agent memory?#
RAG (Retrieval-Augmented Generation) retrieves from a static knowledge base. Agent memory retrieves from the agent's own interaction history. Both use vector search, but agent memory is dynamic and grows with each conversation.
Summary#
AI agent memory is what separates a stateless chatbot from a genuinely useful assistant. The five patterns—buffer, sliding window, summary, vector, and hybrid—give you a toolkit for any use case.
Start with buffer memory for prototypes, graduate to sliding window for production chatbots, and implement hybrid memory for personal assistants and complex agents. With Crazyrouter's access to 300+ models, you can optimize each memory operation with the ideal model for the job. Get started at crazyrouter.com.


