EnglishTutorial

AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026

"Learn how to implement memory patterns for AI agents. Covers conversation buffers, sliding windows, summary memory, vector-based retrieval, and hybrid approaches using GPT-5, Claude, and open-source tools."

Crazyrouter Team

March 13, 2026 / 473 views

AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026

Crazyrouter

Read the docs Check live pricing Open image tool Create account

AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026#

The biggest limitation of LLMs isn't intelligence—it's amnesia. Every API call starts fresh. Your model doesn't remember the conversation from five minutes ago, let alone last week.

AI agent memory patterns solve this. They give your AI applications the ability to maintain context across conversations, remember user preferences, learn from interactions, and behave like an assistant that actually knows who it's talking to.

This guide covers the five core memory patterns every AI developer should know, with production-ready code examples.

Why AI Agents Need Memory#

Without memory, every interaction is a first meeting:

code

User: "I prefer dark mode and Python code examples."
AI: "Got it! I'll use dark mode and Python."

[New session]

User: "Show me how to make an API call."
AI: "Sure! Here's an example in JavaScript..."  // Forgot everything

Memory patterns fix this by persisting context between API calls.

The Five Memory Patterns#

Pattern	Context Window Usage	Best For	Complexity
Buffer Memory	Full history	Short conversations	Low
Sliding Window	Last N messages	Medium conversations	Low
Summary Memory	Compressed summary	Long conversations	Medium
Vector Memory	Semantic retrieval	Knowledge-heavy agents	High
Hybrid Memory	Combined approach	Production agents	High

Pattern 1: Buffer Memory (Full History)#

The simplest approach: store every message and send the full history with each request.

python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.crazyrouter.com/v1",
    api_key="your-crazyrouter-key"
)

class BufferMemory:
    def __init__(self, system_prompt: str):
        self.messages = [{"role": "system", "content": system_prompt}]
    
    def add_user_message(self, content: str):
        self.messages.append({"role": "user", "content": content})
    
    def add_assistant_message(self, content: str):
        self.messages.append({"role": "assistant", "content": content})
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        self.add_user_message(user_input)
        
        response = client.chat.completions.create(
            model=model,
            messages=self.messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.add_assistant_message(reply)
        return reply

# Usage
memory = BufferMemory("You are a helpful coding assistant.")
print(memory.chat("I'm working on a FastAPI project."))
print(memory.chat("How do I add authentication?"))  # Remembers FastAPI context

Pros: Full context, simple implementation
Cons: Token costs grow linearly, hits context window limits
Best for: Chatbots with <20 message conversations

Pattern 2: Sliding Window Memory#

Keep only the last N messages, discarding older ones.

python

from collections import deque

class SlidingWindowMemory:
    def __init__(self, system_prompt: str, window_size: int = 20):
        self.system_prompt = {"role": "system", "content": system_prompt}
        self.window = deque(maxlen=window_size)
    
    def get_messages(self) -> list:
        return [self.system_prompt] + list(self.window)
    
    def chat(self, user_input: str, model: str = "claude-sonnet-4-5") -> str:
        self.window.append({"role": "user", "content": user_input})
        
        response = client.chat.completions.create(
            model=model,
            messages=self.get_messages(),
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.window.append({"role": "assistant", "content": reply})
        return reply

# Usage - keeps last 20 messages
memory = SlidingWindowMemory(
    "You are a data analysis assistant.",
    window_size=20
)

Pros: Bounded token costs, prevents context overflow
Cons: Loses early conversation context
Best for: Support bots, general-purpose assistants

Pattern 3: Summary Memory#

Compress old conversations into summaries, keeping recent messages in full.

python

class SummaryMemory:
    def __init__(self, system_prompt: str, summary_threshold: int = 10):
        self.system_prompt = system_prompt
        self.summary = ""
        self.recent_messages = []
        self.summary_threshold = summary_threshold
    
    def _summarize(self) -> str:
        """Compress old messages into a summary."""
        messages_to_summarize = self.recent_messages[:-4]  # Keep last 4
        
        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages_to_summarize
        )
        
        response = client.chat.completions.create(
            model="gemini-2.5-flash",  # Cheap model for summarization
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this conversation concisely. Preserve key facts, decisions, and user preferences."
                },
                {"role": "user", "content": conversation_text}
            ],
            max_tokens=300,
            temperature=0
        )
        
        return response.choices[0].message.content
    
    def get_messages(self) -> list:
        messages = [{"role": "system", "content": self.system_prompt}]
        
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{self.summary}"
            })
        
        messages.extend(self.recent_messages)
        return messages
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        self.recent_messages.append({"role": "user", "content": user_input})
        
        # Summarize when threshold reached
        if len(self.recent_messages) > self.summary_threshold:
            new_summary = self._summarize()
            self.summary = (
                f"{self.summary}\n{new_summary}" if self.summary else new_summary
            )
            self.recent_messages = self.recent_messages[-4:]  # Keep recent
        
        response = client.chat.completions.create(
            model=model,
            messages=self.get_messages(),
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.recent_messages.append({"role": "assistant", "content": reply})
        return reply

Pros: Preserves key context indefinitely, bounded costs
Cons: Summary may lose nuance, extra API call for summarization
Best for: Long-running assistants, multi-session agents

Pattern 4: Vector Memory (Semantic Retrieval)#

Store all memories as embeddings, retrieve only relevant ones per query.

python

import numpy as np
from datetime import datetime

class VectorMemory:
    def __init__(self, system_prompt: str, top_k: int = 5):
        self.system_prompt = system_prompt
        self.memories = []  # List of {text, embedding, timestamp, metadata}
        self.top_k = top_k
    
    def _get_embedding(self, text: str) -> list:
        """Get embedding using a text embedding model."""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def _cosine_similarity(self, a: list, b: list) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def store(self, text: str, metadata: dict = None):
        """Store a memory with its embedding."""
        embedding = self._get_embedding(text)
        self.memories.append({
            "text": text,
            "embedding": embedding,
            "timestamp": datetime.now().isoformat(),
            "metadata": metadata or {}
        })
    
    def retrieve(self, query: str) -> list[str]:
        """Retrieve top-k most relevant memories."""
        if not self.memories:
            return []
        
        query_embedding = self._get_embedding(query)
        
        scored = [
            (m["text"], self._cosine_similarity(query_embedding, m["embedding"]))
            for m in self.memories
        ]
        
        scored.sort(key=lambda x: x[1], reverse=True)
        return [text for text, score in scored[:self.top_k] if score > 0.3]
    
    def chat(self, user_input: str, model: str = "claude-opus-4-6") -> str:
        # Retrieve relevant memories
        relevant = self.retrieve(user_input)
        
        # Build context
        messages = [{"role": "system", "content": self.system_prompt}]
        
        if relevant:
            memory_context = "\n".join(f"- {m}" for m in relevant)
            messages.append({
                "role": "system",
                "content": f"Relevant memories from past interactions:\n{memory_context}"
            })
        
        messages.append({"role": "user", "content": user_input})
        
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        
        # Store both the user query and the exchange
        self.store(f"User asked: {user_input}")
        self.store(f"Assistant replied about: {user_input[:100]}")
        
        return reply

Pros: Infinite memory capacity, retrieves only relevant context
Cons: Requires embedding model, more complex, retrieval may miss context
Best for: Knowledge-intensive agents, personal assistants, RAG systems

Pattern 5: Hybrid Memory (Production Pattern)#

The gold standard: combine short-term buffer, long-term summaries, and semantic retrieval.

python

class HybridMemory:
    """Production-grade memory combining all patterns."""
    
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        
        # Short-term: recent messages (buffer)
        self.short_term = []
        self.short_term_limit = 10
        
        # Medium-term: conversation summary
        self.summary = ""
        
        # Long-term: vector store for facts and preferences
        self.vector_store = VectorMemory(system_prompt)
        
        # Entity memory: key facts about the user
        self.entities = {}
    
    def extract_entities(self, text: str):
        """Extract and store key entities from conversation."""
        response = client.chat.completions.create(
            model="gemini-2.5-flash",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract key facts as JSON. Categories: "
                        "name, preferences, projects, skills, goals. "
                        "Return {} if none found."
                    )
                },
                {"role": "user", "content": text}
            ],
            response_format={"type": "json_object"},
            max_tokens=200,
            temperature=0
        )
        
        import json
        entities = json.loads(response.choices[0].message.content)
        self.entities.update(entities)
    
    def get_context(self, query: str) -> list:
        """Build optimized context from all memory layers."""
        messages = [{"role": "system", "content": self.system_prompt}]
        
        # Add entity memory (user facts)
        if self.entities:
            entity_str = "\n".join(f"- {k}: {v}" for k, v in self.entities.items())
            messages.append({
                "role": "system",
                "content": f"Known facts about the user:\n{entity_str}"
            })
        
        # Add conversation summary
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"Conversation summary:\n{self.summary}"
            })
        
        # Add relevant long-term memories
        relevant = self.vector_store.retrieve(query)
        if relevant:
            memory_str = "\n".join(f"- {m}" for m in relevant)
            messages.append({
                "role": "system",
                "content": f"Relevant past context:\n{memory_str}"
            })
        
        # Add recent messages
        messages.extend(self.short_term)
        
        return messages
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        # Build context
        messages = self.get_context(user_input)
        messages.append({"role": "user", "content": user_input})
        
        # Generate response
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        
        # Update short-term memory
        self.short_term.append({"role": "user", "content": user_input})
        self.short_term.append({"role": "assistant", "content": reply})
        
        # Compress if needed
        if len(self.short_term) > self.short_term_limit:
            self._compress_short_term()
        
        # Extract entities asynchronously
        self.extract_entities(user_input)
        
        # Store in vector memory
        self.vector_store.store(f"Q: {user_input}\nA: {reply[:200]}")
        
        return reply

Memory Pattern Comparison#

Feature	Buffer	Sliding	Summary	Vector	Hybrid
Token efficiency	❌	✅	✅	✅	✅
Context preservation	✅	❌	⚠️	✅	✅
Implementation effort	Low	Low	Medium	High	High
Cost per request	High	Medium	Medium	Medium	Medium
Infinite conversations	❌	⚠️	✅	✅	✅
Semantic recall	❌	❌	❌	✅	✅

Cost Optimization: Memory with Crazyrouter#

Memory-heavy agents make multiple API calls per interaction. Cost matters.

Operation	Model	Official Price	Crazyrouter Price
Main response	GPT-5.2	$2.50/1M input	$1.25/1M
Summarization	Gemini 2.5 Flash	$0.15/1M input	$0.075/1M
Entity extraction	Gemini 2.5 Flash	$0.15/1M input	$0.075/1M
Embeddings	text-embedding-3-small	$0.02/1M tokens	$0.01/1M

Total per 1K hybrid-memory interactions: ~ $4.20 via [Crazyrouter](https://crazyrouter.com) vs ~$ 8.40 at official prices.

Production Tips#

Persist memory to disk/database: Use Redis for short-term, PostgreSQL + pgvector for long-term
Separate user memories: Never mix memory between users
Set memory expiry: Old memories lose relevance—add decay scoring
Use cheap models for memory operations: Flash/Haiku for summarization and extraction
Batch embedding calls: Embed multiple texts in one API call
Test memory retrieval quality: Wrong memories are worse than no memories

FAQ#

What is the best memory pattern for AI agents?#

The hybrid pattern is best for production applications, combining short-term buffer, summarized medium-term, and vector-based long-term memory. For simpler use cases, sliding window memory with periodic summarization works well.

How much does AI agent memory cost?#

Memory operations add 20-40% to base API costs due to summarization, entity extraction, and embedding calls. Using affordable models through Crazyrouter (Gemini Flash for summarization, text-embedding-3-small for vectors) keeps overhead under $0.005 per interaction.

Can I use memory patterns with any LLM?#

Yes. Memory patterns are model-agnostic—they work with GPT-5, Claude, Gemini, DeepSeek, and any model accessible via the OpenAI-compatible API format. Crazyrouter provides access to 300+ models through a single endpoint.

How do I persist AI agent memory across server restarts?#

Use a database. Redis is ideal for short-term conversation state, while PostgreSQL with the pgvector extension handles long-term vector memory. For simpler setups, JSON files or SQLite work for prototyping.

What's the difference between RAG and agent memory?#

RAG (Retrieval-Augmented Generation) retrieves from a static knowledge base. Agent memory retrieves from the agent's own interaction history. Both use vector search, but agent memory is dynamic and grows with each conversation.

Summary#

AI agent memory is what separates a stateless chatbot from a genuinely useful assistant. The five patterns—buffer, sliding window, summary, vector, and hybrid—give you a toolkit for any use case.

Start with buffer memory for prototypes, graduate to sliding window for production chatbots, and implement hybrid memory for personal assistants and complex agents. With Crazyrouter's access to 300+ models, you can optimize each memory operation with the ideal model for the job. Get started at crazyrouter.com.