Login
Back to Blog
"AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026"

"AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026"

C
Crazyrouter Team
March 13, 2026
4 viewsEnglishTutorial
Share:

AI Agent Memory Patterns: Building Stateful AI Applications with Long-Term Memory in 2026#

The biggest limitation of LLMs isn't intelligence—it's amnesia. Every API call starts fresh. Your model doesn't remember the conversation from five minutes ago, let alone last week.

AI agent memory patterns solve this. They give your AI applications the ability to maintain context across conversations, remember user preferences, learn from interactions, and behave like an assistant that actually knows who it's talking to.

This guide covers the five core memory patterns every AI developer should know, with production-ready code examples.

Why AI Agents Need Memory#

Without memory, every interaction is a first meeting:

code
User: "I prefer dark mode and Python code examples."
AI: "Got it! I'll use dark mode and Python."

[New session]

User: "Show me how to make an API call."
AI: "Sure! Here's an example in JavaScript..."  // Forgot everything

Memory patterns fix this by persisting context between API calls.

The Five Memory Patterns#

PatternContext Window UsageBest ForComplexity
Buffer MemoryFull historyShort conversationsLow
Sliding WindowLast N messagesMedium conversationsLow
Summary MemoryCompressed summaryLong conversationsMedium
Vector MemorySemantic retrievalKnowledge-heavy agentsHigh
Hybrid MemoryCombined approachProduction agentsHigh

Pattern 1: Buffer Memory (Full History)#

The simplest approach: store every message and send the full history with each request.

python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.crazyrouter.com/v1",
    api_key="your-crazyrouter-key"
)

class BufferMemory:
    def __init__(self, system_prompt: str):
        self.messages = [{"role": "system", "content": system_prompt}]
    
    def add_user_message(self, content: str):
        self.messages.append({"role": "user", "content": content})
    
    def add_assistant_message(self, content: str):
        self.messages.append({"role": "assistant", "content": content})
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        self.add_user_message(user_input)
        
        response = client.chat.completions.create(
            model=model,
            messages=self.messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.add_assistant_message(reply)
        return reply

# Usage
memory = BufferMemory("You are a helpful coding assistant.")
print(memory.chat("I'm working on a FastAPI project."))
print(memory.chat("How do I add authentication?"))  # Remembers FastAPI context

Pros: Full context, simple implementation
Cons: Token costs grow linearly, hits context window limits
Best for: Chatbots with <20 message conversations

Pattern 2: Sliding Window Memory#

Keep only the last N messages, discarding older ones.

python
from collections import deque

class SlidingWindowMemory:
    def __init__(self, system_prompt: str, window_size: int = 20):
        self.system_prompt = {"role": "system", "content": system_prompt}
        self.window = deque(maxlen=window_size)
    
    def get_messages(self) -> list:
        return [self.system_prompt] + list(self.window)
    
    def chat(self, user_input: str, model: str = "claude-sonnet-4-5") -> str:
        self.window.append({"role": "user", "content": user_input})
        
        response = client.chat.completions.create(
            model=model,
            messages=self.get_messages(),
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.window.append({"role": "assistant", "content": reply})
        return reply

# Usage - keeps last 20 messages
memory = SlidingWindowMemory(
    "You are a data analysis assistant.",
    window_size=20
)

Pros: Bounded token costs, prevents context overflow
Cons: Loses early conversation context
Best for: Support bots, general-purpose assistants

Pattern 3: Summary Memory#

Compress old conversations into summaries, keeping recent messages in full.

python
class SummaryMemory:
    def __init__(self, system_prompt: str, summary_threshold: int = 10):
        self.system_prompt = system_prompt
        self.summary = ""
        self.recent_messages = []
        self.summary_threshold = summary_threshold
    
    def _summarize(self) -> str:
        """Compress old messages into a summary."""
        messages_to_summarize = self.recent_messages[:-4]  # Keep last 4
        
        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages_to_summarize
        )
        
        response = client.chat.completions.create(
            model="gemini-2.5-flash",  # Cheap model for summarization
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this conversation concisely. Preserve key facts, decisions, and user preferences."
                },
                {"role": "user", "content": conversation_text}
            ],
            max_tokens=300,
            temperature=0
        )
        
        return response.choices[0].message.content
    
    def get_messages(self) -> list:
        messages = [{"role": "system", "content": self.system_prompt}]
        
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{self.summary}"
            })
        
        messages.extend(self.recent_messages)
        return messages
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        self.recent_messages.append({"role": "user", "content": user_input})
        
        # Summarize when threshold reached
        if len(self.recent_messages) > self.summary_threshold:
            new_summary = self._summarize()
            self.summary = (
                f"{self.summary}\n{new_summary}" if self.summary else new_summary
            )
            self.recent_messages = self.recent_messages[-4:]  # Keep recent
        
        response = client.chat.completions.create(
            model=model,
            messages=self.get_messages(),
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        self.recent_messages.append({"role": "assistant", "content": reply})
        return reply

Pros: Preserves key context indefinitely, bounded costs
Cons: Summary may lose nuance, extra API call for summarization
Best for: Long-running assistants, multi-session agents

Pattern 4: Vector Memory (Semantic Retrieval)#

Store all memories as embeddings, retrieve only relevant ones per query.

python
import numpy as np
from datetime import datetime

class VectorMemory:
    def __init__(self, system_prompt: str, top_k: int = 5):
        self.system_prompt = system_prompt
        self.memories = []  # List of {text, embedding, timestamp, metadata}
        self.top_k = top_k
    
    def _get_embedding(self, text: str) -> list:
        """Get embedding using a text embedding model."""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def _cosine_similarity(self, a: list, b: list) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def store(self, text: str, metadata: dict = None):
        """Store a memory with its embedding."""
        embedding = self._get_embedding(text)
        self.memories.append({
            "text": text,
            "embedding": embedding,
            "timestamp": datetime.now().isoformat(),
            "metadata": metadata or {}
        })
    
    def retrieve(self, query: str) -> list[str]:
        """Retrieve top-k most relevant memories."""
        if not self.memories:
            return []
        
        query_embedding = self._get_embedding(query)
        
        scored = [
            (m["text"], self._cosine_similarity(query_embedding, m["embedding"]))
            for m in self.memories
        ]
        
        scored.sort(key=lambda x: x[1], reverse=True)
        return [text for text, score in scored[:self.top_k] if score > 0.3]
    
    def chat(self, user_input: str, model: str = "claude-opus-4-6") -> str:
        # Retrieve relevant memories
        relevant = self.retrieve(user_input)
        
        # Build context
        messages = [{"role": "system", "content": self.system_prompt}]
        
        if relevant:
            memory_context = "\n".join(f"- {m}" for m in relevant)
            messages.append({
                "role": "system",
                "content": f"Relevant memories from past interactions:\n{memory_context}"
            })
        
        messages.append({"role": "user", "content": user_input})
        
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        
        # Store both the user query and the exchange
        self.store(f"User asked: {user_input}")
        self.store(f"Assistant replied about: {user_input[:100]}")
        
        return reply

Pros: Infinite memory capacity, retrieves only relevant context
Cons: Requires embedding model, more complex, retrieval may miss context
Best for: Knowledge-intensive agents, personal assistants, RAG systems

Pattern 5: Hybrid Memory (Production Pattern)#

The gold standard: combine short-term buffer, long-term summaries, and semantic retrieval.

python
class HybridMemory:
    """Production-grade memory combining all patterns."""
    
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        
        # Short-term: recent messages (buffer)
        self.short_term = []
        self.short_term_limit = 10
        
        # Medium-term: conversation summary
        self.summary = ""
        
        # Long-term: vector store for facts and preferences
        self.vector_store = VectorMemory(system_prompt)
        
        # Entity memory: key facts about the user
        self.entities = {}
    
    def extract_entities(self, text: str):
        """Extract and store key entities from conversation."""
        response = client.chat.completions.create(
            model="gemini-2.5-flash",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract key facts as JSON. Categories: "
                        "name, preferences, projects, skills, goals. "
                        "Return {} if none found."
                    )
                },
                {"role": "user", "content": text}
            ],
            response_format={"type": "json_object"},
            max_tokens=200,
            temperature=0
        )
        
        import json
        entities = json.loads(response.choices[0].message.content)
        self.entities.update(entities)
    
    def get_context(self, query: str) -> list:
        """Build optimized context from all memory layers."""
        messages = [{"role": "system", "content": self.system_prompt}]
        
        # Add entity memory (user facts)
        if self.entities:
            entity_str = "\n".join(f"- {k}: {v}" for k, v in self.entities.items())
            messages.append({
                "role": "system",
                "content": f"Known facts about the user:\n{entity_str}"
            })
        
        # Add conversation summary
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"Conversation summary:\n{self.summary}"
            })
        
        # Add relevant long-term memories
        relevant = self.vector_store.retrieve(query)
        if relevant:
            memory_str = "\n".join(f"- {m}" for m in relevant)
            messages.append({
                "role": "system",
                "content": f"Relevant past context:\n{memory_str}"
            })
        
        # Add recent messages
        messages.extend(self.short_term)
        
        return messages
    
    def chat(self, user_input: str, model: str = "gpt-5.2") -> str:
        # Build context
        messages = self.get_context(user_input)
        messages.append({"role": "user", "content": user_input})
        
        # Generate response
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.7
        )
        
        reply = response.choices[0].message.content
        
        # Update short-term memory
        self.short_term.append({"role": "user", "content": user_input})
        self.short_term.append({"role": "assistant", "content": reply})
        
        # Compress if needed
        if len(self.short_term) > self.short_term_limit:
            self._compress_short_term()
        
        # Extract entities asynchronously
        self.extract_entities(user_input)
        
        # Store in vector memory
        self.vector_store.store(f"Q: {user_input}\nA: {reply[:200]}")
        
        return reply

Memory Pattern Comparison#

FeatureBufferSlidingSummaryVectorHybrid
Token efficiency
Context preservation⚠️
Implementation effortLowLowMediumHighHigh
Cost per requestHighMediumMediumMediumMedium
Infinite conversations⚠️
Semantic recall

Cost Optimization: Memory with Crazyrouter#

Memory-heavy agents make multiple API calls per interaction. Cost matters.

OperationModelOfficial PriceCrazyrouter Price
Main responseGPT-5.2$2.50/1M input$1.25/1M
SummarizationGemini 2.5 Flash$0.15/1M input$0.075/1M
Entity extractionGemini 2.5 Flash$0.15/1M input$0.075/1M
Embeddingstext-embedding-3-small$0.02/1M tokens$0.01/1M

Total per 1K hybrid-memory interactions: ~4.20via[Crazyrouter](https://crazyrouter.com)vs 4.20 via [Crazyrouter](https://crazyrouter.com) vs ~8.40 at official prices.

Production Tips#

  1. Persist memory to disk/database: Use Redis for short-term, PostgreSQL + pgvector for long-term
  2. Separate user memories: Never mix memory between users
  3. Set memory expiry: Old memories lose relevance—add decay scoring
  4. Use cheap models for memory operations: Flash/Haiku for summarization and extraction
  5. Batch embedding calls: Embed multiple texts in one API call
  6. Test memory retrieval quality: Wrong memories are worse than no memories

FAQ#

What is the best memory pattern for AI agents?#

The hybrid pattern is best for production applications, combining short-term buffer, summarized medium-term, and vector-based long-term memory. For simpler use cases, sliding window memory with periodic summarization works well.

How much does AI agent memory cost?#

Memory operations add 20-40% to base API costs due to summarization, entity extraction, and embedding calls. Using affordable models through Crazyrouter (Gemini Flash for summarization, text-embedding-3-small for vectors) keeps overhead under $0.005 per interaction.

Can I use memory patterns with any LLM?#

Yes. Memory patterns are model-agnostic—they work with GPT-5, Claude, Gemini, DeepSeek, and any model accessible via the OpenAI-compatible API format. Crazyrouter provides access to 300+ models through a single endpoint.

How do I persist AI agent memory across server restarts?#

Use a database. Redis is ideal for short-term conversation state, while PostgreSQL with the pgvector extension handles long-term vector memory. For simpler setups, JSON files or SQLite work for prototyping.

What's the difference between RAG and agent memory?#

RAG (Retrieval-Augmented Generation) retrieves from a static knowledge base. Agent memory retrieves from the agent's own interaction history. Both use vector search, but agent memory is dynamic and grows with each conversation.

Summary#

AI agent memory is what separates a stateless chatbot from a genuinely useful assistant. The five patterns—buffer, sliding window, summary, vector, and hybrid—give you a toolkit for any use case.

Start with buffer memory for prototypes, graduate to sliding window for production chatbots, and implement hybrid memory for personal assistants and complex agents. With Crazyrouter's access to 300+ models, you can optimize each memory operation with the ideal model for the job. Get started at crazyrouter.com.

Related Articles