Intelligent context management for LLM conversations using semantic similarity Reduce AI costs by 42% with <12ms semantic filtering - verified across 10 real-world scenarios
Demo • Quick Start • Features • Architecture • API Docs
Memory-Optimized Agent is a production-ready AI context management system that uses semantic similarity to intelligently filter conversation history. Instead of sending all messages to your LLM (expensive, slow, and often irrelevant), it retrieves only the semantically relevant context—delivering faster responses at a fraction of the cost.
- High costs: Sending full conversation history to LLMs is expensive (quadratic token growth)
- Slow responses: Processing irrelevant context wastes time
- Poor quality: Irrelevant context confuses the model
- 42% cost reduction through intelligent context filtering (verified)
- <12ms semantic search - real-time filtering at scale (500+ memories)
- 1.5KB per conversation - minimal memory footprint
- Multi-model support - works with any LLM provider (Ollama, OpenAI, Anthropic)
- Production-ready with monitoring, metrics, and REST API
- Semantic Context Filtering: Uses embeddings (all-MiniLM-L6-v2) to find relevant conversation history
- Adaptive Thresholds: Automatically adjusts similarity threshold based on retrieval quality
- Topic-Aware Memory: Intelligently handles conversation topic switches
- Multi-User Support: Namespace isolation for different users and sessions
- LangGraph Workflow: Production-grade orchestration with state management
- REST API: FastAPI with async support and OpenAPI docs
- Database: PostgreSQL with pgvector for scalable vector search
- Monitoring: Prometheus metrics + structured logging
- CLI Interface: Interactive terminal UI with real-time context visualization
- Docker: Full stack deployment with Docker Compose
- Fast semantic search: HNSW indexing for sub-100ms vector search
- Scalable: Handles 500+ conversation memories with linear performance
- Minimal overhead: <10ms filtering adds no perceptible latency
- Efficient storage: 1.5KB per conversation with 384-dimensional embeddings
- Python 3.12+
- Docker & Docker Compose (recommended)
- Ollama (for local LLM) - Install
- UV (Python package manager) - Install
# Clone repository
git clone https://github.com/yourusername/memory-optimized-agent
cd memory-optimized-agent
# Start services (PostgreSQL, Ollama, API)
docker-compose up -d
# Pull LLM model (first time only)
docker exec memory-agent-ollama ollama pull deepseek-r1:1.5b
# Run database migrations
uv run python migrations/run_migration.pyAccess the API at http://localhost:8000/docs for interactive API documentation.
# Install dependencies
uv sync
# Start PostgreSQL (if not using Docker)
# Setup connection in .env file
# Pull Ollama model
ollama pull deepseek-r1:1.5b
# Run migrations
uv run python migrations/run_migration.py
# Start API server
uv run python -m src.api.appAccess the API at http://localhost:8000/docs
# Run the demo (uses in-memory store)
uv run python main.pyThis demonstrates:
- Conversation about interview preparation
- Topic switch to grocery shopping
- Return to interviews → retrieves only interview context (not groceries!)
flowchart TD
Query[User Query] --> Agent[LangGraph Agent]
Agent --> Filter[1. Filter Context]
Filter --> Generate[2. Generate Response]
Generate --> Save[3. Save Memory]
Filter --> FilterDetails["- Generate embedding<br/>- Vector search<br/>- Filter by threshold"]
Generate --> GenDetails["- Relevant context only<br/>- LLM inference"]
Save --> SaveDetails["- Save messages<br/>- Generate embeddings"]
FilterNote["Search past messages<br/>by semantic similarity"] -.-> Filter
GenNote["Send filtered context<br/>to LLM for response"] -.-> Generate
SaveNote["Store conversation<br/>for future retrieval"] -.-> Save
style Query fill:#1976d2,stroke:#0d47a1,stroke-width:3px,color:#fff
style Agent fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff
style Filter fill:#388e3c,stroke:#1b5e20,stroke-width:3px,color:#fff
style Generate fill:#d32f2f,stroke:#b71c1c,stroke-width:3px,color:#fff
style Save fill:#7b1fa2,stroke:#4a148c,stroke-width:3px,color:#fff
- PostgreSQL + pgvector: Scalable vector storage with HNSW indexing
- sentence-transformers: Embedding generation (all-MiniLM-L6-v2)
- Ollama / OpenAI / Anthropic: Flexible LLM provider support
- LangGraph: Workflow orchestration with checkpointing
- FastAPI: Async REST API with automatic OpenAPI docs
The interactive CLI (cli.py) demonstrates the semantic filtering in action:
Features:
- Live Status Panel: Shows current similarity threshold and adaptive mode status
- Context Visualization: Color-coded similarity scores for retrieved memories
- Real-time Statistics: Memory count and filtering effectiveness
- Message History: User and AI messages with timestamps
Run the CLI:
uv run python cli.pyTry switching topics mid-conversation to see how the agent filters context intelligently!
Verified across 10 real-world conversation scenarios with topic switches.
| Metric | Result | Why It Matters |
|---|---|---|
| Token Reduction | 42% | Direct cost savings, works with any LLM |
| Filtering Speed | <10ms average | Adds no perceptible latency |
| Memory Efficiency | 1.5KB/conversation | Scales to millions of messages |
| Context Quality | 0.3 memories/query | Highly selective, low noise |
Note: Total response time depends on your LLM choice (Ollama, OpenAI, Claude, etc.). Semantic filtering adds only ~10ms overhead regardless of which model you use.
Quantitative Performance:
- Token savings: 40-43% reduction across all memory sizes (5-100 conversations)
- Search performance: Scales linearly from 7.8ms (10 memories) to 9.9ms (500 memories)
- Memory efficiency: Consistent 1.5KB per conversation with 384-dim embeddings
- Threshold sensitivity: Tested 7 thresholds (0.2-0.8), recommended default: 0.5
Qualitative Scenarios (Real Conversations):
- Technical Interview Preparation
- Travel Planning (with work interruptions)
- Home Renovation (with fitness topic switches)
- SaaS Startup Launch
- Career Transition PM→Engineering
- Parenting & Child Development
- Learning Spanish for Travel
- Photography Fundamentals
- Freelance Writing Business
- Managing Stress & Anxiety
Key Finding: Smart filtering correctly retrieves relevant context even after topic switches, using only 0.3 memories per query on average (highly selective).
# Full benchmark suite (30-60 minutes)
uv run python benchmark.py
# View results
cat benchmark_results/reference/BENCHMARK_RESULTS.md
# Resume interrupted runs
uv run python benchmark.py --resume --timestamp=YYYYMMDD_HHMMSSView detailed benchmark report →
Example: Perplexity AI uses semantic memory for context-aware responses across conversation threads Solution: Semantic filtering of conversation history reduces token costs while maintaining context quality Impact: 42% token reduction (verified), <10ms filtering overhead, scales to 500+ conversation memories
Example: Microsoft's Chat Copilot integrates semantic memory for contextual code assistance Solution: Retrieve only relevant code snippets and previous Q&A from conversation history Impact: 42% fewer tokens per query, maintains conversation continuity across multi-turn sessions
Industry Context: Smart memory systems reduce token costs by 80-90% while improving response quality (Mem0, October 2025) Solution: Semantic retrieval of relevant past interactions instead of sending full conversation history Impact: 42% token reduction (verified), <10ms filtering, educational applications report 40% lower costs (2025 data)
Threshold Selection Challenge
- Optimal similarity threshold varies by conversation type and domain
- Default 0.5 threshold may be too strict or too loose for specific use cases
- Adaptive threshold helps but requires calibration over multiple queries
Semantic Similarity vs. Conversational Relevance
- Cosine similarity captures semantic meaning but not conversational flow
- Temporally distant but relevant context may score lower than recent similar-but-irrelevant messages
- Topic boundaries are implicit in similarity scores, not explicitly detected
Evaluation Challenges
- Difficult to establish ground truth for "correct" context retrieval
- Token reduction is measurable (42%), but semantic quality is subjective
- Benchmarks validate topic-switching scenarios qualitatively, not quantitatively
Summarization: Compresses conversation history 5x while preserving key information, but requires additional LLM calls and depends on summarization quality
Sliding Window: Simplest approach that retains recent N messages, but completely loses earlier context regardless of relevance
Hybrid Methods: Emerging solutions (e.g., Mem0's smart memory) combine semantic filtering with summarization and explicit memory formation, achieving 80-90% token reduction
Context filtering for LLMs remains an active research area in 2025. No consensus exists on the "best" approach - each method involves trade-offs between computational cost, context quality, and implementation complexity. This project demonstrates semantic filtering as a practical solution with verified 42% token reduction and <10ms overhead.
from src.core.agent import MemoryOptimizedAgent
from config import MemoryConfig, LLMConfig
# Initialize agent
agent = MemoryOptimizedAgent(
memory_config=MemoryConfig(similarity_threshold=0.5),
llm_config=LLMConfig(model="deepseek-r1:1.5b")
)
# Chat with semantic filtering
response = agent.chat(
message="How do I prepare for interviews?",
user_id="user_123",
session_id="session_456"
)
print(f"Response: {response['response']}")
print(f"Used {response['relevant_context_count']} relevant memories")Full API documentation: docs/API.md
memory-optimized-agent/
├── src/
│ ├── core/ # LangGraph agent + memory stores
│ ├── api/ # FastAPI REST API
│ └── db/ # PostgreSQL connection
├── migrations/ # Database migrations
├── benchmark_results/ # Verified benchmark data
├── cli.py # Interactive terminal UI
├── main.py # Simple demo script
├── docker-compose.yml # Full stack deployment
└── docs/ # Technical documentation
- Vector Search: pgvector with HNSW indexing for sub-100ms search on 100K vectors
- Adaptive Thresholds: Auto-adjusts similarity threshold based on retrieval quality
- Production Monitoring: Prometheus metrics + Grafana dashboards for observability
- Multi-Model Support: Swap between Ollama, OpenAI, Anthropic with simple config changes
Full technical details: docs/DEPLOYMENT.md
# Start all services with Docker Compose
docker-compose up -d
# Check service health
docker-compose ps
# View logs
docker-compose logs -f apiFull deployment guide including Kubernetes, monitoring setup, and configuration: docs/DEPLOYMENT.md
The agent supports multiple LLM and embedding providers:
LLM Options:
- Ollama (local): deepseek-r1, deepseek-v3, llama-3.3, mistral
- OpenAI: gpt-5, gpt-4.1, gpt-4o
- Anthropic: claude-sonnet-4.5, claude-opus-4.1
Embedding Options:
- sentence-transformers: all-MiniLM-L6-v2 (fast), all-mpnet-base-v2 (accurate)
- OpenAI: text-embedding-3-small, text-embedding-3-large
Swap models with simple config changes - see docs/DEPLOYMENT.md
This system applies semantic similarity-based filtering to multi-turn conversation history - addressing the token cost problem through intelligent context retrieval before LLM inference. Key aspects:
- Pre-inference filtering: Selects relevant conversation history before sending to LLM
- Semantic retrieval: Uses embeddings and cosine similarity to find contextually relevant messages
- Adaptive thresholds: Dynamic calibration based on retrieval patterns
- Topic-aware memory: Handles conversation topic switches while maintaining relevant context
- Sentence-BERT (Reimers & Gurevych, 2019): Semantic sentence embeddings
- LangGraph: Workflow orchestration framework
- pgvector: Vector similarity search at scale
After development, we found similar problems being addressed through different approaches:
- LazyLLM (Apple, 2024): Prunes tokens during inference using attention scores (complementary approach)
- Mem0 (2025): Hybrid memory systems combining filtering with summarization
- Semantic Kernel (Microsoft): Chat history management patterns for enterprise applications
This project demonstrates that semantic filtering alone, applied before inference, achieves significant cost reduction (42%) with minimal latency overhead (<10ms).
Built to demonstrate production-grade AI/ML engineering skills