Skip to content

connectwithprakash/memory-optimized-agent

Repository files navigation

Memory-Optimized Agent

Intelligent context management for LLM conversations using semantic similarity Reduce AI costs by 42% with <12ms semantic filtering - verified across 10 real-world scenarios

Python 3.12+ LangGraph

DemoQuick StartFeaturesArchitectureAPI Docs


Overview

Memory-Optimized Agent is a production-ready AI context management system that uses semantic similarity to intelligently filter conversation history. Instead of sending all messages to your LLM (expensive, slow, and often irrelevant), it retrieves only the semantically relevant context—delivering faster responses at a fraction of the cost.

The Problem

  • High costs: Sending full conversation history to LLMs is expensive (quadratic token growth)
  • Slow responses: Processing irrelevant context wastes time
  • Poor quality: Irrelevant context confuses the model

The Solution

  • 42% cost reduction through intelligent context filtering (verified)
  • <12ms semantic search - real-time filtering at scale (500+ memories)
  • 1.5KB per conversation - minimal memory footprint
  • Multi-model support - works with any LLM provider (Ollama, OpenAI, Anthropic)
  • Production-ready with monitoring, metrics, and REST API

Features

Core Capabilities

  • Semantic Context Filtering: Uses embeddings (all-MiniLM-L6-v2) to find relevant conversation history
  • Adaptive Thresholds: Automatically adjusts similarity threshold based on retrieval quality
  • Topic-Aware Memory: Intelligently handles conversation topic switches
  • Multi-User Support: Namespace isolation for different users and sessions
  • LangGraph Workflow: Production-grade orchestration with state management

Production Ready

  • REST API: FastAPI with async support and OpenAPI docs
  • Database: PostgreSQL with pgvector for scalable vector search
  • Monitoring: Prometheus metrics + structured logging
  • CLI Interface: Interactive terminal UI with real-time context visualization
  • Docker: Full stack deployment with Docker Compose

Performance

  • Fast semantic search: HNSW indexing for sub-100ms vector search
  • Scalable: Handles 500+ conversation memories with linear performance
  • Minimal overhead: <10ms filtering adds no perceptible latency
  • Efficient storage: 1.5KB per conversation with 384-dimensional embeddings

Quick Start

Prerequisites

  • Python 3.12+
  • Docker & Docker Compose (recommended)
  • Ollama (for local LLM) - Install
  • UV (Python package manager) - Install

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/yourusername/memory-optimized-agent
cd memory-optimized-agent

# Start services (PostgreSQL, Ollama, API)
docker-compose up -d

# Pull LLM model (first time only)
docker exec memory-agent-ollama ollama pull deepseek-r1:1.5b

# Run database migrations
uv run python migrations/run_migration.py

Access the API at http://localhost:8000/docs for interactive API documentation.

Option 2: Local Development

# Install dependencies
uv sync

# Start PostgreSQL (if not using Docker)
# Setup connection in .env file

# Pull Ollama model
ollama pull deepseek-r1:1.5b

# Run migrations
uv run python migrations/run_migration.py

# Start API server
uv run python -m src.api.app

Access the API at http://localhost:8000/docs

Try the CLI Demo

# Run the demo (uses in-memory store)
uv run python main.py

This demonstrates:

  1. Conversation about interview preparation
  2. Topic switch to grocery shopping
  3. Return to interviews → retrieves only interview context (not groceries!)

How It Works

Architecture Overview

flowchart TD
    Query[User Query] --> Agent[LangGraph Agent]
    Agent --> Filter[1. Filter Context]
    Filter --> Generate[2. Generate Response]
    Generate --> Save[3. Save Memory]

    Filter --> FilterDetails["- Generate embedding<br/>- Vector search<br/>- Filter by threshold"]
    Generate --> GenDetails["- Relevant context only<br/>- LLM inference"]
    Save --> SaveDetails["- Save messages<br/>- Generate embeddings"]

    FilterNote["Search past messages<br/>by semantic similarity"] -.-> Filter
    GenNote["Send filtered context<br/>to LLM for response"] -.-> Generate
    SaveNote["Store conversation<br/>for future retrieval"] -.-> Save

    style Query fill:#1976d2,stroke:#0d47a1,stroke-width:3px,color:#fff
    style Agent fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff
    style Filter fill:#388e3c,stroke:#1b5e20,stroke-width:3px,color:#fff
    style Generate fill:#d32f2f,stroke:#b71c1c,stroke-width:3px,color:#fff
    style Save fill:#7b1fa2,stroke:#4a148c,stroke-width:3px,color:#fff
Loading

Key Components

  • PostgreSQL + pgvector: Scalable vector storage with HNSW indexing
  • sentence-transformers: Embedding generation (all-MiniLM-L6-v2)
  • Ollama / OpenAI / Anthropic: Flexible LLM provider support
  • LangGraph: Workflow orchestration with checkpointing
  • FastAPI: Async REST API with automatic OpenAPI docs

Demo

The interactive CLI (cli.py) demonstrates the semantic filtering in action:

Features:

  1. Live Status Panel: Shows current similarity threshold and adaptive mode status
  2. Context Visualization: Color-coded similarity scores for retrieved memories
  3. Real-time Statistics: Memory count and filtering effectiveness
  4. Message History: User and AI messages with timestamps

Run the CLI:

uv run python cli.py

Try switching topics mid-conversation to see how the agent filters context intelligently!


Benchmark Results

Verified across 10 real-world conversation scenarios with topic switches.

Performance Highlights

Metric Result Why It Matters
Token Reduction 42% Direct cost savings, works with any LLM
Filtering Speed <10ms average Adds no perceptible latency
Memory Efficiency 1.5KB/conversation Scales to millions of messages
Context Quality 0.3 memories/query Highly selective, low noise

Note: Total response time depends on your LLM choice (Ollama, OpenAI, Claude, etc.). Semantic filtering adds only ~10ms overhead regardless of which model you use.

What We Tested

Quantitative Performance:

  • Token savings: 40-43% reduction across all memory sizes (5-100 conversations)
  • Search performance: Scales linearly from 7.8ms (10 memories) to 9.9ms (500 memories)
  • Memory efficiency: Consistent 1.5KB per conversation with 384-dim embeddings
  • Threshold sensitivity: Tested 7 thresholds (0.2-0.8), recommended default: 0.5

Qualitative Scenarios (Real Conversations):

  • Technical Interview Preparation
  • Travel Planning (with work interruptions)
  • Home Renovation (with fitness topic switches)
  • SaaS Startup Launch
  • Career Transition PM→Engineering
  • Parenting & Child Development
  • Learning Spanish for Travel
  • Photography Fundamentals
  • Freelance Writing Business
  • Managing Stress & Anxiety

Key Finding: Smart filtering correctly retrieves relevant context even after topic switches, using only 0.3 memories per query on average (highly selective).

Run Benchmarks Yourself

# Full benchmark suite (30-60 minutes)
uv run python benchmark.py

# View results
cat benchmark_results/reference/BENCHMARK_RESULTS.md

# Resume interrupted runs
uv run python benchmark.py --resume --timestamp=YYYYMMDD_HHMMSS

View detailed benchmark report →


Use Cases

Conversational AI Platforms

Example: Perplexity AI uses semantic memory for context-aware responses across conversation threads Solution: Semantic filtering of conversation history reduces token costs while maintaining context quality Impact: 42% token reduction (verified), <10ms filtering overhead, scales to 500+ conversation memories

Developer Tools & Code Assistants

Example: Microsoft's Chat Copilot integrates semantic memory for contextual code assistance Solution: Retrieve only relevant code snippets and previous Q&A from conversation history Impact: 42% fewer tokens per query, maintains conversation continuity across multi-turn sessions

Enterprise Customer Support

Industry Context: Smart memory systems reduce token costs by 80-90% while improving response quality (Mem0, October 2025) Solution: Semantic retrieval of relevant past interactions instead of sending full conversation history Impact: 42% token reduction (verified), <10ms filtering, educational applications report 40% lower costs (2025 data)


Limitations & Future Work

Current Limitations

Threshold Selection Challenge

  • Optimal similarity threshold varies by conversation type and domain
  • Default 0.5 threshold may be too strict or too loose for specific use cases
  • Adaptive threshold helps but requires calibration over multiple queries

Semantic Similarity vs. Conversational Relevance

  • Cosine similarity captures semantic meaning but not conversational flow
  • Temporally distant but relevant context may score lower than recent similar-but-irrelevant messages
  • Topic boundaries are implicit in similarity scores, not explicitly detected

Evaluation Challenges

  • Difficult to establish ground truth for "correct" context retrieval
  • Token reduction is measurable (42%), but semantic quality is subjective
  • Benchmarks validate topic-switching scenarios qualitatively, not quantitatively

Alternative Approaches

Summarization: Compresses conversation history 5x while preserving key information, but requires additional LLM calls and depends on summarization quality

Sliding Window: Simplest approach that retains recent N messages, but completely loses earlier context regardless of relevance

Hybrid Methods: Emerging solutions (e.g., Mem0's smart memory) combine semantic filtering with summarization and explicit memory formation, achieving 80-90% token reduction

Research Status

Context filtering for LLMs remains an active research area in 2025. No consensus exists on the "best" approach - each method involves trade-offs between computational cost, context quality, and implementation complexity. This project demonstrates semantic filtering as a practical solution with verified 42% token reduction and <10ms overhead.


API Usage

from src.core.agent import MemoryOptimizedAgent
from config import MemoryConfig, LLMConfig

# Initialize agent
agent = MemoryOptimizedAgent(
    memory_config=MemoryConfig(similarity_threshold=0.5),
    llm_config=LLMConfig(model="deepseek-r1:1.5b")
)

# Chat with semantic filtering
response = agent.chat(
    message="How do I prepare for interviews?",
    user_id="user_123",
    session_id="session_456"
)

print(f"Response: {response['response']}")
print(f"Used {response['relevant_context_count']} relevant memories")

Full API documentation: docs/API.md


Project Structure

memory-optimized-agent/
├── src/
│   ├── core/              # LangGraph agent + memory stores
│   ├── api/               # FastAPI REST API
│   └── db/                # PostgreSQL connection
├── migrations/            # Database migrations
├── benchmark_results/     # Verified benchmark data
├── cli.py                 # Interactive terminal UI
├── main.py                # Simple demo script
├── docker-compose.yml     # Full stack deployment
└── docs/                  # Technical documentation

Technical Highlights

  • Vector Search: pgvector with HNSW indexing for sub-100ms search on 100K vectors
  • Adaptive Thresholds: Auto-adjusts similarity threshold based on retrieval quality
  • Production Monitoring: Prometheus metrics + Grafana dashboards for observability
  • Multi-Model Support: Swap between Ollama, OpenAI, Anthropic with simple config changes

Full technical details: docs/DEPLOYMENT.md


Deployment

# Start all services with Docker Compose
docker-compose up -d

# Check service health
docker-compose ps

# View logs
docker-compose logs -f api

Full deployment guide including Kubernetes, monitoring setup, and configuration: docs/DEPLOYMENT.md


Model Flexibility

The agent supports multiple LLM and embedding providers:

LLM Options:

  • Ollama (local): deepseek-r1, deepseek-v3, llama-3.3, mistral
  • OpenAI: gpt-5, gpt-4.1, gpt-4o
  • Anthropic: claude-sonnet-4.5, claude-opus-4.1

Embedding Options:

  • sentence-transformers: all-MiniLM-L6-v2 (fast), all-mpnet-base-v2 (accurate)
  • OpenAI: text-embedding-3-small, text-embedding-3-large

Swap models with simple config changes - see docs/DEPLOYMENT.md


Approach & Related Work

This Project's Approach

This system applies semantic similarity-based filtering to multi-turn conversation history - addressing the token cost problem through intelligent context retrieval before LLM inference. Key aspects:

  • Pre-inference filtering: Selects relevant conversation history before sending to LLM
  • Semantic retrieval: Uses embeddings and cosine similarity to find contextually relevant messages
  • Adaptive thresholds: Dynamic calibration based on retrieval patterns
  • Topic-aware memory: Handles conversation topic switches while maintaining relevant context

Technical Foundation

  • Sentence-BERT (Reimers & Gurevych, 2019): Semantic sentence embeddings
  • LangGraph: Workflow orchestration framework
  • pgvector: Vector similarity search at scale

Related Research

After development, we found similar problems being addressed through different approaches:

  • LazyLLM (Apple, 2024): Prunes tokens during inference using attention scores (complementary approach)
  • Mem0 (2025): Hybrid memory systems combining filtering with summarization
  • Semantic Kernel (Microsoft): Chat history management patterns for enterprise applications

This project demonstrates that semantic filtering alone, applied before inference, achieves significant cost reduction (42%) with minimal latency overhead (<10ms).


Built to demonstrate production-grade AI/ML engineering skills

About

Production-ready AI context management using semantic similarity - 42% cost reduction with <12ms filtering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors