Memory-Optimized Agent

Intelligent context management for LLM conversations using semantic similarity Reduce AI costs by 42% with <12ms semantic filtering - verified across 10 real-world scenarios

Demo • Quick Start • Features • Architecture • API Docs

Overview

Memory-Optimized Agent is a production-ready AI context management system that uses semantic similarity to intelligently filter conversation history. Instead of sending all messages to your LLM (expensive, slow, and often irrelevant), it retrieves only the semantically relevant context—delivering faster responses at a fraction of the cost.

The Problem

High costs: Sending full conversation history to LLMs is expensive (quadratic token growth)
Slow responses: Processing irrelevant context wastes time
Poor quality: Irrelevant context confuses the model

The Solution

42% cost reduction through intelligent context filtering (verified)
<12ms semantic search - real-time filtering at scale (500+ memories)
1.5KB per conversation - minimal memory footprint
Multi-model support - works with any LLM provider (Ollama, OpenAI, Anthropic)
Production-ready with monitoring, metrics, and REST API

Features

Core Capabilities

Semantic Context Filtering: Uses embeddings (all-MiniLM-L6-v2) to find relevant conversation history
Adaptive Thresholds: Automatically adjusts similarity threshold based on retrieval quality
Topic-Aware Memory: Intelligently handles conversation topic switches
Multi-User Support: Namespace isolation for different users and sessions
LangGraph Workflow: Production-grade orchestration with state management

Production Ready

REST API: FastAPI with async support and OpenAPI docs
Database: PostgreSQL with pgvector for scalable vector search
Monitoring: Prometheus metrics + structured logging
CLI Interface: Interactive terminal UI with real-time context visualization
Docker: Full stack deployment with Docker Compose

Performance

Fast semantic search: HNSW indexing for sub-100ms vector search
Scalable: Handles 500+ conversation memories with linear performance
Minimal overhead: <10ms filtering adds no perceptible latency
Efficient storage: 1.5KB per conversation with 384-dimensional embeddings

Quick Start

Prerequisites

Python 3.12+
Docker & Docker Compose (recommended)
Ollama (for local LLM) - Install
UV (Python package manager) - Install

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/yourusername/memory-optimized-agent
cd memory-optimized-agent

# Start services (PostgreSQL, Ollama, API)
docker-compose up -d

# Pull LLM model (first time only)
docker exec memory-agent-ollama ollama pull deepseek-r1:1.5b

# Run database migrations
uv run python migrations/run_migration.py

Access the API at http://localhost:8000/docs for interactive API documentation.

Option 2: Local Development

# Install dependencies
uv sync

# Start PostgreSQL (if not using Docker)
# Setup connection in .env file

# Pull Ollama model
ollama pull deepseek-r1:1.5b

# Run migrations
uv run python migrations/run_migration.py

# Start API server
uv run python -m src.api.app

Access the API at http://localhost:8000/docs

Try the CLI Demo

# Run the demo (uses in-memory store)
uv run python main.py

This demonstrates:

Conversation about interview preparation
Topic switch to grocery shopping
Return to interviews → retrieves only interview context (not groceries!)

How It Works

Architecture Overview

flowchart TD
    Query[User Query] --> Agent[LangGraph Agent]
    Agent --> Filter[1. Filter Context]
    Filter --> Generate[2. Generate Response]
    Generate --> Save[3. Save Memory]

    Filter --> FilterDetails["- Generate embedding<br/>- Vector search<br/>- Filter by threshold"]
    Generate --> GenDetails["- Relevant context only<br/>- LLM inference"]
    Save --> SaveDetails["- Save messages<br/>- Generate embeddings"]

    FilterNote["Search past messages<br/>by semantic similarity"] -.-> Filter
    GenNote["Send filtered context<br/>to LLM for response"] -.-> Generate
    SaveNote["Store conversation<br/>for future retrieval"] -.-> Save

    style Query fill:#1976d2,stroke:#0d47a1,stroke-width:3px,color:#fff
    style Agent fill:#f57c00,stroke:#e65100,stroke-width:3px,color:#fff
    style Filter fill:#388e3c,stroke:#1b5e20,stroke-width:3px,color:#fff
    style Generate fill:#d32f2f,stroke:#b71c1c,stroke-width:3px,color:#fff
    style Save fill:#7b1fa2,stroke:#4a148c,stroke-width:3px,color:#fff

Key Components

PostgreSQL + pgvector: Scalable vector storage with HNSW indexing
sentence-transformers: Embedding generation (all-MiniLM-L6-v2)
Ollama / OpenAI / Anthropic: Flexible LLM provider support
LangGraph: Workflow orchestration with checkpointing
FastAPI: Async REST API with automatic OpenAPI docs

Demo

The interactive CLI (cli.py) demonstrates the semantic filtering in action:

Features:

Live Status Panel: Shows current similarity threshold and adaptive mode status
Context Visualization: Color-coded similarity scores for retrieved memories
Real-time Statistics: Memory count and filtering effectiveness
Message History: User and AI messages with timestamps

Run the CLI:

uv run python cli.py

Try switching topics mid-conversation to see how the agent filters context intelligently!

Benchmark Results

Verified across 10 real-world conversation scenarios with topic switches.

Performance Highlights

Metric	Result	Why It Matters
Token Reduction	42%	Direct cost savings, works with any LLM
Filtering Speed	<10ms average	Adds no perceptible latency
Memory Efficiency	1.5KB/conversation	Scales to millions of messages
Context Quality	0.3 memories/query	Highly selective, low noise

Note: Total response time depends on your LLM choice (Ollama, OpenAI, Claude, etc.). Semantic filtering adds only ~10ms overhead regardless of which model you use.

What We Tested

Quantitative Performance:

Token savings: 40-43% reduction across all memory sizes (5-100 conversations)
Search performance: Scales linearly from 7.8ms (10 memories) to 9.9ms (500 memories)
Memory efficiency: Consistent 1.5KB per conversation with 384-dim embeddings
Threshold sensitivity: Tested 7 thresholds (0.2-0.8), recommended default: 0.5

Qualitative Scenarios (Real Conversations):

Technical Interview Preparation
Travel Planning (with work interruptions)
Home Renovation (with fitness topic switches)
SaaS Startup Launch
Career Transition PM→Engineering
Parenting & Child Development
Learning Spanish for Travel
Photography Fundamentals
Freelance Writing Business
Managing Stress & Anxiety

Key Finding: Smart filtering correctly retrieves relevant context even after topic switches, using only 0.3 memories per query on average (highly selective).

Run Benchmarks Yourself

# Full benchmark suite (30-60 minutes)
uv run python benchmark.py

# View results
cat benchmark_results/reference/BENCHMARK_RESULTS.md

# Resume interrupted runs
uv run python benchmark.py --resume --timestamp=YYYYMMDD_HHMMSS

View detailed benchmark report →

Use Cases

Conversational AI Platforms

Example: Perplexity AI uses semantic memory for context-aware responses across conversation threads Solution: Semantic filtering of conversation history reduces token costs while maintaining context quality Impact: 42% token reduction (verified), <10ms filtering overhead, scales to 500+ conversation memories

Developer Tools & Code Assistants

Example: Microsoft's Chat Copilot integrates semantic memory for contextual code assistance Solution: Retrieve only relevant code snippets and previous Q&A from conversation history Impact: 42% fewer tokens per query, maintains conversation continuity across multi-turn sessions

Enterprise Customer Support

Industry Context: Smart memory systems reduce token costs by 80-90% while improving response quality (Mem0, October 2025) Solution: Semantic retrieval of relevant past interactions instead of sending full conversation history Impact: 42% token reduction (verified), <10ms filtering, educational applications report 40% lower costs (2025 data)

Limitations & Future Work

Current Limitations

Threshold Selection Challenge

Optimal similarity threshold varies by conversation type and domain
Default 0.5 threshold may be too strict or too loose for specific use cases
Adaptive threshold helps but requires calibration over multiple queries

Semantic Similarity vs. Conversational Relevance

Cosine similarity captures semantic meaning but not conversational flow
Temporally distant but relevant context may score lower than recent similar-but-irrelevant messages
Topic boundaries are implicit in similarity scores, not explicitly detected

Evaluation Challenges

Difficult to establish ground truth for "correct" context retrieval
Token reduction is measurable (42%), but semantic quality is subjective
Benchmarks validate topic-switching scenarios qualitatively, not quantitatively

Alternative Approaches

Summarization: Compresses conversation history 5x while preserving key information, but requires additional LLM calls and depends on summarization quality

Sliding Window: Simplest approach that retains recent N messages, but completely loses earlier context regardless of relevance

Hybrid Methods: Emerging solutions (e.g., Mem0's smart memory) combine semantic filtering with summarization and explicit memory formation, achieving 80-90% token reduction

Research Status

Context filtering for LLMs remains an active research area in 2025. No consensus exists on the "best" approach - each method involves trade-offs between computational cost, context quality, and implementation complexity. This project demonstrates semantic filtering as a practical solution with verified 42% token reduction and <10ms overhead.

API Usage

from src.core.agent import MemoryOptimizedAgent
from config import MemoryConfig, LLMConfig

# Initialize agent
agent = MemoryOptimizedAgent(
    memory_config=MemoryConfig(similarity_threshold=0.5),
    llm_config=LLMConfig(model="deepseek-r1:1.5b")
)

# Chat with semantic filtering
response = agent.chat(
    message="How do I prepare for interviews?",
    user_id="user_123",
    session_id="session_456"
)

print(f"Response: {response['response']}")
print(f"Used {response['relevant_context_count']} relevant memories")

Full API documentation: docs/API.md

Project Structure

memory-optimized-agent/
├── src/
│   ├── core/              # LangGraph agent + memory stores
│   ├── api/               # FastAPI REST API
│   └── db/                # PostgreSQL connection
├── migrations/            # Database migrations
├── benchmark_results/     # Verified benchmark data
├── cli.py                 # Interactive terminal UI
├── main.py                # Simple demo script
├── docker-compose.yml     # Full stack deployment
└── docs/                  # Technical documentation

Technical Highlights

Vector Search: pgvector with HNSW indexing for sub-100ms search on 100K vectors
Adaptive Thresholds: Auto-adjusts similarity threshold based on retrieval quality
Production Monitoring: Prometheus metrics + Grafana dashboards for observability
Multi-Model Support: Swap between Ollama, OpenAI, Anthropic with simple config changes

Full technical details: docs/DEPLOYMENT.md

Deployment

# Start all services with Docker Compose
docker-compose up -d

# Check service health
docker-compose ps

# View logs
docker-compose logs -f api

Full deployment guide including Kubernetes, monitoring setup, and configuration: docs/DEPLOYMENT.md

Model Flexibility

The agent supports multiple LLM and embedding providers:

LLM Options:

Ollama (local): deepseek-r1, deepseek-v3, llama-3.3, mistral
OpenAI: gpt-5, gpt-4.1, gpt-4o
Anthropic: claude-sonnet-4.5, claude-opus-4.1

Embedding Options:

sentence-transformers: all-MiniLM-L6-v2 (fast), all-mpnet-base-v2 (accurate)
OpenAI: text-embedding-3-small, text-embedding-3-large

Swap models with simple config changes - see docs/DEPLOYMENT.md

Approach & Related Work

This Project's Approach

This system applies semantic similarity-based filtering to multi-turn conversation history - addressing the token cost problem through intelligent context retrieval before LLM inference. Key aspects:

Pre-inference filtering: Selects relevant conversation history before sending to LLM
Semantic retrieval: Uses embeddings and cosine similarity to find contextually relevant messages
Adaptive thresholds: Dynamic calibration based on retrieval patterns
Topic-aware memory: Handles conversation topic switches while maintaining relevant context

Technical Foundation

Sentence-BERT (Reimers & Gurevych, 2019): Semantic sentence embeddings
LangGraph: Workflow orchestration framework
pgvector: Vector similarity search at scale

Related Research

After development, we found similar problems being addressed through different approaches:

LazyLLM (Apple, 2024): Prunes tokens during inference using attention scores (complementary approach)
Mem0 (2025): Hybrid memory systems combining filtering with summarization
Semantic Kernel (Microsoft): Chat history management patterns for enterprise applications

This project demonstrates that semantic filtering alone, applied before inference, achieves significant cost reduction (42%) with minimal latency overhead (<10ms).

Built to demonstrate production-grade AI/ML engineering skills

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
benchmark_results/reference		benchmark_results/reference
docs		docs
migrations		migrations
monitoring		monitoring
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Memory-Optimized Agent

Overview

The Problem

The Solution

Features

Core Capabilities

Production Ready

Performance

Quick Start

Prerequisites

Option 1: Docker (Recommended)

Option 2: Local Development

Try the CLI Demo

How It Works

Architecture Overview

Key Components

Demo

Benchmark Results

Performance Highlights

What We Tested

Run Benchmarks Yourself

Use Cases

Conversational AI Platforms

Developer Tools & Code Assistants

Enterprise Customer Support

Limitations & Future Work

Current Limitations

Alternative Approaches

Research Status

API Usage

Project Structure

Technical Highlights

Deployment

Model Flexibility

Approach & Related Work

This Project's Approach

Technical Foundation

Related Research

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages