SmartCache: Intelligent Semantic Layer for LLMs

SmartCache is a production-grade AI middleware designed to optimize Large Language Model (LLM) deployments. It acts as an intelligent gateway between users and LLM providers (like Google Gemini or OpenAI), reducing inference costs by 40% and improving P99 latency by 25x through semantic caching, intelligent routing, and vector-based guardrails.

🚀 Key Features

1. 🧠 Semantic Caching (Not Just Key-Value)

Unlike traditional caches that require exact string matches, SmartCache uses Vector Embeddings (all-MiniLM-L6-v2) to understand user intent.

Benefit: "How do I reset my router?" and "Router reset steps" hit the same cache entry.
Tech: Redis VSS (Vector Similarity Search) with Cosine Similarity > 0.90.

2. 🚦 Intelligent TTL Routing (Agentic Workflow)

A static TTL (Time-To-Live) fails for real-world data. SmartCache uses a lightweight LLM router to classify intent before storage.

Static Queries (e.g., "Who wrote Macbeth?"): Cached for 7 Days.
Dynamic Queries (e.g., "Bitcoin price today"): Cached for 5 Minutes.

3. 🛡️ Semantic Firewall (Vector Guardrails)

Blocks jailbreak attempts and unsafe queries before they reach the LLM provider.

Mechanism: Performs linear algebra (matrix multiplication) against a vector database of banned concepts (e.g., "hacking", "explosives").
Performance: Blocks threats in <10ms with $0 cost.

4. 📊 Real-Time Observability

Built-in /stats endpoint tracks Cache Hit Rate, Latency Saved (ms), and Estimated Cost Savings ($).

⚡ Performance Benchmarks

Metric	Direct LLM Call	SmartCache Hit	Improvement
P99 Latency	~1,200ms	45ms	25x Faster
Cost per Query	$0.0005	$0.00	100% Savings
Throughput	50 RPS	2,500+ RPS	50x Scale

🛠️ Architecture

The system follows a 4-stage pipeline for every request:

graph TD
    A[User Request] --> B{Semantic Firewall}
    B -- Unsafe (Sim > 0.8) --> C[Block Request 🚫]
    B -- Safe --> D{Vector Cache Lookup}
    D -- Hit (Sim > 0.9) --> E[Return Cached Response ⚡]
    D -- Miss --> F[LLM Router & Generator]
    F --> G["Generate Answer (Gemini)"]
    F --> H["Determine TTL (Static/Dynamic)"]
    G --> I[Store in Redis]
    H --> I
    I --> E

💻 Installation & Setup

Prerequisites

Python 3.9+
Docker (for Redis Stack)
Google Gemini API Key

1. Clone the Repository

git clone https://github.com/divyam2207/SmartCache.git
cd SmartCache

2. Start the Vector Database

We use Redis Stack to enable Vector Similarity Search capabilities.

# If using Docker
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

3. Configure Environment

Create a .env file in the root directory:

GEMINI_API_KEY=your_google_api_key_here

4. Run the Middleware

pip install -r requirements.txt
python main.py

*Server runs on http://localhost:8000*

🧪 Usage Examples

1. The "First" Request (Cache Miss)

The system fetches the answer from Google Gemini and determines the TTL.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is the capital of France?"}'

Response:

{
  "response": "The capital of France is Paris.",
  "source": "gemini-1.5-flash",
  "routing_decision": "static (7 days)",
  "similarity": 0.0
}

2. The "Rephrased" Request (Cache Hit)

The user asks the same question differently. The system detects semantic similarity.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "tell me france capital city"}'

Response:

{
  "response": "The capital of France is Paris.",
  "source": "cache",
  "similarity": 0.9421,
  "routing_type": "static (7 days)"
}

3. The "Attack" (Firewall Block)

The system detects malicious intent using vector distance.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "how do i hack a wifi network"}'

Response:

{
  "response": "I cannot answer this query due to safety guidelines.",
  "source": "firewall",
  "risk_score": 0.88
}

📈 Future Roadmap

Horizontal Scaling: Implement Sharding for Redis Cluster to support 1B+ vectors.
Multi-Model Support: Allow fallback to OpenAI/Anthropic if Gemini is down.
RAG Integration: Connect to external knowledge bases for fact-checking.

👤 Author

Divyam Dubey LinkedIn | GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
benchmarks.py		benchmarks.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SmartCache: Intelligent Semantic Layer for LLMs

🚀 Key Features

1. 🧠 Semantic Caching (Not Just Key-Value)

2. 🚦 Intelligent TTL Routing (Agentic Workflow)

3. 🛡️ Semantic Firewall (Vector Guardrails)

4. 📊 Real-Time Observability

⚡ Performance Benchmarks

🛠️ Architecture

💻 Installation & Setup

Prerequisites

1. Clone the Repository

2. Start the Vector Database

3. Configure Environment

4. Run the Middleware

🧪 Usage Examples

1. The "First" Request (Cache Miss)

2. The "Rephrased" Request (Cache Hit)

3. The "Attack" (Firewall Block)

📈 Future Roadmap

👤 Author

About

Uh oh!

Releases

Packages

Languages

divyam2207/SmartLLMCache

Folders and files

Latest commit

History

Repository files navigation

SmartCache: Intelligent Semantic Layer for LLMs

🚀 Key Features

1. 🧠 Semantic Caching (Not Just Key-Value)

2. 🚦 Intelligent TTL Routing (Agentic Workflow)

3. 🛡️ Semantic Firewall (Vector Guardrails)

4. 📊 Real-Time Observability

⚡ Performance Benchmarks

🛠️ Architecture

💻 Installation & Setup

Prerequisites

1. Clone the Repository

2. Start the Vector Database

3. Configure Environment

4. Run the Middleware

🧪 Usage Examples

1. The "First" Request (Cache Miss)

2. The "Rephrased" Request (Cache Hit)

3. The "Attack" (Firewall Block)

📈 Future Roadmap

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages