Skip to content

divyam2207/SmartLLMCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartCache: Intelligent Semantic Layer for LLMs

SmartCache is a production-grade AI middleware designed to optimize Large Language Model (LLM) deployments. It acts as an intelligent gateway between users and LLM providers (like Google Gemini or OpenAI), reducing inference costs by 40% and improving P99 latency by 25x through semantic caching, intelligent routing, and vector-based guardrails.


🚀 Key Features

1. 🧠 Semantic Caching (Not Just Key-Value)

Unlike traditional caches that require exact string matches, SmartCache uses Vector Embeddings (all-MiniLM-L6-v2) to understand user intent.

  • Benefit: "How do I reset my router?" and "Router reset steps" hit the same cache entry.
  • Tech: Redis VSS (Vector Similarity Search) with Cosine Similarity > 0.90.

2. 🚦 Intelligent TTL Routing (Agentic Workflow)

A static TTL (Time-To-Live) fails for real-world data. SmartCache uses a lightweight LLM router to classify intent before storage.

  • Static Queries (e.g., "Who wrote Macbeth?"): Cached for 7 Days.
  • Dynamic Queries (e.g., "Bitcoin price today"): Cached for 5 Minutes.

3. 🛡️ Semantic Firewall (Vector Guardrails)

Blocks jailbreak attempts and unsafe queries before they reach the LLM provider.

  • Mechanism: Performs linear algebra (matrix multiplication) against a vector database of banned concepts (e.g., "hacking", "explosives").
  • Performance: Blocks threats in <10ms with $0 cost.

4. 📊 Real-Time Observability

Built-in /stats endpoint tracks Cache Hit Rate, Latency Saved (ms), and Estimated Cost Savings ($).


⚡ Performance Benchmarks

Metric Direct LLM Call SmartCache Hit Improvement
P99 Latency ~1,200ms 45ms 25x Faster
Cost per Query $0.0005 $0.00 100% Savings
Throughput 50 RPS 2,500+ RPS 50x Scale

🛠️ Architecture

The system follows a 4-stage pipeline for every request:

graph TD
    A[User Request] --> B{Semantic Firewall}
    B -- Unsafe (Sim > 0.8) --> C[Block Request 🚫]
    B -- Safe --> D{Vector Cache Lookup}
    D -- Hit (Sim > 0.9) --> E[Return Cached Response ⚡]
    D -- Miss --> F[LLM Router & Generator]
    F --> G["Generate Answer (Gemini)"]
    F --> H["Determine TTL (Static/Dynamic)"]
    G --> I[Store in Redis]
    H --> I
    I --> E

Loading

💻 Installation & Setup

Prerequisites

  • Python 3.9+
  • Docker (for Redis Stack)
  • Google Gemini API Key

1. Clone the Repository

git clone https://github.com/divyam2207/SmartCache.git
cd SmartCache

2. Start the Vector Database

We use Redis Stack to enable Vector Similarity Search capabilities.

# If using Docker
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

3. Configure Environment

Create a .env file in the root directory:

GEMINI_API_KEY=your_google_api_key_here

4. Run the Middleware

pip install -r requirements.txt
python main.py

*Server runs on http://localhost:8000*


🧪 Usage Examples

1. The "First" Request (Cache Miss)

The system fetches the answer from Google Gemini and determines the TTL.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is the capital of France?"}'

Response:

{
  "response": "The capital of France is Paris.",
  "source": "gemini-1.5-flash",
  "routing_decision": "static (7 days)",
  "similarity": 0.0
}

2. The "Rephrased" Request (Cache Hit)

The user asks the same question differently. The system detects semantic similarity.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "tell me france capital city"}'

Response:

{
  "response": "The capital of France is Paris.",
  "source": "cache",
  "similarity": 0.9421,
  "routing_type": "static (7 days)"
}

3. The "Attack" (Firewall Block)

The system detects malicious intent using vector distance.

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "how do i hack a wifi network"}'

Response:

{
  "response": "I cannot answer this query due to safety guidelines.",
  "source": "firewall",
  "risk_score": 0.88
}

📈 Future Roadmap

  • Horizontal Scaling: Implement Sharding for Redis Cluster to support 1B+ vectors.
  • Multi-Model Support: Allow fallback to OpenAI/Anthropic if Gemini is down.
  • RAG Integration: Connect to external knowledge bases for fact-checking.

👤 Author

Divyam Dubey LinkedIn | GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages