This service provides a semantic caching layer for an AI-powered query API. It uses OpenAI embeddings to identify semantically similar queries and serves cached responses from Redis when similarity is above a configurable threshold. If no cache hit exists (or forceRefresh=true), it calls the LLM, caches the result, and returns the response.
- Demo video: Google Drive
- Semantic Cache Review:
Semantic Cache Review.pdf
- Quick Start
- Features
- Architecture
- Configuration
- API Endpoints
- UI Features
- Testing Guide
- Advanced Features
- Performance Optimizations
- Logging & Monitoring
- Shell Scripts
- Troubleshooting
- Documentation
- Docker and Docker Compose installed
- OpenAI API key
-
Copy
.env.exampleto.envand setOPENAI_API_KEY:cp .env.example .env # Edit .env and add your OPENAI_API_KEY -
Start all services:
docker compose up --build
-
Access the services:
- API:
http://localhost:3000 - UI:
http://localhost:5173 - Redis:
localhost:6379 - Weaviate (if enabled):
localhost:8080
- API:
- Semantic Caching: Uses cosine similarity on embeddings to match semantically similar queries
- Time-Based Cache Invalidation: Domain-specific age limits (weather: 1h, news: 30min, price: 30min, score: 10min) to prevent stale data
- Topic-Based Cache Partitioning: Prevents cross-domain contamination by partitioning cache by topic namespace
- Batch Processing: Parallel embedding generation using OpenAI's native batch API for improved throughput
- Parallel LLM Calls: Concurrent LLM completions with rate limiting for high-volume scenarios
- Structured Logging: JSON-formatted logs with severity levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Metrics Collection: Time-series metrics tracking cache performance (hit rate, latency, request volume)
- Real-time Visualization: Interactive charts showing cache performance over time
- Cost Tracking: Estimates LLM costs and cache savings
- Dual Storage: Supports Redis (default) and optional Weaviate vector database with topic filtering
- Request-level Caching: Fast exact-match cache before semantic search
- TTL Management: Automatic expiration based on query type (time-sensitive vs evergreen)
- Error Handling: Retry logic with exponential backoff for batch operations
Client
|
v
FastAPI API Layer (/api/query, /api/stats, /api/metrics)
| \
| \-> Structured Logging (JSON)
| \-> Metrics Collector (counters + time series)
v
Query Pipeline
|
+--> Request-Level Cache (exact match)
| |
| +--> Redis (cache:<topic>:<hash>)
|
+--> Query Classification
| |
| +--> Topic Extraction (keywords + embedding fallback)
| +--> Query Type Detection (time-sensitive)
|
+--> Embedding Service
| |
| +--> OpenAI Embeddings (text-embedding-3-small)
| +--> Redis Embedding Cache (embed:<model>:<hash>)
|
+--> Semantic Cache Lookup
|
+--> Weaviate (optional vector search + topic filter)
+--> Redis Linear Scan (topic partition + fallback global scan)
|
+--> Staleness Check (min(TTL, max_age_by_query_type))
|
+--> Cache Miss Path
|
+--> OpenAI LLM (gpt-4o-mini / gpt-4o-search-preview)
+--> Store Response (Redis + optional Weaviate)
- Fast API layer with observability: Logging + metrics are first-class so cache behavior is measurable.
- Exact-match before semantic: Cheapest, lowest-latency hit path.
- Topic classification before vector search: Reduces cross-domain false positives and narrows search space.
- Embedding cache: Avoids repeated embedding costs for repeated/normalized queries.
- Dual storage: Redis for TTL + counters; optional Weaviate for scalable vector search.
- Staleness gate:
min(TTL, max_age_by_domain)prevents serving outdated responses.
- Query is classified into a topic and query type
- Cache lookup is scoped to the topic partition
- Staleness check uses
min(TTL, max_age_by_domain) - If cache miss, LLM is called and response is cached with topic and TTL
- Embeddings: OpenAI
text-embedding-3-small(configurable) - Similarity: Cosine similarity across embedding vectors
- Threshold: Configurable via
SIMILARITY_THRESHOLD(default0.85) - Preprocessing: Enhanced query normalization for better cache matching
If the best cached entry is above the threshold and not expired, the cached response is returned.
Based on the experiments documented in Semantic Cache Review.pdf:
- Model choice:
text-embedding-3-smallshowed the most predictable, conservative reuse behavior with safer cost savings thantext-embedding-ada-002(too aggressive) andtext-embedding-3-large(too conservative). - Thresholds are model-specific: Similarity distributions vary by model, so optimal thresholds differ. For example,
ada-002produces higher and wider similarity scores than the 3-series models. - Empirical sweet spot: On the evaluated dataset, a threshold around 0.75–0.80 with preprocessing balanced acceptability and hit rate. The repo default remains
0.85as a safer baseline, and should be tuned per workload. - Preprocessing impact: Removing semantic "noise" (e.g., greetings or filler phrases) measurably improved hit rates for
3-smallwithout sacrificing answer acceptability.
Key configuration options in .env:
OPENAI_API_KEY: Your OpenAI API key (required)SIMILARITY_THRESHOLD: Similarity threshold for cache matching (default:0.85)MAX_LLM_CALLS: Maximum LLM calls allowed (default:100)MAX_BATCH_SIZE: Maximum number of embeddings per batch request (default:2048)MAX_PARALLEL_LLM_CALLS: Maximum concurrent LLM calls for parallel processing (default:10)USE_WEAVIATE: Enable Weaviate vector database (default:false)MAX_AGE_BY_QUERY_TYPE: JSON object mapping query types to max age in seconds (default:{"weather": 3600, "news": 1800, "price": 1800, "score": 600})LOG_LEVEL: Logging level - DEBUG, INFO, WARNING, ERROR, CRITICAL (default:INFO)USE_JSON_LOGGING: Use JSON format for logs (default:true)LLM_COST_PER_CALL: Cost per LLM call for estimates (default:0.01)ENABLE_WEB_SEARCH: Enable web search for time-sensitive queries using OpenAI search-preview models (default:false)CHAT_MODEL: OpenAI chat model to use (default:gpt-4o-mini)
- Cache entries include
query_text,embedding,response,created_at,ttl_seconds,topic. - Time-sensitive queries are detected via keywords:
today,now,current,weather,news,price,score. - Time-sensitive TTL: Default 10 minutes (
SHORT_TTL_SECONDS) - Evergreen TTL: Default 24 hours (
LONG_TTL_SECONDS) - Embedding cache TTL: Default 7 days (
EMBEDDING_CACHE_TTL_SECONDS)
POST http://localhost:3000/api/query
Content-Type: application/json
{
"query": "What's the weather like in New York today?",
"forceRefresh": false,
"similarityThreshold": 0.85, # Optional: override default threshold
"embeddingModel": "text-embedding-3-small" # Optional: override embedding model
}Response:
{
"response": "The weather in New York today is sunny...",
"metadata": {
"source": "cache",
"similarity": 0.87
}
}GET http://localhost:3000/api/statsReturns live counters and cost estimates:
requests: Total number of requestscache_hits: Number of cache hitscache_misses: Number of cache missescache_hit_rate: Hit rate percentagellm_calls: Total LLM API calls madellm_fallbacks: Number of fallback responsesestimated_llm_cost: Estimated cost of LLM callsestimated_cache_savings: Estimated savings from cache hits
GET http://localhost:3000/api/metrics?hours=1&interval_seconds=10Returns time-series metrics data:
data: Array of aggregated metrics per time intervaltimestamp: ISO timestamprequests: Requests in this intervalhits: Cache hits in this intervalmisses: Cache misses in this intervalhit_rate: Hit rate for this intervalavg_latency_ms: Average latency in millisecondscumulative_requests: Total requests up to this pointcumulative_hits: Total hits up to this pointcumulative_hit_rate: Overall hit rate up to this point
current_stats: Current aggregate statistics
Query Parameters:
hours: Number of hours of data to retrieve (default:1)interval_seconds: Aggregation interval in seconds (default:10)
The web interface at http://localhost:5173 provides:
- Query Interface: Submit queries and see responses
- Live Stats Dashboard: Real-time statistics including:
- Request counts
- Cache hits/misses
- Hit rate percentage
- LLM calls and fallbacks
- Estimated costs and savings
- Performance Visualization: Interactive chart showing:
- Cumulative hit rate over time
- Total requests over time
- Summary statistics (average latency, overall hit rate)
- Response Metadata: Shows source (cache vs LLM) and similarity scores
The project includes a comprehensive testing suite covering multiple aspects of the semantic cache system. All tests are designed to be cost-effective and can be run individually or as a complete suite.
Important: All test scripts must be run from the project root directory (/Users/wiledw/boardy), not from subdirectories.
For a quick validation that everything works:
-
Start the system:
docker compose up
-
Generate test metrics data:
./test/metrics/test_metrics.sh
-
Check the UI: Open
http://localhost:5173to see the metrics visualization
The testing suite is organized into several categories:
Location: test/metrics/
Purpose: Quick validation of cache functionality and metrics collection
Run:
./test/metrics/test_metrics.sh
# Or
python3 test/metrics/test_metrics.pyWhat it tests:
- Basic cache hit/miss behavior
- Metrics collection and visualization
- API endpoint functionality
Cost: Minimal (~$0.10)
When to run: First test to run after setup to verify basic functionality
Location: test/similarityThresholds/
Purpose: Comprehensive evaluation of embedding models and similarity thresholds
Run:
cd test/similarityThresholds
./run_threshold_tests.sh
# Or with specific model
./run_threshold_tests.sh --model text-embedding-3-largeWhat it tests:
- All 3 embedding models (
text-embedding-3-small,text-embedding-3-large,text-embedding-ada-002) - Multiple thresholds (0.75, 0.80, 0.85, 0.90)
- Hit rates, acceptability rates, false positives/negatives
- Cost analysis per model
Cost: ~$2-3 (comprehensive testing)
Duration: ~30-40 minutes (12 test combinations)
When to run: When tuning similarity thresholds or choosing embedding models
See: test/similarityThresholds/README.md for detailed usage
Location: test/advancedCaching/
Purpose: Validates time-based invalidation and topic-based partitioning
Run:
cd test/advancedCaching
./run_tests.sh
# Or
python3 test/advancedCaching/test_advanced_caching.pyWhat it tests:
- Time-based cache invalidation (TTL + age-based limits)
- Topic-based cache partitioning
- Combined feature performance
Cost: ~$0.20
When to run: When verifying advanced caching features are working correctly
See: test/advancedCaching/README.md for details
Location: test/queryPatterns/
Purpose: Tests semantic cache against diverse query patterns
Run:
python3 test/queryPatterns/test_query_patterns.py
# Or
./test/queryPatterns/run_query_tests.shWhat it tests:
- Exact duplicates (100% cache hit expected)
- Semantically similar queries (>85% similarity expected)
- Unrelated queries (<50% similarity expected)
- Time-sensitive vs evergreen queries
- Different languages and special characters
Cost: ~$0.20 (uses pre-populated cache)
When to run: After populating cache, to validate cache matching behavior
Prerequisites: Cache should be populated first (see step 5)
Location: test/loadTesting/
Purpose: Tests system performance under various load scenarios
Run:
python3 test/loadTesting/test_load_performance.py
# Or
./test/loadTesting/run_load_tests.shWhat it tests:
- Baseline: 10 concurrent users, 1 req/sec
- Moderate: 50 concurrent users, 2 req/sec
- High: 200 concurrent users, 5 req/sec
- Spike: 500 concurrent users for 30 seconds
- Sustained: 100 concurrent users for 5 minutes
Metrics tracked:
- Latency (p50, p95, p99)
- Throughput (requests/second)
- Cache hit rate
- Error rate
Cost: ~$0.60 (uses pre-populated cache, 95%+ cache hits)
When to run: After cache population, to validate performance under load
Prerequisites: Cache should be populated first (see step 5)
Location: test/resilience/
Purpose: Tests system behavior under failures and high load
Run:
# Circuit breaker tests (mocked, $0 cost)
python3 test/resilience/test_circuit_breakers.py
# Graceful degradation tests
python3 test/resilience/test_graceful_degradation.pyWhat it tests:
- Circuit breakers for OpenAI, Redis, Weaviate failures
- Graceful degradation under overload
- Error response handling
- Automatic recovery
Cost: $0 (all mocked) to ~$0.10 (uses cached data)
When to run: To validate system resilience and error handling
Follow this order for a complete testing workflow:
-
Start the system:
docker compose up
-
Basic functionality test (verify setup):
./test/metrics/test_metrics.sh
-
Populate cache (one-time, ~$1.50):
python3 test/utils/populate_cache.py
-
Query pattern tests (validate cache matching):
python3 test/queryPatterns/test_query_patterns.py
-
Load testing (validate performance):
python3 test/loadTesting/test_load_performance.py
-
Advanced caching tests (validate features):
python3 test/advancedCaching/test_advanced_caching.py
-
Resilience tests (validate error handling):
python3 test/resilience/test_circuit_breakers.py python3 test/resilience/test_graceful_degradation.py
-
Similarity threshold tests (tune configuration):
cd test/similarityThresholds ./run_threshold_tests.sh
To run the complete test suite in order:
./test/run_all_tests.shThis script will:
- Check if API is running
- Optionally populate cache
- Run query pattern tests
- Run load tests
- Run resilience tests
- Generate reports
Total estimated cost: ~$2.30 (within $5 budget)
If scripts are not executable, make them executable:
chmod +x clear_redis.sh
chmod +x inspect_redis.sh
chmod +x test/metrics/test_metrics.sh
chmod +x test/similarityThresholds/run_threshold_tests.sh
chmod +x test/advancedCaching/run_tests.sh
chmod +x test/queryPatterns/run_query_tests.sh
chmod +x test/loadTesting/run_load_tests.sh
chmod +x test/run_all_tests.shTest results are saved in test/results/:
query_patterns/results_*.json: Query pattern test resultsload_testing/results_*.json: Load test resultscost_log.jsonl: Detailed cost logsimilarityThresholds/threshold_test_results_*.json: Threshold test resultssimilarityThresholds/threshold_test_output/*.png: Visualization charts
All tests are designed to minimize API costs:
- Pre-populated cache: One-time cache population (~$1.50) before running tests
- Mocked services: All failure scenarios use mocked OpenAI/Redis/Weaviate ($0 cost)
- High cache hit rates: Load and pattern tests achieve 95%+ cache hits
- Cost tracking: Built-in cost tracking warns when approaching budget limits
Estimated Total Cost: ~$2.30 (leaves $2.70 buffer)
For more details, see test/README.md and test/QUICK_START.md.
Goal: Avoid reusing answers that are correct but stale (prices, news, metrics, system state).
The system implements domain-specific age limits to prevent serving outdated information:
- Weather: 1 hour max age
- News: 30 minutes max age
- Price: 30 minutes max age
- Score: 10 minutes max age
How it works:
- Each cached entry has both a TTL (time-to-live) and a domain-specific max age.
- The system always invalidates based on the earlier of the two:
effective_expiry = min(TTL, max_age_by_domain) - For example, a weather query with a 10-minute TTL and 1-hour max age will expire after 10 minutes (the earlier value).
- This ensures time-sensitive data is refreshed appropriately even if the global TTL hasn't expired.
Configuration:
- Set
MAX_AGE_BY_QUERY_TYPEenvironment variable as a JSON object:{ "weather": 3600, "news": 1800, "price": 1800, "score": 600 } - Values are in seconds.
Goal: Prevent semantic cross-contamination (e.g., "BTC price" accidentally matching "ETH price").
The cache is partitioned by topic namespace before similarity search, ensuring queries only match entries within the same domain.
Topic Extraction:
- Each query is classified into a topic (e.g., "weather", "tech", "news", "price", "score", "general") using:
- Keyword matching: First-pass routing using domain-specific keywords
- Embedding fallback: If keyword matching returns "general", the system uses embedding-based similarity to topic centroids stored in Redis
Weaviate Integration:
- When using Weaviate (
USE_WEAVIATE=true), thetopicproperty is stored in the Weaviate schema - Database-level filtering: Weaviate filters by topic before vector search, making retrieval faster and more accurate
- This prevents cross-domain cache reuse at the database level, improving both performance and accuracy
Benefits:
- Faster retrieval: Topic filtering happens before vector search, reducing the search space
- Better accuracy: Prevents false positives from semantically similar but domain-different queries
- Scalability: Topic partitioning improves performance as the cache grows
Example:
- Query: "What's the weather in NYC?" → Topic:
weather→ Only searches within weather partition - Query: "What's the price of Bitcoin?" → Topic:
price→ Only searches within price partition - Query: "Who was the first president?" → Topic:
general→ Searches within general partition
- Embeddings are cached by normalized query to minimize repeat calls.
- LLM calls are capped via
MAX_LLM_CALLS. Once the limit is reached, the API returns a graceful fallback response without crashing. - Estimated cost tracking uses
LLM_COST_PER_CALL(default$0.01). - Real-time cost and savings metrics displayed in UI.
The service includes optimized batch processing for high-throughput scenarios:
- Batch Embeddings: Uses OpenAI's native batch API to process up to 2048 texts in a single request, reducing API round trips and improving latency
- Parallel LLM Calls: Concurrent LLM completions with semaphore-based rate limiting to prevent API limit violations
- Parallel Cache Lookups: Batch embedding retrieval checks cache for multiple queries in parallel before generating missing embeddings
- Retry Logic: Exponential backoff retry mechanism for batch operations to handle transient failures
Tune performance via environment variables:
MAX_BATCH_SIZE: Controls maximum embeddings per batch (default: 2048, OpenAI's limit)MAX_PARALLEL_LLM_CALLS: Limits concurrent LLM calls to prevent rate limiting (default: 10)
Batch methods are automatically used internally for optimal performance. For custom implementations:
# Batch embeddings (single API call)
embeddings = await openai_client.get_embeddings_batch(["text1", "text2", "text3"])
# Parallel embeddings (multiple concurrent calls)
embeddings = await openai_client.get_embeddings_parallel(["text1", "text2", "text3"])
# Batch LLM completions (parallel calls with rate limiting)
results = await openai_client.get_completions_batch(["query1", "query2", "query3"])
# Batch cache lookup (optimized cache + batch generation)
embeddings = await cache.get_or_create_embeddings_batch(["query1", "query2", "query3"])- For higher throughput, use Weaviate vector search (set
USE_WEAVIATE=true) - Batch processing is enabled by default for optimal performance
- Move metrics to a dedicated time-series database (e.g., InfluxDB, TimescaleDB) for large-scale deployments
- Consider distributed caching for multi-instance deployments
- Adjust
MAX_PARALLEL_LLM_CALLSbased on your OpenAI rate limits
- Lower thresholds: Improve hit rate but increase risk of mismatched responses
- Higher thresholds: Reduce false positives but incur more LLM calls and cost
- Linear scan: Simple and correct for small scale, but slower at large scale
- Weaviate: Faster vector search but adds complexity and infrastructure
The application uses structured JSON logging by default. Logs include:
- Timestamp: ISO 8601 format
- Level: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Severity: Severity code
- Logger: Module name
- Message: Log message
- Context Fields: Additional context like
request_id,operation,latency_ms,similarity, etc.
{
"timestamp": "2024-01-17T12:34:56.789Z",
"level": "INFO",
"severity": "INFO",
"logger": "app.api.routes",
"message": "Cache hit with similarity",
"request_id": "req_1705496096789",
"operation": "semantic_search",
"hit": true,
"similarity": 0.87,
"latency_ms": 45.2
}# View API logs
docker compose logs -f api
# View logs in JSON format (default)
# Or set USE_JSON_LOGGING=false for standard formatSet in .env:
LOG_LEVEL: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)USE_JSON_LOGGING: Use JSON format (true) or standard format (false)
The system automatically collects metrics for:
- Request counts (total, hits, misses)
- Latency measurements
- Hit rates (per interval and cumulative)
- Operation types (exact_match, semantic_search, etc.)
- Time-series data: Stored in Redis sorted sets
- Aggregate counters: Stored as Redis keys
- Retention: Configurable (default: 24 hours)
- Via API:
GET /api/metrics - Via UI: Performance visualization chart
- Via Redis: Direct inspection of metrics keys
The project includes several shell scripts for common tasks:
Clears all Redis data including persistence files and optionally clears Weaviate data.
./clear_redis.shWhat it does:
- Flushes all Redis keys from memory
- Deletes Redis persistence files (
dump.rdb,appendonly.aof) - Clears Weaviate data (if enabled)
- Provides instructions for complete cleanup
Note: This script handles Redis persistence properly. After running it, restart containers to ensure a clean state:
docker compose restart redisInspects Redis data structure and shows statistics.
./inspect_redis.sh
./inspect_redis.sh --all-keys # Show all keysWhat it shows:
- Total key count
- Breakdown by type (cache entries, embeddings, statistics)
- Current statistics (requests, hits, misses, hit rate)
- Sample cache entries with TTL information
If cache persists after running clear_redis.sh:
-
Check Redis persistence files:
docker exec <redis-container> ls -la /data/
-
Restart Redis container:
docker compose restart redis
-
For complete cleanup:
docker compose down -v docker compose up -d
-
Ensure metrics endpoint is accessible:
curl http://localhost:3000/api/metrics
-
Check browser console for errors (F12 → Console)
-
Verify data exists in Redis:
docker exec <redis-container> redis-cli ZRANGE metrics:timeseries 0 -1
- Check log level configuration in
.env - Verify
USE_JSON_LOGGINGsetting - Check Docker logs:
docker compose logs api
-
Check if API is running:
docker compose ps
-
Check API health:
curl http://localhost:3000/api/stats
If you get ModuleNotFoundError when running tests:
Solution: Make sure you're in the project root:
cd /Users/wiledw/boardy # Go to project root
python3 test/utils/populate_cache.py # Now it will workAll test scripts must be run from the project root directory.
- Redis Data Structure: See
REDIS_DATA_STRUCTURE.md - Threshold Testing: See
test/similarityThresholds/README.md - Preprocessing: See
test/preProcessing/PREPROCESSING_IMPROVEMENTS.md - Testing Guide: See
test/README.mdandtest/QUICK_START.md - Advanced Caching: See
test/advancedCaching/README.md
[Add your license here]