An advanced AI-powered search engine built with Vespa, FastAPI, and Groq, featuring neural query intelligence, hybrid search (BM25 + semantic), multi-model embeddings, result diversification, and comprehensive analytics.
NeuralSearch is a production-ready, AI-powered search engine that combines lexical (BM25) and semantic (vector) search with neural query intelligence. Built with modern Python frameworks and cutting-edge AI models, it delivers intelligent search capabilities with enterprise-grade performance.
π§ Neural Query Intelligence - Leverages Groq's high-performance LLM to automatically understand, improve, and expand user queries, making searches more intuitive and effective.
π Hybrid Search Architecture - Seamlessly combines BM25 lexical search with semantic vector search, ensuring both precision and recall for optimal results.
π― Multi-Model Embeddings - Supports multiple state-of-the-art embedding models (MiniLM, MPNet, BGE) allowing you to choose the perfect balance between speed and accuracy.
π Comprehensive Analytics - Built-in analytics dashboard tracks search performance, user behavior, and query patterns to continuously improve search quality.
- Enterprise Search - Deploy as an internal search solution for documentation, knowledge bases, or content management systems
- E-commerce - Power product search with intelligent query understanding and semantic matching
- Research & Development - Build custom search solutions for academic papers, research databases, or technical documentation
- Content Discovery - Enable semantic content discovery in media platforms, blogs, or news aggregators
- Backend: FastAPI (Python) - High-performance async API framework
- Search Engine: Vespa - Production-grade open-source search engine
- AI/ML: Groq LLM, Sentence Transformers - State-of-the-art NLP models
- Frontend: Vanilla JavaScript, HTML5, CSS3 - Modern, lightweight web interface
- Analytics: SQLite - Efficient data storage and querying
- Deployment: Docker, Uvicorn - Containerized, scalable deployment
- Hybrid Search: Combines BM25 (lexical) and semantic (dense vector) search
- Query Intelligence: AI-powered query expansion, improvement, and autocomplete using Groq LLM
- Multi-Model Support: Support for multiple embedding models (MiniLM, MPNet, BGE)
- Result Diversification: MMR-based diversification to reduce redundancy
- Analytics Dashboard: Track searches, clicks, query performance, and user behavior
- Modern UI: Clean, responsive web interface with real-time search
- Python 3.10+
- Docker (for Vespa deployment)
- Groq API Key (for query intelligence features)
# Using pip
pip install -r requirements.txt
# Or using the project file
pip install -e .Vespa needs to be running in Docker. The application expects Vespa at http://localhost:8080.
Important: The current main.py defines a BM25-only schema. For full functionality (semantic/fusion search), you'll need to add a text_embedding field to your Vespa schema. The application will work with BM25-only mode if you select "bm25" ranking.
Option A: Manual Vespa Setup (Recommended)
- Start Vespa Docker container:
docker run -m 8G --detach --name vespa --hostname vespa-container \
--publish 8080:8080 --publish 19071:19071 \
vespaengine/vespa:latest- Wait for Vespa to be ready (check health):
curl http://localhost:8080/state/v1/health- Deploy your Vespa application package:
- Use
main.pyas a reference for the package definition - For semantic search, add a
text_embeddingtensor field to the schema - Feed documents with embeddings to Vespa
- Use
Note: The application can run in BM25-only mode without semantic embeddings. Simply use the "bm25" ranking profile.
Create a .env file or set environment variables:
# Vespa Configuration
VESPA_URL=http://localhost
VESPA_PORT=8080
VESPA_RESULT_LIMIT=10
VESPA_MAX_RESULT_LIMIT=100
VESPA_DEFAULT_RANKING=fusion
# Embedding Configuration
VESPA_EMBEDDING_MODEL=all-MiniLM-L6-v2
VESPA_EMBEDDING_DEVICE= # Leave empty for auto, or set to 'cpu', 'cuda', 'mps'
# Query Intelligence
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=openai/gpt-oss-120b
ENABLE_QUERY_INTELLIGENCE=true
ENABLE_DIVERSIFICATION=trueImportant: You must set the GROQ_API_KEY environment variable.
Quick Setup:
# Windows PowerShell
$env:GROQ_API_KEY = "your_groq_api_key_here"
# Or create .env file
echo GROQ_API_KEY=your_groq_api_key_here > .envpython start_server.pyOr using uvicorn directly:
uvicorn ui_e2e:app --reload --port 8000The server will start at: http://localhost:8000
- Open your browser and navigate to
http://localhost:8000 - Enter a search query, e.g., "python machine learning tutorial"
- Select ranking mode (fusion, semantic, or bm25)
- Click "Search"
- View results with relevance scores, snippets, and URLs
# Basic search
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "python machine learning",
"limit": 10,
"ranking": "fusion"
}'Response:
{
"query": "python machine learning",
"hits": [
{
"id": "doc123",
"url": "https://example.com/python-ml",
"text": "Complete guide to Python machine learning...",
"snippet": "Complete guide to Python machine learning...",
"relevance": 0.8542
}
],
"returned": 10,
"total_available": 150,
"latency_ms": 45.2,
"ranking_profile": "fusion",
"embedding_model": "minilm"
}# Search with query improvement and expansion
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "how learn python",
"limit": 5,
"use_query_improvement": true,
"use_query_expansion": true,
"ranking": "fusion"
}'Response includes:
query_improvement: Improved query with explanationexpanded_queries: List of query variations usedquery_intent: Classified intent (FACTUAL, INFORMATIONAL, etc.)
curl -X POST http://localhost:8000/suggestions \
-H "Content-Type: application/json" \
-d '{
"partial_query": "python",
"num_suggestions": 5
}'Response:
{
"suggestions": [
"python tutorial",
"python programming",
"python for beginners",
"python data science",
"python web development"
]
}# Get analytics for last 7 days
curl http://localhost:8000/analytics?days=7
# Get analytics for specific query
curl http://localhost:8000/analytics/query/python%20tutorial
# Get top clicked documents
curl http://localhost:8000/analytics/top-documents?limit=10curl -X POST http://localhost:8000/compare-models \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning algorithms"
}'Response:
{
"query": "machine learning algorithms",
"models": {
"minilm": {
"embedding": [0.123, -0.456, ...],
"dimension": 384,
"model_info": {
"name": "all-MiniLM-L6-v2",
"dim": 384,
"description": "Fast, lightweight (default)"
}
},
"mpnet": {
"embedding": [0.234, -0.567, ...],
"dimension": 768,
"model_info": {
"name": "all-mpnet-base-v2",
"dim": 768,
"description": "Higher quality, slower"
}
}
}
}curl -X POST http://localhost:8000/improve-query \
-H "Content-Type: application/json" \
-d '{
"query": "how learn python fast"
}'Response:
{
"original": "how learn python fast",
"improved": "how to learn python quickly",
"explanation": "Added 'to' for proper grammar and changed 'fast' to 'quickly' for better clarity"
}curl -X POST http://localhost:8000/click \
-H "Content-Type: application/json" \
-d '{
"search_id": 123,
"document_id": "doc-456",
"document_url": "https://example.com/article",
"position": 1,
"relevance_score": 0.95
}'Response:
{
"status": "success"
}# Export as JSON
curl -X POST "http://localhost:8000/export/results?format=json" \
-H "Content-Type: application/json" \
-d '{
"query": "python tutorial",
"limit": 10
}' --output results.json
# Export as CSV
curl -X POST "http://localhost:8000/export/results?format=csv" \
-H "Content-Type: application/json" \
-d '{
"query": "python tutorial",
"limit": 10
}' --output results.csvcurl http://localhost:8000/modelsResponse:
{
"models": {
"minilm": {
"name": "all-MiniLM-L6-v2",
"dim": 384,
"description": "Fast, lightweight (default)"
},
"mpnet": {
"name": "all-mpnet-base-v2",
"dim": 768,
"description": "Higher quality, slower"
},
"bge": {
"name": "BAAI/bge-base-en-v1.5",
"dim": 768,
"description": "State-of-the-art quality"
}
},
"default": "minilm"
}curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "deep learning neural networks",
"limit": 5,
"ranking": "semantic",
"embedding_model": "mpnet"
}'curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "python web frameworks",
"limit": 10,
"use_diversification": true,
"diversification_method": "mmr",
"ranking": "fusion"
}'import requests
# Basic search
response = requests.post(
"http://localhost:8000/search",
json={
"query": "machine learning",
"limit": 10,
"ranking": "fusion"
}
)
data = response.json()
print(f"Found {data['returned']} results")
for hit in data['hits']:
print(f"- {hit['url']} (relevance: {hit['relevance']})")import requests
response = requests.post(
"http://localhost:8000/search",
json={
"query": "how to use python",
"limit": 5,
"use_query_improvement": True,
"use_query_expansion": True,
"ranking": "fusion"
}
)
data = response.json()
# Check if query was improved
if "query_improvement" in data:
print(f"Original: {data['query_improvement']['original']}")
print(f"Improved: {data['query_improvement']['improved']}")
print(f"Explanation: {data['query_improvement']['explanation']}")
# Check expanded queries
if "expanded_queries" in data:
print(f"Expanded queries: {data['expanded_queries']}")
# Check query intent
if "query_intent" in data:
print(f"Intent: {data['query_intent']['intent']}")
print(f"Confidence: {data['query_intent']['confidence']}")import requests
# Get overall analytics
analytics = requests.get("http://localhost:8000/analytics?days=7").json()
print(f"Total searches: {analytics['total_searches']}")
print(f"Total clicks: {analytics['total_clicks']}")
print(f"CTR: {analytics['click_through_rate']}%")
print(f"Avg latency: {analytics['avg_latency_ms']}ms")
# Get top queries
for query in analytics['top_queries'][:5]:
print(f"- {query['query']}: {query['search_count']} searches")
# Get query-specific analytics
query_stats = requests.get(
"http://localhost:8000/analytics/query/python%20tutorial"
).json()
print(f"\nQuery 'python tutorial':")
print(f" Searches: {query_stats['total_searches']}")
print(f" Clicks: {query_stats['total_clicks']}")
print(f" Avg latency: {query_stats['avg_latency_ms']}ms")import requests
# Perform a search first
search_response = requests.post(
"http://localhost:8000/search",
json={"query": "python tutorial", "limit": 5}
)
search_data = search_response.json()
search_id = search_data['search_id']
# Track a click on the first result
click_response = requests.post(
"http://localhost:8000/click",
json={
"search_id": search_id,
"document_id": search_data['hits'][0]['id'],
"document_url": search_data['hits'][0]['url'],
"position": 1,
"relevance_score": search_data['hits'][0]['relevance']
}
)
print(f"Click tracked: {click_response.json()}")import requests
response = requests.post(
"http://localhost:8000/suggestions",
json={
"partial_query": "python",
"num_suggestions": 5
}
)
suggestions = response.json()['suggestions']
print("Suggestions:")
for suggestion in suggestions:
print(f" - {suggestion}")import requests
response = requests.post(
"http://localhost:8000/compare-models",
json={"query": "natural language processing"}
)
data = response.json()
print(f"Query: {data['query']}\n")
for model_key, model_data in data['models'].items():
if 'error' not in model_data:
print(f"{model_key}:")
print(f" Model: {model_data['model_info']['name']}")
print(f" Dimension: {model_data['dimension']}")
print(f" Description: {model_data['model_info']['description']}\n")-
Start the server:
python start_server.py
-
Open your browser: Navigate to
http://localhost:8000 -
Basic Search:
- Type your query in the search box
- Select ranking mode (fusion, semantic, or bm25)
- Click "Search"
- Results appear with relevance scores and snippets
-
Advanced Features:
- Query Improvement: Check "β¨ Improve query with AI" to automatically fix grammar and improve your query
- Query Expansion: Check "π Expand query" to generate query variations
- Diversification: Check "π― Diversify results" to reduce redundant results
- Model Selection: Choose different embedding models from the dropdown
-
Analytics Dashboard:
- Click "π Analytics Dashboard" to view:
- Total searches and clicks
- Click-through rate
- Top queries
- Query performance metrics
- Popular documents
- Click "π Analytics Dashboard" to view:
-
Export Results:
- Click "π₯ Export Results" to download search results as JSON or CSV
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Web UI homepage |
GET |
/health |
Health check endpoint |
POST |
/search |
Perform search query |
POST |
/suggestions |
Get autocomplete suggestions |
POST |
/improve-query |
Improve query using AI |
POST |
/click |
Track click on search result |
GET |
/analytics |
Get search analytics |
GET |
/analytics/query/{query} |
Get query-specific analytics |
GET |
/analytics/top-documents |
Get top clicked documents |
POST |
/export/results |
Export search results (JSON/CSV) |
GET |
/models |
Get available embedding models |
POST |
/compare-models |
Compare embeddings from different models |
NeuralSearch/
βββ main.py # Vespa application package definition
βββ start_server.py # Server startup script
βββ ui_e2e.py # FastAPI application with all endpoints
βββ analytics.py # Analytics and tracking module
βββ query_intelligence.py # AI-powered query features (Groq)
βββ multi_model_ensemble.py # Multi-model embedding support
βββ result_diversification.py # Result diversification algorithms
βββ pyproject.toml # Project configuration
βββ requirements.txt # Python dependencies
βββ search_analytics.db # SQLite database for analytics
βββ templates/
β βββ index_e2e.html # Web UI template
βββ static/
βββ app_e2e.js # Frontend JavaScript
βββ styles_e2e.css # Stylesheet
NeuralSearch follows a layered microservices architecture with clear separation of concerns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend Layer (Web UI) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β HTML/CSS β β JavaScript β β Templates β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP/REST
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Layer (FastAPI) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ui_e2e.py - Main Application Server β β
β β - Request Routing & Validation β β
β β - Response Formatting β β
β β - Error Handling β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Business Logic Layer (Python Modules) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Query β β Multi- β β Result β β
β β Intelligence β β Model β β Diversifi- β β
β β (Groq) β β Ensemble β β cation β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β ββββββββββββββββ ββββββββββββββββ β
β β Analytics β β Embedding β β
β β (SQLite) β β Models β β
β ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Layer (Vespa + SQLite) β
β ββββββββββββββββ ββββββββββββββββ β
β β Vespa β β SQLite β β
β β Search β β Analytics β β
β β Engine β β Database β β
β ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Technology: FastAPI (Python 3.10+)
Key Features:
- Async Request Handling: Non-blocking I/O for concurrent requests
- Pydantic Models: Type-safe request/response validation
- Session Management: Cached Vespa connections with lazy initialization
- Error Handling: Graceful fallback to mock mode when Vespa unavailable
- CORS Support: Cross-origin resource sharing enabled
Core Functions:
# Connection Management
get_vespa_client() -> Vespa # Creates Vespa client (cached)
get_vespa_session() -> VespaSync # Returns active session (lazy-loaded)
close_vespa_session() -> None # Cleanup on shutdown
# Query Processing
run_vespa_query() -> Dict # Executes search with Vespa
_get_mock_results() -> Dict # Fallback when Vespa unavailable
_format_hit() -> Dict # Normalizes Vespa response formatAPI Endpoints:
POST /search- Main search endpoint with query intelligencePOST /suggestions- Autocomplete suggestionsPOST /improve-query- Query improvement servicePOST /click- Click tracking for analyticsGET /analytics- Analytics dashboard dataGET /models- Available embedding modelsPOST /compare-models- Model comparison utilityGET /health- Health check with Vespa status
Technology: Groq LLM API (openai/gpt-oss-120b)
Architecture:
- Cached Client:
@lru_cachedecorator for singleton Groq client - Prompt Engineering: Structured prompts for consistent outputs
- Error Resilience: Fallback to original query on API failures
- Response Parsing: Robust text parsing with cleanup logic
Functions:
expand_query(query, num_variations=3):
- Purpose: Generate query variations for better coverage
- Algorithm: LLM-based synonym and rephrasing generation
- Output: List of query variations including original
- Parameters:
temperature=0.8(creativity)max_tokens=200(response length)top_p=0.9(nucleus sampling)
improve_query(query):
- Purpose: Fix grammar, add keywords, clarify intent
- Algorithm: LLM-based query rewriting
- Output:
{original, improved, explanation} - Parameters:
temperature=0.3(more deterministic)
get_query_suggestions(partial_query, num_suggestions=5):
- Purpose: Autocomplete suggestions
- Algorithm: Context-aware completion generation
- Output: List of complete query suggestions
classify_query_intent(query):
- Purpose: Classify search intent
- Categories: FACTUAL, NAVIGATIONAL, INFORMATIONAL, TRANSACTIONAL
- Output:
{intent, confidence, reasoning}
Technology: Sentence Transformers
Supported Models:
| Model Key | Model Name | Dimensions | Speed | Quality |
|---|---|---|---|---|
minilm |
all-MiniLM-L6-v2 | 384 | Fast (~50ms) | Good |
mpnet |
all-mpnet-base-v2 | 768 | Moderate (~150ms) | High |
bge |
BAAI/bge-base-en-v1.5 | 768 | Moderate (~150ms) | SOTA |
Implementation Details:
- Model Caching:
@lru_cache(maxsize=3)prevents reloading - Device Support: Automatic CPU/GPU/MPS detection
- Encoding:
encode_query(query, model_key)returns normalized vector - Ensemble Methods: Average or concatenate multiple model outputs
Functions:
encode_query(query, model_key) -> List[float] # Single model encoding
ensemble_encode(query, models, method) -> Dict # Multi-model ensemble
compare_models(query, models) -> Dict # Side-by-side comparison
get_model_info() -> Dict # Model metadataAlgorithm: Maximal Marginal Relevance (MMR)
MMR Formula:
MMR = argmax[Ξ» Γ Sim(d, q) - (1-Ξ») Γ max(Sim(d, di))]
Where:
Ξ»(lambda_param): Balance between relevance (1.0) and diversity (0.0)Sim(d, q): Relevance score of documentdto queryqSim(d, di): Similarity between documentdand selected documentdi
Implementation:
- Text Similarity: Jaccard similarity (word overlap) for efficiency
- Relevance Normalization: Normalized by top result's relevance
- Greedy Selection: Iteratively selects best MMR-scored candidates
Functions:
diversify_results_mmr(hits, lambda_param, max_results) -> List
diversify_by_domain(hits, max_per_domain, max_results) -> List
combine_diversification(hits, method, lambda_param) -> ListDiversification Methods:
- MMR: Reduces redundancy based on content similarity
- Domain: Ensures no single domain dominates results
- Both: Applies domain filtering then MMR
Technology: SQLite with row factory for dict-like access
Database Schema:
searches Table:
CREATE TABLE searches (
id INTEGER PRIMARY KEY,
query TEXT NOT NULL,
query_hash TEXT NOT NULL, -- MD5 hash for normalization
ranking_profile TEXT,
limit_results INTEGER,
use_query_expansion BOOLEAN,
use_query_improvement BOOLEAN,
query_intent TEXT,
num_results INTEGER,
latency_ms REAL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
session_id TEXT
)clicks Table:
CREATE TABLE clicks (
id INTEGER PRIMARY KEY,
search_id INTEGER, -- Foreign key to searches
document_id TEXT NOT NULL,
document_url TEXT,
position INTEGER,
relevance_score REAL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)query_metrics Table (Aggregated):
CREATE TABLE query_metrics (
id INTEGER PRIMARY KEY,
query_hash TEXT UNIQUE,
total_searches INTEGER DEFAULT 0,
total_clicks INTEGER DEFAULT 0,
avg_latency_ms REAL,
avg_results INTEGER,
last_searched DATETIME
)Indexes:
idx_searches_timestamp- Time-based queriesidx_searches_query_hash- Query lookupidx_clicks_search_id- Join optimizationidx_clicks_timestamp- Time-based analytics
Analytics Functions:
log_search(...) -> int # Logs search, returns search_id
log_click(...) -> None # Tracks user clicks
get_search_analytics(days) -> Dict # Overall analytics
get_query_performance(query) -> Dict # Query-specific metrics
get_top_clicked_documents(limit) -> ListQuery Normalization:
- Uses MD5 hash of lowercased, stripped query
- Enables case-insensitive query grouping
- Prevents duplicate metrics for similar queries
Vespa Application Package:
Schema Definition:
Schema(
name="doc",
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary"]),
Field(name="text", type="string",
indexing=["index", "summary"], index="enable-bm25"),
Field(name="url", type="string",
indexing=["index", "summary"], index="enable-bm25"),
]
)
)Ranking Profiles:
bm25_text_only:bm25(text)bm25_url_only:bm25(url)bm25:bm25(text) + 0.1 Γ bm25(url)(default k1=1.2, b=0.75)bm25_comb_tuned: Same expression with tuned parameters:- Text: k1=1.8, b=0.40
- URL: k1=0.9, b=0.30
BM25 Formula:
BM25(q, d) = Ξ£ IDF(qi) Γ (f(qi, d) Γ (k1 + 1)) / (f(qi, d) + k1 Γ (1 - b + b Γ |d|/avgdl))
Query Construction:
query_body = {
"yql": "select * from doc where userQuery()",
"hits": limit,
"query": query_text,
"ranking": {"profile": ranking_profile},
"input.query(q)": query_embedding, # For semantic search
"ranking.features.query(q)": query_embedding
}Connection Management:
- Lazy Loading: Session created on first query
- Connection Test: 2-second socket timeout before attempting connection
- Caching: Global
_vespa_sessionvariable for reuse - Error Handling: Falls back to mock mode on connection failure
1. User Query
β
2. FastAPI Endpoint (/search)
βββ Request Validation (Pydantic)
βββ Query Intelligence (if enabled)
β βββ Query Improvement (Groq)
β βββ Query Expansion (Groq)
β βββ Intent Classification (Groq)
β
3. Embedding Generation (if semantic/fusion)
βββ Model Selection (minilm/mpnet/bge)
βββ Sentence Transformer Encoding
βββ Vector Normalization
β
4. Vespa Query Construction
βββ YQL Query Building
βββ Ranking Profile Selection
βββ Embedding Injection (if needed)
βββ Connection Check (2s timeout)
β
5. Vespa Execution
βββ BM25 Scoring (lexical)
βββ Vector Similarity (semantic, if enabled)
βββ Result Ranking
β
6. Result Processing
βββ Hit Formatting
βββ Diversification (MMR, if enabled)
βββ Limit Application
β
7. Analytics Logging
βββ Search Log Entry
βββ Query Metrics Update
βββ Popular Queries Update
β
8. Response Formatting
βββ JSON Serialization
βββ Metadata Addition
βββ HTTP Response
1. User Click Event
β
2. Frontend JavaScript
βββ Extract: search_id, document_id, position, relevance
βββ POST /click
β
3. Backend Processing
βββ Click Log Entry
βββ Query Metrics Update (total_clicks++)
βββ Popular Queries Update
β
4. Analytics Aggregation
βββ Real-time CTR calculation
Purpose: Lexical term-based ranking
Parameters:
- k1: Term frequency saturation (default: 1.2, tuned: 1.8 for text, 0.9 for URL)
- b: Length normalization (default: 0.75, tuned: 0.40 for text, 0.30 for URL)
Implementation: Native Vespa BM25 scoring
Purpose: Meaning-based similarity matching
Process:
- Query β Embedding (Sentence Transformer)
- Embedding β Vespa vector search
- Cosine similarity calculation
- Result ranking by similarity score
Models: MiniLM (384d), MPNet (768d), BGE (768d)
Purpose: Combine lexical and semantic signals
Scoring:
final_score = Ξ± Γ BM25_score + Ξ² Γ semantic_score
Where Ξ± and Ξ² are tunable weights (typically Ξ±=0.5, Ξ²=0.5)
Purpose: Reduce result redundancy
Algorithm:
selected = [top_result] # Always include best result
remaining = sorted_results[1:]
while len(selected) < max_results:
best_candidate = None
best_mmr = -inf
for candidate in remaining:
relevance = normalize(candidate.relevance)
max_sim = max(similarity(candidate, s) for s in selected)
mmr = Ξ» Γ relevance - (1-Ξ») Γ max_sim
if mmr > best_mmr:
best_mmr = mmr
best_candidate = candidate
selected.append(best_candidate)
remaining.remove(best_candidate)Parameters:
- Ξ» (lambda_param): Default 0.6 (60% relevance, 40% diversity)
- Similarity Metric: Jaccard similarity (word overlap)
| Operation | Typical Latency | Notes |
|---|---|---|
| Mock Mode | < 1ms | No external dependencies |
| BM25 Search | 10-50ms | Vespa only, no embeddings |
| Embedding Generation | 30-150ms | Depends on model (MiniLM fastest) |
| Semantic Search | 50-200ms | Embedding + Vespa vector search |
| Fusion Search | 50-250ms | Both BM25 and semantic |
| Query Intelligence | 200-800ms | Groq API call (network dependent) |
| Analytics Logging | 1-5ms | SQLite write operation |
- Concurrent Requests: Limited by Vespa and embedding model capacity
- Recommended: 10-50 requests/second (depending on ranking mode)
- Bottlenecks:
- Embedding generation (CPU-bound, single-threaded)
- Vespa query processing (multi-threaded, scales with cores)
- Groq API rate limits (external dependency)
Memory:
- Base application: ~200MB
- Each embedding model: ~100-400MB (cached after first load)
- Vespa container: ~2-4GB
- SQLite database: Grows with usage (~1MB per 1000 searches)
CPU:
- Embedding generation: Moderate (single-threaded per request)
- Vespa: Multi-threaded, benefits from multiple cores
- Query intelligence: Network I/O bound
Disk:
- Models: ~500MB-2GB per model (downloaded once, cached)
- Analytics DB: Grows linearly with usage
- API Keys: Environment variables (not hardcoded)
- Input Validation: Pydantic models validate all inputs
- SQL Injection: Parameterized queries in SQLite
- CORS: Enabled for development (configure for production)
- Authentication: Add API key or OAuth for protected endpoints
- Rate Limiting: Implement request throttling (e.g.,
slowapi) - HTTPS: Use TLS/SSL certificates
- Secrets Management: Use secure secret management services
- Database Encryption: Encrypt analytics database at rest
- Error Handling: Avoid exposing internal errors to clients
- Logging: Structured logging without sensitive data
βββββββββββββββ
β Browser β
ββββββββ¬βββββββ
β HTTP
ββββββββΌβββββββββββββββββββ
β FastAPI (localhost) β
β Port: 8000 β
ββββββββ¬βββββββββββββββββββ
β
βββ Vespa (localhost:8080) [Optional]
βββ Groq API (cloud)
βββ SQLite (local file)
services:
vespa:
image: vespaengine/vespa:latest
ports: ["8080:8080", "19071:19071"]
mem_limit: 8g
neuralsearch-api:
build: .
ports: ["8000:8000"]
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- VESPA_URL=http://vespa
- VESPA_PORT=8080
depends_on: [vespa]Horizontal Scaling:
- FastAPI: Stateless, can run multiple instances behind load balancer
- Vespa: Supports clustering for distributed search
- Embedding Models: Can be offloaded to separate service (e.g., TensorFlow Serving)
Vertical Scaling:
- Increase Vespa memory for larger indexes
- GPU acceleration for embedding generation
- Faster CPUs for BM25 processing
NeuralSearch is built on a microservices architecture with the following components:
- FastAPI Backend (
ui_e2e.py): RESTful API server handling all search requests - Vespa Search Engine: Distributed search engine for indexing and retrieval
- Sentence Transformers: Local embedding generation for semantic search
- Groq LLM API: Cloud-based query intelligence (expansion, improvement, suggestions)
- SQLite Analytics: Local database for tracking searches, clicks, and performance
- Modern Web UI: Vanilla JavaScript frontend with real-time search
User Query
β
[Query Intelligence] β Query Improvement/Expansion/Intent Classification
β
[Embedding Generation] β Multi-model semantic embeddings (optional)
β
[Vespa Query] β Hybrid search (BM25 + semantic or BM25-only)
β
[Result Diversification] β MMR-based deduplication (optional)
β
[Analytics Logging] β Track search, clicks, performance
β
Formatted Results
Fusion Mode combines:
- BM25 (Lexical): Term frequency-based scoring on
textandurlfields - Semantic (Vector): Cosine similarity between query embedding and document embeddings
- Combined Score:
relevance = Ξ± Γ BM25_score + Ξ² Γ semantic_score
BM25 Formula:
BM25(q, d) = Ξ£ IDF(qi) Γ (f(qi, d) Γ (k1 + 1)) / (f(qi, d) + k1 Γ (1 - b + b Γ |d|/avgdl))
Where:
k1: Term frequency saturation parameter (default: 1.2, tuned: 1.8 for text, 0.9 for URL)b: Length normalization parameter (default: 0.75, tuned: 0.40 for text, 0.30 for URL)f(qi, d): Frequency of termqiin documentdIDF(qi): Inverse document frequency of termqi
Maximal Marginal Relevance (MMR) algorithm:
MMR = argmax[Ξ» Γ Sim(d, q) - (1-Ξ») Γ max(Sim(d, di))]
Where:
Ξ»(lambda_param): Balance between relevance and diversity (default: 0.6)Sim(d, q): Relevance score of documentdto queryqSim(d, di): Similarity between documentdand already selected documentdi
When Vespa is unavailable, the system automatically switches to mock mode:
- Generates realistic search results with working URLs (Wikipedia, Stack Overflow, GitHub, etc.)
- Query-specific URLs that redirect to real search engines
- Maintains API compatibility for frontend testing
- Visual indicators in UI to distinguish mock results
- Fast response times (< 1ms) for development/testing
Current Schema (defined in main.py):
The application uses a document schema with the following structure:
Schema(
name="doc",
document=Document(
fields=[
Field(name="id", type="string", indexing=["summary"]),
Field(name="text", type="string", indexing=["index", "summary"], index="enable-bm25"),
Field(name="url", type="string", indexing=["index", "summary"], index="enable-bm25"),
]
),
fieldsets=[FieldSet(name="default", fields=["text", "url"])],
rank_profiles=[...] # 4 BM25 profiles defined
)Field Descriptions:
id: Document identifier (summary only, not indexed for search)text: Main content field (indexed with BM25, used for lexical search)url: URL field (indexed with BM25, weighted 0.1x in combined profiles)
For Semantic/Fusion Search (requires schema modification):
To enable semantic search, add an embedding field to the schema:
Field(
name="text_embedding",
type="tensor<float>(x[384])", # 384 for MiniLM, 768 for MPNet/BGE
indexing=["attribute", "index"],
attribute=["distance-metric: euclidean"]
)Then update rank profiles to include semantic scoring:
RankProfile(
name="fusion",
functions=[
Function(name="bm25score", expression="bm25(text) + 0.1 * bm25(url)"),
Function(name="semanticscore", expression="closeness(field, text_embedding)")
],
first_phase="bm25score + 0.7 * semanticscore"
)Analytics Database (search_analytics.db):
-- Searches table
CREATE TABLE searches (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query TEXT NOT NULL,
query_hash TEXT NOT NULL,
ranking_profile TEXT,
limit_results INTEGER,
use_query_expansion BOOLEAN,
use_query_improvement BOOLEAN,
query_intent TEXT,
num_results INTEGER,
latency_ms REAL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
session_id TEXT
);
-- Clicks table
CREATE TABLE clicks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
search_id INTEGER,
document_id TEXT NOT NULL,
document_url TEXT,
position INTEGER,
relevance_score REAL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);The system supports multiple ranking profiles defined in main.py:
Application-Level Profiles:
- fusion: Hybrid search combining BM25 + semantic (default, requires embeddings)
- semantic: Dense vector search only (requires embeddings)
- bm25: Lexical search only (works with current schema)
Vespa Schema Profiles (defined in main.py):
- bm25_text_only: BM25 scoring on
textfield only- Expression:
bm25(text)
- Expression:
- bm25_url_only: BM25 scoring on
urlfield only- Expression:
bm25(url)
- Expression:
- bm25: Combined BM25 with default parameters
- Expression:
bm25(text) + 0.1 Γ bm25(url) - Default k1=1.2, b=0.75
- Expression:
- bm25_comb_tuned: Combined BM25 with tuned parameters
- Expression:
bm25(text) + 0.1 Γ bm25(url) - Text field: k1=1.8, b=0.40
- URL field: k1=0.9, b=0.30
- Expression:
Three pre-trained models are supported via Sentence Transformers:
-
minilm (
all-MiniLM-L6-v2):- Dimensions: 384
- Speed: Fast (~50ms per query)
- Quality: Good for general purpose
- Use case: Default, production workloads
-
mpnet (
all-mpnet-base-v2):- Dimensions: 768
- Speed: Moderate (~150ms per query)
- Quality: Higher accuracy
- Use case: Quality-critical applications
-
bge (
BAAI/bge-base-en-v1.5):- Dimensions: 768
- Speed: Moderate (~150ms per query)
- Quality: State-of-the-art
- Use case: Research, high-accuracy requirements
Powered by Groq LLM (openai/gpt-oss-120b):
-
Query Improvement:
- Fixes grammar errors
- Adds missing keywords
- Clarifies ambiguous intent
- Returns improved query with explanation
-
Query Expansion:
- Generates 3-5 alternative query variations
- Uses synonyms and related terms
- Maintains original intent
- Used for multi-query retrieval
-
Query Suggestions (Autocomplete):
- Generates 5-10 completion suggestions
- Context-aware completions
- Popular query patterns
-
Intent Classification:
- FACTUAL: Seeking specific facts or data
- NAVIGATIONAL: Looking for a specific website/page
- INFORMATIONAL: Learning about a topic
- TRANSACTIONAL: Wanting to perform an action
Core Dependencies (from requirements.txt):
fastapi>=0.111.0: Web framework for building APIsuvicorn>=0.30.0: ASGI server for FastAPIgroq>=0.4.0: Groq LLM API clientsentence-transformers>=5.1.2: Embedding model librarypyvespa>=0.62.0: Vespa Python clientdatasets>=4.4.1: Dataset loading utilitiespandas>=2.3.3: Data manipulation (for analytics)
System Requirements:
- Python 3.10+
- Docker (for Vespa)
- 4GB+ RAM (8GB recommended for Vespa)
- Internet connection (for Groq API and model downloads)
# Check if Vespa is running
curl http://localhost:8080/state/v1/health
# Check Vespa container
docker ps | grep vespa# Reinstall dependencies
pip install -r requirements.txt --upgradeThe analytics database (search_analytics.db) is created automatically. If you encounter issues, delete it and restart the server.
{
"query": "string (required)",
"limit": "integer (optional, default: 10, max: 100)",
"ranking": "string (optional: 'fusion'|'semantic'|'bm25', default: 'fusion')",
"use_query_expansion": "boolean (optional, default: false)",
"use_query_improvement": "boolean (optional, default: false)",
"use_diversification": "boolean (optional, default: false)",
"diversification_method": "string (optional: 'mmr', default: 'mmr')",
"embedding_model": "string (optional: 'minilm'|'mpnet'|'bge', default: 'minilm')",
"session_id": "string (optional, for analytics tracking)"
}{
"query": "string",
"hits": [
{
"id": "string",
"document_id": "string",
"vespa_document_id": "string",
"url": "string",
"text": "string",
"snippet": "string",
"relevance": "float (0.0-1.0)",
"fields": {}
}
],
"returned": "integer",
"limit": "integer",
"total_available": "integer",
"latency_ms": "float",
"coverage": {},
"ranking_profile": "string",
"embedding_model": "string",
"search_id": "integer (for click tracking)",
"query_improvement": {
"original": "string",
"improved": "string",
"explanation": "string"
},
"expanded_queries": ["string"],
"query_intent": {
"intent": "FACTUAL|NAVIGATIONAL|INFORMATIONAL|TRANSACTIONAL",
"confidence": "float"
},
"mock_mode": "boolean (true if Vespa unavailable)"
}400 Bad Request:
{
"detail": "Error message describing the issue"
}503 Service Unavailable (Vespa connection):
- Automatically falls back to mock mode
- Response includes
"mock_mode": true
- Mock Mode: < 1ms (no external dependencies)
- BM25 Search: 10-50ms (Vespa only)
- Semantic Search: 50-200ms (embedding generation + Vespa)
- Fusion Search: 50-250ms (embedding + hybrid retrieval)
- Query Intelligence: 200-800ms (Groq API call)
- Concurrent Requests: Limited by Vespa and embedding model capacity
- Recommended: 10-50 requests/second (depending on ranking mode)
- Bottlenecks:
- Embedding generation (CPU-bound)
- Vespa query processing
- Groq API rate limits
- Memory:
- Base application: ~200MB
- Each embedding model: ~100-400MB
- Vespa container: ~2-4GB
- CPU:
- Embedding generation: Moderate (single-threaded)
- Vespa: Multi-threaded, scales with cores
- Disk:
- Models: ~500MB-2GB per model (cached after first download)
- Analytics DB: Grows with usage (~1MB per 1000 searches)
- Mock Mode: The application automatically works in mock mode when Vespa is not running, returning demo results with working URLs
- BM25 Mode: Works immediately with the current schema in
main.py(no embeddings required) - Semantic/Fusion Mode: Requires
text_embeddingtensor field in Vespa schema (not included in currentmain.py) - Model Downloads: First-time embedding model downloads may take 2-5 minutes (models are cached locally)
- Groq API Key: Must set
GROQ_API_KEYenvironment variable (raises error if missing) - Analytics: Data stored in SQLite database (
search_analytics.db), created automatically on first run - Connection Handling: Vespa connection is lazy-loaded with 2-second timeout to prevent hangs
- Caching: Embedding models and Groq client are cached using
@lru_cachefor performance - Thread Safety: SQLite connection uses
check_same_thread=Falsefor FastAPI compatibility
Check if the server is running:
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"vespa_available": false,
"version": "0.3.0"
}- API Keys: Stored in environment variables (not hardcoded)
- Input Validation: Pydantic models validate all API inputs
- SQL Injection: Protected via parameterized queries in SQLite
- CORS: Enabled for development (configure for production)
- Authentication: Add API key or OAuth for protected endpoints
- Rate Limiting: Implement request throttling (e.g., using
slowapi) - HTTPS: Use TLS/SSL certificates for encrypted connections
- Input Sanitization: Additional validation for user queries
- Secrets Management: Use secure secret management (e.g., AWS Secrets Manager, HashiCorp Vault)
- Database Security: Encrypt analytics database, implement access controls
- Error Handling: Avoid exposing internal errors to clients
- Logging: Implement structured logging without sensitive data
# Test health endpoint
curl http://localhost:8000/health
# Test search endpoint
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "test", "limit": 5}'
# Test query intelligence
curl -X POST http://localhost:8000/improve-query \
-H "Content-Type: application/json" \
-d '{"query": "how learn python"}'The application supports testing in mock mode without Vespa:
- All endpoints work with mock data
- Query intelligence requires Groq API key
- Analytics database is created automatically
- Build Docker image:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "ui_e2e:app", "--host", "0.0.0.0", "--port", "8000"]- Docker Compose (with Vespa):
version: '3.8'
services:
vespa:
image: vespaengine/vespa:latest
ports:
- "8080:8080"
- "19071:19071"
mem_limit: 8g
search-api:
build: .
ports:
- "8000:8000"
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- VESPA_URL=http://vespa
- VESPA_PORT=8080
depends_on:
- vespa- AWS: Deploy on EC2, ECS, or Lambda (with container support)
- Google Cloud: Cloud Run or Compute Engine
- Azure: Container Instances or App Service
- Heroku: Use buildpacks or container registry
Environment Variables (set in cloud platform):
GROQ_API_KEY: Required for query intelligenceVESPA_URL: Vespa endpoint URLVESPA_PORT: Vespa port (default: 8080)VESPA_EMBEDDING_MODEL: Embedding model selectionENABLE_QUERY_INTELLIGENCE: Enable/disable AI featuresENABLE_DIVERSIFICATION: Enable/disable result diversification
This project is provided as-is for educational and development purposes.
Contributions are welcome! Areas for improvement:
- Additional embedding models
- More ranking profiles
- Enhanced analytics dashboard
- Performance optimizations
- Test coverage
- Documentation improvements