NeuralSearch - AI-Powered Hybrid Search Engine

An advanced AI-powered search engine built with Vespa, FastAPI, and Groq, featuring neural query intelligence, hybrid search (BM25 + semantic), multi-model embeddings, result diversification, and comprehensive analytics.

📖 About

NeuralSearch is a production-ready, AI-powered search engine that combines lexical (BM25) and semantic (vector) search with neural query intelligence. Built with modern Python frameworks and cutting-edge AI models, it delivers intelligent search capabilities with enterprise-grade performance.

What Makes NeuralSearch Unique?

🧠 Neural Query Intelligence - Leverages Groq's high-performance LLM to automatically understand, improve, and expand user queries, making searches more intuitive and effective.

🔀 Hybrid Search Architecture - Seamlessly combines BM25 lexical search with semantic vector search, ensuring both precision and recall for optimal results.

🎯 Multi-Model Embeddings - Supports multiple state-of-the-art embedding models (MiniLM, MPNet, BGE) allowing you to choose the perfect balance between speed and accuracy.

📊 Comprehensive Analytics - Built-in analytics dashboard tracks search performance, user behavior, and query patterns to continuously improve search quality.

Use Cases

Enterprise Search - Deploy as an internal search solution for documentation, knowledge bases, or content management systems
E-commerce - Power product search with intelligent query understanding and semantic matching
Research & Development - Build custom search solutions for academic papers, research databases, or technical documentation
Content Discovery - Enable semantic content discovery in media platforms, blogs, or news aggregators

Technology Stack

Backend: FastAPI (Python) - High-performance async API framework
Search Engine: Vespa - Production-grade open-source search engine
AI/ML: Groq LLM, Sentence Transformers - State-of-the-art NLP models
Frontend: Vanilla JavaScript, HTML5, CSS3 - Modern, lightweight web interface
Analytics: SQLite - Efficient data storage and querying
Deployment: Docker, Uvicorn - Containerized, scalable deployment

🚀 Features

Hybrid Search: Combines BM25 (lexical) and semantic (dense vector) search
Query Intelligence: AI-powered query expansion, improvement, and autocomplete using Groq LLM
Multi-Model Support: Support for multiple embedding models (MiniLM, MPNet, BGE)
Result Diversification: MMR-based diversification to reduce redundancy
Analytics Dashboard: Track searches, clicks, query performance, and user behavior
Modern UI: Clean, responsive web interface with real-time search

📋 Prerequisites

Python 3.10+
Docker (for Vespa deployment)
Groq API Key (for query intelligence features)

🛠️ Installation

1. Install Python Dependencies

# Using pip
pip install -r requirements.txt

# Or using the project file
pip install -e .

2. Set Up Vespa Search Engine

Vespa needs to be running in Docker. The application expects Vespa at http://localhost:8080.

Important: The current main.py defines a BM25-only schema. For full functionality (semantic/fusion search), you'll need to add a text_embedding field to your Vespa schema. The application will work with BM25-only mode if you select "bm25" ranking.

Option A: Manual Vespa Setup (Recommended)

Start Vespa Docker container:

docker run -m 8G --detach --name vespa --hostname vespa-container \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa:latest

Wait for Vespa to be ready (check health):

curl http://localhost:8080/state/v1/health

Deploy your Vespa application package:
- Use main.py as a reference for the package definition
- For semantic search, add a text_embedding tensor field to the schema
- Feed documents with embeddings to Vespa

Note: The application can run in BM25-only mode without semantic embeddings. Simply use the "bm25" ranking profile.

3. Configure Environment Variables (Optional)

Create a .env file or set environment variables:

# Vespa Configuration
VESPA_URL=http://localhost
VESPA_PORT=8080
VESPA_RESULT_LIMIT=10
VESPA_MAX_RESULT_LIMIT=100
VESPA_DEFAULT_RANKING=fusion

# Embedding Configuration
VESPA_EMBEDDING_MODEL=all-MiniLM-L6-v2
VESPA_EMBEDDING_DEVICE=  # Leave empty for auto, or set to 'cpu', 'cuda', 'mps'

# Query Intelligence
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=openai/gpt-oss-120b
ENABLE_QUERY_INTELLIGENCE=true
ENABLE_DIVERSIFICATION=true

Important: You must set the GROQ_API_KEY environment variable.

Quick Setup:

# Windows PowerShell
$env:GROQ_API_KEY = "your_groq_api_key_here"

# Or create .env file
echo GROQ_API_KEY=your_groq_api_key_here > .env

🏃 Running the Application

Start the Server

python start_server.py

Or using uvicorn directly:

uvicorn ui_e2e:app --reload --port 8000

The server will start at: http://localhost:8000

📖 Example Usage

Example 1: Basic Search via Web UI

Open your browser and navigate to http://localhost:8000
Enter a search query, e.g., "python machine learning tutorial"
Select ranking mode (fusion, semantic, or bm25)
Click "Search"
View results with relevance scores, snippets, and URLs

Example 2: Search via API

# Basic search
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python machine learning",
    "limit": 10,
    "ranking": "fusion"
  }'

Response:

{
  "query": "python machine learning",
  "hits": [
    {
      "id": "doc123",
      "url": "https://example.com/python-ml",
      "text": "Complete guide to Python machine learning...",
      "snippet": "Complete guide to Python machine learning...",
      "relevance": 0.8542
    }
  ],
  "returned": 10,
  "total_available": 150,
  "latency_ms": 45.2,
  "ranking_profile": "fusion",
  "embedding_model": "minilm"
}

Example 3: Search with Query Intelligence

# Search with query improvement and expansion
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how learn python",
    "limit": 5,
    "use_query_improvement": true,
    "use_query_expansion": true,
    "ranking": "fusion"
  }'

Response includes:

query_improvement: Improved query with explanation
expanded_queries: List of query variations used
query_intent: Classified intent (FACTUAL, INFORMATIONAL, etc.)

Example 4: Get Autocomplete Suggestions

curl -X POST http://localhost:8000/suggestions \
  -H "Content-Type: application/json" \
  -d '{
    "partial_query": "python",
    "num_suggestions": 5
  }'

Response:

{
  "suggestions": [
    "python tutorial",
    "python programming",
    "python for beginners",
    "python data science",
    "python web development"
  ]
}

Example 5: View Analytics

# Get analytics for last 7 days
curl http://localhost:8000/analytics?days=7

# Get analytics for specific query
curl http://localhost:8000/analytics/query/python%20tutorial

# Get top clicked documents
curl http://localhost:8000/analytics/top-documents?limit=10

Example 6: Compare Embedding Models

curl -X POST http://localhost:8000/compare-models \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning algorithms"
  }'

Response:

{
  "query": "machine learning algorithms",
  "models": {
    "minilm": {
      "embedding": [0.123, -0.456, ...],
      "dimension": 384,
      "model_info": {
        "name": "all-MiniLM-L6-v2",
        "dim": 384,
        "description": "Fast, lightweight (default)"
      }
    },
    "mpnet": {
      "embedding": [0.234, -0.567, ...],
      "dimension": 768,
      "model_info": {
        "name": "all-mpnet-base-v2",
        "dim": 768,
        "description": "Higher quality, slower"
      }
    }
  }
}

Example 7: Improve Query with AI

curl -X POST http://localhost:8000/improve-query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how learn python fast"
  }'

Response:

{
  "original": "how learn python fast",
  "improved": "how to learn python quickly",
  "explanation": "Added 'to' for proper grammar and changed 'fast' to 'quickly' for better clarity"
}

Example 8: Track Click on Search Result

curl -X POST http://localhost:8000/click \
  -H "Content-Type: application/json" \
  -d '{
    "search_id": 123,
    "document_id": "doc-456",
    "document_url": "https://example.com/article",
    "position": 1,
    "relevance_score": 0.95
  }'

Response:

{
  "status": "success"
}

Example 9: Export Search Results

# Export as JSON
curl -X POST "http://localhost:8000/export/results?format=json" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorial",
    "limit": 10
  }' --output results.json

# Export as CSV
curl -X POST "http://localhost:8000/export/results?format=csv" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorial",
    "limit": 10
  }' --output results.csv

Example 10: Get Available Embedding Models

curl http://localhost:8000/models

Response:

{
  "models": {
    "minilm": {
      "name": "all-MiniLM-L6-v2",
      "dim": 384,
      "description": "Fast, lightweight (default)"
    },
    "mpnet": {
      "name": "all-mpnet-base-v2",
      "dim": 768,
      "description": "Higher quality, slower"
    },
    "bge": {
      "name": "BAAI/bge-base-en-v1.5",
      "dim": 768,
      "description": "State-of-the-art quality"
    }
  },
  "default": "minilm"
}

Example 11: Search with Custom Embedding Model

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "deep learning neural networks",
    "limit": 5,
    "ranking": "semantic",
    "embedding_model": "mpnet"
  }'

Example 12: Search with Result Diversification

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python web frameworks",
    "limit": 10,
    "use_diversification": true,
    "diversification_method": "mmr",
    "ranking": "fusion"
  }'

💻 Python Code Examples

Example 13: Using the Search API with Python

import requests

# Basic search
response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "machine learning",
        "limit": 10,
        "ranking": "fusion"
    }
)

data = response.json()
print(f"Found {data['returned']} results")
for hit in data['hits']:
    print(f"- {hit['url']} (relevance: {hit['relevance']})")

Example 14: Search with Query Intelligence

import requests

response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "how to use python",
        "limit": 5,
        "use_query_improvement": True,
        "use_query_expansion": True,
        "ranking": "fusion"
    }
)

data = response.json()

# Check if query was improved
if "query_improvement" in data:
    print(f"Original: {data['query_improvement']['original']}")
    print(f"Improved: {data['query_improvement']['improved']}")
    print(f"Explanation: {data['query_improvement']['explanation']}")

# Check expanded queries
if "expanded_queries" in data:
    print(f"Expanded queries: {data['expanded_queries']}")

# Check query intent
if "query_intent" in data:
    print(f"Intent: {data['query_intent']['intent']}")
    print(f"Confidence: {data['query_intent']['confidence']}")

Example 15: Get Analytics Data

import requests

# Get overall analytics
analytics = requests.get("http://localhost:8000/analytics?days=7").json()
print(f"Total searches: {analytics['total_searches']}")
print(f"Total clicks: {analytics['total_clicks']}")
print(f"CTR: {analytics['click_through_rate']}%")
print(f"Avg latency: {analytics['avg_latency_ms']}ms")

# Get top queries
for query in analytics['top_queries'][:5]:
    print(f"- {query['query']}: {query['search_count']} searches")

# Get query-specific analytics
query_stats = requests.get(
    "http://localhost:8000/analytics/query/python%20tutorial"
).json()
print(f"\nQuery 'python tutorial':")
print(f"  Searches: {query_stats['total_searches']}")
print(f"  Clicks: {query_stats['total_clicks']}")
print(f"  Avg latency: {query_stats['avg_latency_ms']}ms")

Example 16: Track User Clicks

import requests

# Perform a search first
search_response = requests.post(
    "http://localhost:8000/search",
    json={"query": "python tutorial", "limit": 5}
)
search_data = search_response.json()
search_id = search_data['search_id']

# Track a click on the first result
click_response = requests.post(
    "http://localhost:8000/click",
    json={
        "search_id": search_id,
        "document_id": search_data['hits'][0]['id'],
        "document_url": search_data['hits'][0]['url'],
        "position": 1,
        "relevance_score": search_data['hits'][0]['relevance']
    }
)
print(f"Click tracked: {click_response.json()}")

Example 17: Get Autocomplete Suggestions

import requests

response = requests.post(
    "http://localhost:8000/suggestions",
    json={
        "partial_query": "python",
        "num_suggestions": 5
    }
)

suggestions = response.json()['suggestions']
print("Suggestions:")
for suggestion in suggestions:
    print(f"  - {suggestion}")

Example 18: Compare Different Embedding Models

import requests

response = requests.post(
    "http://localhost:8000/compare-models",
    json={"query": "natural language processing"}
)

data = response.json()
print(f"Query: {data['query']}\n")

for model_key, model_data in data['models'].items():
    if 'error' not in model_data:
        print(f"{model_key}:")
        print(f"  Model: {model_data['model_info']['name']}")
        print(f"  Dimension: {model_data['dimension']}")
        print(f"  Description: {model_data['model_info']['description']}\n")

🌐 Web UI Examples

Example 19: Using the Web Interface

Start the server:
```
python start_server.py
```
Open your browser: Navigate to http://localhost:8000
Basic Search:
- Type your query in the search box
- Select ranking mode (fusion, semantic, or bm25)
- Click "Search"
- Results appear with relevance scores and snippets
Advanced Features:
- Query Improvement: Check "✨ Improve query with AI" to automatically fix grammar and improve your query
- Query Expansion: Check "🔍 Expand query" to generate query variations
- Diversification: Check "🎯 Diversify results" to reduce redundant results
- Model Selection: Choose different embedding models from the dropdown
Analytics Dashboard:
- Click "📊 Analytics Dashboard" to view:
  - Total searches and clicks
  - Click-through rate
  - Top queries
  - Query performance metrics
  - Popular documents
Export Results:
- Click "📥 Export Results" to download search results as JSON or CSV

🔌 API Endpoints

Method	Endpoint	Description
`GET`	`/`	Web UI homepage
`GET`	`/health`	Health check endpoint
`POST`	`/search`	Perform search query
`POST`	`/suggestions`	Get autocomplete suggestions
`POST`	`/improve-query`	Improve query using AI
`POST`	`/click`	Track click on search result
`GET`	`/analytics`	Get search analytics
`GET`	`/analytics/query/{query}`	Get query-specific analytics
`GET`	`/analytics/top-documents`	Get top clicked documents
`POST`	`/export/results`	Export search results (JSON/CSV)
`GET`	`/models`	Get available embedding models
`POST`	`/compare-models`	Compare embeddings from different models

📁 Project Structure

NeuralSearch/
├── main.py                      # Vespa application package definition
├── start_server.py              # Server startup script
├── ui_e2e.py                    # FastAPI application with all endpoints
├── analytics.py                 # Analytics and tracking module
├── query_intelligence.py        # AI-powered query features (Groq)
├── multi_model_ensemble.py      # Multi-model embedding support
├── result_diversification.py    # Result diversification algorithms
├── pyproject.toml              # Project configuration
├── requirements.txt            # Python dependencies
├── search_analytics.db         # SQLite database for analytics
├── templates/
│   └── index_e2e.html         # Web UI template
└── static/
    ├── app_e2e.js             # Frontend JavaScript
    └── styles_e2e.css         # Stylesheet

🔬 Technical Overview

System Architecture

NeuralSearch follows a layered microservices architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│                    Frontend Layer (Web UI)                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   HTML/CSS   │  │  JavaScript  │  │   Templates  │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────┘
                            ↕ HTTP/REST
┌─────────────────────────────────────────────────────────────┐
│                  API Layer (FastAPI)                          │
│  ┌──────────────────────────────────────────────────────┐ │
│  │  ui_e2e.py - Main Application Server                    │ │
│  │  - Request Routing & Validation                         │ │
│  │  - Response Formatting                                  │ │
│  │  - Error Handling                                      │ │
│  └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│              Business Logic Layer (Python Modules)            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │   Query      │  │   Multi-     │  │   Result     │     │
│  │ Intelligence │  │   Model      │  │ Diversifi-   │     │
│  │   (Groq)     │  │  Ensemble    │  │   cation     │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
│  ┌──────────────┐  ┌──────────────┐                       │
│  │   Analytics   │  │   Embedding  │                       │
│  │   (SQLite)    │  │   Models     │                       │
│  └──────────────┘  └──────────────┘                       │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│              Data Layer (Vespa + SQLite)                      │
│  ┌──────────────┐              ┌──────────────┐            │
│  │    Vespa     │              │   SQLite     │            │
│  │  Search      │              │  Analytics   │            │
│  │  Engine      │              │   Database   │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

Component Breakdown

1. API Server (`ui_e2e.py`)

Technology: FastAPI (Python 3.10+)

Key Features:

Async Request Handling: Non-blocking I/O for concurrent requests
Pydantic Models: Type-safe request/response validation
Session Management: Cached Vespa connections with lazy initialization
Error Handling: Graceful fallback to mock mode when Vespa unavailable
CORS Support: Cross-origin resource sharing enabled

Core Functions:

# Connection Management
get_vespa_client() -> Vespa          # Creates Vespa client (cached)
get_vespa_session() -> VespaSync     # Returns active session (lazy-loaded)
close_vespa_session() -> None       # Cleanup on shutdown

# Query Processing
run_vespa_query() -> Dict            # Executes search with Vespa
_get_mock_results() -> Dict          # Fallback when Vespa unavailable
_format_hit() -> Dict                # Normalizes Vespa response format

API Endpoints:

POST /search - Main search endpoint with query intelligence
POST /suggestions - Autocomplete suggestions
POST /improve-query - Query improvement service
POST /click - Click tracking for analytics
GET /analytics - Analytics dashboard data
GET /models - Available embedding models
POST /compare-models - Model comparison utility
GET /health - Health check with Vespa status

2. Query Intelligence (`query_intelligence.py`)

Technology: Groq LLM API (openai/gpt-oss-120b)

Architecture:

Cached Client: @lru_cache decorator for singleton Groq client
Prompt Engineering: Structured prompts for consistent outputs
Error Resilience: Fallback to original query on API failures
Response Parsing: Robust text parsing with cleanup logic

Functions:

expand_query(query, num_variations=3):

Purpose: Generate query variations for better coverage
Algorithm: LLM-based synonym and rephrasing generation
Output: List of query variations including original
Parameters:
- temperature=0.8 (creativity)
- max_tokens=200 (response length)
- top_p=0.9 (nucleus sampling)

improve_query(query):

Purpose: Fix grammar, add keywords, clarify intent
Algorithm: LLM-based query rewriting
Output: {original, improved, explanation}
Parameters: temperature=0.3 (more deterministic)

get_query_suggestions(partial_query, num_suggestions=5):

Purpose: Autocomplete suggestions
Algorithm: Context-aware completion generation
Output: List of complete query suggestions

classify_query_intent(query):

Purpose: Classify search intent
Categories: FACTUAL, NAVIGATIONAL, INFORMATIONAL, TRANSACTIONAL
Output: {intent, confidence, reasoning}

3. Multi-Model Embeddings (`multi_model_ensemble.py`)

Technology: Sentence Transformers

Supported Models:

Model Key	Model Name	Dimensions	Speed	Quality
`minilm`	all-MiniLM-L6-v2	384	Fast (~50ms)	Good
`mpnet`	all-mpnet-base-v2	768	Moderate (~150ms)	High
`bge`	BAAI/bge-base-en-v1.5	768	Moderate (~150ms)	SOTA

Implementation Details:

Model Caching: @lru_cache(maxsize=3) prevents reloading
Device Support: Automatic CPU/GPU/MPS detection
Encoding: encode_query(query, model_key) returns normalized vector
Ensemble Methods: Average or concatenate multiple model outputs

Functions:

encode_query(query, model_key) -> List[float]      # Single model encoding
ensemble_encode(query, models, method) -> Dict     # Multi-model ensemble
compare_models(query, models) -> Dict              # Side-by-side comparison
get_model_info() -> Dict                            # Model metadata

4. Result Diversification (`result_diversification.py`)

Algorithm: Maximal Marginal Relevance (MMR)

MMR Formula:

MMR = argmax[λ × Sim(d, q) - (1-λ) × max(Sim(d, di))]

Where:

λ (lambda_param): Balance between relevance (1.0) and diversity (0.0)
Sim(d, q): Relevance score of document d to query q
Sim(d, di): Similarity between document d and selected document di

Implementation:

Text Similarity: Jaccard similarity (word overlap) for efficiency
Relevance Normalization: Normalized by top result's relevance
Greedy Selection: Iteratively selects best MMR-scored candidates

Functions:

diversify_results_mmr(hits, lambda_param, max_results) -> List
diversify_by_domain(hits, max_per_domain, max_results) -> List
combine_diversification(hits, method, lambda_param) -> List

Diversification Methods:

MMR: Reduces redundancy based on content similarity
Domain: Ensures no single domain dominates results
Both: Applies domain filtering then MMR

5. Analytics System (`analytics.py`)

Technology: SQLite with row factory for dict-like access

Database Schema:

searches Table:

CREATE TABLE searches (
    id INTEGER PRIMARY KEY,
    query TEXT NOT NULL,
    query_hash TEXT NOT NULL,          -- MD5 hash for normalization
    ranking_profile TEXT,
    limit_results INTEGER,
    use_query_expansion BOOLEAN,
    use_query_improvement BOOLEAN,
    query_intent TEXT,
    num_results INTEGER,
    latency_ms REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    session_id TEXT
)

clicks Table:

CREATE TABLE clicks (
    id INTEGER PRIMARY KEY,
    search_id INTEGER,                 -- Foreign key to searches
    document_id TEXT NOT NULL,
    document_url TEXT,
    position INTEGER,
    relevance_score REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)

query_metrics Table (Aggregated):

CREATE TABLE query_metrics (
    id INTEGER PRIMARY KEY,
    query_hash TEXT UNIQUE,
    total_searches INTEGER DEFAULT 0,
    total_clicks INTEGER DEFAULT 0,
    avg_latency_ms REAL,
    avg_results INTEGER,
    last_searched DATETIME
)

Indexes:

idx_searches_timestamp - Time-based queries
idx_searches_query_hash - Query lookup
idx_clicks_search_id - Join optimization
idx_clicks_timestamp - Time-based analytics

Analytics Functions:

log_search(...) -> int                # Logs search, returns search_id
log_click(...) -> None                # Tracks user clicks
get_search_analytics(days) -> Dict    # Overall analytics
get_query_performance(query) -> Dict  # Query-specific metrics
get_top_clicked_documents(limit) -> List

Query Normalization:

Uses MD5 hash of lowercased, stripped query
Enables case-insensitive query grouping
Prevents duplicate metrics for similar queries

6. Vespa Integration (`main.py` + `ui_e2e.py`)

Vespa Application Package:

Schema Definition:

Schema(
    name="doc",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary"]),
            Field(name="text", type="string", 
                  indexing=["index", "summary"], index="enable-bm25"),
            Field(name="url", type="string",
                  indexing=["index", "summary"], index="enable-bm25"),
        ]
    )
)

Ranking Profiles:

bm25_text_only: bm25(text)
bm25_url_only: bm25(url)
bm25: bm25(text) + 0.1 × bm25(url) (default k1=1.2, b=0.75)
bm25_comb_tuned: Same expression with tuned parameters:
- Text: k1=1.8, b=0.40
- URL: k1=0.9, b=0.30

BM25 Formula:

BM25(q, d) = Σ IDF(qi) × (f(qi, d) × (k1 + 1)) / (f(qi, d) + k1 × (1 - b + b × |d|/avgdl))

Query Construction:

query_body = {
    "yql": "select * from doc where userQuery()",
    "hits": limit,
    "query": query_text,
    "ranking": {"profile": ranking_profile},
    "input.query(q)": query_embedding,  # For semantic search
    "ranking.features.query(q)": query_embedding
}

Connection Management:

Lazy Loading: Session created on first query
Connection Test: 2-second socket timeout before attempting connection
Caching: Global _vespa_session variable for reuse
Error Handling: Falls back to mock mode on connection failure

Data Flow

Search Request Flow

1. User Query
   ↓
2. FastAPI Endpoint (/search)
   ├─→ Request Validation (Pydantic)
   ├─→ Query Intelligence (if enabled)
   │   ├─→ Query Improvement (Groq)
   │   ├─→ Query Expansion (Groq)
   │   └─→ Intent Classification (Groq)
   ↓
3. Embedding Generation (if semantic/fusion)
   ├─→ Model Selection (minilm/mpnet/bge)
   ├─→ Sentence Transformer Encoding
   └─→ Vector Normalization
   ↓
4. Vespa Query Construction
   ├─→ YQL Query Building
   ├─→ Ranking Profile Selection
   ├─→ Embedding Injection (if needed)
   └─→ Connection Check (2s timeout)
   ↓
5. Vespa Execution
   ├─→ BM25 Scoring (lexical)
   ├─→ Vector Similarity (semantic, if enabled)
   └─→ Result Ranking
   ↓
6. Result Processing
   ├─→ Hit Formatting
   ├─→ Diversification (MMR, if enabled)
   └─→ Limit Application
   ↓
7. Analytics Logging
   ├─→ Search Log Entry
   ├─→ Query Metrics Update
   └─→ Popular Queries Update
   ↓
8. Response Formatting
   ├─→ JSON Serialization
   ├─→ Metadata Addition
   └─→ HTTP Response

Click Tracking Flow

1. User Click Event
   ↓
2. Frontend JavaScript
   ├─→ Extract: search_id, document_id, position, relevance
   └─→ POST /click
   ↓
3. Backend Processing
   ├─→ Click Log Entry
   ├─→ Query Metrics Update (total_clicks++)
   └─→ Popular Queries Update
   ↓
4. Analytics Aggregation
   └─→ Real-time CTR calculation

Algorithms & Methods

1. BM25 (Best Matching 25)

Purpose: Lexical term-based ranking

Parameters:

k1: Term frequency saturation (default: 1.2, tuned: 1.8 for text, 0.9 for URL)
b: Length normalization (default: 0.75, tuned: 0.40 for text, 0.30 for URL)

Implementation: Native Vespa BM25 scoring

2. Semantic Search (Dense Vector)

Purpose: Meaning-based similarity matching

Process:

Query → Embedding (Sentence Transformer)
Embedding → Vespa vector search
Cosine similarity calculation
Result ranking by similarity score

Models: MiniLM (384d), MPNet (768d), BGE (768d)

3. Hybrid Search (Fusion)

Purpose: Combine lexical and semantic signals

Scoring:

final_score = α × BM25_score + β × semantic_score

Where α and β are tunable weights (typically α=0.5, β=0.5)

4. Maximal Marginal Relevance (MMR)

Purpose: Reduce result redundancy

Algorithm:

selected = [top_result]  # Always include best result
remaining = sorted_results[1:]

while len(selected) < max_results:
    best_candidate = None
    best_mmr = -inf
    
    for candidate in remaining:
        relevance = normalize(candidate.relevance)
        max_sim = max(similarity(candidate, s) for s in selected)
        mmr = λ × relevance - (1-λ) × max_sim
        
        if mmr > best_mmr:
            best_mmr = mmr
            best_candidate = candidate
    
    selected.append(best_candidate)
    remaining.remove(best_candidate)

Parameters:

λ (lambda_param): Default 0.6 (60% relevance, 40% diversity)
Similarity Metric: Jaccard similarity (word overlap)

Performance Characteristics

Latency Breakdown

Operation	Typical Latency	Notes
Mock Mode	< 1ms	No external dependencies
BM25 Search	10-50ms	Vespa only, no embeddings
Embedding Generation	30-150ms	Depends on model (MiniLM fastest)
Semantic Search	50-200ms	Embedding + Vespa vector search
Fusion Search	50-250ms	Both BM25 and semantic
Query Intelligence	200-800ms	Groq API call (network dependent)
Analytics Logging	1-5ms	SQLite write operation

Throughput

Concurrent Requests: Limited by Vespa and embedding model capacity
Recommended: 10-50 requests/second (depending on ranking mode)
Bottlenecks:
- Embedding generation (CPU-bound, single-threaded)
- Vespa query processing (multi-threaded, scales with cores)
- Groq API rate limits (external dependency)

Resource Usage

Memory:

Base application: ~200MB
Each embedding model: ~100-400MB (cached after first load)
Vespa container: ~2-4GB
SQLite database: Grows with usage (~1MB per 1000 searches)

CPU:

Embedding generation: Moderate (single-threaded per request)
Vespa: Multi-threaded, benefits from multiple cores
Query intelligence: Network I/O bound

Disk:

Models: ~500MB-2GB per model (downloaded once, cached)
Analytics DB: Grows linearly with usage

Security Considerations

Current Implementation

API Keys: Environment variables (not hardcoded)
Input Validation: Pydantic models validate all inputs
SQL Injection: Parameterized queries in SQLite
CORS: Enabled for development (configure for production)

Production Recommendations

Authentication: Add API key or OAuth for protected endpoints
Rate Limiting: Implement request throttling (e.g., slowapi)
HTTPS: Use TLS/SSL certificates
Secrets Management: Use secure secret management services
Database Encryption: Encrypt analytics database at rest
Error Handling: Avoid exposing internal errors to clients
Logging: Structured logging without sensitive data

Deployment Architecture

Development Setup

┌─────────────┐
│   Browser   │
└──────┬──────┘
       │ HTTP
┌──────▼──────────────────┐
│  FastAPI (localhost)    │
│  Port: 8000             │
└──────┬──────────────────┘
       │
       ├─→ Vespa (localhost:8080) [Optional]
       ├─→ Groq API (cloud)
       └─→ SQLite (local file)

Production Setup (Docker Compose)

services:
  vespa:
    image: vespaengine/vespa:latest
    ports: ["8080:8080", "19071:19071"]
    mem_limit: 8g
  
  neuralsearch-api:
    build: .
    ports: ["8000:8000"]
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - VESPA_URL=http://vespa
      - VESPA_PORT=8080
    depends_on: [vespa]

Scalability

Horizontal Scaling:

FastAPI: Stateless, can run multiple instances behind load balancer
Vespa: Supports clustering for distributed search
Embedding Models: Can be offloaded to separate service (e.g., TensorFlow Serving)

Vertical Scaling:

Increase Vespa memory for larger indexes
GPU acceleration for embedding generation
Faster CPUs for BM25 processing

🏗️ Technical Architecture

System Overview

NeuralSearch is built on a microservices architecture with the following components:

FastAPI Backend (ui_e2e.py): RESTful API server handling all search requests
Vespa Search Engine: Distributed search engine for indexing and retrieval
Sentence Transformers: Local embedding generation for semantic search
Groq LLM API: Cloud-based query intelligence (expansion, improvement, suggestions)
SQLite Analytics: Local database for tracking searches, clicks, and performance
Modern Web UI: Vanilla JavaScript frontend with real-time search

Search Pipeline

User Query
    ↓
[Query Intelligence] → Query Improvement/Expansion/Intent Classification
    ↓
[Embedding Generation] → Multi-model semantic embeddings (optional)
    ↓
[Vespa Query] → Hybrid search (BM25 + semantic or BM25-only)
    ↓
[Result Diversification] → MMR-based deduplication (optional)
    ↓
[Analytics Logging] → Track search, clicks, performance
    ↓
Formatted Results

Hybrid Search Implementation

Fusion Mode combines:

BM25 (Lexical): Term frequency-based scoring on text and url fields
Semantic (Vector): Cosine similarity between query embedding and document embeddings
Combined Score: relevance = α × BM25_score + β × semantic_score

BM25 Formula:

BM25(q, d) = Σ IDF(qi) × (f(qi, d) × (k1 + 1)) / (f(qi, d) + k1 × (1 - b + b × |d|/avgdl))

Where:

k1: Term frequency saturation parameter (default: 1.2, tuned: 1.8 for text, 0.9 for URL)
b: Length normalization parameter (default: 0.75, tuned: 0.40 for text, 0.30 for URL)
f(qi, d): Frequency of term qi in document d
IDF(qi): Inverse document frequency of term qi

Result Diversification (MMR)

Maximal Marginal Relevance (MMR) algorithm:

MMR = argmax[λ × Sim(d, q) - (1-λ) × max(Sim(d, di))]

Where:

λ (lambda_param): Balance between relevance and diversity (default: 0.6)
Sim(d, q): Relevance score of document d to query q
Sim(d, di): Similarity between document d and already selected document di

Mock Mode

When Vespa is unavailable, the system automatically switches to mock mode:

Generates realistic search results with working URLs (Wikipedia, Stack Overflow, GitHub, etc.)
Query-specific URLs that redirect to real search engines
Maintains API compatibility for frontend testing
Visual indicators in UI to distinguish mock results
Fast response times (< 1ms) for development/testing

Vespa Schema Structure

Current Schema (defined in main.py):

The application uses a document schema with the following structure:

Schema(
    name="doc",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary"]),
            Field(name="text", type="string", indexing=["index", "summary"], index="enable-bm25"),
            Field(name="url", type="string", indexing=["index", "summary"], index="enable-bm25"),
        ]
    ),
    fieldsets=[FieldSet(name="default", fields=["text", "url"])],
    rank_profiles=[...]  # 4 BM25 profiles defined
)

Field Descriptions:

id: Document identifier (summary only, not indexed for search)
text: Main content field (indexed with BM25, used for lexical search)
url: URL field (indexed with BM25, weighted 0.1x in combined profiles)

For Semantic/Fusion Search (requires schema modification):

To enable semantic search, add an embedding field to the schema:

Field(
    name="text_embedding",
    type="tensor<float>(x[384])",  # 384 for MiniLM, 768 for MPNet/BGE
    indexing=["attribute", "index"],
    attribute=["distance-metric: euclidean"]
)

Then update rank profiles to include semantic scoring:

RankProfile(
    name="fusion",
    functions=[
        Function(name="bm25score", expression="bm25(text) + 0.1 * bm25(url)"),
        Function(name="semanticscore", expression="closeness(field, text_embedding)")
    ],
    first_phase="bm25score + 0.7 * semanticscore"
)

Database Schema

Analytics Database (search_analytics.db):

-- Searches table
CREATE TABLE searches (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    query TEXT NOT NULL,
    query_hash TEXT NOT NULL,
    ranking_profile TEXT,
    limit_results INTEGER,
    use_query_expansion BOOLEAN,
    use_query_improvement BOOLEAN,
    query_intent TEXT,
    num_results INTEGER,
    latency_ms REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    session_id TEXT
);

-- Clicks table
CREATE TABLE clicks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    search_id INTEGER,
    document_id TEXT NOT NULL,
    document_url TEXT,
    position INTEGER,
    relevance_score REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);

⚙️ Configuration

Ranking Profiles

The system supports multiple ranking profiles defined in main.py:

Application-Level Profiles:

fusion: Hybrid search combining BM25 + semantic (default, requires embeddings)
semantic: Dense vector search only (requires embeddings)
bm25: Lexical search only (works with current schema)

Vespa Schema Profiles (defined in main.py):

bm25_text_only: BM25 scoring on text field only
- Expression: bm25(text)
bm25_url_only: BM25 scoring on url field only
- Expression: bm25(url)
bm25: Combined BM25 with default parameters
- Expression: bm25(text) + 0.1 × bm25(url)
- Default k1=1.2, b=0.75
bm25_comb_tuned: Combined BM25 with tuned parameters
- Expression: bm25(text) + 0.1 × bm25(url)
- Text field: k1=1.8, b=0.40
- URL field: k1=0.9, b=0.30

Embedding Models

Three pre-trained models are supported via Sentence Transformers:

minilm (all-MiniLM-L6-v2):
- Dimensions: 384
- Speed: Fast (~50ms per query)
- Quality: Good for general purpose
- Use case: Default, production workloads
mpnet (all-mpnet-base-v2):
- Dimensions: 768
- Speed: Moderate (~150ms per query)
- Quality: Higher accuracy
- Use case: Quality-critical applications
bge (BAAI/bge-base-en-v1.5):
- Dimensions: 768
- Speed: Moderate (~150ms per query)
- Quality: State-of-the-art
- Use case: Research, high-accuracy requirements

Query Intelligence Features

Powered by Groq LLM (openai/gpt-oss-120b):

Query Improvement:
- Fixes grammar errors
- Adds missing keywords
- Clarifies ambiguous intent
- Returns improved query with explanation
Query Expansion:
- Generates 3-5 alternative query variations
- Uses synonyms and related terms
- Maintains original intent
- Used for multi-query retrieval
Query Suggestions (Autocomplete):
- Generates 5-10 completion suggestions
- Context-aware completions
- Popular query patterns
Intent Classification:
- FACTUAL: Seeking specific facts or data
- NAVIGATIONAL: Looking for a specific website/page
- INFORMATIONAL: Learning about a topic
- TRANSACTIONAL: Wanting to perform an action

Dependencies

Core Dependencies (from requirements.txt):

fastapi>=0.111.0: Web framework for building APIs
uvicorn>=0.30.0: ASGI server for FastAPI
groq>=0.4.0: Groq LLM API client
sentence-transformers>=5.1.2: Embedding model library
pyvespa>=0.62.0: Vespa Python client
datasets>=4.4.1: Dataset loading utilities
pandas>=2.3.3: Data manipulation (for analytics)

System Requirements:

Python 3.10+
Docker (for Vespa)
4GB+ RAM (8GB recommended for Vespa)
Internet connection (for Groq API and model downloads)

🐛 Troubleshooting

Vespa Connection Issues

# Check if Vespa is running
curl http://localhost:8080/state/v1/health

# Check Vespa container
docker ps | grep vespa

Missing Dependencies

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Database Issues

The analytics database (search_analytics.db) is created automatically. If you encounter issues, delete it and restart the server.

📊 API Request/Response Schemas

Search Request

{
  "query": "string (required)",
  "limit": "integer (optional, default: 10, max: 100)",
  "ranking": "string (optional: 'fusion'|'semantic'|'bm25', default: 'fusion')",
  "use_query_expansion": "boolean (optional, default: false)",
  "use_query_improvement": "boolean (optional, default: false)",
  "use_diversification": "boolean (optional, default: false)",
  "diversification_method": "string (optional: 'mmr', default: 'mmr')",
  "embedding_model": "string (optional: 'minilm'|'mpnet'|'bge', default: 'minilm')",
  "session_id": "string (optional, for analytics tracking)"
}

Search Response

{
  "query": "string",
  "hits": [
    {
      "id": "string",
      "document_id": "string",
      "vespa_document_id": "string",
      "url": "string",
      "text": "string",
      "snippet": "string",
      "relevance": "float (0.0-1.0)",
      "fields": {}
    }
  ],
  "returned": "integer",
  "limit": "integer",
  "total_available": "integer",
  "latency_ms": "float",
  "coverage": {},
  "ranking_profile": "string",
  "embedding_model": "string",
  "search_id": "integer (for click tracking)",
  "query_improvement": {
    "original": "string",
    "improved": "string",
    "explanation": "string"
  },
  "expanded_queries": ["string"],
  "query_intent": {
    "intent": "FACTUAL|NAVIGATIONAL|INFORMATIONAL|TRANSACTIONAL",
    "confidence": "float"
  },
  "mock_mode": "boolean (true if Vespa unavailable)"
}

Error Responses

400 Bad Request:

{
  "detail": "Error message describing the issue"
}

503 Service Unavailable (Vespa connection):

Automatically falls back to mock mode
Response includes "mock_mode": true

⚡ Performance Characteristics

Latency Benchmarks

Mock Mode: < 1ms (no external dependencies)
BM25 Search: 10-50ms (Vespa only)
Semantic Search: 50-200ms (embedding generation + Vespa)
Fusion Search: 50-250ms (embedding + hybrid retrieval)
Query Intelligence: 200-800ms (Groq API call)

Throughput

Concurrent Requests: Limited by Vespa and embedding model capacity
Recommended: 10-50 requests/second (depending on ranking mode)
Bottlenecks:
- Embedding generation (CPU-bound)
- Vespa query processing
- Groq API rate limits

Resource Usage

Memory:
- Base application: ~200MB
- Each embedding model: ~100-400MB
- Vespa container: ~2-4GB
CPU:
- Embedding generation: Moderate (single-threaded)
- Vespa: Multi-threaded, scales with cores
Disk:
- Models: ~500MB-2GB per model (cached after first download)
- Analytics DB: Grows with usage (~1MB per 1000 searches)

📝 Notes

Mock Mode: The application automatically works in mock mode when Vespa is not running, returning demo results with working URLs
BM25 Mode: Works immediately with the current schema in main.py (no embeddings required)
Semantic/Fusion Mode: Requires text_embedding tensor field in Vespa schema (not included in current main.py)
Model Downloads: First-time embedding model downloads may take 2-5 minutes (models are cached locally)
Groq API Key: Must set GROQ_API_KEY environment variable (raises error if missing)
Analytics: Data stored in SQLite database (search_analytics.db), created automatically on first run
Connection Handling: Vespa connection is lazy-loaded with 2-second timeout to prevent hangs
Caching: Embedding models and Groq client are cached using @lru_cache for performance
Thread Safety: SQLite connection uses check_same_thread=False for FastAPI compatibility

🏥 Health Check

Check if the server is running:

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "vespa_available": false,
  "version": "0.3.0"
}

🔒 Security Considerations

Current Implementation

API Keys: Stored in environment variables (not hardcoded)
Input Validation: Pydantic models validate all API inputs
SQL Injection: Protected via parameterized queries in SQLite
CORS: Enabled for development (configure for production)

Production Recommendations

Authentication: Add API key or OAuth for protected endpoints
Rate Limiting: Implement request throttling (e.g., using slowapi)
HTTPS: Use TLS/SSL certificates for encrypted connections
Input Sanitization: Additional validation for user queries
Secrets Management: Use secure secret management (e.g., AWS Secrets Manager, HashiCorp Vault)
Database Security: Encrypt analytics database, implement access controls
Error Handling: Avoid exposing internal errors to clients
Logging: Implement structured logging without sensitive data

🧪 Testing

Manual Testing

# Test health endpoint
curl http://localhost:8000/health

# Test search endpoint
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "limit": 5}'

# Test query intelligence
curl -X POST http://localhost:8000/improve-query \
  -H "Content-Type: application/json" \
  -d '{"query": "how learn python"}'

Integration Testing

The application supports testing in mock mode without Vespa:

All endpoints work with mock data
Query intelligence requires Groq API key
Analytics database is created automatically

🚀 Deployment

Docker Deployment (Recommended)

Build Docker image:

FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "ui_e2e:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose (with Vespa):

version: '3.8'
services:
  vespa:
    image: vespaengine/vespa:latest
    ports:
      - "8080:8080"
      - "19071:19071"
    mem_limit: 8g
  
  search-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - VESPA_URL=http://vespa
      - VESPA_PORT=8080
    depends_on:
      - vespa

Cloud Deployment

AWS: Deploy on EC2, ECS, or Lambda (with container support)
Google Cloud: Cloud Run or Compute Engine
Azure: Container Instances or App Service
Heroku: Use buildpacks or container registry

Environment Variables (set in cloud platform):

GROQ_API_KEY: Required for query intelligence
VESPA_URL: Vespa endpoint URL
VESPA_PORT: Vespa port (default: 8080)
VESPA_EMBEDDING_MODEL: Embedding model selection
ENABLE_QUERY_INTELLIGENCE: Enable/disable AI features
ENABLE_DIVERSIFICATION: Enable/disable result diversification

📄 License

This project is provided as-is for educational and development purposes.

🤝 Contributing

Contributions are welcome! Areas for improvement:

Additional embedding models
More ranking profiles
Enhanced analytics dashboard
Performance optimizations
Test coverage
Documentation improvements

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
static		static
templates		templates
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analytics.py		analytics.py
main.py		main.py
multi_model_ensemble.py		multi_model_ensemble.py
pyproject.toml		pyproject.toml
query_intelligence.py		query_intelligence.py
requirements.txt		requirements.txt
result_diversification.py		result_diversification.py
start_server.py		start_server.py
ui_e2e.py		ui_e2e.py

Folders and files

Latest commit

History

Repository files navigation

NeuralSearch - AI-Powered Hybrid Search Engine

📖 About

What Makes NeuralSearch Unique?

Use Cases

Technology Stack

🚀 Features

📋 Prerequisites

🛠️ Installation

1. Install Python Dependencies

2. Set Up Vespa Search Engine

3. Configure Environment Variables (Optional)

🏃 Running the Application

Start the Server

📖 Example Usage

Example 1: Basic Search via Web UI

Example 2: Search via API

Example 3: Search with Query Intelligence

Example 4: Get Autocomplete Suggestions

Example 5: View Analytics

Example 6: Compare Embedding Models

Example 7: Improve Query with AI

Example 8: Track Click on Search Result

Example 9: Export Search Results

Example 10: Get Available Embedding Models

Example 11: Search with Custom Embedding Model

Example 12: Search with Result Diversification

💻 Python Code Examples

Example 13: Using the Search API with Python

Example 14: Search with Query Intelligence

Example 15: Get Analytics Data

Example 16: Track User Clicks

Example 17: Get Autocomplete Suggestions

Example 18: Compare Different Embedding Models

🌐 Web UI Examples

Example 19: Using the Web Interface

🔌 API Endpoints

📁 Project Structure

🔬 Technical Overview

System Architecture

Component Breakdown

1. API Server (ui_e2e.py)

2. Query Intelligence (query_intelligence.py)

3. Multi-Model Embeddings (multi_model_ensemble.py)

4. Result Diversification (result_diversification.py)

5. Analytics System (analytics.py)

6. Vespa Integration (main.py + ui_e2e.py)

Data Flow

Search Request Flow

Click Tracking Flow

Algorithms & Methods

1. BM25 (Best Matching 25)

2. Semantic Search (Dense Vector)

3. Hybrid Search (Fusion)

4. Maximal Marginal Relevance (MMR)

Performance Characteristics

Latency Breakdown

Throughput

Resource Usage

Security Considerations

Current Implementation

Production Recommendations

Deployment Architecture

Development Setup

Production Setup (Docker Compose)

Scalability

🏗️ Technical Architecture

System Overview

Search Pipeline

Hybrid Search Implementation

Result Diversification (MMR)

Mock Mode

Vespa Schema Structure

Database Schema

⚙️ Configuration

Ranking Profiles

Embedding Models

1. API Server (`ui_e2e.py`)

2. Query Intelligence (`query_intelligence.py`)

3. Multi-Model Embeddings (`multi_model_ensemble.py`)

4. Result Diversification (`result_diversification.py`)

5. Analytics System (`analytics.py`)

6. Vespa Integration (`main.py` + `ui_e2e.py`)

Packages