Skip to content

imcoza/NeuralSearch

Repository files navigation

NeuralSearch - AI-Powered Hybrid Search Engine

An advanced AI-powered search engine built with Vespa, FastAPI, and Groq, featuring neural query intelligence, hybrid search (BM25 + semantic), multi-model embeddings, result diversification, and comprehensive analytics.

πŸ“– About

NeuralSearch is a production-ready, AI-powered search engine that combines lexical (BM25) and semantic (vector) search with neural query intelligence. Built with modern Python frameworks and cutting-edge AI models, it delivers intelligent search capabilities with enterprise-grade performance.

What Makes NeuralSearch Unique?

🧠 Neural Query Intelligence - Leverages Groq's high-performance LLM to automatically understand, improve, and expand user queries, making searches more intuitive and effective.

πŸ”€ Hybrid Search Architecture - Seamlessly combines BM25 lexical search with semantic vector search, ensuring both precision and recall for optimal results.

🎯 Multi-Model Embeddings - Supports multiple state-of-the-art embedding models (MiniLM, MPNet, BGE) allowing you to choose the perfect balance between speed and accuracy.

πŸ“Š Comprehensive Analytics - Built-in analytics dashboard tracks search performance, user behavior, and query patterns to continuously improve search quality.

Use Cases

  • Enterprise Search - Deploy as an internal search solution for documentation, knowledge bases, or content management systems
  • E-commerce - Power product search with intelligent query understanding and semantic matching
  • Research & Development - Build custom search solutions for academic papers, research databases, or technical documentation
  • Content Discovery - Enable semantic content discovery in media platforms, blogs, or news aggregators

Technology Stack

  • Backend: FastAPI (Python) - High-performance async API framework
  • Search Engine: Vespa - Production-grade open-source search engine
  • AI/ML: Groq LLM, Sentence Transformers - State-of-the-art NLP models
  • Frontend: Vanilla JavaScript, HTML5, CSS3 - Modern, lightweight web interface
  • Analytics: SQLite - Efficient data storage and querying
  • Deployment: Docker, Uvicorn - Containerized, scalable deployment

πŸš€ Features

  • Hybrid Search: Combines BM25 (lexical) and semantic (dense vector) search
  • Query Intelligence: AI-powered query expansion, improvement, and autocomplete using Groq LLM
  • Multi-Model Support: Support for multiple embedding models (MiniLM, MPNet, BGE)
  • Result Diversification: MMR-based diversification to reduce redundancy
  • Analytics Dashboard: Track searches, clicks, query performance, and user behavior
  • Modern UI: Clean, responsive web interface with real-time search

πŸ“‹ Prerequisites

  • Python 3.10+
  • Docker (for Vespa deployment)
  • Groq API Key (for query intelligence features)

πŸ› οΈ Installation

1. Install Python Dependencies

# Using pip
pip install -r requirements.txt

# Or using the project file
pip install -e .

2. Set Up Vespa Search Engine

Vespa needs to be running in Docker. The application expects Vespa at http://localhost:8080.

Important: The current main.py defines a BM25-only schema. For full functionality (semantic/fusion search), you'll need to add a text_embedding field to your Vespa schema. The application will work with BM25-only mode if you select "bm25" ranking.

Option A: Manual Vespa Setup (Recommended)

  1. Start Vespa Docker container:
docker run -m 8G --detach --name vespa --hostname vespa-container \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa:latest
  1. Wait for Vespa to be ready (check health):
curl http://localhost:8080/state/v1/health
  1. Deploy your Vespa application package:
    • Use main.py as a reference for the package definition
    • For semantic search, add a text_embedding tensor field to the schema
    • Feed documents with embeddings to Vespa

Note: The application can run in BM25-only mode without semantic embeddings. Simply use the "bm25" ranking profile.

3. Configure Environment Variables (Optional)

Create a .env file or set environment variables:

# Vespa Configuration
VESPA_URL=http://localhost
VESPA_PORT=8080
VESPA_RESULT_LIMIT=10
VESPA_MAX_RESULT_LIMIT=100
VESPA_DEFAULT_RANKING=fusion

# Embedding Configuration
VESPA_EMBEDDING_MODEL=all-MiniLM-L6-v2
VESPA_EMBEDDING_DEVICE=  # Leave empty for auto, or set to 'cpu', 'cuda', 'mps'

# Query Intelligence
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=openai/gpt-oss-120b
ENABLE_QUERY_INTELLIGENCE=true
ENABLE_DIVERSIFICATION=true

Important: You must set the GROQ_API_KEY environment variable.

Quick Setup:

# Windows PowerShell
$env:GROQ_API_KEY = "your_groq_api_key_here"

# Or create .env file
echo GROQ_API_KEY=your_groq_api_key_here > .env

πŸƒ Running the Application

Start the Server

python start_server.py

Or using uvicorn directly:

uvicorn ui_e2e:app --reload --port 8000

The server will start at: http://localhost:8000

πŸ“– Example Usage

Example 1: Basic Search via Web UI

  1. Open your browser and navigate to http://localhost:8000
  2. Enter a search query, e.g., "python machine learning tutorial"
  3. Select ranking mode (fusion, semantic, or bm25)
  4. Click "Search"
  5. View results with relevance scores, snippets, and URLs

Example 2: Search via API

# Basic search
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python machine learning",
    "limit": 10,
    "ranking": "fusion"
  }'

Response:

{
  "query": "python machine learning",
  "hits": [
    {
      "id": "doc123",
      "url": "https://example.com/python-ml",
      "text": "Complete guide to Python machine learning...",
      "snippet": "Complete guide to Python machine learning...",
      "relevance": 0.8542
    }
  ],
  "returned": 10,
  "total_available": 150,
  "latency_ms": 45.2,
  "ranking_profile": "fusion",
  "embedding_model": "minilm"
}

Example 3: Search with Query Intelligence

# Search with query improvement and expansion
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how learn python",
    "limit": 5,
    "use_query_improvement": true,
    "use_query_expansion": true,
    "ranking": "fusion"
  }'

Response includes:

  • query_improvement: Improved query with explanation
  • expanded_queries: List of query variations used
  • query_intent: Classified intent (FACTUAL, INFORMATIONAL, etc.)

Example 4: Get Autocomplete Suggestions

curl -X POST http://localhost:8000/suggestions \
  -H "Content-Type: application/json" \
  -d '{
    "partial_query": "python",
    "num_suggestions": 5
  }'

Response:

{
  "suggestions": [
    "python tutorial",
    "python programming",
    "python for beginners",
    "python data science",
    "python web development"
  ]
}

Example 5: View Analytics

# Get analytics for last 7 days
curl http://localhost:8000/analytics?days=7

# Get analytics for specific query
curl http://localhost:8000/analytics/query/python%20tutorial

# Get top clicked documents
curl http://localhost:8000/analytics/top-documents?limit=10

Example 6: Compare Embedding Models

curl -X POST http://localhost:8000/compare-models \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning algorithms"
  }'

Response:

{
  "query": "machine learning algorithms",
  "models": {
    "minilm": {
      "embedding": [0.123, -0.456, ...],
      "dimension": 384,
      "model_info": {
        "name": "all-MiniLM-L6-v2",
        "dim": 384,
        "description": "Fast, lightweight (default)"
      }
    },
    "mpnet": {
      "embedding": [0.234, -0.567, ...],
      "dimension": 768,
      "model_info": {
        "name": "all-mpnet-base-v2",
        "dim": 768,
        "description": "Higher quality, slower"
      }
    }
  }
}

Example 7: Improve Query with AI

curl -X POST http://localhost:8000/improve-query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how learn python fast"
  }'

Response:

{
  "original": "how learn python fast",
  "improved": "how to learn python quickly",
  "explanation": "Added 'to' for proper grammar and changed 'fast' to 'quickly' for better clarity"
}

Example 8: Track Click on Search Result

curl -X POST http://localhost:8000/click \
  -H "Content-Type: application/json" \
  -d '{
    "search_id": 123,
    "document_id": "doc-456",
    "document_url": "https://example.com/article",
    "position": 1,
    "relevance_score": 0.95
  }'

Response:

{
  "status": "success"
}

Example 9: Export Search Results

# Export as JSON
curl -X POST "http://localhost:8000/export/results?format=json" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorial",
    "limit": 10
  }' --output results.json

# Export as CSV
curl -X POST "http://localhost:8000/export/results?format=csv" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorial",
    "limit": 10
  }' --output results.csv

Example 10: Get Available Embedding Models

curl http://localhost:8000/models

Response:

{
  "models": {
    "minilm": {
      "name": "all-MiniLM-L6-v2",
      "dim": 384,
      "description": "Fast, lightweight (default)"
    },
    "mpnet": {
      "name": "all-mpnet-base-v2",
      "dim": 768,
      "description": "Higher quality, slower"
    },
    "bge": {
      "name": "BAAI/bge-base-en-v1.5",
      "dim": 768,
      "description": "State-of-the-art quality"
    }
  },
  "default": "minilm"
}

Example 11: Search with Custom Embedding Model

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "deep learning neural networks",
    "limit": 5,
    "ranking": "semantic",
    "embedding_model": "mpnet"
  }'

Example 12: Search with Result Diversification

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python web frameworks",
    "limit": 10,
    "use_diversification": true,
    "diversification_method": "mmr",
    "ranking": "fusion"
  }'

πŸ’» Python Code Examples

Example 13: Using the Search API with Python

import requests

# Basic search
response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "machine learning",
        "limit": 10,
        "ranking": "fusion"
    }
)

data = response.json()
print(f"Found {data['returned']} results")
for hit in data['hits']:
    print(f"- {hit['url']} (relevance: {hit['relevance']})")

Example 14: Search with Query Intelligence

import requests

response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "how to use python",
        "limit": 5,
        "use_query_improvement": True,
        "use_query_expansion": True,
        "ranking": "fusion"
    }
)

data = response.json()

# Check if query was improved
if "query_improvement" in data:
    print(f"Original: {data['query_improvement']['original']}")
    print(f"Improved: {data['query_improvement']['improved']}")
    print(f"Explanation: {data['query_improvement']['explanation']}")

# Check expanded queries
if "expanded_queries" in data:
    print(f"Expanded queries: {data['expanded_queries']}")

# Check query intent
if "query_intent" in data:
    print(f"Intent: {data['query_intent']['intent']}")
    print(f"Confidence: {data['query_intent']['confidence']}")

Example 15: Get Analytics Data

import requests

# Get overall analytics
analytics = requests.get("http://localhost:8000/analytics?days=7").json()
print(f"Total searches: {analytics['total_searches']}")
print(f"Total clicks: {analytics['total_clicks']}")
print(f"CTR: {analytics['click_through_rate']}%")
print(f"Avg latency: {analytics['avg_latency_ms']}ms")

# Get top queries
for query in analytics['top_queries'][:5]:
    print(f"- {query['query']}: {query['search_count']} searches")

# Get query-specific analytics
query_stats = requests.get(
    "http://localhost:8000/analytics/query/python%20tutorial"
).json()
print(f"\nQuery 'python tutorial':")
print(f"  Searches: {query_stats['total_searches']}")
print(f"  Clicks: {query_stats['total_clicks']}")
print(f"  Avg latency: {query_stats['avg_latency_ms']}ms")

Example 16: Track User Clicks

import requests

# Perform a search first
search_response = requests.post(
    "http://localhost:8000/search",
    json={"query": "python tutorial", "limit": 5}
)
search_data = search_response.json()
search_id = search_data['search_id']

# Track a click on the first result
click_response = requests.post(
    "http://localhost:8000/click",
    json={
        "search_id": search_id,
        "document_id": search_data['hits'][0]['id'],
        "document_url": search_data['hits'][0]['url'],
        "position": 1,
        "relevance_score": search_data['hits'][0]['relevance']
    }
)
print(f"Click tracked: {click_response.json()}")

Example 17: Get Autocomplete Suggestions

import requests

response = requests.post(
    "http://localhost:8000/suggestions",
    json={
        "partial_query": "python",
        "num_suggestions": 5
    }
)

suggestions = response.json()['suggestions']
print("Suggestions:")
for suggestion in suggestions:
    print(f"  - {suggestion}")

Example 18: Compare Different Embedding Models

import requests

response = requests.post(
    "http://localhost:8000/compare-models",
    json={"query": "natural language processing"}
)

data = response.json()
print(f"Query: {data['query']}\n")

for model_key, model_data in data['models'].items():
    if 'error' not in model_data:
        print(f"{model_key}:")
        print(f"  Model: {model_data['model_info']['name']}")
        print(f"  Dimension: {model_data['dimension']}")
        print(f"  Description: {model_data['model_info']['description']}\n")

🌐 Web UI Examples

Example 19: Using the Web Interface

  1. Start the server:

    python start_server.py
  2. Open your browser: Navigate to http://localhost:8000

  3. Basic Search:

    • Type your query in the search box
    • Select ranking mode (fusion, semantic, or bm25)
    • Click "Search"
    • Results appear with relevance scores and snippets
  4. Advanced Features:

    • Query Improvement: Check "✨ Improve query with AI" to automatically fix grammar and improve your query
    • Query Expansion: Check "πŸ” Expand query" to generate query variations
    • Diversification: Check "🎯 Diversify results" to reduce redundant results
    • Model Selection: Choose different embedding models from the dropdown
  5. Analytics Dashboard:

    • Click "πŸ“Š Analytics Dashboard" to view:
      • Total searches and clicks
      • Click-through rate
      • Top queries
      • Query performance metrics
      • Popular documents
  6. Export Results:

    • Click "πŸ“₯ Export Results" to download search results as JSON or CSV

πŸ”Œ API Endpoints

Method Endpoint Description
GET / Web UI homepage
GET /health Health check endpoint
POST /search Perform search query
POST /suggestions Get autocomplete suggestions
POST /improve-query Improve query using AI
POST /click Track click on search result
GET /analytics Get search analytics
GET /analytics/query/{query} Get query-specific analytics
GET /analytics/top-documents Get top clicked documents
POST /export/results Export search results (JSON/CSV)
GET /models Get available embedding models
POST /compare-models Compare embeddings from different models

πŸ“ Project Structure

NeuralSearch/
β”œβ”€β”€ main.py                      # Vespa application package definition
β”œβ”€β”€ start_server.py              # Server startup script
β”œβ”€β”€ ui_e2e.py                    # FastAPI application with all endpoints
β”œβ”€β”€ analytics.py                 # Analytics and tracking module
β”œβ”€β”€ query_intelligence.py        # AI-powered query features (Groq)
β”œβ”€β”€ multi_model_ensemble.py      # Multi-model embedding support
β”œβ”€β”€ result_diversification.py    # Result diversification algorithms
β”œβ”€β”€ pyproject.toml              # Project configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ search_analytics.db         # SQLite database for analytics
β”œβ”€β”€ templates/
β”‚   └── index_e2e.html         # Web UI template
└── static/
    β”œβ”€β”€ app_e2e.js             # Frontend JavaScript
    └── styles_e2e.css         # Stylesheet

πŸ”¬ Technical Overview

System Architecture

NeuralSearch follows a layered microservices architecture with clear separation of concerns:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Frontend Layer (Web UI)                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚   HTML/CSS   β”‚  β”‚  JavaScript  β”‚  β”‚   Templates  β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↕ HTTP/REST
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  API Layer (FastAPI)                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  ui_e2e.py - Main Application Server                    β”‚ β”‚
β”‚  β”‚  - Request Routing & Validation                         β”‚ β”‚
β”‚  β”‚  - Response Formatting                                  β”‚ β”‚
β”‚  β”‚  - Error Handling                                      β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↕
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Business Logic Layer (Python Modules)            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Query      β”‚  β”‚   Multi-     β”‚  β”‚   Result     β”‚     β”‚
β”‚  β”‚ Intelligence β”‚  β”‚   Model      β”‚  β”‚ Diversifi-   β”‚     β”‚
β”‚  β”‚   (Groq)     β”‚  β”‚  Ensemble    β”‚  β”‚   cation     β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚  β”‚   Analytics   β”‚  β”‚   Embedding  β”‚                       β”‚
β”‚  β”‚   (SQLite)    β”‚  β”‚   Models     β”‚                       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↕
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Data Layer (Vespa + SQLite)                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚    Vespa     β”‚              β”‚   SQLite     β”‚            β”‚
β”‚  β”‚  Search      β”‚              β”‚  Analytics   β”‚            β”‚
β”‚  β”‚  Engine      β”‚              β”‚   Database   β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

1. API Server (ui_e2e.py)

Technology: FastAPI (Python 3.10+)

Key Features:

  • Async Request Handling: Non-blocking I/O for concurrent requests
  • Pydantic Models: Type-safe request/response validation
  • Session Management: Cached Vespa connections with lazy initialization
  • Error Handling: Graceful fallback to mock mode when Vespa unavailable
  • CORS Support: Cross-origin resource sharing enabled

Core Functions:

# Connection Management
get_vespa_client() -> Vespa          # Creates Vespa client (cached)
get_vespa_session() -> VespaSync     # Returns active session (lazy-loaded)
close_vespa_session() -> None       # Cleanup on shutdown

# Query Processing
run_vespa_query() -> Dict            # Executes search with Vespa
_get_mock_results() -> Dict          # Fallback when Vespa unavailable
_format_hit() -> Dict                # Normalizes Vespa response format

API Endpoints:

  • POST /search - Main search endpoint with query intelligence
  • POST /suggestions - Autocomplete suggestions
  • POST /improve-query - Query improvement service
  • POST /click - Click tracking for analytics
  • GET /analytics - Analytics dashboard data
  • GET /models - Available embedding models
  • POST /compare-models - Model comparison utility
  • GET /health - Health check with Vespa status

2. Query Intelligence (query_intelligence.py)

Technology: Groq LLM API (openai/gpt-oss-120b)

Architecture:

  • Cached Client: @lru_cache decorator for singleton Groq client
  • Prompt Engineering: Structured prompts for consistent outputs
  • Error Resilience: Fallback to original query on API failures
  • Response Parsing: Robust text parsing with cleanup logic

Functions:

expand_query(query, num_variations=3):

  • Purpose: Generate query variations for better coverage
  • Algorithm: LLM-based synonym and rephrasing generation
  • Output: List of query variations including original
  • Parameters:
    • temperature=0.8 (creativity)
    • max_tokens=200 (response length)
    • top_p=0.9 (nucleus sampling)

improve_query(query):

  • Purpose: Fix grammar, add keywords, clarify intent
  • Algorithm: LLM-based query rewriting
  • Output: {original, improved, explanation}
  • Parameters: temperature=0.3 (more deterministic)

get_query_suggestions(partial_query, num_suggestions=5):

  • Purpose: Autocomplete suggestions
  • Algorithm: Context-aware completion generation
  • Output: List of complete query suggestions

classify_query_intent(query):

  • Purpose: Classify search intent
  • Categories: FACTUAL, NAVIGATIONAL, INFORMATIONAL, TRANSACTIONAL
  • Output: {intent, confidence, reasoning}

3. Multi-Model Embeddings (multi_model_ensemble.py)

Technology: Sentence Transformers

Supported Models:

Model Key Model Name Dimensions Speed Quality
minilm all-MiniLM-L6-v2 384 Fast (~50ms) Good
mpnet all-mpnet-base-v2 768 Moderate (~150ms) High
bge BAAI/bge-base-en-v1.5 768 Moderate (~150ms) SOTA

Implementation Details:

  • Model Caching: @lru_cache(maxsize=3) prevents reloading
  • Device Support: Automatic CPU/GPU/MPS detection
  • Encoding: encode_query(query, model_key) returns normalized vector
  • Ensemble Methods: Average or concatenate multiple model outputs

Functions:

encode_query(query, model_key) -> List[float]      # Single model encoding
ensemble_encode(query, models, method) -> Dict     # Multi-model ensemble
compare_models(query, models) -> Dict              # Side-by-side comparison
get_model_info() -> Dict                            # Model metadata

4. Result Diversification (result_diversification.py)

Algorithm: Maximal Marginal Relevance (MMR)

MMR Formula:

MMR = argmax[Ξ» Γ— Sim(d, q) - (1-Ξ») Γ— max(Sim(d, di))]

Where:

  • Ξ» (lambda_param): Balance between relevance (1.0) and diversity (0.0)
  • Sim(d, q): Relevance score of document d to query q
  • Sim(d, di): Similarity between document d and selected document di

Implementation:

  • Text Similarity: Jaccard similarity (word overlap) for efficiency
  • Relevance Normalization: Normalized by top result's relevance
  • Greedy Selection: Iteratively selects best MMR-scored candidates

Functions:

diversify_results_mmr(hits, lambda_param, max_results) -> List
diversify_by_domain(hits, max_per_domain, max_results) -> List
combine_diversification(hits, method, lambda_param) -> List

Diversification Methods:

  1. MMR: Reduces redundancy based on content similarity
  2. Domain: Ensures no single domain dominates results
  3. Both: Applies domain filtering then MMR

5. Analytics System (analytics.py)

Technology: SQLite with row factory for dict-like access

Database Schema:

searches Table:

CREATE TABLE searches (
    id INTEGER PRIMARY KEY,
    query TEXT NOT NULL,
    query_hash TEXT NOT NULL,          -- MD5 hash for normalization
    ranking_profile TEXT,
    limit_results INTEGER,
    use_query_expansion BOOLEAN,
    use_query_improvement BOOLEAN,
    query_intent TEXT,
    num_results INTEGER,
    latency_ms REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    session_id TEXT
)

clicks Table:

CREATE TABLE clicks (
    id INTEGER PRIMARY KEY,
    search_id INTEGER,                 -- Foreign key to searches
    document_id TEXT NOT NULL,
    document_url TEXT,
    position INTEGER,
    relevance_score REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)

query_metrics Table (Aggregated):

CREATE TABLE query_metrics (
    id INTEGER PRIMARY KEY,
    query_hash TEXT UNIQUE,
    total_searches INTEGER DEFAULT 0,
    total_clicks INTEGER DEFAULT 0,
    avg_latency_ms REAL,
    avg_results INTEGER,
    last_searched DATETIME
)

Indexes:

  • idx_searches_timestamp - Time-based queries
  • idx_searches_query_hash - Query lookup
  • idx_clicks_search_id - Join optimization
  • idx_clicks_timestamp - Time-based analytics

Analytics Functions:

log_search(...) -> int                # Logs search, returns search_id
log_click(...) -> None                # Tracks user clicks
get_search_analytics(days) -> Dict    # Overall analytics
get_query_performance(query) -> Dict  # Query-specific metrics
get_top_clicked_documents(limit) -> List

Query Normalization:

  • Uses MD5 hash of lowercased, stripped query
  • Enables case-insensitive query grouping
  • Prevents duplicate metrics for similar queries

6. Vespa Integration (main.py + ui_e2e.py)

Vespa Application Package:

Schema Definition:

Schema(
    name="doc",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary"]),
            Field(name="text", type="string", 
                  indexing=["index", "summary"], index="enable-bm25"),
            Field(name="url", type="string",
                  indexing=["index", "summary"], index="enable-bm25"),
        ]
    )
)

Ranking Profiles:

  1. bm25_text_only: bm25(text)
  2. bm25_url_only: bm25(url)
  3. bm25: bm25(text) + 0.1 Γ— bm25(url) (default k1=1.2, b=0.75)
  4. bm25_comb_tuned: Same expression with tuned parameters:
    • Text: k1=1.8, b=0.40
    • URL: k1=0.9, b=0.30

BM25 Formula:

BM25(q, d) = Ξ£ IDF(qi) Γ— (f(qi, d) Γ— (k1 + 1)) / (f(qi, d) + k1 Γ— (1 - b + b Γ— |d|/avgdl))

Query Construction:

query_body = {
    "yql": "select * from doc where userQuery()",
    "hits": limit,
    "query": query_text,
    "ranking": {"profile": ranking_profile},
    "input.query(q)": query_embedding,  # For semantic search
    "ranking.features.query(q)": query_embedding
}

Connection Management:

  • Lazy Loading: Session created on first query
  • Connection Test: 2-second socket timeout before attempting connection
  • Caching: Global _vespa_session variable for reuse
  • Error Handling: Falls back to mock mode on connection failure

Data Flow

Search Request Flow

1. User Query
   ↓
2. FastAPI Endpoint (/search)
   β”œβ”€β†’ Request Validation (Pydantic)
   β”œβ”€β†’ Query Intelligence (if enabled)
   β”‚   β”œβ”€β†’ Query Improvement (Groq)
   β”‚   β”œβ”€β†’ Query Expansion (Groq)
   β”‚   └─→ Intent Classification (Groq)
   ↓
3. Embedding Generation (if semantic/fusion)
   β”œβ”€β†’ Model Selection (minilm/mpnet/bge)
   β”œβ”€β†’ Sentence Transformer Encoding
   └─→ Vector Normalization
   ↓
4. Vespa Query Construction
   β”œβ”€β†’ YQL Query Building
   β”œβ”€β†’ Ranking Profile Selection
   β”œβ”€β†’ Embedding Injection (if needed)
   └─→ Connection Check (2s timeout)
   ↓
5. Vespa Execution
   β”œβ”€β†’ BM25 Scoring (lexical)
   β”œβ”€β†’ Vector Similarity (semantic, if enabled)
   └─→ Result Ranking
   ↓
6. Result Processing
   β”œβ”€β†’ Hit Formatting
   β”œβ”€β†’ Diversification (MMR, if enabled)
   └─→ Limit Application
   ↓
7. Analytics Logging
   β”œβ”€β†’ Search Log Entry
   β”œβ”€β†’ Query Metrics Update
   └─→ Popular Queries Update
   ↓
8. Response Formatting
   β”œβ”€β†’ JSON Serialization
   β”œβ”€β†’ Metadata Addition
   └─→ HTTP Response

Click Tracking Flow

1. User Click Event
   ↓
2. Frontend JavaScript
   β”œβ”€β†’ Extract: search_id, document_id, position, relevance
   └─→ POST /click
   ↓
3. Backend Processing
   β”œβ”€β†’ Click Log Entry
   β”œβ”€β†’ Query Metrics Update (total_clicks++)
   └─→ Popular Queries Update
   ↓
4. Analytics Aggregation
   └─→ Real-time CTR calculation

Algorithms & Methods

1. BM25 (Best Matching 25)

Purpose: Lexical term-based ranking

Parameters:

  • k1: Term frequency saturation (default: 1.2, tuned: 1.8 for text, 0.9 for URL)
  • b: Length normalization (default: 0.75, tuned: 0.40 for text, 0.30 for URL)

Implementation: Native Vespa BM25 scoring

2. Semantic Search (Dense Vector)

Purpose: Meaning-based similarity matching

Process:

  1. Query β†’ Embedding (Sentence Transformer)
  2. Embedding β†’ Vespa vector search
  3. Cosine similarity calculation
  4. Result ranking by similarity score

Models: MiniLM (384d), MPNet (768d), BGE (768d)

3. Hybrid Search (Fusion)

Purpose: Combine lexical and semantic signals

Scoring:

final_score = Ξ± Γ— BM25_score + Ξ² Γ— semantic_score

Where Ξ± and Ξ² are tunable weights (typically Ξ±=0.5, Ξ²=0.5)

4. Maximal Marginal Relevance (MMR)

Purpose: Reduce result redundancy

Algorithm:

selected = [top_result]  # Always include best result
remaining = sorted_results[1:]

while len(selected) < max_results:
    best_candidate = None
    best_mmr = -inf
    
    for candidate in remaining:
        relevance = normalize(candidate.relevance)
        max_sim = max(similarity(candidate, s) for s in selected)
        mmr = Ξ» Γ— relevance - (1-Ξ») Γ— max_sim
        
        if mmr > best_mmr:
            best_mmr = mmr
            best_candidate = candidate
    
    selected.append(best_candidate)
    remaining.remove(best_candidate)

Parameters:

  • Ξ» (lambda_param): Default 0.6 (60% relevance, 40% diversity)
  • Similarity Metric: Jaccard similarity (word overlap)

Performance Characteristics

Latency Breakdown

Operation Typical Latency Notes
Mock Mode < 1ms No external dependencies
BM25 Search 10-50ms Vespa only, no embeddings
Embedding Generation 30-150ms Depends on model (MiniLM fastest)
Semantic Search 50-200ms Embedding + Vespa vector search
Fusion Search 50-250ms Both BM25 and semantic
Query Intelligence 200-800ms Groq API call (network dependent)
Analytics Logging 1-5ms SQLite write operation

Throughput

  • Concurrent Requests: Limited by Vespa and embedding model capacity
  • Recommended: 10-50 requests/second (depending on ranking mode)
  • Bottlenecks:
    • Embedding generation (CPU-bound, single-threaded)
    • Vespa query processing (multi-threaded, scales with cores)
    • Groq API rate limits (external dependency)

Resource Usage

Memory:

  • Base application: ~200MB
  • Each embedding model: ~100-400MB (cached after first load)
  • Vespa container: ~2-4GB
  • SQLite database: Grows with usage (~1MB per 1000 searches)

CPU:

  • Embedding generation: Moderate (single-threaded per request)
  • Vespa: Multi-threaded, benefits from multiple cores
  • Query intelligence: Network I/O bound

Disk:

  • Models: ~500MB-2GB per model (downloaded once, cached)
  • Analytics DB: Grows linearly with usage

Security Considerations

Current Implementation

  1. API Keys: Environment variables (not hardcoded)
  2. Input Validation: Pydantic models validate all inputs
  3. SQL Injection: Parameterized queries in SQLite
  4. CORS: Enabled for development (configure for production)

Production Recommendations

  1. Authentication: Add API key or OAuth for protected endpoints
  2. Rate Limiting: Implement request throttling (e.g., slowapi)
  3. HTTPS: Use TLS/SSL certificates
  4. Secrets Management: Use secure secret management services
  5. Database Encryption: Encrypt analytics database at rest
  6. Error Handling: Avoid exposing internal errors to clients
  7. Logging: Structured logging without sensitive data

Deployment Architecture

Development Setup

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Browser   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI (localhost)    β”‚
β”‚  Port: 8000             β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β†’ Vespa (localhost:8080) [Optional]
       β”œβ”€β†’ Groq API (cloud)
       └─→ SQLite (local file)

Production Setup (Docker Compose)

services:
  vespa:
    image: vespaengine/vespa:latest
    ports: ["8080:8080", "19071:19071"]
    mem_limit: 8g
  
  neuralsearch-api:
    build: .
    ports: ["8000:8000"]
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - VESPA_URL=http://vespa
      - VESPA_PORT=8080
    depends_on: [vespa]

Scalability

Horizontal Scaling:

  • FastAPI: Stateless, can run multiple instances behind load balancer
  • Vespa: Supports clustering for distributed search
  • Embedding Models: Can be offloaded to separate service (e.g., TensorFlow Serving)

Vertical Scaling:

  • Increase Vespa memory for larger indexes
  • GPU acceleration for embedding generation
  • Faster CPUs for BM25 processing

πŸ—οΈ Technical Architecture

System Overview

NeuralSearch is built on a microservices architecture with the following components:

  1. FastAPI Backend (ui_e2e.py): RESTful API server handling all search requests
  2. Vespa Search Engine: Distributed search engine for indexing and retrieval
  3. Sentence Transformers: Local embedding generation for semantic search
  4. Groq LLM API: Cloud-based query intelligence (expansion, improvement, suggestions)
  5. SQLite Analytics: Local database for tracking searches, clicks, and performance
  6. Modern Web UI: Vanilla JavaScript frontend with real-time search

Search Pipeline

User Query
    ↓
[Query Intelligence] β†’ Query Improvement/Expansion/Intent Classification
    ↓
[Embedding Generation] β†’ Multi-model semantic embeddings (optional)
    ↓
[Vespa Query] β†’ Hybrid search (BM25 + semantic or BM25-only)
    ↓
[Result Diversification] β†’ MMR-based deduplication (optional)
    ↓
[Analytics Logging] β†’ Track search, clicks, performance
    ↓
Formatted Results

Hybrid Search Implementation

Fusion Mode combines:

  • BM25 (Lexical): Term frequency-based scoring on text and url fields
  • Semantic (Vector): Cosine similarity between query embedding and document embeddings
  • Combined Score: relevance = Ξ± Γ— BM25_score + Ξ² Γ— semantic_score

BM25 Formula:

BM25(q, d) = Ξ£ IDF(qi) Γ— (f(qi, d) Γ— (k1 + 1)) / (f(qi, d) + k1 Γ— (1 - b + b Γ— |d|/avgdl))

Where:

  • k1: Term frequency saturation parameter (default: 1.2, tuned: 1.8 for text, 0.9 for URL)
  • b: Length normalization parameter (default: 0.75, tuned: 0.40 for text, 0.30 for URL)
  • f(qi, d): Frequency of term qi in document d
  • IDF(qi): Inverse document frequency of term qi

Result Diversification (MMR)

Maximal Marginal Relevance (MMR) algorithm:

MMR = argmax[Ξ» Γ— Sim(d, q) - (1-Ξ») Γ— max(Sim(d, di))]

Where:

  • Ξ» (lambda_param): Balance between relevance and diversity (default: 0.6)
  • Sim(d, q): Relevance score of document d to query q
  • Sim(d, di): Similarity between document d and already selected document di

Mock Mode

When Vespa is unavailable, the system automatically switches to mock mode:

  • Generates realistic search results with working URLs (Wikipedia, Stack Overflow, GitHub, etc.)
  • Query-specific URLs that redirect to real search engines
  • Maintains API compatibility for frontend testing
  • Visual indicators in UI to distinguish mock results
  • Fast response times (< 1ms) for development/testing

Vespa Schema Structure

Current Schema (defined in main.py):

The application uses a document schema with the following structure:

Schema(
    name="doc",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary"]),
            Field(name="text", type="string", indexing=["index", "summary"], index="enable-bm25"),
            Field(name="url", type="string", indexing=["index", "summary"], index="enable-bm25"),
        ]
    ),
    fieldsets=[FieldSet(name="default", fields=["text", "url"])],
    rank_profiles=[...]  # 4 BM25 profiles defined
)

Field Descriptions:

  • id: Document identifier (summary only, not indexed for search)
  • text: Main content field (indexed with BM25, used for lexical search)
  • url: URL field (indexed with BM25, weighted 0.1x in combined profiles)

For Semantic/Fusion Search (requires schema modification):

To enable semantic search, add an embedding field to the schema:

Field(
    name="text_embedding",
    type="tensor<float>(x[384])",  # 384 for MiniLM, 768 for MPNet/BGE
    indexing=["attribute", "index"],
    attribute=["distance-metric: euclidean"]
)

Then update rank profiles to include semantic scoring:

RankProfile(
    name="fusion",
    functions=[
        Function(name="bm25score", expression="bm25(text) + 0.1 * bm25(url)"),
        Function(name="semanticscore", expression="closeness(field, text_embedding)")
    ],
    first_phase="bm25score + 0.7 * semanticscore"
)

Database Schema

Analytics Database (search_analytics.db):

-- Searches table
CREATE TABLE searches (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    query TEXT NOT NULL,
    query_hash TEXT NOT NULL,
    ranking_profile TEXT,
    limit_results INTEGER,
    use_query_expansion BOOLEAN,
    use_query_improvement BOOLEAN,
    query_intent TEXT,
    num_results INTEGER,
    latency_ms REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    session_id TEXT
);

-- Clicks table
CREATE TABLE clicks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    search_id INTEGER,
    document_id TEXT NOT NULL,
    document_url TEXT,
    position INTEGER,
    relevance_score REAL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);

βš™οΈ Configuration

Ranking Profiles

The system supports multiple ranking profiles defined in main.py:

Application-Level Profiles:

  • fusion: Hybrid search combining BM25 + semantic (default, requires embeddings)
  • semantic: Dense vector search only (requires embeddings)
  • bm25: Lexical search only (works with current schema)

Vespa Schema Profiles (defined in main.py):

  1. bm25_text_only: BM25 scoring on text field only
    • Expression: bm25(text)
  2. bm25_url_only: BM25 scoring on url field only
    • Expression: bm25(url)
  3. bm25: Combined BM25 with default parameters
    • Expression: bm25(text) + 0.1 Γ— bm25(url)
    • Default k1=1.2, b=0.75
  4. bm25_comb_tuned: Combined BM25 with tuned parameters
    • Expression: bm25(text) + 0.1 Γ— bm25(url)
    • Text field: k1=1.8, b=0.40
    • URL field: k1=0.9, b=0.30

Embedding Models

Three pre-trained models are supported via Sentence Transformers:

  • minilm (all-MiniLM-L6-v2):

    • Dimensions: 384
    • Speed: Fast (~50ms per query)
    • Quality: Good for general purpose
    • Use case: Default, production workloads
  • mpnet (all-mpnet-base-v2):

    • Dimensions: 768
    • Speed: Moderate (~150ms per query)
    • Quality: Higher accuracy
    • Use case: Quality-critical applications
  • bge (BAAI/bge-base-en-v1.5):

    • Dimensions: 768
    • Speed: Moderate (~150ms per query)
    • Quality: State-of-the-art
    • Use case: Research, high-accuracy requirements

Query Intelligence Features

Powered by Groq LLM (openai/gpt-oss-120b):

  • Query Improvement:

    • Fixes grammar errors
    • Adds missing keywords
    • Clarifies ambiguous intent
    • Returns improved query with explanation
  • Query Expansion:

    • Generates 3-5 alternative query variations
    • Uses synonyms and related terms
    • Maintains original intent
    • Used for multi-query retrieval
  • Query Suggestions (Autocomplete):

    • Generates 5-10 completion suggestions
    • Context-aware completions
    • Popular query patterns
  • Intent Classification:

    • FACTUAL: Seeking specific facts or data
    • NAVIGATIONAL: Looking for a specific website/page
    • INFORMATIONAL: Learning about a topic
    • TRANSACTIONAL: Wanting to perform an action

Dependencies

Core Dependencies (from requirements.txt):

  • fastapi>=0.111.0: Web framework for building APIs
  • uvicorn>=0.30.0: ASGI server for FastAPI
  • groq>=0.4.0: Groq LLM API client
  • sentence-transformers>=5.1.2: Embedding model library
  • pyvespa>=0.62.0: Vespa Python client
  • datasets>=4.4.1: Dataset loading utilities
  • pandas>=2.3.3: Data manipulation (for analytics)

System Requirements:

  • Python 3.10+
  • Docker (for Vespa)
  • 4GB+ RAM (8GB recommended for Vespa)
  • Internet connection (for Groq API and model downloads)

πŸ› Troubleshooting

Vespa Connection Issues

# Check if Vespa is running
curl http://localhost:8080/state/v1/health

# Check Vespa container
docker ps | grep vespa

Missing Dependencies

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Database Issues

The analytics database (search_analytics.db) is created automatically. If you encounter issues, delete it and restart the server.

πŸ“Š API Request/Response Schemas

Search Request

{
  "query": "string (required)",
  "limit": "integer (optional, default: 10, max: 100)",
  "ranking": "string (optional: 'fusion'|'semantic'|'bm25', default: 'fusion')",
  "use_query_expansion": "boolean (optional, default: false)",
  "use_query_improvement": "boolean (optional, default: false)",
  "use_diversification": "boolean (optional, default: false)",
  "diversification_method": "string (optional: 'mmr', default: 'mmr')",
  "embedding_model": "string (optional: 'minilm'|'mpnet'|'bge', default: 'minilm')",
  "session_id": "string (optional, for analytics tracking)"
}

Search Response

{
  "query": "string",
  "hits": [
    {
      "id": "string",
      "document_id": "string",
      "vespa_document_id": "string",
      "url": "string",
      "text": "string",
      "snippet": "string",
      "relevance": "float (0.0-1.0)",
      "fields": {}
    }
  ],
  "returned": "integer",
  "limit": "integer",
  "total_available": "integer",
  "latency_ms": "float",
  "coverage": {},
  "ranking_profile": "string",
  "embedding_model": "string",
  "search_id": "integer (for click tracking)",
  "query_improvement": {
    "original": "string",
    "improved": "string",
    "explanation": "string"
  },
  "expanded_queries": ["string"],
  "query_intent": {
    "intent": "FACTUAL|NAVIGATIONAL|INFORMATIONAL|TRANSACTIONAL",
    "confidence": "float"
  },
  "mock_mode": "boolean (true if Vespa unavailable)"
}

Error Responses

400 Bad Request:

{
  "detail": "Error message describing the issue"
}

503 Service Unavailable (Vespa connection):

  • Automatically falls back to mock mode
  • Response includes "mock_mode": true

⚑ Performance Characteristics

Latency Benchmarks

  • Mock Mode: < 1ms (no external dependencies)
  • BM25 Search: 10-50ms (Vespa only)
  • Semantic Search: 50-200ms (embedding generation + Vespa)
  • Fusion Search: 50-250ms (embedding + hybrid retrieval)
  • Query Intelligence: 200-800ms (Groq API call)

Throughput

  • Concurrent Requests: Limited by Vespa and embedding model capacity
  • Recommended: 10-50 requests/second (depending on ranking mode)
  • Bottlenecks:
    • Embedding generation (CPU-bound)
    • Vespa query processing
    • Groq API rate limits

Resource Usage

  • Memory:
    • Base application: ~200MB
    • Each embedding model: ~100-400MB
    • Vespa container: ~2-4GB
  • CPU:
    • Embedding generation: Moderate (single-threaded)
    • Vespa: Multi-threaded, scales with cores
  • Disk:
    • Models: ~500MB-2GB per model (cached after first download)
    • Analytics DB: Grows with usage (~1MB per 1000 searches)

πŸ“ Notes

  • Mock Mode: The application automatically works in mock mode when Vespa is not running, returning demo results with working URLs
  • BM25 Mode: Works immediately with the current schema in main.py (no embeddings required)
  • Semantic/Fusion Mode: Requires text_embedding tensor field in Vespa schema (not included in current main.py)
  • Model Downloads: First-time embedding model downloads may take 2-5 minutes (models are cached locally)
  • Groq API Key: Must set GROQ_API_KEY environment variable (raises error if missing)
  • Analytics: Data stored in SQLite database (search_analytics.db), created automatically on first run
  • Connection Handling: Vespa connection is lazy-loaded with 2-second timeout to prevent hangs
  • Caching: Embedding models and Groq client are cached using @lru_cache for performance
  • Thread Safety: SQLite connection uses check_same_thread=False for FastAPI compatibility

πŸ₯ Health Check

Check if the server is running:

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "vespa_available": false,
  "version": "0.3.0"
}

πŸ”’ Security Considerations

Current Implementation

  • API Keys: Stored in environment variables (not hardcoded)
  • Input Validation: Pydantic models validate all API inputs
  • SQL Injection: Protected via parameterized queries in SQLite
  • CORS: Enabled for development (configure for production)

Production Recommendations

  1. Authentication: Add API key or OAuth for protected endpoints
  2. Rate Limiting: Implement request throttling (e.g., using slowapi)
  3. HTTPS: Use TLS/SSL certificates for encrypted connections
  4. Input Sanitization: Additional validation for user queries
  5. Secrets Management: Use secure secret management (e.g., AWS Secrets Manager, HashiCorp Vault)
  6. Database Security: Encrypt analytics database, implement access controls
  7. Error Handling: Avoid exposing internal errors to clients
  8. Logging: Implement structured logging without sensitive data

πŸ§ͺ Testing

Manual Testing

# Test health endpoint
curl http://localhost:8000/health

# Test search endpoint
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "limit": 5}'

# Test query intelligence
curl -X POST http://localhost:8000/improve-query \
  -H "Content-Type: application/json" \
  -d '{"query": "how learn python"}'

Integration Testing

The application supports testing in mock mode without Vespa:

  • All endpoints work with mock data
  • Query intelligence requires Groq API key
  • Analytics database is created automatically

πŸš€ Deployment

Docker Deployment (Recommended)

  1. Build Docker image:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "ui_e2e:app", "--host", "0.0.0.0", "--port", "8000"]
  1. Docker Compose (with Vespa):
version: '3.8'
services:
  vespa:
    image: vespaengine/vespa:latest
    ports:
      - "8080:8080"
      - "19071:19071"
    mem_limit: 8g
  
  search-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - VESPA_URL=http://vespa
      - VESPA_PORT=8080
    depends_on:
      - vespa

Cloud Deployment

  • AWS: Deploy on EC2, ECS, or Lambda (with container support)
  • Google Cloud: Cloud Run or Compute Engine
  • Azure: Container Instances or App Service
  • Heroku: Use buildpacks or container registry

Environment Variables (set in cloud platform):

  • GROQ_API_KEY: Required for query intelligence
  • VESPA_URL: Vespa endpoint URL
  • VESPA_PORT: Vespa port (default: 8080)
  • VESPA_EMBEDDING_MODEL: Embedding model selection
  • ENABLE_QUERY_INTELLIGENCE: Enable/disable AI features
  • ENABLE_DIVERSIFICATION: Enable/disable result diversification

πŸ“„ License

This project is provided as-is for educational and development purposes.

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Additional embedding models
  • More ranking profiles
  • Enhanced analytics dashboard
  • Performance optimizations
  • Test coverage
  • Documentation improvements

About

An advanced AI-powered search engine built with Vespa, FastAPI, and Groq, featuring neural query intelligence, hybrid search (BM25 + semantic), multi-model embeddings, result diversification, and comprehensive analytics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors