OpenVault Search System - Whoosh Integration

Overview

The OpenVault search system has been upgraded from a basic TF-IDF implementation to use Whoosh, a powerful pure-Python search library. This provides better search capabilities, query parsing, and is optimized for serverless environments like Vercel.

Key Features

🔍 Advanced Search Capabilities

Multi-field search: Searches across title, description, author, and content fields
Field boosting: Title matches are weighted higher than other fields
Query parsing: Supports advanced search syntax (AND, OR, quotes for phrases, etc.)
Fuzzy matching: Better handling of typos and similar terms

🚀 Serverless Compatibility

In-memory indexing: Uses RAM storage instead of disk files (perfect for Vercel)
Dynamic index rebuilding: Automatically rebuilds index when content changes
No persistent files: No need to worry about filesystem write permissions

🔄 Automatic Content Updates

Change detection: Automatically detects when records have changed
Fresh data fetching: Always fetches latest data from GitHub for searches
Hash-based caching: Only rebuilds index when actual content changes

How It Works

Schema Design

The search index includes these fields:

title (TEXT, boosted): Main title field with higher search weight
description (TEXT): Detailed descriptions
author (TEXT): Author names
content (TEXT): Combined searchable content
team_number, years_used (KEYWORD): Structured data
language, awards_won, used_in_comp (KEYWORD/TEXT): Category-specific fields

Search Process

Index Building: Creates in-memory index from current records
Change Detection: Uses MD5 hash to detect if records have changed
Query Parsing: Parses user queries using Whoosh's MultifieldParser
Result Ranking: Returns results ranked by Whoosh's BM25F scoring algorithm

API Endpoints

Search API

POST /api/search
{
  "query": "search terms here"
}

Automatically fetches fresh data from GitHub
Rebuilds index if content has changed
Returns HTML template with filtered results

Refresh Index API

POST /api/refresh-search-index

Manually refreshes the search index
Useful after new content is added
Forces complete index rebuild

Search Query Examples

Basic Search

robot design

Searches for documents containing "robot" OR "design"

Phrase Search

"intake mechanism"

Searches for the exact phrase "intake mechanism"

Field-Specific Search

title:drivetrain

Searches only in the title field for "drivetrain"

Advanced Queries

(robot OR drivetrain) AND author:teamname

Complex boolean queries with field specifications

Benefits Over Previous System

Performance

Faster searches: Optimized indexing and BM25F scoring
Better relevance: More sophisticated ranking algorithm
Efficient memory usage: Only rebuilds when necessary

Serverless Compatibility

No disk I/O: Works perfectly on Vercel's read-only filesystem
Stateless: Each request can rebuild index independently
Scalable: Memory usage scales with content size

Search Quality

Relevance ranking: Better results ordering
Query flexibility: Supports complex search syntax
Typo tolerance: Better handling of misspelled terms

Technical Implementation

Key Classes

WhooshSearchEngine: Main search engine class
Schema: Defines searchable fields and their types
MultifieldParser: Handles complex query parsing

Memory Management

Uses RamStorage for in-memory indexing
Automatic garbage collection when index is rebuilt
Hash-based change detection to minimize rebuilds

Error Handling

Graceful fallback to showing all records on search errors
Multiple parser fallbacks for malformed queries
Comprehensive exception handling throughout

Content Updates

Automatic Updates

The system automatically handles content updates in two ways:

Fresh Data Fetching: Every search fetches the latest data from GitHub
Change Detection: Compares content hash to detect updates
Index Rebuilding: Rebuilds index only when content actually changes

Manual Refresh

You can manually refresh the search index using:

fetch('/api/refresh-search-index', { method: 'POST' })

This is particularly useful after:

Adding new content through the contribute page
Making changes to existing content
When you want to ensure the search index is completely up-to-date

Deployment Notes

Vercel Compatibility

✅ No file system writes required
✅ Pure Python implementation
✅ Memory-efficient indexing
✅ Stateless operation

Dependencies

Added to requirements.txt:

Whoosh==2.7.4

Environment Variables

No additional environment variables required. The system works out-of-the-box with the existing GitHub API integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVault Search System - Whoosh Integration

Overview

Key Features

🔍 Advanced Search Capabilities

🚀 Serverless Compatibility

🔄 Automatic Content Updates

How It Works

Schema Design

Search Process

API Endpoints

Search API

Refresh Index API

Search Query Examples

Basic Search

Phrase Search

Field-Specific Search

Advanced Queries

Benefits Over Previous System

Performance

Serverless Compatibility

Search Quality

Technical Implementation

Key Classes

Memory Management

Error Handling

Content Updates

Automatic Updates

Manual Refresh

Deployment Notes

Vercel Compatibility

Dependencies

Environment Variables

FilesExpand file tree

SEARCH_README.md

Latest commit

History

SEARCH_README.md

File metadata and controls

OpenVault Search System - Whoosh Integration

Overview

Key Features

🔍 Advanced Search Capabilities

🚀 Serverless Compatibility

🔄 Automatic Content Updates

How It Works

Schema Design

Search Process

API Endpoints

Search API

Refresh Index API

Search Query Examples

Basic Search

Phrase Search

Field-Specific Search

Advanced Queries

Benefits Over Previous System

Performance

Serverless Compatibility

Search Quality

Technical Implementation

Key Classes

Memory Management

Error Handling

Content Updates

Automatic Updates

Manual Refresh

Deployment Notes

Vercel Compatibility

Dependencies

Environment Variables