The OpenVault search system has been upgraded from a basic TF-IDF implementation to use Whoosh, a powerful pure-Python search library. This provides better search capabilities, query parsing, and is optimized for serverless environments like Vercel.
- Multi-field search: Searches across title, description, author, and content fields
- Field boosting: Title matches are weighted higher than other fields
- Query parsing: Supports advanced search syntax (AND, OR, quotes for phrases, etc.)
- Fuzzy matching: Better handling of typos and similar terms
- In-memory indexing: Uses RAM storage instead of disk files (perfect for Vercel)
- Dynamic index rebuilding: Automatically rebuilds index when content changes
- No persistent files: No need to worry about filesystem write permissions
- Change detection: Automatically detects when records have changed
- Fresh data fetching: Always fetches latest data from GitHub for searches
- Hash-based caching: Only rebuilds index when actual content changes
The search index includes these fields:
- title (TEXT, boosted): Main title field with higher search weight
- description (TEXT): Detailed descriptions
- author (TEXT): Author names
- content (TEXT): Combined searchable content
- team_number, years_used (KEYWORD): Structured data
- language, awards_won, used_in_comp (KEYWORD/TEXT): Category-specific fields
- Index Building: Creates in-memory index from current records
- Change Detection: Uses MD5 hash to detect if records have changed
- Query Parsing: Parses user queries using Whoosh's MultifieldParser
- Result Ranking: Returns results ranked by Whoosh's BM25F scoring algorithm
POST /api/search
{
"query": "search terms here"
}
- Automatically fetches fresh data from GitHub
- Rebuilds index if content has changed
- Returns HTML template with filtered results
POST /api/refresh-search-index
- Manually refreshes the search index
- Useful after new content is added
- Forces complete index rebuild
robot design
Searches for documents containing "robot" OR "design"
"intake mechanism"
Searches for the exact phrase "intake mechanism"
title:drivetrain
Searches only in the title field for "drivetrain"
(robot OR drivetrain) AND author:teamname
Complex boolean queries with field specifications
- Faster searches: Optimized indexing and BM25F scoring
- Better relevance: More sophisticated ranking algorithm
- Efficient memory usage: Only rebuilds when necessary
- No disk I/O: Works perfectly on Vercel's read-only filesystem
- Stateless: Each request can rebuild index independently
- Scalable: Memory usage scales with content size
- Relevance ranking: Better results ordering
- Query flexibility: Supports complex search syntax
- Typo tolerance: Better handling of misspelled terms
- WhooshSearchEngine: Main search engine class
- Schema: Defines searchable fields and their types
- MultifieldParser: Handles complex query parsing
- Uses
RamStoragefor in-memory indexing - Automatic garbage collection when index is rebuilt
- Hash-based change detection to minimize rebuilds
- Graceful fallback to showing all records on search errors
- Multiple parser fallbacks for malformed queries
- Comprehensive exception handling throughout
The system automatically handles content updates in two ways:
- Fresh Data Fetching: Every search fetches the latest data from GitHub
- Change Detection: Compares content hash to detect updates
- Index Rebuilding: Rebuilds index only when content actually changes
You can manually refresh the search index using:
fetch('/api/refresh-search-index', { method: 'POST' })This is particularly useful after:
- Adding new content through the contribute page
- Making changes to existing content
- When you want to ensure the search index is completely up-to-date
- ✅ No file system writes required
- ✅ Pure Python implementation
- ✅ Memory-efficient indexing
- ✅ Stateless operation
Added to requirements.txt:
Whoosh==2.7.4
No additional environment variables required. The system works out-of-the-box with the existing GitHub API integration.