EngineDB is a high-performance semantic search engine that processes Wikipedia articles and enables fast similarity-based searching using neural embeddings and approximate nearest neighbor indexing. A working demo is now available, with limited usage. Please allow the backend server a few minutes to warm up and turn online. You can view the project here.
EngineDB combines modern NLP techniques with efficient database technologies to create a searchable knowledge base:
- Text Embedding: Uses the all-MiniLM-L6-v2 model (384-dimensional embeddings) via ONNX Runtime for fast CPU inference
- Vector Storage: PostgreSQL with HNSW (Hierarchical Navigable Small World) indexing for efficient approximate nearest-neighbor search
- Data Pipeline: Automated parsing of Wikipedia XML dumps to JSON, with batch processing and asynchronous embedding
EngineDB/
├── main.cpp # Entry point with interactive CLI
├── ArticleParser.cpp/h # Parses JSON files and coordinates batch processing
├── VectorStorage.cpp/h # Manages PostgreSQL storage and HNSW indexing
├── ONNXEmbedder.cpp/h # Text embedding using ONNX models
├── WordPieceTokenizer.cpp/h # Tokenization for embedding models
├── PageItem.h # Data structure for articles
├── Embedding.py # Script to export ONNX models
├── models/ # Pre-trained model files
│ ├── model.onnx # All-MiniLM-L6-v2 in ONNX format
│ └── vocab.txt # Tokenizer vocabulary
├── Data/ # Data pipeline and processing
│ ├── WikipediaParse.py # Wikipedia XML dump parser
│ ├── output/ # Parsed JSON output (created by WikipediaParse.py)
│ └── README.md # Data preparation instructions
├── hnswlib/ # HNSW library for nearest-neighbor search
├── packages/ # NuGet packages (ONNX Runtime)
├── vcpkg_installed/ # C++ dependencies (vcpkg)
└── venv310/ # Python virtual environment
ArticleParser
- Reads JSON files containing parsed Wikipedia articles
- Processes articles in configurable batch sizes (default: 250)
- Uses multi-threaded processing for efficient embedding
- Coordinates with VectorStorage to store embeddings
VectorStorage
- Manages all interaction with PostgreSQL database
- Handles HNSW index creation and management
- Provides semantic search functionality
- Stores article metadata (title, description, link)
- Supports configurable embedding dimensions (384-dim by default)
- Can handle up to 2 million vectors in HNSW index
ONNXEmbedder
- Loads and runs the all-MiniLM-L6-v2 model via ONNX Runtime
- Processes variable-length text inputs
- Handles tokenization, padding, and truncation
- Returns normalized 384-dimensional vectors
- Supports batch embedding for efficiency
WordPieceTokenizer
- Implements WordPiece tokenization algorithm
- Loads vocabulary from
vocab.txt - Handles special tokens (CLS, SEP, PAD, UNK)
- Supports configurable max sequence length (default: 256)
WikipediaParse.py
- Extracts articles from compressed Wikipedia XML dumps
- Outputs structured JSON files (up to 10,000 articles per file)
- Skips redirect pages automatically
- Runs in the
Data/directory
Embedding.py
- Exports the all-MiniLM-L6-v2 model to ONNX format
- Exports tokenizer vocabulary in the format expected by WordPieceTokenizer
- Creates the model files needed by the C++ application (see Data/README.md)
- Windows 10+ (or Linux/macOS with modifications)
- Visual Studio 2022 with C++ support (for building)
- PostgreSQL server running locally or accessible via network
- At least 4GB RAM for embedding processing
- ~10GB disk space for models and indexes
C++ Dependencies (managed via vcpkg):
- ONNX Runtime
- libpqxx (PostgreSQL C++ client)
- cpp-httplib
Python Dependencies:
- Python 3.10+
- sentence-transformers
- transformers
- pqxx (PostgreSQL connector)
See Data/README.md for complete data preparation instructions.
Summary:
cd Data
# Download Wikipedia dump to Data/wikiarticles.xml.bz2
# Then run:
python WikipediaParse.pyThis generates JSON files in Data/output/ containing parsed articles.
Build C++ Project:
# In Visual Studio
# Build → Build Solution (or Ctrl+Shift+B)Configure and Run:
Edit main.cpp to match your setup:
pqxx::connection conn("host=localhost port=5432 dbname=vectorstore user=postgres password=YOUR_PASSWORD");
// Adjust these parameters as needed:
size_t batchSize = 250; // Articles processed per batch
size_t maxThreads = 8; // Concurrent workers
int maxPages = 5000; // Total articles to process (-1 for all)Run Application:
cd EngineDB\x64\Release
EngineDB.exeThe application presents an interactive menu:
Select an option:
1. Parse JSON files and store vectors
2. Search
3. Exit
Option 1 - Parse and Store:
- Reads JSON files from
Data/output/ - Embeds each article using the ONNX model
- Stores embeddings and metadata in PostgreSQL
- Creates/updates HNSW index for fast search
- Shows progress and timing information
Option 2 - Search:
Search query (or 'exit'): neural networks in deep learning
- Enter search text (natural language)
- System finds most similar articles using semantic similarity
- Returns top results with scores
parsedJSONpath: Path to JSON files from WikipediaParse.py (default:./Data/output)batchSize: Articles per processing batch (default: 250, higher = faster but more memory)maxThreads: Concurrent embedding workers (default: 8, adjust based on CPU cores)maxPages: Limit total articles processed, -1 for all (default: 5000)
DIM: Embedding dimension (default: 384, matches all-MiniLM-L6-v2 output)MAX_ELEMENTS: Maximum HNSW index capacity (default: 2,000,000)
pqxx::connection conn("host=localhost port=5432 dbname=vectorstore user=postgres password=YOUR_PASSWORD");- Wikipedia XML Dump → WikipediaParse.py → JSON Files (10K articles/file)
- JSON Files → ArticleParser → Embedding Queue
- Embedding Queue → ONNXEmbedder → 384-dim Vectors
- Vectors → VectorStorage → PostgreSQL + HNSW Index
- User enters search query
- Query text is embedded using the same ONNX model
- HNSW index performs approximate nearest neighbor search
- Top K most similar articles are retrieved from PostgreSQL
- Results displayed with similarity scores
- Embedding: ~100-200 articles/second (batch processing)
- Search Latency: <100ms for exact NN search against 2M vectors
- Memory: ~4GB for embeddings + 2M articles (~1.5GB for HNSW index + 2.5GB for PostgreSQL)
- Model Size: 91MB (all-MiniLM-L6-v2 ONNX format)
- Verify PostgreSQL is running:
psql -U postgres - Check credentials in connection string
- Ensure database and table exist
- Test connectivity:
psql -h localhost -U postgres -d vectorstore
- Verify
models/model.onnxandmodels/vocab.txtexist - Run Embedding.py to regenerate model files (see Data/README.md)
- Check relative paths in ONNXEmbedder constructor
- Reduce
batchSizein main.cpp - Process in multiple runs with
maxPageslimit - Reduce
maxThreadsto lower peak memory usage
- Ensure batch processing is enabled
- Check CPU isn't bottlenecked by disk I/O (SSD recommended)
- Verify ONNX Runtime isn't set to GPU mode