EngineDB - Semantic Search Engine

EngineDB is a high-performance semantic search engine that processes Wikipedia articles and enables fast similarity-based searching using neural embeddings and approximate nearest neighbor indexing. A working demo is now available, with limited usage. Please allow the backend server a few minutes to warm up and turn online. You can view the project here.

Overview

EngineDB combines modern NLP techniques with efficient database technologies to create a searchable knowledge base:

Text Embedding: Uses the all-MiniLM-L6-v2 model (384-dimensional embeddings) via ONNX Runtime for fast CPU inference
Vector Storage: PostgreSQL with HNSW (Hierarchical Navigable Small World) indexing for efficient approximate nearest-neighbor search
Data Pipeline: Automated parsing of Wikipedia XML dumps to JSON, with batch processing and asynchronous embedding

Project Structure

EngineDB/
├── main.cpp                    # Entry point with interactive CLI
├── ArticleParser.cpp/h         # Parses JSON files and coordinates batch processing
├── VectorStorage.cpp/h         # Manages PostgreSQL storage and HNSW indexing
├── ONNXEmbedder.cpp/h          # Text embedding using ONNX models
├── WordPieceTokenizer.cpp/h    # Tokenization for embedding models
├── PageItem.h                  # Data structure for articles
├── Embedding.py                # Script to export ONNX models
├── models/                     # Pre-trained model files
│   ├── model.onnx              # All-MiniLM-L6-v2 in ONNX format
│   └── vocab.txt               # Tokenizer vocabulary
├── Data/                       # Data pipeline and processing
│   ├── WikipediaParse.py       # Wikipedia XML dump parser
│   ├── output/                 # Parsed JSON output (created by WikipediaParse.py)
│   └── README.md               # Data preparation instructions
├── hnswlib/                    # HNSW library for nearest-neighbor search
├── packages/                   # NuGet packages (ONNX Runtime)
├── vcpkg_installed/            # C++ dependencies (vcpkg)
└── venv310/                    # Python virtual environment

Components

Core C++ Components

ArticleParser

Reads JSON files containing parsed Wikipedia articles
Processes articles in configurable batch sizes (default: 250)
Uses multi-threaded processing for efficient embedding
Coordinates with VectorStorage to store embeddings

VectorStorage

Manages all interaction with PostgreSQL database
Handles HNSW index creation and management
Provides semantic search functionality
Stores article metadata (title, description, link)
Supports configurable embedding dimensions (384-dim by default)
Can handle up to 2 million vectors in HNSW index

ONNXEmbedder

Loads and runs the all-MiniLM-L6-v2 model via ONNX Runtime
Processes variable-length text inputs
Handles tokenization, padding, and truncation
Returns normalized 384-dimensional vectors
Supports batch embedding for efficiency

WordPieceTokenizer

Implements WordPiece tokenization algorithm
Loads vocabulary from vocab.txt
Handles special tokens (CLS, SEP, PAD, UNK)
Supports configurable max sequence length (default: 256)

Python Components

WikipediaParse.py

Extracts articles from compressed Wikipedia XML dumps
Outputs structured JSON files (up to 10,000 articles per file)
Skips redirect pages automatically
Runs in the Data/ directory

Embedding.py

Exports the all-MiniLM-L6-v2 model to ONNX format
Exports tokenizer vocabulary in the format expected by WordPieceTokenizer
Creates the model files needed by the C++ application (see Data/README.md)

Prerequisites

System Requirements

Windows 10+ (or Linux/macOS with modifications)
Visual Studio 2022 with C++ support (for building)
PostgreSQL server running locally or accessible via network
At least 4GB RAM for embedding processing
~10GB disk space for models and indexes

Software Dependencies

C++ Dependencies (managed via vcpkg):

ONNX Runtime
libpqxx (PostgreSQL C++ client)
cpp-httplib

Python Dependencies:

Python 3.10+
sentence-transformers
transformers
pqxx (PostgreSQL connector)

Quick Start

1. Prepare Data

See Data/README.md for complete data preparation instructions.

Summary:

cd Data
# Download Wikipedia dump to Data/wikiarticles.xml.bz2
# Then run:
python WikipediaParse.py

This generates JSON files in Data/output/ containing parsed articles.

2. Build and Run

Build C++ Project:

# In Visual Studio
# Build → Build Solution (or Ctrl+Shift+B)

Configure and Run:

Edit main.cpp to match your setup:

pqxx::connection conn("host=localhost port=5432 dbname=vectorstore user=postgres password=YOUR_PASSWORD");

// Adjust these parameters as needed:
size_t batchSize = 250;        // Articles processed per batch
size_t maxThreads = 8;         // Concurrent workers
int maxPages = 5000;           // Total articles to process (-1 for all)

Run Application:

cd EngineDB\x64\Release
EngineDB.exe

4. Usage

The application presents an interactive menu:

Select an option:
1. Parse JSON files and store vectors
2. Search
3. Exit

Option 1 - Parse and Store:

Reads JSON files from Data/output/
Embeds each article using the ONNX model
Stores embeddings and metadata in PostgreSQL
Creates/updates HNSW index for fast search
Shows progress and timing information

Option 2 - Search:

Search query (or 'exit'): neural networks in deep learning

Enter search text (natural language)
System finds most similar articles using semantic similarity
Returns top results with scores

Configuration

ArticleParser Configuration (main.cpp)

parsedJSONpath: Path to JSON files from WikipediaParse.py (default: ./Data/output)
batchSize: Articles per processing batch (default: 250, higher = faster but more memory)
maxThreads: Concurrent embedding workers (default: 8, adjust based on CPU cores)
maxPages: Limit total articles processed, -1 for all (default: 5000)

VectorStorage Configuration (main.cpp)

DIM: Embedding dimension (default: 384, matches all-MiniLM-L6-v2 output)
MAX_ELEMENTS: Maximum HNSW index capacity (default: 2,000,000)

Database Connection (main.cpp)

pqxx::connection conn("host=localhost port=5432 dbname=vectorstore user=postgres password=YOUR_PASSWORD");

How It Works

Data Pipeline

Wikipedia XML Dump → WikipediaParse.py → JSON Files (10K articles/file)
JSON Files → ArticleParser → Embedding Queue
Embedding Queue → ONNXEmbedder → 384-dim Vectors
Vectors → VectorStorage → PostgreSQL + HNSW Index

Search Process

User enters search query
Query text is embedded using the same ONNX model
HNSW index performs approximate nearest neighbor search
Top K most similar articles are retrieved from PostgreSQL
Results displayed with similarity scores

Performance Characteristics

Embedding: ~100-200 articles/second (batch processing)
Search Latency: <100ms for exact NN search against 2M vectors
Memory: ~4GB for embeddings + 2M articles (~1.5GB for HNSW index + 2.5GB for PostgreSQL)
Model Size: 91MB (all-MiniLM-L6-v2 ONNX format)

Troubleshooting

Database Connection Failures

Verify PostgreSQL is running: psql -U postgres
Check credentials in connection string
Ensure database and table exist
Test connectivity: psql -h localhost -U postgres -d vectorstore

Embedding Model Not Found

Verify models/model.onnx and models/vocab.txt exist
Run Embedding.py to regenerate model files (see Data/README.md)
Check relative paths in ONNXEmbedder constructor

Memory Issues During Processing

Reduce batchSize in main.cpp
Process in multiple runs with maxPages limit
Reduce maxThreads to lower peak memory usage

Slow Embedding Speed

Ensure batch processing is enabled
Check CPU isn't bottlenecked by disk I/O (SSD recommended)
Verify ONNX Runtime isn't set to GPU mode

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
models		models
.gitignore		.gitignore
ArticleParser.cpp		ArticleParser.cpp
ArticleParser.h		ArticleParser.h
EngineDB.slnx		EngineDB.slnx
ONNXEmbedder.cpp		ONNXEmbedder.cpp
ONNXEmbedder.h		ONNXEmbedder.h
PageItem.h		PageItem.h
README.md		README.md
VectorStorage.cpp		VectorStorage.cpp
VectorStorage.h		VectorStorage.h
WordPieceTokenizer.cpp		WordPieceTokenizer.cpp
WordPieceTokenizer.h		WordPieceTokenizer.h
main.cpp		main.cpp
packages.config		packages.config
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EngineDB - Semantic Search Engine

Overview

Project Structure

Components

Core C++ Components

Python Components

Prerequisites

System Requirements

Software Dependencies

Quick Start

1. Prepare Data

2. Build and Run

4. Usage

Configuration

ArticleParser Configuration (main.cpp)

VectorStorage Configuration (main.cpp)

Database Connection (main.cpp)

How It Works

Data Pipeline

Search Process

Performance Characteristics

Troubleshooting

Database Connection Failures

Embedding Model Not Found

Memory Issues During Processing

Slow Embedding Speed

About

Uh oh!

Releases

Packages

Languages

Macpickle/Wiki-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

EngineDB - Semantic Search Engine

Overview

Project Structure

Components

Core C++ Components

Python Components

Prerequisites

System Requirements

Software Dependencies

Quick Start

1. Prepare Data

2. Build and Run

4. Usage

Configuration

ArticleParser Configuration (main.cpp)

VectorStorage Configuration (main.cpp)

Database Connection (main.cpp)

How It Works

Data Pipeline

Search Process

Performance Characteristics

Troubleshooting

Database Connection Failures

Embedding Model Not Found

Memory Issues During Processing

Slow Embedding Speed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages