GitHub - PranavBhatP/velox-db: An hobby project to construct a fully functioning vector database from scratch using first principles. Built to be fast. Available as a python package.

High-Performance Embedded Vector Database

Overview

VeloxDB is a production-ready vector database engineered for high-performance similarity search operations. The system leverages native C++ implementation with AVX2 SIMD instruction sets to deliver optimal performance for vector-intensive workloads. With support for datasets exceeding available memory through memory-mapped file I/O and intelligent indexing via Inverted File structures, VeloxDB provides enterprise-grade scalability while maintaining a simple developer experience.

Key Features

High Performance: Native C++ implementation with AVX2 SIMD instructions for vectorized distance calculations
Scalable Architecture: Memory-mapped file support enables handling datasets larger than available RAM
Intelligent Indexing: Inverted File Index (IVF) with K-Means clustering for efficient approximate nearest neighbor (ANN) search
Python Integration: Simple, intuitive Python API via pybind11 bindings
Persistent Storage: Save and load vector databases and indices in the efficient .fvecs format
Flexible Metrics: Support for Euclidean and Cosine distance metrics
REST API: Optional FastAPI server for remote access and microservice deployments

Installation

From PyPI

pip install veloxdb

From Source

git clone https://github.com/PranavBhatP/velox-db.git
cd velox-db
pip install .

Requirements:

Python 3.9+
CMake 3.15+
C++17 compatible compiler
AVX2-capable CPU (for SIMD optimizations)

Quick Start

Basic Usage

import veloxdb

# Create a new vector index
db = veloxdb.VectorIndex()

# Add vectors
db.add_vector([1.0, 2.0, 3.0, 4.0, 5.0])
db.add_vector([2.0, 3.0, 4.0, 5.0, 6.0])
db.add_vector([10.0, 20.0, 30.0, 40.0, 50.0])

# Build IVF index for fast search
db.build_index(num_clusters=2, max_iters=10, metric="eucl")

# Search for nearest neighbor
query = [1.1, 2.1, 3.1, 4.1, 5.1]
result_id = db.search(query, metric="eucl")
print(f"Nearest neighbor ID: {result_id}")

# Retrieve the matched vector
matched_vector = db.get_vector(result_id)
print(f"Matched vector: {matched_vector}")

Persistence

# Save vectors and index to disk
db.write_fvecs("vectors.fvecs")
db.save_index("index.ivf")

# Load from disk
db_loaded = veloxdb.VectorIndex()
db_loaded.load_fvecs("vectors.fvecs")
db_loaded.load_index("index.ivf")

Using the REST API Server

Start the FastAPI server:

python server/main.py

The server will be available at http://localhost:8000. Example API calls:

# Add a vector
curl -X POST http://localhost:8000/add_vectors \
  -H "Content-Type: application/json" \
  -d '{"vector": [1.0, 2.0, 3.0, 4.0, 5.0]}'

# Train the index
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{"num_clusters": 10, "max_iters": 20, "metric": "eucl"}'

# Search for similar vectors
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query_vector": [1.1, 2.1, 3.1, 4.1, 5.1], "metric": "eucl"}'

# Save state to disk
curl -X POST http://localhost:8000/save

Architecture & Implementation

High-Level Design

VeloxDB is architected as a three-layer system:

Core Engine (C++): Implements the core vector storage, indexing algorithms, and distance calculations
Python Bindings: Exposes C++ functionality to Python via pybind11
REST API Layer: Optional FastAPI server for network-accessible deployments

Core Components

Vector Storage

In-Memory Storage: Vectors stored as std::vector<std::vector<float>> for fast access
Memory-Mapped Files: Large datasets can be loaded from .fvecs files using mmap, enabling efficient access to datasets exceeding RAM capacity
Binary Format: Uses the standard .fvecs format for efficient serialization

Distance Metrics

VeloxDB supports multiple distance metrics with optional SIMD acceleration:

Euclidean Distance: L2 distance with AVX2 vectorized implementation
Cosine Distance: Angular similarity with SIMD optimization

SIMD acceleration can be toggled via the set_simd() method.

IVF Indexing

The Inverted File Index (IVF) dramatically accelerates similarity search:

Clustering: K-Means algorithm partitions vectors into clusters
Inverted Lists: Each cluster maintains a list of vector IDs
Search: Query is compared only to centroids, then searched within the nearest cluster(s)

Parameters:

num_clusters: Number of K-Means clusters (more clusters = faster search, but may reduce recall)
max_iters: Maximum iterations for K-Means convergence
metric: Distance metric ("eucl" for Euclidean, "cos" for Cosine)

Performance Optimizations

AVX2 SIMD: Vectorized distance calculations process 8 floats per instruction
Memory-Mapped I/O: Efficient disk access without loading entire datasets into RAM
Cache-Friendly Data Structures: Contiguous memory layouts for optimal CPU cache utilization

API Reference

Python API

`VectorIndex`

The main class for interacting with the vector database.

class VectorIndex:
    def __init__(self) -> None:
        """Initialize a new vector index."""
    
    def add_vector(self, vector: list[float]) -> None:
        """Add a vector to the index.
        
        Args:
            vector: A list of float values representing the vector.
        """
    
    def load_fvecs(self, filename: str) -> None:
        """Load vectors from a .fvecs file.
        
        Args:
            filename: Path to the .fvecs file.
        """
    
    def get_vector(self, index: int) -> list[float]:
        """Retrieve a vector by its ID.
        
        Args:
            index: The integer ID of the vector.
        
        Returns:
            The vector as a list of floats.
        """
    
    def build_index(self, num_clusters: int, max_iters: int = 10, 
                   metric: str = "eucl") -> None:
        """Build the IVF index using K-Means clustering.
        
        Args:
            num_clusters: Number of clusters for K-Means.
            max_iters: Maximum iterations for K-Means (default: 10).
            metric: Distance metric - "eucl" or "cos" (default: "eucl").
        """
    
    def search(self, query: list[float], metric: str = "eucl") -> int:
        """Search for the nearest neighbor.
        
        Args:
            query: Query vector as a list of floats.
            metric: Distance metric - "eucl" or "cos" (default: "eucl").
        
        Returns:
            The ID of the nearest neighbor vector.
        """
    
    def write_fvecs(self, filename: str) -> None:
        """Save vectors to a .fvecs file.
        
        Args:
            filename: Path where the .fvecs file will be saved.
        """
    
    def save_index(self, filename: str) -> None:
        """Save the IVF index to disk.
        
        Args:
            filename: Path where the index file will be saved.
        """
    
    def load_index(self, filename: str) -> None:
        """Load a saved IVF index from disk.
        
        Args:
            filename: Path to the index file.
        """
    
    def set_simd(self, enable: bool) -> None:
        """Enable or disable SIMD acceleration.
        
        Args:
            enable: True to enable AVX2 SIMD, False to use standard implementation.
        """

REST API Endpoints

Endpoint	Method	Description
`/`	GET	Health check
`/add_vectors`	POST	Add a new vector
`/train`	POST	Build/train the IVF index
`/search`	POST	Search for nearest neighbors
`/save`	POST	Persist database and index to disk

Advanced Configuration

Optimizing Index Parameters

# For small datasets (<10K vectors)
db.build_index(num_clusters=10, max_iters=20)

# For medium datasets (10K-1M vectors)
db.build_index(num_clusters=100, max_iters=15)

# For large datasets (>1M vectors)
db.build_index(num_clusters=1000, max_iters=10)

Rule of thumb: num_clusters ≈ sqrt(num_vectors) provides a good balance between speed and accuracy.

Memory-Mapped Files

For datasets larger than RAM:

db = veloxdb.VectorIndex()
db.load_fvecs("large_dataset.fvecs")  # Uses mmap automatically
db.load_index("large_index.ivf")

The database will efficiently page data from disk as needed.

Development

Building from Source

# Clone the repository
git clone https://github.com/PranavBhatP/velox-db.git
cd velox-db

# Install in editable mode
pip install -e .

# Run tests
python tests/api/test_script.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with pybind11 for seamless C++/Python integration
Inspired by FAISS and Annoy

Contact

Pranav Bhat - pranavbhat2004@gmail.com

Project Link: https://github.com/PranavBhatP/velox-db

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
.vscode		.vscode
bindings		bindings
data		data
docs/source/_static		docs/source/_static
include		include
server		server
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key Features

Installation

From PyPI

From Source

Quick Start

Basic Usage

Persistence

Using the REST API Server

Architecture & Implementation

High-Level Design

Core Components

Vector Storage

Distance Metrics

IVF Indexing

Performance Optimizations

API Reference

Python API

`VectorIndex`

REST API Endpoints

Advanced Configuration

Optimizing Index Parameters

Memory-Mapped Files

Development

Building from Source

Contributing

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Languages

License

PranavBhatP/velox-db

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

Installation

From PyPI

From Source

Quick Start

Basic Usage

Persistence

Using the REST API Server

Architecture & Implementation

High-Level Design

Core Components

Vector Storage

Distance Metrics

IVF Indexing

Performance Optimizations

API Reference

Python API

VectorIndex

REST API Endpoints

Advanced Configuration

Optimizing Index Parameters

Memory-Mapped Files

Development

Building from Source

Contributing

License

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`VectorIndex`

Packages