Skip to content

hicder/muopdb

Repository files navigation

MuopDB - A vector database for AI memories

Introduction

MuopDB is a vector database for machine learning. Currently, it supports:

  • Hybrid search: Text search (with stemming support), vector search with filtering.
  • Index type: HNSW, IVF, SPANN, Multi-user SPANN. All on-disk.
  • Different I/O model: mmap, async I/O (with optional io_uring support on Linux).
  • Quantization: product quantization

Why MuopDB?

MuopDB supports multiple users by default. What that means is, each user will have its own vector index, within the same collection. The use-case for this is to build memory for LLMs. Think of it as:

  • Each user will have its own memory
  • Each user can still search a shared knowledge base.

All users' indices will be stored in a few files, reducing operational complexity.

Quick Start

  • Build MuopDB. Refer to this instruction.
  • Prepare necessary data and indices directories. On Mac, you might want to change these directories since root directory is read-only, i.e: ~/mnt/muopdb/.
mkdir -p /mnt/muopdb/indices
mkdir -p /mnt/muopdb/data
  • Start MuopDB index_server with the directories we just prepared using one of these methods:
# Start server locally. This is recommended for Mac.
cd target/release
RUST_LOG=info ./index_server --node-id 0 --index-config-path /mnt/muopdb/indices --index-data-path /mnt/muopdb/data --port 9002

# Start server with Docker. Only use this option on Linux.
docker-compose up --build
  • Now you have an up and running MuopDB index_server.
    • You can send gRPC requests to this server (possibly with Postman).
    • You can use Server Reflection in Postman - it will automatically detect the RPCs for MuopDB.

Examples using Postman

  1. Create collection
Screenshot 2025-03-26 at 8 32 05 PM
{
    "collection_name": "test-collection-2",
    "num_features": 10,
    "wal_file_size": 1024000000,
    "max_time_to_flush_ms": 5000,
    "max_pending_ops": 10,
    "attribute_schema": {
        "attributes": [
            {
                "name": "title",
                "type": "ATTRIBUTE_TYPE_TEXT",
                "language": "english"
            },
            {
                "name": "content",
                "type": "ATTRIBUTE_TYPE_TEXT",
                "language": "english"
            }
        ]
    }
}
  1. Insert some data
Screenshot 2025-03-26 at 8 24 52 PM
{
    "collection_name": "test-collection-2",
    "doc_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000064"
        }
    ],
    "user_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000000"
        }
    ],
    "vectors": [
        100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0
    ],
    "attributes": {
        "values": [
            {
                "value": {
                    "title": {
                        "text_value": "Example Document"
                    },
                    "content": {
                        "text_value": "This is an example document for search demonstration"
                    }
                }
            }
        ]
    }
}
  1. Search
Screenshot 2025-03-26 at 8 25 40 PM
{
    "collection_name": "test-collection-2",
    "params": {
        "ef_construction": 200,
        "record_metrics": false,
        "top_k": 1
    },
    "user_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000000"
        }
    ],
    "vector": [100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0]
}
  1. Remove
Screenshot 2025-03-26 at 8 25 57 PM
{
    "collection_name": "test-collection-2",
    "doc_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000064"
        }
    ],
    "user_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000000"
        }
    ]
}
  1. Search again You should see something else
Screenshot 2025-03-26 at 8 26 15 PM
{
    "collection_name": "test-collection-2",
    "params": {
        "ef_construction": 200,
        "record_metrics": false,
        "top_k": 1
    },
    "user_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000000"
        }
    ],
    "vector": [100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0]
}

This time it should give you something else

  1. TermSearch only
{
    "collection_name": "test-collection-2",
    "user_ids": [
        {
            "uuid": "00000000-0000-0000-0000-000000000000"
        }
    ],
    "limit": 10,
    "filter": {
        "contains": {
            "path": "content",
            "value": "search"
        }
    }
}

This performs a text-only search without requiring a vector, returning documents where the content field contains the term "search". You can also search the title field or combine multiple filters using and/or operators.

Plans

Phase 0 (Done)

  • Query path
    • Vector similarity search
    • Hierarchical Navigable Small Worlds (HNSW)
    • Product Quantization (PQ)
  • Indexing path
    • Support periodic offline indexing
  • Database Management
    • Doc-sharding & query fan-out with aggregator-leaf architecture
    • In-memory & disk-based storage with mmap

Phase 1 (Done)

  • Query & Indexing
    • Inverted File (IVF)
    • Improve locality for HNSW
    • SPANN

Phase 2 (Done)

  • Query
    • Multiple index segments
    • L2 distance
  • Index
    • Optimizing index build time
    • Elias-Fano encoding for IVF
    • Multi-user SPANN index

Phase 3 (Done)

  • Features
    • Delete vector from collection
  • Database Management
    • Segment optimizer framework
    • Write-ahead-log
    • Segments merger
    • Segments vacuum

Phase 4 (Done)

  • Features
    • Hybrid search
    • Term search only (without vector)
  • Database Management
    • Optimizing deletion with bloom filter
    • Optimizing WAL write with thread-safe write group
    • Automatic segment optimizer
    • Non-mmap implementation of SPANN and Term index (with io_uring support)

Phase 5 (Ongoing)

  • Features
    • Search relevance score (BM25, TF/IDF)
  • Database management / Optimization
    • MuopDB with consensus protocol (Raft)
    • Cloud MuopDB (native on object store)
    • Improve skip_to performance on Elias-Fano encoding

Building

# MacOS (using Homebrew)
brew install protobuf openblas

# Linux (Arch-based)
# On Arch Linux (and its derivatives, such as EndeavourOS, CachyOS):
sudo pacman -Syu protobuf openblas

# Linux (Debian-based)
sudo apt-get install libprotobuf-dev libopenblas-dev
  • Build from Source:
git clone https://github.com/hicder/muopdb.git
cd muopdb

# Build
cargo build --release

# Run tests
cargo test --release

Contributions

Main contributors:

This project is done with TechCare Coaching. I am mentoring mentees who made contributions to this project.

About

MuopDB - A Vector Database

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 9