MuopDB is a vector database for machine learning. Currently, it supports:
- Hybrid search: Text search (with stemming support), vector search with filtering.
- Index type: HNSW, IVF, SPANN, Multi-user SPANN. All on-disk.
- Different I/O model: mmap, async I/O (with optional
io_uringsupport on Linux). - Quantization: product quantization
MuopDB supports multiple users by default. What that means is, each user will have its own vector index, within the same collection. The use-case for this is to build memory for LLMs. Think of it as:
- Each user will have its own memory
- Each user can still search a shared knowledge base.
All users' indices will be stored in a few files, reducing operational complexity.
- Build MuopDB. Refer to this instruction.
- Prepare necessary
dataandindicesdirectories. On Mac, you might want to change these directories since root directory is read-only, i.e:~/mnt/muopdb/.
mkdir -p /mnt/muopdb/indices
mkdir -p /mnt/muopdb/data
- Start MuopDB
index_serverwith the directories we just prepared using one of these methods:
# Start server locally. This is recommended for Mac.
cd target/release
RUST_LOG=info ./index_server --node-id 0 --index-config-path /mnt/muopdb/indices --index-data-path /mnt/muopdb/data --port 9002
# Start server with Docker. Only use this option on Linux.
docker-compose up --build- Now you have an up and running MuopDB
index_server.- You can send gRPC requests to this server (possibly with Postman).
- You can use Server Reflection in Postman - it will automatically detect the RPCs for MuopDB.
- Create collection
{
"collection_name": "test-collection-2",
"num_features": 10,
"wal_file_size": 1024000000,
"max_time_to_flush_ms": 5000,
"max_pending_ops": 10,
"attribute_schema": {
"attributes": [
{
"name": "title",
"type": "ATTRIBUTE_TYPE_TEXT",
"language": "english"
},
{
"name": "content",
"type": "ATTRIBUTE_TYPE_TEXT",
"language": "english"
}
]
}
}
- Insert some data
{
"collection_name": "test-collection-2",
"doc_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000064"
}
],
"user_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000000"
}
],
"vectors": [
100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0
],
"attributes": {
"values": [
{
"value": {
"title": {
"text_value": "Example Document"
},
"content": {
"text_value": "This is an example document for search demonstration"
}
}
}
]
}
}
- Search
{
"collection_name": "test-collection-2",
"params": {
"ef_construction": 200,
"record_metrics": false,
"top_k": 1
},
"user_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000000"
}
],
"vector": [100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0]
}
- Remove
{
"collection_name": "test-collection-2",
"doc_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000064"
}
],
"user_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000000"
}
]
}
- Search again You should see something else
{
"collection_name": "test-collection-2",
"params": {
"ef_construction": 200,
"record_metrics": false,
"top_k": 1
},
"user_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000000"
}
],
"vector": [100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0]
}
This time it should give you something else
- TermSearch only
{
"collection_name": "test-collection-2",
"user_ids": [
{
"uuid": "00000000-0000-0000-0000-000000000000"
}
],
"limit": 10,
"filter": {
"contains": {
"path": "content",
"value": "search"
}
}
}
This performs a text-only search without requiring a vector, returning documents where the content field contains the term "search". You can also search the title field or combine multiple filters using and/or operators.
- Query path
- Vector similarity search
- Hierarchical Navigable Small Worlds (HNSW)
- Product Quantization (PQ)
- Indexing path
- Support periodic offline indexing
- Database Management
- Doc-sharding & query fan-out with aggregator-leaf architecture
- In-memory & disk-based storage with mmap
- Query & Indexing
- Inverted File (IVF)
- Improve locality for HNSW
- SPANN
- Query
- Multiple index segments
- L2 distance
- Index
- Optimizing index build time
- Elias-Fano encoding for IVF
- Multi-user SPANN index
- Features
- Delete vector from collection
- Database Management
- Segment optimizer framework
- Write-ahead-log
- Segments merger
- Segments vacuum
- Features
- Hybrid search
- Term search only (without vector)
- Database Management
- Optimizing deletion with bloom filter
- Optimizing WAL write with thread-safe write group
- Automatic segment optimizer
- Non-mmap implementation of SPANN and Term index (with
io_uringsupport)
- Features
- Search relevance score (BM25, TF/IDF)
- Database management / Optimization
- MuopDB with consensus protocol (Raft)
- Cloud MuopDB (native on object store)
- Improve skip_to performance on Elias-Fano encoding
- Install prerequisites:
- Rust: https://www.rust-lang.org/tools/install
- Make sure you're on nightly:
rustup toolchain install nightly - Libraries
# MacOS (using Homebrew)
brew install protobuf openblas
# Linux (Arch-based)
# On Arch Linux (and its derivatives, such as EndeavourOS, CachyOS):
sudo pacman -Syu protobuf openblas
# Linux (Debian-based)
sudo apt-get install libprotobuf-dev libopenblas-dev- Build from Source:
git clone https://github.com/hicder/muopdb.git
cd muopdb
# Build
cargo build --release
# Run tests
cargo test --releaseMain contributors:
This project is done with TechCare Coaching. I am mentoring mentees who made contributions to this project.