Skip to content

[Enhance]: Concurrent multi-process access support without needs of FileLock for All Operations #247

@Rm1n90

Description

@Rm1n90

Affected Component

Codebase

Current Behavior

Collection.query() is not safe for concurrent multi-process access; users must serialize all reads with an external lock

As I understand, Zvec is an embedded, file-based vector database (similar design to SQLite). Because there is no server process managing access, when the same collection is opened by multiple OS processes simultaneously, for example, in a multi-worker web server (Gunicorn/Uvicorn) or across multiple Docker containers sharing a volume, there are no internal concurrency guarantees. Users are forced to wrap every operation (including read-only queries) in an external FileLock, which serializes all search operations and creates a hard throughput ceiling.

Concrete Setup That Reproduces This

Three processes all mount the same collection path:

Process A: API server worker 1 (collection.query(...))
Process B: API server worker 2 (collection.query(...))
Process C: Celery embedding worker (collection.upsert(...))
Without an external lock, Process A and B opening the same collection simultaneously for parallel reads is undefined behavior. The HNSW graph file could be in a partially-flushed state from Process C's flush(). There is no documented guarantee that Zvec's file format is safe for concurrent readers.

Desired Improvement

Concurrent read safety guarantee: document that multiple processes can safely call [collection.query()] simultaneously on the same collection path, as long as no writer is active. This is the standard MRSW (Multiple Readers Single Writer) model that SQLite WAL mode implements. If the file format already supports this, just documenting it would let users use an RWLock instead of an exclusive.

An in-process thread-safety guarantee, if a single process opens the collection once at startup and multiple threads call [query()] concurrently, is that safe? If yes, document it explicitly. Users deploying with --workers 1 and async I/O (FastAPI/asyncio) could then load the collection once and share it across coroutines without any locking.

Impact

With 4 Uvicorn workers and a query that holds the lock for ~30ms, worst-case, the 4th worker waits 90ms purely for lock acquisition before any search work starts. Throughput ceiling is approximately 1 / lock_hold_time regardless of how many CPU cores or workers are added. Scaling horizontally is impossible; more processes mean a longer queue at the lock.

Current Workaround

from filelock import FileLock
_zvec_lock = FileLock(f"{settings.ZVEC_PATH}.lock")

class ZvecClient:
    def connect(self):
        _zvec_lock.acquire()  # exclusive lock — blocks ALL other processes
        self._collection = zvec.open(path)
    
    def disconnect(self):
        self._collection.flush()
        _zvec_lock.release()

Zvec Version: 0.2.0
Python: 3.12
Platform: Linux (AWS EC2, ARM Graviton4), macOS (development)

Metadata

Metadata

Assignees

Labels

enhancementImprove an existing feature or component

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions