Skip to content

A complete end-to-end semantic search system with document preprocessing, embedding generation, caching, FAISS/NumPy similarity search, FastAPI backend, and a polished Streamlit UI. Implements efficient embeddings, caching logic, and ranking explanations for transparent retrieval.

Notifications You must be signed in to change notification settings

AshNicolus/VectorVault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorVault

Lightweight local semantic search over plain text documents using sentence embeddings.

Description

VectorVault is a small project that extracts semantic embeddings from text documents and provides two ways to interact with them:

  • a Streamlit web UI for interactive exploration and demos (fast, visual), and
  • a FastAPI-based HTTP API for programmatic queries and integration.

The repository contains simple modules to preprocess text, compute and cache embeddings, and perform approximate/nearest-neighbor search.

Purpose

This project is designed as a local, easy-to-run demo and development platform for semantic search over a small corpus of documents. Its goals are:

  • Make it trivial to index a small document set and query semantically.
  • Provide both an interactive UI (Streamlit) and an API (FastAPI) so developers and non-developers can explore results.
  • Use a compact, fast embedding model (MiniLM) so it runs well on modest hardware.

Why use the MiniLM model ("all-MiniLM-L6-v2")?

Reason for choosing the MiniLM (L6) family:

  • Performance vs. size: all-MiniLM-L6-v2 is small and fast while delivering strong semantic quality for many search tasks.
  • Low latency: great for interactive UIs and local/edge environments.
  • Lower resource requirements: works on a laptop or small VM without needing a GPU.
  • Easy to scale: because embeddings are compact and quick to compute, the system is cheaper and faster to run.

Tradeoffs: larger models (e.g., MPNet or transformer-based large models) can give higher accuracy on subtle semantics but require more memory, compute and latency. For a lightweight local project, MiniLM is an excellent default.

Architecture

High-level flow:

  1. Preprocess: read raw text files from data/docs/, clean and split them into chunks.
  2. Embed: compute dense vector embeddings for each chunk using the MiniLM sentence-transformer model.
  3. Cache: store embeddings and minimal metadata in cache/embeddings.json (managed by src/cache_manager.py).
  4. Search: for a user query, compute its embedding and perform nearest-neighbor search across cached vectors (src/search_engine.py).
  5. Serve: expose functionality via the Streamlit UI (app.py) and the FastAPI endpoints (src/api.py).

Components and where to find them:

  • data/docs/ — source documents (text files).
  • cache/ — embedding cache and manifest.
  • src/preprocess.py — helpers to load and prepare text.
  • src/embedder.py — code that talks to the sentence-transformers model to compute embeddings.
  • src/cache_manager.py — read/write cached vectors and metadata.
  • src/search_engine.py — nearest-neighbor search logic.
  • src/api.py — FastAPI app exposing endpoints for search and metadata.
  • app.py — Streamlit UI front-end for interactive exploration.

Simplified diagram:

Text Files (data/docs/) --> Preprocess --> Embedder (MiniLM) --> Cache (cache/embeddings.json) | v Search Engine <- Query Embedding | --------------------------------------------------------- | | Streamlit UI (app.py) FastAPI (src/api.py)

How to use (Windows -- PowerShell)

  1. Create & activate a virtual environment, then install dependencies:
python -m venv .venv
# Activate the venv in PowerShell
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
  1. Run the Streamlit UI (interactive demo):
# from the project root
streamlit run app.py

The Streamlit app typically opens at http://localhost:8501. Use it to upload documents, run queries and inspect results.

  1. Run the FastAPI server (programmatic access):
# from the project root
# serve the API on port 8000
uvicorn src.api:app --reload --port 8000

The API root will be at http://127.0.0.1:8000. If the project includes an OpenAPI schema, you can view it at http://127.0.0.1:8000/docs.

  1. Running both at the same time

Run the Streamlit UI and the FastAPI server in separate terminals. They use different default ports (Streamlit 8501, FastAPI 8000), so they do not conflict.

Why run both? Running both provides the best of both worlds:

  • Streamlit: great for manual inspection, demos, and iterating on UI/UX for search and retrieval.
  • FastAPI: exposes endpoints for automated tests, integrations, or enabling multiple clients.

Example: run the API in one terminal and the Streamlit demo in another; the Streamlit app can call the local API for queries or you can hit the API directly from other programs.

Quickstart script

A small PowerShell helper script quickstart.ps1 is provided to automate setup and optionally start both servers in new PowerShell windows.

From the project root you can:

# Install dependencies only
.\quickstart.ps1 -InstallOnly

# Create venv, install deps and start both Streamlit and FastAPI
.\quickstart.ps1 -RunBoth

# Start only Streamlit (after installing)
.\quickstart.ps1 -RunStreamlit

# Start only FastAPI (after installing)
.\quickstart.ps1 -RunAPI

If your PowerShell blocks script execution, run it temporarily for this process with:

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

The script will create a .venv virtual environment (if missing), install packages from requirements.txt into that venv, and open new PowerShell windows to run Streamlit and/or Uvicorn so both services can run concurrently.

Usage examples

  • Interactive: open Streamlit, point it at data/docs/ and click the button to build embeddings and query.
  • Programmatic: POST a JSON payload to /search (or whichever endpoint exists in src/api.py) with a query field and receive nearest-neighbor results as JSON.

Note: the exact API routes and function names depend on src/api.py and may be extended. Open src/api.py if you want to add or inspect endpoints.

Troubleshooting

  • If embeddings are not present, check cache/embeddings.json and delete it to force a rebuild.
  • If the model fails to load, ensure sentence-transformers and its dependencies are installed in the active virtual environment.
  • Port conflicts: if port 8000 or 8501 are in use, pick alternative ports with Streamlit --server.port or Uvicorn --port flags.

About

A complete end-to-end semantic search system with document preprocessing, embedding generation, caching, FAISS/NumPy similarity search, FastAPI backend, and a polished Streamlit UI. Implements efficient embeddings, caching logic, and ranking explanations for transparent retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published