Skip to content

Local-first document archive assistant for semantic search and RAG using local LLMs and embeddings.

License

Notifications You must be signed in to change notification settings

ryanlane/document-manager

Repository files navigation

Archive Brain

License: MIT

Archive Brain is a local-first document archive assistant.
It ingests your personal files, enriches them with metadata, and enables semantic search and Retrieval-Augmented Generation (RAG) — all running on your own machine.

This project is designed for people who want to understand and explore their archives, not ship their data to the cloud.


✨ What Archive Brain Does

  • Ingests documents from local folders automatically
  • Extracts text from PDFs, images (OCR), and plain text
  • Segments large documents into meaningful chunks
  • Uses local LLMs to generate:
    • Titles
    • Summaries
    • Tags
  • Builds vector embeddings for semantic search
  • Gallery view for browsing and analyzing images with vision models
  • Real-time dashboard showing pipeline progress and current processing phase
  • Lets you ask natural-language questions over your archive with RAG

Archive Brain Search UI
Main semantic search interface

All processing runs locally via Docker and Ollama.


🔐 Data & Privacy Model

Archive Brain is local-first by default.

  • Files are read from your local filesystem
  • All processing happens inside Docker containers on your machine
  • LLM inference runs via Ollama (local or self-hosted)
  • No data is sent to external services unless you explicitly configure it

If you point the system at a remote LLM or external API, you control that tradeoff.


🚀 Quick Start

docker compose -f docker-compose.yml --profile prod up -d --build

On first run, the system will download required LLM models (several GB). You can monitor progress with:

docker compose -f docker-compose.yml --profile prod logs -f ollama-init

Once running, open:

  • Web UI: http://localhost:3000 - Search your archive with semantic queries
    • Browse files and images in gallery view
    • Analyze images with AI vision models
    • Monitor pipeline progress on the dashboard* API: http://localhost:8000

That’s it — the ingestion pipeline starts automatically.

➡️ New here? Read docs/first-run.md for what to expect on first startup.


📂 Adding Your Documents

Archive Brain runs in Docker, so folders from your host system must be explicitly mounted before they can be indexed.

If you can search but clicking a document shows empty content or Open source returns {"detail":"File not found on disk"}, it usually means the file path exists in the database but the underlying folder is not mounted into the containers. Set STORY_SOURCE / KNOWLEDGE_SOURCE in .env to point at your real folders (see .env.example), then restart the stack.

This is a one-time setup step and is required before your files will appear in the UI.

➡️ Read: Adding Folders to Archive Brain


🧠 How It Works (High Level)

Archive Brain runs a background pipeline:

  1. Ingest – Scan folders and extract raw content
  2. Segment – Split content into logical chunks
  3. Enrich Documents – Generate metadata for full documents
  4. Enrich Chunks – Optionally enrich chunks (configurable mode)
  5. Embed Documents – Create vector embeddings for document-level search
  6. Embed Chunks – Create vector embeddings for chunk-level search
  7. Retrieve & Generate – Power semantic search and Q&A

The dashboard shows real-time progress, including which phase the worker is currently processing and estimated completion times for each stage.

Archive Brain Dashboard
Dashboard: pipeline status with real-time phase tracking

➡️ For details, see docs/architecture.md.


📦 Supported File Types

Type Extensions Notes
Text .txt, .md, .html Full text extraction
PDF .pdf Text + OCR fallback
Images .jpg, .png, .gif, .webp, .tiff OCR + vision descriptions
Documents .docx Planned

🖼️ Image Gallery & Analysis

Archive Brain includes a dedicated gallery view for browsing and analyzing images:

  • Grid and list views for browsing all extracted images
  • Lightbox viewer with full-resolution display
  • OCR text extraction from images
  • AI-powered descriptions using vision models (LLaVA)
  • On-demand analysis - generate descriptions for any image with a single click
  • Sortable views - sort by date, filename, or file size

Images are automatically extracted during ingestion and can be analyzed individually or in batch.


⚙️ Configuration

cp .env.example .env

Key settings:

  • OLLAMA_MODEL – Chat model for enrichment and Q&A
  • OLLAMA_EMBEDDING_MODEL – Embedding model for vector search
  • OLLAMA_VISION_MODEL – Vision model for image analysis
  • DB_PASSWORD – Database password

Source folders and file types are defined in:

config/config.yaml

Performance Tuning

Chunk Enrichment Mode

Control how chunks are processed to balance speed vs. metadata richness:

  • none – Skip chunk enrichment entirely (fastest)
  • embed_only – Only create embeddings, no LLM enrichment (recommended default)
  • full – Full LLM enrichment with titles, summaries, and tags (slowest)

Change this in Settings via the web UI or by calling the API. The embed_only mode provides excellent search quality while dramatically reducing processing time.

Multi-Provider LLM Support

Archive Brain can distribute load across multiple LLM providers for faster processing:

  • Configure additional Ollama servers or cloud providers
  • Worker automatically balances requests across available providers
  • Improves throughput for large archives

🖥️ Deployment Options

Default (Recommended)

Self-contained Docker setup with Ollama included:

docker compose -f docker-compose.yml --profile prod up -d

NVIDIA GPU Acceleration

Requires NVIDIA Container Toolkit:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml --profile prod up -d

External Ollama (Advanced)

Run Ollama on the host or another machine:

export OLLAMA_URL=http://host.docker.internal:11434
docker compose -f docker-compose.yml -f docker-compose.external-llm.yml --profile prod up -d

🔄 Re-running & Iteration

  • Pipeline steps are designed to be idempotent
  • Re-running ingestion will skip unchanged files
  • Metadata and embeddings are reused when possible

To reset everything:

docker compose -f docker-compose.yml --profile prod down -v
docker compose -f docker-compose.yml --profile prod up -d --build

🚧 Current Limitations

  • Single-user only - no multi-tenancy support
  • No authentication or access control
  • Not optimized for real-time ingestion - designed for batch processing
  • Large archives may require extended processing time
    • Use embed_only mode for faster processing
    • GPU acceleration recommended for 1M+ document archives
  • Worker cycles through phases - processes one type of task at a time (documents → chunks → embeddings)

🧬 Embeddings Visualization

Embeddings Visualization
Visualize document and chunk embeddings

🛠️ Tech Stack

  • PostgreSQL + pgvector
  • Python + FastAPI
  • React + Vite
  • Apache Tika, Tesseract OCR
  • Ollama (LLMs)
  • Docker Compose

📚 Documentation


📄 License

MIT