Archive Brain is a local-first document archive assistant.
It ingests your personal files, enriches them with metadata, and enables semantic search and Retrieval-Augmented Generation (RAG) — all running on your own machine.
This project is designed for people who want to understand and explore their archives, not ship their data to the cloud.
- Ingests documents from local folders automatically
- Extracts text from PDFs, images (OCR), and plain text
- Segments large documents into meaningful chunks
- Uses local LLMs to generate:
- Titles
- Summaries
- Tags
- Builds vector embeddings for semantic search
- Gallery view for browsing and analyzing images with vision models
- Real-time dashboard showing pipeline progress and current processing phase
- Lets you ask natural-language questions over your archive with RAG
Main semantic search interface
All processing runs locally via Docker and Ollama.
Archive Brain is local-first by default.
- Files are read from your local filesystem
- All processing happens inside Docker containers on your machine
- LLM inference runs via Ollama (local or self-hosted)
- No data is sent to external services unless you explicitly configure it
If you point the system at a remote LLM or external API, you control that tradeoff.
docker compose -f docker-compose.yml --profile prod up -d --buildOn first run, the system will download required LLM models (several GB). You can monitor progress with:
docker compose -f docker-compose.yml --profile prod logs -f ollama-initOnce running, open:
- Web UI: http://localhost:3000 - Search your archive with semantic queries
- Browse files and images in gallery view
- Analyze images with AI vision models
- Monitor pipeline progress on the dashboard* API: http://localhost:8000
That’s it — the ingestion pipeline starts automatically.
➡️ New here?
Read docs/first-run.md for what to expect on first startup.
Archive Brain runs in Docker, so folders from your host system must be explicitly mounted before they can be indexed.
If you can search but clicking a document shows empty content or Open source returns {"detail":"File not found on disk"}, it usually means the file path exists in the database but the underlying folder is not mounted into the containers.
Set STORY_SOURCE / KNOWLEDGE_SOURCE in .env to point at your real folders (see .env.example), then restart the stack.
This is a one-time setup step and is required before your files will appear in the UI.
➡️ Read: Adding Folders to Archive Brain
Archive Brain runs a background pipeline:
- Ingest – Scan folders and extract raw content
- Segment – Split content into logical chunks
- Enrich Documents – Generate metadata for full documents
- Enrich Chunks – Optionally enrich chunks (configurable mode)
- Embed Documents – Create vector embeddings for document-level search
- Embed Chunks – Create vector embeddings for chunk-level search
- Retrieve & Generate – Power semantic search and Q&A
The dashboard shows real-time progress, including which phase the worker is currently processing and estimated completion times for each stage.
Dashboard: pipeline status with real-time phase tracking
➡️ For details, see docs/architecture.md.
| Type | Extensions | Notes |
|---|---|---|
| Text | .txt, .md, .html |
Full text extraction |
.pdf |
Text + OCR fallback | |
| Images | .jpg, .png, .gif, .webp, .tiff |
OCR + vision descriptions |
| Documents | .docx |
Planned |
Archive Brain includes a dedicated gallery view for browsing and analyzing images:
- Grid and list views for browsing all extracted images
- Lightbox viewer with full-resolution display
- OCR text extraction from images
- AI-powered descriptions using vision models (LLaVA)
- On-demand analysis - generate descriptions for any image with a single click
- Sortable views - sort by date, filename, or file size
Images are automatically extracted during ingestion and can be analyzed individually or in batch.
cp .env.example .envKey settings:
OLLAMA_MODEL– Chat model for enrichment and Q&AOLLAMA_EMBEDDING_MODEL– Embedding model for vector searchOLLAMA_VISION_MODEL– Vision model for image analysisDB_PASSWORD– Database password
Source folders and file types are defined in:
config/config.yaml
Chunk Enrichment Mode
Control how chunks are processed to balance speed vs. metadata richness:
none– Skip chunk enrichment entirely (fastest)embed_only– Only create embeddings, no LLM enrichment (recommended default)full– Full LLM enrichment with titles, summaries, and tags (slowest)
Change this in Settings via the web UI or by calling the API. The embed_only mode provides excellent search quality while dramatically reducing processing time.
Multi-Provider LLM Support
Archive Brain can distribute load across multiple LLM providers for faster processing:
- Configure additional Ollama servers or cloud providers
- Worker automatically balances requests across available providers
- Improves throughput for large archives
Self-contained Docker setup with Ollama included:
docker compose -f docker-compose.yml --profile prod up -dRequires NVIDIA Container Toolkit:
docker compose -f docker-compose.yml -f docker-compose.gpu.yml --profile prod up -dRun Ollama on the host or another machine:
export OLLAMA_URL=http://host.docker.internal:11434
docker compose -f docker-compose.yml -f docker-compose.external-llm.yml --profile prod up -d- Pipeline steps are designed to be idempotent
- Re-running ingestion will skip unchanged files
- Metadata and embeddings are reused when possible
To reset everything:
docker compose -f docker-compose.yml --profile prod down -v
docker compose -f docker-compose.yml --profile prod up -d --build- Single-user only - no multi-tenancy support
- No authentication or access control
- Not optimized for real-time ingestion - designed for batch processing
- Large archives may require extended processing time
- Use
embed_onlymode for faster processing - GPU acceleration recommended for 1M+ document archives
- Use
- Worker cycles through phases - processes one type of task at a time (documents → chunks → embeddings)
Visualize document and chunk embeddings
- PostgreSQL + pgvector
- Python + FastAPI
- React + Vite
- Apache Tika, Tesseract OCR
- Ollama (LLMs)
- Docker Compose
- First Run Guide:
docs/first-run.md - Adding Source Folders:
docs/ADDING_FOLDERS.md - Architecture Overview:
docs/architecture.md