MobileRAG is a self-contained Retrieval-Augmented Generation (RAG) system that pairs a FastAPI backend with a lightweight browser UI and CLI for chat-based workflows. RAG data is indexed on disk, model responses stream over WebSockets, and chat history is kept in a local SQLite store for fast replay and inspection.
- API Server:
src/api/server.pyexposes REST endpoints for listing chats and a WebSocket endpoint (/v1/chat/ws) that streams tokens, thinking hints, and metadata while delegating retrieval and generation to the registered components. - RAG Pipeline:
src/rag/pipeline.pyscans the document globs defined inconfigs/mobile_rag.yaml, chunks and embeds every document, stores metadata indata/rag/rag_meta.db, and keeps vectors indata/rag/chunks.index.faiss. Retrieval reranks candidates before feeding them into the prompt slot thatbuild_llm_messages()prepares for the LLM. - Persistence:
src/storage/history_db.pykeeps every chat and assistant turn indata/history/history.dbso the UI and CLI can replay multi-turn conversations with the original timestamps, thinking traces, and auxiliary metadata thatsrc/storage/persist.pyappends. - Clients: The browser UI (
src/api/static/*) connects to the WebSocket endpoint, renders Markdown + KaTeX, exposes a thinking drawer, and mirrors the chat list stored in the database. The CLI (src/chat/cli.py) is a thin WebSocket client that can list chats, load archives, delete sessions, and stream both thinking tokens and answers.
- Install requirements (any Python 3.11+ virtualenv is fine):
pip install -r requirements.txt
- Adjust the configuration in
configs/mobile_rag.yamlto point at your document globs, desired model backend, and logging level. The defaults target a local Ollama replica with simple hashing embeddings. - (Optional) Populate the RAG index by running a quick Python script or REPL that calls
RagPipeline(load_config()).build_or_update_index()so that chunks exist before the first chat. Without it, the pipeline will simply build on demand at first query.
-
Start the backend:
uvicorn src.api.server:app --host 0.0.0.0 --port 8000 --reload
The server serves the chat UI at
/and mounts/staticfor the supporting assets. -
Use the browser UI: Open
http://localhost:8000/to open the built-in chat interface. Messages are sent via WebSocket to/v1/chat/ws, tokens stream back asthink_token/answer_tokenevents, and the thinking drawer captures the hidden reasoning trace. -
CLI access:
python -m src.chat.cli --server http://localhost:8000
The CLI keeps a live WebSocket session, prints assistant thinking durations, and lets you
/list,/load <chat_id>, or/del <chat_id>without leaving the terminal.
| Section | Purpose |
|---|---|
MODEL |
Controls the oracle (Ollama by default), streaming behavior, temperature, and maximum output tokens for create_chat_model(). |
RAG |
Enables/disables retrieval, tuning chunk size/overlap, embedding backend (hashing or Ollama), reranker order, and output budget for the prompt injection guard. |
DOCS_GLOBS / DOCS_EXTS |
Define where src/rag/fs_scan.py looks for sources when rebuilding the Faiss index. |
HISTORY |
Path for chat persistence; the API expects this directory to exist and writes history.db automatically. |
data/raw/: place source documents (text, Markdown, PDF, etc.) for chunking.data/rag/: vector index files (chunks.index.faiss) and metadata (rag_meta.db).data/history/: chat history SQLite database with assistant/user turns, reasoning traces, and meta payloads.
- The FastAPI app initializes
HistoryDB,RagPipeline, and the model loader once at import time, so changes toconfigs/mobile_rag.yamlrequire restarting the server. - The WebSocket handler in
src/api/server.pybuffers LLM output throughsplit_think_stream()so thinking tokens can be surfaced independently of final answers. - Extend
src/rag/embedder.pyto add new embedding backends, or swap in a different reranker viasrc/rag/rerank.py.
- Improve ingestion tooling so that
RagPipelinecan be driven from a CLI command instead of relying on first-query laziness. - Add automated tests for
src/rag/pipeline.pyand the persistence layer to guard regressions during refactors. - Implement file uploads/drag-and-drop in the browser UI and emit metadata to the server when new assets are attached.
- Document deployment steps (containerized, cloud, or mobile build) and expose a health-check endpoint for readiness probes.