Ask questions against your industrial equipment manuals. Get cited, engineering-grade answers. Fully offline. No cloud. No API keys. Runs on AMD or NVIDIA GPU.
ManualIQ is a local Retrieval-Augmented Generation (RAG) system built for engineers and technicians who need fast, reliable answers from dense technical documentation — service manuals, maintenance guides, wiring diagrams, and more. Every answer is grounded in your documents and cites the exact chunk it came from.
Persistence & Multi-Session
- 💾 Chat history saved to disk — survives browser refresh, replayed on notebook select
- ⚙ Per-notebook custom instructions — prepended to every prompt, auto-saved
- ⬇ Export full conversation as Markdown with one click
- 🗓 Source timestamps — "Last updated" shown next to each uploaded document
Notebook Intelligence Layer
- 🗺 Per-notebook query map — tracks which chunks and sections actually answer questions
- 📈 Learned relevance boost — frequently-cited chunks get a rank nudge on future queries (2× cap, 90-day decay)
- 🧠 Notebook memory summary — LLM auto-generates a profile every 20 queries; injected into system prompt; viewable in sidebar
- ⚖ Source load balancing — every selected source guaranteed ≥1 chunk in context; large manuals can no longer silence small bulletins
UX Fixes
- 🔧 Chat scroll freeze fixed — incremental DOM append instead of full text replace per token
- ⧉ Copy button on every assistant answer
| Capability | Detail |
|---|---|
| 100% Offline | Ollama for local LLM inference — air-gap capable after model download |
| Docling OCR Pipeline | Handles scanned/image-heavy PDFs that other tools fail on |
| Hybrid Retrieval | Vector (semantic) + BM25 lexical search merged for best precision |
| Multi-Notebook | Isolate different equipment manuals into separate workspaces |
| DeepSeek-R1 Reasoning | Chain-of-thought model with live 🧠 Reasoning… indicator |
| Vision Captioning | Optional: extract & caption diagrams via multimodal model |
| AMD GPU Support | Vulkan compute via OLLAMA_VULKAN=1 — works on AMD Radeon iGPU/dGPU |
| FastAPI SPA | Custom dark-mode single-page app — real-time SSE streaming; no framework dependency |
| Anti-Hallucination | Every answer must cite [Chunk N] or return a fixed refusal string |
| Skip Re-indexing | Already-embedded documents are detected and skipped automatically |
| Typed Chunks | Chunks tagged as text / figure / table for richer retrieval |
| Persistent Chat History | Chat turns saved to data/chats/{nb_id}.jsonl; replayed on notebook select; survives browser refresh |
| Per-Notebook Instructions | Custom system prompt per notebook (⚙ gear panel); prepended to every prompt; auto-saved |
| Export Conversation | One-click Markdown download of full chat session (⬇ Export in title bar) |
| Ingest Timestamps | "Last updated" date shown next to each source in the sidebar |
| Notebook Intelligence | Per-notebook query map tracks which chunks and sections answer questions; drives learned boost |
| Learned Relevance Boost | Frequently-cited chunks get a rank nudge on future queries (max 2×, 90-day decay) |
| Notebook Memory Summary | LLM auto-generates a 2–4 sentence profile every 20 queries; injected into system prompt; shown in 🧠 sidebar panel |
| Source Load Balancing | Every selected source guaranteed ≥1 chunk in context — prevents large manuals silencing small bulletins |
| Smooth Streaming UX | Incremental DOM append (no full replace per token); rAF-throttled scroll; per-answer copy button (⧉) |
PDF Upload
│
▼
┌─────────────────────────────────────────────────────┐
│ ingest.py — Docling Python API │
│ • do_ocr=True (RapidOCR for scanned pages) │
│ • Table structure extraction → markdown tables │
│ • Figure extraction → minicpm-v captions (optional) │
│ Output: {notebook_id}/{filename}.md │
│ Sidecar: .ingest_timestamps.json (per source) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ index.py — Chunking & Embedding │
│ • LangChain MarkdownHeaderTextSplitter │
│ • Section-prefix baked into vectors │
│ "[Section: A > B]\n{chunk text}" │
│ • nomic-embed-text via Ollama │
│ • Stored in LanceDB (serverless) │
│ • Lexical chunks saved to chunks.jsonl │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ query.py — Hybrid Retrieval │
│ • Vector search (LanceDB ANN) │
│ • BM25 lexical search (chunks.jsonl) │
│ • Fetch 2× pool → load-balance (≥1 per source) │
│ • Apply learned relevance boost (notebook_map.py) │
│ • Trim to top-k │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ app.py / server.py — Prompt Builder │
│ • Inject notebook memory summary (if available) │
│ • Inject per-notebook custom instructions │
│ • Build [Chunk N] context + history block │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ DeepSeek-R1:8b via Ollama │
│ • System prompt enforces citation + no hallucination│
│ • <think> blocks stripped; 🧠 indicator shown live │
│ • Streamed token-by-token to UI │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ notebook_map.py — Notebook Intelligence │
│ • Append turn to data/chats/{nb_id}.jsonl │
│ • Update data/maps/{nb_id}_query_map.json │
│ – chunk_usage, section_usage, query_log │
│ • Every 20 queries: regenerate memory summary │
│ data/maps/{nb_id}_memory.md │
└─────────────────────────────────────────────────────┘
- Python 3.11+
- Ollama installed and running
Pull the required models:
ollama pull deepseek-r1:8b
ollama pull nomic-embed-textFor AMD GPU acceleration (optional but highly recommended):
# Windows — set as a System Environment Variable:
OLLAMA_VULKAN=1
# Then restart Ollamagit clone https://github.com/Ashwathama2024/manual-diagnostic-ai.git
cd manual-diagnostic-ai
python -m venv .venv
# Windows:
.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate
pip install -r requirements.txtOption A — FastAPI SPA (recommended)
python scripts/server.py --port 7860
# Opens http://127.0.0.1:7860Option B — Gradio Web UI (legacy)
python scripts/app.py
# Opens http://127.0.0.1:7860Option C — Terminal REPL (power users)
python scripts/chat.py --retrieval-mode hybrid┌──────────────────────────┬────────────────────────────────────────────┐
│ SIDEBAR │ TITLE BAR │
│ ───────────────────── │ 📓 Main Engine · 3 sources · 412 chunks │
│ [Select notebook ▼] │ [⬇ Export] [✕ Clear] │
│ [+ New] ├────────────────────────────────────────────┤
│ │ MESSAGES │
│ SOURCES │ │
│ ───────────────────── │ You: What is the FO pump pressure? │
│ ☑ 📄 ME_manual.md │ │
│ Last updated: Apr 1 │ ManualIQ: Based on [Chunk 3] the normal │
│ ☑ 📄 FO_system.md │ operating pressure is 8–10 bar… [⧉ Copy]│
│ ☑ Core: propulsion │ │
│ [⬆ Upload PDF / DOCX] │ [Type your question…] [Send ➤] │
│ [🗑 Remove Selected] ├────────────────────────────────────────────┤
│ │ CITATIONS │
│ ⚙ NOTEBOOK SETTINGS │ [1] FO_system.md › Pressure Specs │
│ ┌──────────────────┐ │ [2] ME_manual.md › Fuel System │
│ │ Custom prompt… │ └────────────────────────────────────────────┘
│ └──────────────────┘
│
│ 🧠 NOTEBOOK INTELLIGENCE
│ "Covers MAN B&W 6S60MC-C.
│ Strong: fuel injection,
│ turbocharger maintenance."
│ [↺ Regenerate]
│
│ Docs: 3 · Chunks: 412
└──────────────────────────
paths:
raw_dir: data/raw
processed_dir: data/processed
lancedb_dir: data/lancedb
lancedb_table: manual_chunks
notebooks_registry: data/notebooks.json
chats_dir: data/chats # per-notebook chat history JSONL
maps_dir: data/maps # per-notebook query maps + memory summaries
models:
embedding: nomic-embed-text # Ollama embedding model
reasoning: deepseek-r1:8b # Reasoning/chat LLM
vision: minicpm-v # Vision model for diagram captions
indexing:
min_chunk_chars: 400
max_chunk_chars: 1800
overlap_chars: 200
batch_size: 16
retrieval:
mode: hybrid # vector | vectorless | hybrid
top_k: 6
load_balance: true # guarantee ≥1 chunk per source (default on)
chat:
show_thinking: false # Show DeepSeek <think> blocks
history_turns: 6
vision:
enabled: false # Enable diagram captioning (GPU recommended)
timeout: 180 # Seconds per image
max_per_doc: 10
runtime:
ollama_num_thread: 12
ollama_ctx_num: 8192| Script | Purpose | Usage |
|---|---|---|
scripts/server.py |
FastAPI backend + SPA (primary) | python scripts/server.py |
scripts/app.py |
Gradio web UI (legacy) + shared helpers | python scripts/app.py |
scripts/notebook_map.py |
Query map, learned boost, memory summary | (library — imported by server.py) |
scripts/ingest.py |
PDF → Markdown | python scripts/ingest.py file.pdf |
scripts/index.py |
Markdown → LanceDB | python scripts/index.py |
scripts/query.py |
One-shot retrieval + answer | python scripts/query.py "question" |
scripts/chat.py |
Terminal REPL | python scripts/chat.py |
scripts/preflight.py |
Validate environment | python scripts/preflight.py |
scripts/common.py |
Shared utilities | (library — not run directly) |
data/
├── notebooks.json # Notebook registry (id, name, category, custom_prompt)
├── raw/{nb_id}/ # Original PDFs/DOCX per notebook
├── processed/{nb_id}/
│ ├── *.md # Docling-converted markdown
│ ├── chunks.jsonl # BM25 lexical index
│ └── .ingest_timestamps.json # Per-source ingest dates
├── chats/{nb_id}.jsonl # Chat history (one turn per line)
├── maps/
│ ├── {nb_id}_query_map.json # Usage tracking + boost table
│ └── {nb_id}_memory.md # LLM-generated notebook summary
└── lancedb/
├── manual_chunks.lance/ # Notebook chunk vectors
└── core_knowledge.lance/ # Core knowledge base vectors
- Upload — PDF dropped into the UI or placed in
data/raw/ - Ingest (
ingest.py) — Docling converts to structured Markdown with:- Full OCR on scanned/image pages (RapidOCR)
- Tables preserved as Markdown tables
- Figures optionally captioned by multimodal model
- Timestamp written to
.ingest_timestamps.json
- Chunk (
index.py) — LangChain splits on Markdown headers; each chunk records itssection_path(e.g.,Engine Room > Fuel System > Pump Specs) - Embed (
index.py) —nomic-embed-textembeds"[Section: X > Y]\nchunk text"so section context is baked into the vector - Store — Vectors go to LanceDB; raw text goes to
chunks.jsonlfor lexical search - Query (
query.py+server.py) — BM25 + vector results fetched (2× pool) → load-balanced → learned boost applied → top-k returned - Reason — DeepSeek-R1 receives chunks + memory summary + custom instructions + history; streams cited answer
- Learn (
notebook_map.py) — Turn written to chat JSONL; query map updated with chunk usage; memory regenerated every 20 queries
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 32 GB |
| CPU | 4 cores | 12+ cores |
| GPU | None (CPU-only) | AMD Radeon (4GB+ VRAM via BIOS UMA) or NVIDIA 8GB+ |
| Storage | 5 GB (models) | 20 GB (models + docs) |
AMD iGPU tip: If your laptop has an AMD Radeon integrated GPU with low dedicated VRAM, increase the UMA Frame Buffer Size in BIOS to 4–8 GB. Set OLLAMA_VULKAN=1 in System Environment Variables. This provides 10–15× faster inference vs CPU.
| Mode | How it works | Best for |
|---|---|---|
vector |
Semantic similarity via nomic-embed-text | Conceptual / paraphrased questions |
vectorless |
BM25 keyword scoring on chunks.jsonl | Exact part numbers, codes, spec lookups |
hybrid |
Both merged + deduplicated + load-balanced (default) | All-around best results |
Q: Does this need internet? No. After pulling Ollama models once, everything runs offline. No telemetry, no API calls.
Q: What PDF types work best? Both text-based and scanned PDFs work. Docling's OCR handles image-heavy manuals that tools like PyMuPDF can't extract from.
Q: Can I use a different LLM?
Yes — change models.reasoning in config.yaml to any model you've pulled via ollama pull. Tested with llama3.2:3b, qwen2.5:7b, and deepseek-r1:8b.
Q: How do I enable vision diagram captioning?
Set vision.enabled: true in config.yaml. Requires minicpm-v pulled via Ollama. Strongly recommended to have GPU — CPU inference is ~3 min/image.
Q: Does chat history persist after a browser refresh?
Yes (Phase 3). Every turn is saved to data/chats/{nb_id}.jsonl and replayed automatically when you select a notebook.
Q: What is the Notebook Intelligence panel? The 🧠 panel shows a natural-language summary of what the notebook is good at answering. It is generated by the LLM after every 20 queries and injected into the system prompt so answers become more contextually aware over time.
Q: How does load balancing work?
When you have multiple sources selected, the retriever fetches a 2× candidate pool, then guarantees at least one chunk from every source before trimming to top_k. Set retrieval.load_balance: false in config.yaml to disable.
MIT — see LICENSE