nim-opensansad

RAG pipeline for Indian parliamentary data (Lok Sabha Q&A), built on the NVIDIA NIM stack.

Architecture

opensansad/lok-sabha-qa (HuggingFace or local parquet)
    │
    ▼
┌─────────────────────────────────────────────────────┐
│  INGEST (local, no API calls)                       │
│                                                     │
│  full_text (pre-extracted markdown)                 │
│    → SentenceSplitter (512 tokens, 64 overlap)      │
│    → e5-large-v2 embedding (local HuggingFace, CPU) │
│    → Milvus standalone (Docker)                     │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│  QUERY                                              │
│                                                     │
│  user query                                         │
│    → e5-large-v2 embedding (local, same model)      │
│    → Milvus ANN search (top_k=10)                   │
│    → NVIDIA NIM reranker (llama-3.2-nv-rerankqa)    │
│    → NVIDIA NIM LLM (llama-3.1-70b-instruct)       │
│    → synthesized answer with sources                │
└─────────────────────────────────────────────────────┘

Embeddings are fully local — no NVIDIA API calls during ingestion. The NVIDIA API key is only used at query time for reranking and LLM synthesis (2 calls per query, well within the free tier's 40 RPM limit).

Stack

Component	Technology	Local / API
Data source	opensansad/lok-sabha-qa	HuggingFace download
Document parsing	Docling + EasyOCR (done upstream in lok-sabha-dataset)	Pre-computed
Embeddings	`intfloat/e5-large-v2` via LlamaIndex	Local (CPU)
Vector DB	Milvus standalone	Local (Docker)
Reranker	NVIDIA NIM `nvidia/llama-3.2-nv-rerankqa-1b-v2`	API
LLM	NVIDIA NIM `meta/llama-3.1-70b-instruct`	API
Orchestration	LlamaIndex	Local

Quick start

# 1. Clone and install
git clone <repo-url>
cd nim-opensansad
uv sync

# 2. Start Milvus
docker compose up -d

# 3. Configure
cp .env.example .env
# Edit .env — add your NVIDIA_API_KEY from https://build.nvidia.com

# 4. Ingest (downloads from HuggingFace by default)
uv run opensansad ingest --limit 20 --overwrite    # test with 20 docs
uv run opensansad ingest                            # full dataset

# Or from a local parquet:
uv run opensansad ingest --parquet /path/to/lok_sabha_qa.parquet --limit 100

# 5. Build metadata DB (for aggregate stats)
uv run opensansad build-db

# 6. Search
uv run opensansad search "What did the Home Minister say about border security?"

# Search with MP/ministry stats injection
uv run opensansad search "what issues have been raised" --mp "KANGNA RANAUT"
uv run opensansad search "recent questions" --min "TRIBAL AFFAIRS"
uv run opensansad search "education questions" --mp "SHASHI THAROOR" --min "EDUCATION"

# Discover canonical names
uv run opensansad list-mps --search "kangna"
uv run opensansad list-ministries --search "education"

# Collection stats
uv run opensansad stats

System requirements

Python 3.11+
Docker (for Milvus)
~1.3 GB disk for the embedding model (auto-downloaded on first run)
NVIDIA API key (free tier at build.nvidia.com)
No GPU required

Design decisions

Local embeddings over NIM API: nv-embedqa-e5-v5 is not downloadable — NVIDIA only serves it via API/NGC containers requiring NVIDIA GPUs. Using intfloat/e5-large-v2 (same E5 family, 1024-dim) locally removes the rate limit bottleneck for bulk ingestion and keeps the embedding step completely free. MPS (Apple Silicon GPU) is used automatically when available.
Pre-parsed data: Document extraction (OCR, layout detection) is handled upstream by lok-sabha-dataset using a Docling + EasyOCR two-pass pipeline. This project consumes the already-extracted markdown, keeping the codebase focused on retrieval.
Milvus standalone (not Lite): Full Docker deployment with etcd + minio. Same architecture scales from development to production — switching to Milvus cluster requires no code changes.
Aggregate stats via SQLite: Pure chunk retrieval can't answer analytical queries like "What kind of questions has MP X raised?" A separate SQLite metadata DB provides pre-computed aggregates (by ministry, session, type) that are injected into the LLM context as an evidence packet alongside retrieved chunks. MP names are canonicalised across Lok Sabhas using mpNo from supplementary data.
Ministry name caveats: Ministry names are not canonicalised across Lok Sabhas because some ministries were genuinely renamed (e.g. "Human Resource Development" → "Education") while others had their codes reassigned to different ministries. Whitespace is normalised, but renames are kept as-is. The LLM prompt should note that ministries may have historical alternate names.

Evaluation

A metadata-driven retrieval eval lives in eval/. It runs queries through the real pipeline (embed → Milvus → filter), skipping only the NIM reranker and LLM, and scores retrieved chunks against expected metadata (MP name, ministry, topic keywords).

Run it:

uv run opensansad eval           # 18 curated queries, top_k=10
uv run opensansad eval --debug   # also prints per-chunk pass/fail

Results — 2026-03-27 (18 queries, top_k=10):

Mode	Queries	Avg P@k	Avg Hit@1	Avg Hit@k	Avg Latency
unfiltered	18	0.69	0.72	1.00	344ms
filtered	13	1.00	1.00	1.00	414ms

Key takeaways:

Filtered retrieval is perfect (P@k=1.00) — Milvus LIKE filters with MP alias expansion reliably scope results to the right MP/ministry
Unfiltered Hit@k=1.00 — semantic search always finds at least one relevant chunk in the top 10, even without filters
Unfiltered P@k=0.69 — without filters, ~3 of every 10 chunks are off-target; this is the gap metadata filtering closes
Next step: extend with LLM-as-judge scoring (faithfulness, answer relevance) via RAGAS

Roadmap

Completed

Metadata-filtered vector search: --mp and --min flags scope Milvus ANN search using native filter expressions (members LIKE %name%, ministry == "X"), with alias expansion across Lok Sabhas via data/mp_aliases.json
Retrieval evaluation: 18-query metadata-driven eval (uv run opensansad eval) with per-query and summary metrics (P@k, Hit@1, Hit@k, latency)

Ahead

LLM-as-judge eval: Extend current retrieval eval with RAGAS-style faithfulness and answer relevance scoring. Requires curated reference answers for a subset of test queries.
Hybrid search + RRF: Add BM25 sparse retrieval alongside dense vectors, fused with Reciprocal Rank Fusion. Milvus supports this natively. Important for exact matches on MP names, ministry names, bill numbers.
Agentic query decomposition: Auto-detect when a query needs aggregate stats vs chunk retrieval, replacing the explicit --mp/--min flags. Small LLM call to extract structured intent from natural language.
Scale to 500k+ docs: Switch Milvus index from HNSW to IVF_PQ for memory efficiency. Export pre-built Milvus snapshots for distribution.
Ministry name canonicalisation: Curate a manual mapping of known ministry renames across Lok Sabhas (e.g. HRD → Education, Shipping → Ports Shipping and Waterways). Complex because some minCode values were reassigned to entirely different ministries.
System prompt: Add parliamentary domain context to the LLM via system_prompt.
Temperature tuning: Set LLM temperature to 0.1-0.2 for deterministic factual answers.
Frontend: Web UI with typeahead search for MP/ministry names (populates --mp/--min filters automatically).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
eval		eval
src/nim_opensansad		src/nim_opensansad
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nim-opensansad

Architecture

Stack

Quick start

System requirements

Design decisions

Evaluation

Roadmap

Completed

Ahead

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nim-opensansad

Architecture

Stack

Quick start

System requirements

Design decisions

Evaluation

Roadmap

Completed

Ahead

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages