A minimal, research-friendly RAG (Retrieval-Augmented Generation) chatbot that ingests local documents (PDF/TXT/MD), stores embeddings in ChromaDB, and answers questions with citations.
This repository is intentionally CLI-first to keep the architecture transparent and easy to evaluate.
Priority order:
- Minimize hallucinations
- Accuracy
- Speed
Key design decisions:
- Strict evidence gating (score-based) to prevent generation when the retrieved context is likely irrelevant.
- Citations are printed in a deterministic format (
[filename p.N]when available). - Client-side embeddings (OpenAI) for portability across Windows and Ubuntu.
- Ingest documents from a directory:
.pdf(via LangChain PDFLoader).txt,.md(via TextLoader)
- Chunking with
RecursiveCharacterTextSplitter - Vector storage in ChromaDB
- Chat CLI:
- Retrieval → evidence scoring → (pass) generation
- (fail) hard refusal without calling the LLM
- Debug query CLI for inspecting retrieved chunks
src/
cli/
ingest.ts # Load -> split -> embed -> upsert into Chroma
debug-query.ts # Inspect similarity search output quickly
core/
env.ts # Zod-validated environment config
loaders.ts # DirectoryLoader + metadata normalization
splitter.ts # Chunking strategy
vectorstore.ts # Chroma vector store wiring
chromaClient.ts # ChromaClient wiring (admin ops / injection)
metadata.ts # Chroma-safe metadata sanitizer
chat.ts # Chat loop with evidence gating + citations- Node.js (recommended: 20+)
- Yarn (classic v1 is OK)
- A running ChromaDB server (default:
http://localhost:8000) - OpenAI API key for embeddings and chat generation
yarn installCreate a .env file in the project root:
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
CHROMA_URL=http://localhost:8000
CHROMA_COLLECTION=rag_docs
DATA_DIR=./data
RESET_COLLECTION=falseNotes:
DATA_DIRshould contain your.pdf,.txt,.mdfiles.- If you want ingestion to delete the collection before re-ingesting, set:
RESET_COLLECTION=true
You must have ChromaDB running before ingestion or chat.
Example (Docker):
docker run -p 8000:8000 chromadb/chromaIf you use a different host/port, update CHROMA_URL in .env.
Example:
data/
attention is all you need.pdf
book.txt
notes.mdyarn ingestExpected output includes:
- number of loaded documents
- number of chunks
- target Chroma collection name
yarn chatFlow:
- Retrieve top-K chunks from Chroma
- Compute an evidence score (lexical overlap)
- If evidence is insufficient:
- refuse without calling the LLM
- Otherwise:
- generate an answer grounded in retrieved sources
- print citations
yarn debug:queryThis prints:
- number of retrieved chunks
- short snippets with citations
The chat CLI applies a conservative “evidence gate” before calling the LLM.
- It computes a lexical overlap score between the question tokens and the top retrieved chunks.
- If the score fails the thresholds, the system prints a fixed refusal message and stops.
Key constants (in src/cli/chat.ts):
MIN_OVERLAP_RATIOMIN_MATCHED_TOKENSTOP_SOURCES_FOR_SCORING
Tuning guidance:
- If the bot refuses too often (false negatives):
- lower
MIN_OVERLAP_RATIO - or lower
MIN_MATCHED_TOKENS
- lower
- If the bot answers irrelevant questions (false positives):
- raise
MIN_OVERLAP_RATIO - or raise
MIN_MATCHED_TOKENS - reduce
TOP_SOURCES_FOR_SCORINGfor stricter gating
- raise
Citations are printed based on normalized metadata:
- With page number:
[filename p.N] - Without page number:
[filename]
Metadata normalization happens in src/core/loaders.ts, and values are sanitized for Chroma in src/core/metadata.ts.
Common scripts:
yarn lint
yarn ingest
yarn chat
yarn debug:queryIf debug:query is missing, ensure package.json includes:
{
"scripts": {
"debug:query": "tsx src/cli/debug-query.ts"
}
}Depending on your dependency versions, you may see warnings from Chroma / LangChain related to:
- deprecated arguments
- missing server-side embedding configuration
This project uses client-side embeddings by design, so server-side embedding configuration is not required for correctness. If you want completely clean console output, consider adding a small warning suppression layer in a CLI bootstrap (optional).
MIT (or choose your preferred license).
PRs are welcome. If you change chunking, retrieval, or gating logic, please include:
- a short rationale
- a minimal eval set (even a small manual list of queries) showing the impact on:
- hallucination rate
- refusal rate
- accuracy