RAG-Chatbot (PDF/TXT/MD) — LangChain + ChromaDB (TypeScript)

A minimal, research-friendly RAG (Retrieval-Augmented Generation) chatbot that ingests local documents (PDF/TXT/MD), stores embeddings in ChromaDB, and answers questions with citations.

This repository is intentionally CLI-first to keep the architecture transparent and easy to evaluate.

Goals

Priority order:

Minimize hallucinations
Accuracy
Speed

Key design decisions:

Strict evidence gating (score-based) to prevent generation when the retrieved context is likely irrelevant.
Citations are printed in a deterministic format ([filename p.N] when available).
Client-side embeddings (OpenAI) for portability across Windows and Ubuntu.

Features

Ingest documents from a directory:
- .pdf (via LangChain PDFLoader)
- .txt, .md (via TextLoader)
Chunking with RecursiveCharacterTextSplitter
Vector storage in ChromaDB
Chat CLI:
- Retrieval → evidence scoring → (pass) generation
- (fail) hard refusal without calling the LLM
Debug query CLI for inspecting retrieved chunks

Repository Structure

src/
  cli/
    ingest.ts        # Load -> split -> embed -> upsert into Chroma
    debug-query.ts   # Inspect similarity search output quickly

  core/
    env.ts           # Zod-validated environment config
    loaders.ts       # DirectoryLoader + metadata normalization
    splitter.ts      # Chunking strategy
    vectorstore.ts   # Chroma vector store wiring
    chromaClient.ts  # ChromaClient wiring (admin ops / injection)
    metadata.ts      # Chroma-safe metadata sanitizer

chat.ts          # Chat loop with evidence gating + citations

Requirements

Node.js (recommended: 20+)
Yarn (classic v1 is OK)
A running ChromaDB server (default: http://localhost:8000)
OpenAI API key for embeddings and chat generation

Setup

1) Install dependencies

yarn install

2) Create `.env`

Create a .env file in the project root:

OPENAI_API_KEY=YOUR_OPENAI_API_KEY
CHROMA_URL=http://localhost:8000
CHROMA_COLLECTION=rag_docs
DATA_DIR=./data
RESET_COLLECTION=false

Notes:

DATA_DIR should contain your .pdf, .txt, .md files.
If you want ingestion to delete the collection before re-ingesting, set:
- RESET_COLLECTION=true

Run ChromaDB

You must have ChromaDB running before ingestion or chat.

Example (Docker):

docker run -p 8000:8000 chromadb/chroma

If you use a different host/port, update CHROMA_URL in .env.

Usage

1) Put documents in `DATA_DIR`

Example:

data/
  attention is all you need.pdf
  book.txt
  notes.md

2) Ingest

yarn ingest

Expected output includes:

number of loaded documents
number of chunks
target Chroma collection name

3) Chat

yarn chat

Flow:

Retrieve top-K chunks from Chroma
Compute an evidence score (lexical overlap)
If evidence is insufficient:
- refuse without calling the LLM
Otherwise:
- generate an answer grounded in retrieved sources
- print citations

4) Debug retrieval

yarn debug:query

This prints:

number of retrieved chunks
short snippets with citations

Evidence Gating (Hallucination Control)

The chat CLI applies a conservative “evidence gate” before calling the LLM.

It computes a lexical overlap score between the question tokens and the top retrieved chunks.
If the score fails the thresholds, the system prints a fixed refusal message and stops.

Key constants (in src/cli/chat.ts):

MIN_OVERLAP_RATIO
MIN_MATCHED_TOKENS
TOP_SOURCES_FOR_SCORING

Tuning guidance:

If the bot refuses too often (false negatives):
- lower MIN_OVERLAP_RATIO
- or lower MIN_MATCHED_TOKENS
If the bot answers irrelevant questions (false positives):
- raise MIN_OVERLAP_RATIO
- or raise MIN_MATCHED_TOKENS
- reduce TOP_SOURCES_FOR_SCORING for stricter gating

Citations Format

Citations are printed based on normalized metadata:

With page number: [filename p.N]
Without page number: [filename]

Metadata normalization happens in src/core/loaders.ts, and values are sanitized for Chroma in src/core/metadata.ts.

Scripts

Common scripts:

yarn lint
yarn ingest
yarn chat
yarn debug:query

If debug:query is missing, ensure package.json includes:

{
  "scripts": {
    "debug:query": "tsx src/cli/debug-query.ts"
  }
}

Notes on Warnings

Depending on your dependency versions, you may see warnings from Chroma / LangChain related to:

deprecated arguments
missing server-side embedding configuration

This project uses client-side embeddings by design, so server-side embedding configuration is not required for correctness. If you want completely clean console output, consider adding a small warning suppression layer in a CLI bootstrap (optional).

License

MIT (or choose your preferred license).

Contributing

PRs are welcome. If you change chunking, retrieval, or gating logic, please include:

a short rationale
a minimal eval set (even a small manual list of queries) showing the impact on:
- hallucination rate
- refusal rate
- accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
docker/chroma		docker/chroma
src		src
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
README.MD		README.MD
eslint.config.js		eslint.config.js
package.json		package.json
tsconfig.eslint.json		tsconfig.eslint.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-Chatbot (PDF/TXT/MD) — LangChain + ChromaDB (TypeScript)

Goals

Features

Repository Structure

Requirements

Setup

1) Install dependencies

2) Create `.env`

Run ChromaDB

Usage

1) Put documents in `DATA_DIR`

2) Ingest

3) Chat

4) Debug retrieval

Evidence Gating (Hallucination Control)

Citations Format

Scripts

Notes on Warnings

License

Contributing

About

Uh oh!

Releases

Packages

Languages

Ukong0324/RAG-Chatbot

Folders and files

Latest commit

History

Repository files navigation

RAG-Chatbot (PDF/TXT/MD) — LangChain + ChromaDB (TypeScript)

Goals

Features

Repository Structure

Requirements

Setup

1) Install dependencies

2) Create .env

Run ChromaDB

Usage

1) Put documents in DATA_DIR

2) Ingest

3) Chat

4) Debug retrieval

Evidence Gating (Hallucination Control)

Citations Format

Scripts

Notes on Warnings

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2) Create `.env`

1) Put documents in `DATA_DIR`

Packages