GitHub - ShreyashAnarase/RAG-app

A Retrieval-Augmented Generation (RAG) system built with FastAPI, Kafka, background workers, and a vector database(CHROMA).

The system ingests documents, chunks them asynchronously, generates embeddings, and serves context-aware answers via an API.

Most common commands :

# Build and run locally
dkr_pvt build
dkr_pvt up -d

# EC2 server is linux/AMD and my machine is Apple Silicon(ARM64)
# so using Docker bake to build images on my machine
# and pushing them to Dockerhub
# Then pull them in EC2 and run. EC2 is only for running the containers, no build

### Prerequisites
- Docker ≥ 24
- Docker Buildx enabled
- Docker Hub account

# clean up Docker junk
docker system prune -af

# Build and push images for EC2 (Linux/AMD)
	📦 Build & Push Images to Docker Hub
	docker buildx bake --push

	# build and push individually
	docker buildx bake fastapi --push
	docker buildx bake chunker --push
	docker buildx bake embedder --push

# Pull images from Docker Hub and run in EC2
dkr_pvt pull

#pull individually  -> to avoid space limiatations
docker pull discreteflow/rag-agent:fastapi
docker pull discreteflow/rag-agent:chunker
docker pull discreteflow/rag-agent:embedder




# Created an alias dkr_pvt  (which overrides the docker-compose.yml with docker-compose.private.yml)

docker compose logs -f fastapi
docker compose logs -f chunker
docker compose logs -f embedder
docker compose logs -f kafka
docker compose logs -f zookeeper

Architecture overview

flowchart TD
    Client --> FastAPI["FastAPI API Layer"]
    FastAPI --> K1["Kafka topic: documents-uploaded"]
    K1 --> Chunker["Chunking Worker"]
    Chunker --> K2["Kafka topic: embedding-requests"]
    K2 --> Embedder["Embedding Worker"]
    Embedder --> Chroma["Vector Store (Chroma)"]

Key idea: document ingestion, chunking, and embedding are fully decoupled using Kafka so each stage can scale independently.

Components

FastAPI Service

Handles document uploads and chat requests
Publishes document events to Kafka
Performs retrieval + final LLM response generation

Chunking Worker

Consumes uploaded documents
Splits content into overlapping chunks
Publishes chunk events to Kafka

Embedding Worker

Consumes chunk events
Generates embeddings
Stores them in a persistent vector database

Kafka

Event backbone between services
Enables asynchronous, fault-tolerant processing

Vector Store

Chroma-based persistent embedding storage
Supports semantic similarity search for RAG

Commands

Build images locally , push to Docker Hub and pull in EC2 and run.

dkr_p
<!-- Build images locally (these are not compatible with linux for ec2 ) -->
dkr_pvt build

<!-- Run the container  -->

dkr_pvt down
dkr_pvt up -d

<!-- tag images for Docker Hub (One time) -->
docker tag ragagent-fastapi:latest discreteflow/rag-agent:fastapi
docker tag ragagent-chunker:latest discreteflow/rag-agent:chunker
docker tag ragagent-embedder:latest discreteflow/rag-agent:embedder

Running Locally (Public / Demo Mode)

This mode runs without external credentials and uses a demo LLM backend.

docker compose up --build

Running the APP


> [!NOTE]
> This app can be run in two ways
1. Docker Compose
2. Makefile

Private mode, override  docker config (With my private LLM config)
build the image
docker compose  -f docker-compose.yml  -f docker-compose.private.yml build
<!-- run the container  -->
docker compose -f docker-compose.yml -f docker-compose.private.yml up -d

<!-- Single cmd to start everything With docker  -->
docker compose up
    <!-- if changes in dockerfile -->
		docker compose up --build
		<!-- to run in background add -d   -->
		docker compose up -d

<!-- Check everything is running  -->
	docker compose ps

<!-- to stop -->
	docker compose down


<!-- See logs  -->
	docker compose logs -f
<!-- Specific service logs -->
	docker compose logs -f fastapi
	docker compose logs -f chunker
	docker compose logs -f embedder

Running with makefile

make start -> start kafka + api + workers
make private -> start everything in PRIVATE mode (OCI)
make stop -> stop everything cleanly

4 terminals Kafka , App , chunking, embedding

start kafka Dir with quick-kafka.yml docker compose -f quick-kafka.yml up -d

Verify status docker compose -f quick-kafka.yml ps

activate venv source .venv/bin/activate

-- Start app + workers

Chunking worker python3 -m workers.chunking_worker
Embedding worker python3 -m workers.embedding_worker
FastAPI app uvicorn app.main:app --reload

All config parameters/ config_public.py -> public vars config_private.py -> pvt vars

Files used in the app

main.py calls Lchain.py -> kafka_producer.py -> kafka_consumer.py -> chunking_worker.py -> embedding_worker.py

kafka_producer.py -> A wrapper class To save messages to a Kafka topic kafka_consumer.py chunking_worker.py -> consumes from uploded_docs topic -> splits into chunks -> saves to embedding topic embedding_worker.py -> consumes from embed topic -> embeds chunk -> saves to Chroma vector DB

Frontend files - templates/ , /static/

templates/chatbox.html -> serves the UI for interacting with the RAG tool

upload → chunk → embed → store,

Manual .md file loading User uploads through UI via /upload-doc

One big process Multiple lightweight Kafka workers Blocking Async, streaming, real-time No separation of concerns Clearly defined responsibilities No retry/failure management Kafka lets you retry + log errors

Challenges Faced

Old Design -Doesnt scale well

Old Design
monolithic and synchronous:
	1.	Manually drop .md files into a templates/ folder.
	2.	Run main() to:
	•	Load all markdown files.
	•	Split them into chunks.
	•	Embed and store them in ChromaDB using Chroma.from_documents(...).

when using Local kafka - Unable to read from a Kafka topic without prociding the partition number =0 even in the command and so had to use manual partition to get_kafka_consumer method.

On running my fastapi app with uvicorn app.main:app --reload ERROR: [Errno 48] Address already in use

Why this keeps happening
	•	--reload spawns a watcher process
	•	If you Ctrl+C during a bad moment, the child survives
	•	macOS is especially good at leaving zombies

Local fix Kill proces runing on port 8000 lsof -i :8000 → kill -9 lsof -i :8000

Prod fix - Use gunicorn + uvicorn workers gunicorn -w 4 -k uvicorn.workers.UvicornWorker app.main:app

Dependency conflicts SOlved via trial and error Had to downgrade some packages to match - langchain core ,langchain chroma, transformers , torch , sentence transformers

Using heavy embedding model like cohere was bloating the images, choking space on EC2 Switched to a practical choice while not compromising on quality of results ----------------------------X----------------------------X----------------------------X

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
static		static
templates		templates
workers		workers
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
config_private.example.py		config_private.example.py
docker-bake.hcl		docker-bake.hcl
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Components

FastAPI Service

Chunking Worker

Embedding Worker

Kafka

Vector Store

Commands

Running Locally (Public / Demo Mode)

Running with makefile

Files used in the app

Manual .md file loading User uploads through UI via /upload-doc

Challenges Faced

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Components

FastAPI Service

Chunking Worker

Embedding Worker

Kafka

Vector Store

Commands

Running Locally (Public / Demo Mode)

Running with makefile

Files used in the app

Manual .md file loading User uploads through UI via /upload-doc

Challenges Faced

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages