A Retrieval-Augmented Generation (RAG) system built with FastAPI, Kafka, background workers, and a vector database(CHROMA).
The system ingests documents, chunks them asynchronously, generates embeddings, and serves context-aware answers via an API.
Most common commands :
# Build and run locally
dkr_pvt build
dkr_pvt up -d
# EC2 server is linux/AMD and my machine is Apple Silicon(ARM64)
# so using Docker bake to build images on my machine
# and pushing them to Dockerhub
# Then pull them in EC2 and run. EC2 is only for running the containers, no build
### Prerequisites
- Docker ≥ 24
- Docker Buildx enabled
- Docker Hub account
# clean up Docker junk
docker system prune -af
# Build and push images for EC2 (Linux/AMD)
📦 Build & Push Images to Docker Hub
docker buildx bake --push
# build and push individually
docker buildx bake fastapi --push
docker buildx bake chunker --push
docker buildx bake embedder --push
# Pull images from Docker Hub and run in EC2
dkr_pvt pull
#pull individually -> to avoid space limiatations
docker pull discreteflow/rag-agent:fastapi
docker pull discreteflow/rag-agent:chunker
docker pull discreteflow/rag-agent:embedder
# Created an alias dkr_pvt (which overrides the docker-compose.yml with docker-compose.private.yml)
docker compose logs -f fastapi
docker compose logs -f chunker
docker compose logs -f embedder
docker compose logs -f kafka
docker compose logs -f zookeeperArchitecture overview
flowchart TD
Client --> FastAPI["FastAPI API Layer"]
FastAPI --> K1["Kafka topic: documents-uploaded"]
K1 --> Chunker["Chunking Worker"]
Chunker --> K2["Kafka topic: embedding-requests"]
K2 --> Embedder["Embedding Worker"]
Embedder --> Chroma["Vector Store (Chroma)"]
Key idea: document ingestion, chunking, and embedding are fully decoupled using Kafka so each stage can scale independently.
- Handles document uploads and chat requests
- Publishes document events to Kafka
- Performs retrieval + final LLM response generation
- Consumes uploaded documents
- Splits content into overlapping chunks
- Publishes chunk events to Kafka
- Consumes chunk events
- Generates embeddings
- Stores them in a persistent vector database
- Event backbone between services
- Enables asynchronous, fault-tolerant processing
- Chroma-based persistent embedding storage
- Supports semantic similarity search for RAG
Build images locally , push to Docker Hub and pull in EC2 and run.
dkr_p
<!-- Build images locally (these are not compatible with linux for ec2 ) -->
dkr_pvt build
<!-- Run the container -->
dkr_pvt down
dkr_pvt up -d
<!-- tag images for Docker Hub (One time) -->
docker tag ragagent-fastapi:latest discreteflow/rag-agent:fastapi
docker tag ragagent-chunker:latest discreteflow/rag-agent:chunker
docker tag ragagent-embedder:latest discreteflow/rag-agent:embedder
This mode runs without external credentials and uses a demo LLM backend.
docker compose up --build
Running the APP
> [!NOTE]
> This app can be run in two ways
1. Docker Compose
2. Makefile
Private mode, override docker config (With my private LLM config)
build the image
docker compose -f docker-compose.yml -f docker-compose.private.yml build
<!-- run the container -->
docker compose -f docker-compose.yml -f docker-compose.private.yml up -d
<!-- Single cmd to start everything With docker -->
docker compose up
<!-- if changes in dockerfile -->
docker compose up --build
<!-- to run in background add -d -->
docker compose up -d
<!-- Check everything is running -->
docker compose ps
<!-- to stop -->
docker compose down
<!-- See logs -->
docker compose logs -f
<!-- Specific service logs -->
docker compose logs -f fastapi
docker compose logs -f chunker
docker compose logs -f embeddermake start -> start kafka + api + workers
make private -> start everything in PRIVATE mode (OCI)
make stop -> stop everything cleanly4 terminals Kafka , App , chunking, embedding
- start kafka Dir with quick-kafka.yml docker compose -f quick-kafka.yml up -d
Verify status docker compose -f quick-kafka.yml ps
- activate venv source .venv/bin/activate
-- Start app + workers
- Chunking worker python3 -m workers.chunking_worker
- Embedding worker python3 -m workers.embedding_worker
- FastAPI app uvicorn app.main:app --reload
All config parameters/ config_public.py -> public vars config_private.py -> pvt vars
main.py calls Lchain.py -> kafka_producer.py -> kafka_consumer.py -> chunking_worker.py -> embedding_worker.py
kafka_producer.py -> A wrapper class To save messages to a Kafka topic kafka_consumer.py chunking_worker.py -> consumes from uploded_docs topic -> splits into chunks -> saves to embedding topic embedding_worker.py -> consumes from embed topic -> embeds chunk -> saves to Chroma vector DB
Frontend files - templates/ , /static/
templates/chatbox.html -> serves the UI for interacting with the RAG tool
upload → chunk → embed → store,
One big process Multiple lightweight Kafka workers Blocking Async, streaming, real-time No separation of concerns Clearly defined responsibilities No retry/failure management Kafka lets you retry + log errors
Old Design -Doesnt scale well
Old Design
monolithic and synchronous:
1. Manually drop .md files into a templates/ folder.
2. Run main() to:
• Load all markdown files.
• Split them into chunks.
• Embed and store them in ChromaDB using Chroma.from_documents(...).
when using Local kafka - Unable to read from a Kafka topic without prociding the partition number =0 even in the command and so had to use manual partition to get_kafka_consumer method.
On running my fastapi app with uvicorn app.main:app --reload ERROR: [Errno 48] Address already in use
Why this keeps happening
• --reload spawns a watcher process
• If you Ctrl+C during a bad moment, the child survives
• macOS is especially good at leaving zombies
Local fix Kill proces runing on port 8000 lsof -i :8000 → kill -9 lsof -i :8000
Prod fix - Use gunicorn + uvicorn workers gunicorn -w 4 -k uvicorn.workers.UvicornWorker app.main:app
Dependency conflicts SOlved via trial and error Had to downgrade some packages to match - langchain core ,langchain chroma, transformers , torch , sentence transformers
Using heavy embedding model like cohere was bloating the images, choking space on EC2 Switched to a practical choice while not compromising on quality of results ----------------------------X----------------------------X----------------------------X