This repository contains materials for the Building RAG (Retrieval-Augmented Generation) Applications Workshop. The workshop covers both naive RAG implementations and advanced RAG techniques to help you understand and build effective RAG systems.
This repository is used for the O'Reilly training course: Building Reliable RAG Applications: From PoC to Production.
- API keys (OpenAI, Cohere)
- Qdrant database (Cloud or Docker)
This project uses uv for dependency management. It automatically handles Python and virtual environments.
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/Sarangk90/building-rag-app-workshop.git
cd building-rag-app-workshop
# Install Python 3.11 and all dependencies
uv sync📖 Follow the complete setup guide: SETUP.md
The setup guide covers:
- Qdrant database setup (Cloud or Docker options)
- Environment variable configuration
- Data ingestion process
- Troubleshooting common issues
uv run jupyter labAfter completing the setup, run the notebooks in this order:
naive-rag/01-naive-rag.ipynb- Basic RAG implementationnaive-rag/02-naive-rag-challenges.ipynb- RAG limitations and evaluation
advanced-rag/01-advanced-rag-rerank.ipynb- Advanced RAG with reranking
advanced-rag/scifact/01-data-indexing.ipynb- Data indexing techniquesadvanced-rag/scifact/02-advanced-rag.ipynb- Advanced techniques
Note: Each notebook automatically detects your setup (Cloud vs Docker) and connects appropriately.
- Basic RAG implementation using OpenAI embeddings
- Vector storage with Qdrant
- Simple retrieval and generation pipeline
- Hybrid search combining dense and sparse embeddings
- Reranking with cross-encoders for improved relevance
- Evaluation using standard metrics and benchmarks
The workshop uses:
- Wikipedia articles on machine learning topics (Deep learning, Transformers, etc.)
- BeIR SciFact dataset for demonstrations and evaluations
The repository includes pre-downloaded Wikipedia articles in data/wiki_articles/ to avoid repetitive API calls during workshops. Use the following scripts to manage articles:
uv run python scripts/fetch_additional_articles.py --list-available# Fetch specific articles
uv run python scripts/fetch_additional_articles.py "Machine learning" "Computer vision"
# Fetch from extended list (30+ ML/AI topics)
uv run python scripts/fetch_additional_articles.py
# View the extended article list
uv run python scripts/fetch_additional_articles.py --list-extendeduv run python scripts/fetch_additional_articles.py --force "Deep learning"Available Pre-downloaded Articles:
- Artificial neural network
- BERT (language model)
- Deep learning
- Generative pre-trained transformer
- Overfitting
- Transformer (machine learning model)
- openai: For embeddings and completions
- qdrant-client: For vector storage and retrieval
- wikipedia, beautifulsoup4: For data collection and cleaning
- FlagEmbedding: For reranking functionality
- cohere: For additional reranking options
- ragas: For comprehensive RAG evaluation
- Various utilities: tqdm, python-dotenv, etc.
The workshop includes evaluation scripts using RAGAS metrics to assess the quality of RAG outputs across dimensions like relevance, faithfulness, and answer quality.