A corrective RAG pipeline that embeds raw PDF bytes using Gemini Embedding 2 — no text extraction needed for embedding. Uses Gemini 2.5 Pro for vision-model OCR (text for search and LLM context), Claude for grading and generation, and LangGraph for the corrective retrieval loop. All steps are traced with LangSmith.
- Python 3.11+
- Google API key (Gemini Embedding 2 + Gemini 2.5 Pro)
- Anthropic API key
- LangSmith API key
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keys| Variable | Required | Description |
|---|---|---|
GOOGLE_API_KEY |
Yes | Gemini Embedding 2 + Gemini 2.5 Pro OCR |
ANTHROPIC_API_KEY |
Yes | Claude for grading + generation |
LANGSMITH_API_KEY |
Yes | LangSmith tracing and evals |
LANGSMITH_TRACING |
Yes | Set to true to enable |
- Add PDF files to the
docs/directory - Run the pipeline:
python pipeline.pyTo run evaluations:
python evals.pyStop Extracting Text from PDFs — Embed the Document Directly