Native PDF RAG Pipeline

A corrective RAG pipeline that embeds raw PDF bytes using Gemini Embedding 2 — no text extraction needed for embedding. Uses Gemini 2.5 Pro for vision-model OCR (text for search and LLM context), Claude for grading and generation, and LangGraph for the corrective retrieval loop. All steps are traced with LangSmith.

Prerequisites

Python 3.11+
Google API key (Gemini Embedding 2 + Gemini 2.5 Pro)
Anthropic API key
LangSmith API key

Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keys

Environment Variables

Variable	Required	Description
`GOOGLE_API_KEY`	Yes	Gemini Embedding 2 + Gemini 2.5 Pro OCR
`ANTHROPIC_API_KEY`	Yes	Claude for grading + generation
`LANGSMITH_API_KEY`	Yes	LangSmith tracing and evals
`LANGSMITH_TRACING`	Yes	Set to `true` to enable

Running

Add PDF files to the docs/ directory
Run the pipeline:

python pipeline.py

To run evaluations:

python evals.py

Article

Stop Extracting Text from PDFs — Embed the Document Directly

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
evals.py		evals.py
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Native PDF RAG Pipeline

Prerequisites

Setup

Environment Variables

Running

Article

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Native PDF RAG Pipeline

Prerequisites

Setup

Environment Variables

Running

Article

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages