Skip to content

focused-dot-io/17-native-pdf-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Native PDF RAG Pipeline

A corrective RAG pipeline that embeds raw PDF bytes using Gemini Embedding 2 — no text extraction needed for embedding. Uses Gemini 2.5 Pro for vision-model OCR (text for search and LLM context), Claude for grading and generation, and LangGraph for the corrective retrieval loop. All steps are traced with LangSmith.

Prerequisites

Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keys

Environment Variables

Variable Required Description
GOOGLE_API_KEY Yes Gemini Embedding 2 + Gemini 2.5 Pro OCR
ANTHROPIC_API_KEY Yes Claude for grading + generation
LANGSMITH_API_KEY Yes LangSmith tracing and evals
LANGSMITH_TRACING Yes Set to true to enable

Running

  1. Add PDF files to the docs/ directory
  2. Run the pipeline:
python pipeline.py

To run evaluations:

python evals.py

Article

Stop Extracting Text from PDFs — Embed the Document Directly

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages