AI-powered FDA document intelligence system using Retrieval-Augmented Generation (RAG) to analyze FDA 483 inspection reports and provide accurate, compliance-grade answers through a Streamlit interface.
This project demonstrates an end-to-end RAG (Retrieval-Augmented Generation) pipeline designed to work with FDA inspection and quality documents (FDA 483 reports).
Instead of relying on generic AI knowledge, the system:
- Retrieves relevant content directly from uploaded FDA PDFs
- Uses a strict Quality & Compliance expert prompt
- Generates fact-based, non-hallucinated answers
- Supports querying across all documents or a single selected PDF
The application is built as a professional Streamlit web app, suitable for demos, learning, and portfolio use.
- 📄 Query across all FDA PDFs or a single selected document
- 🔍 Accurate document-grounded answers using RAG
- 🧠 Strict FDA Quality, Compliance & R&D expert behavior
- 🚫 No hallucination (answers only from provided PDFs)
- 📊 Displays total PDFs and knowledge chunks
- 🎨 Clean, enterprise-style Streamlit UI
- ⚡ Powered by OpenAI embeddings + Pinecone vector DB
RAG combines:
- Retrieval – finding relevant document chunks from a vector database
- Generation – using an LLM to answer based only on retrieved content
This ensures:
- High factual accuracy
- No guessing
- Enterprise-ready AI behavior
FDA PDFs ↓ Text Extraction + Chunking ↓ Embeddings (OpenAI) ↓ Vector Database (Pinecone) ↓ Retriever ↓ LLM (Strict Quality Expert Prompt) ↓ Streamlit UI
- Language: Python
- Frontend: Streamlit
- LLM & Embeddings: OpenAI
- Vector Database: Pinecone
- RAG Architecture
- PDF Handling: PyMuPDF + OCR (in notebook)
fda-rag-document-intelligence/
│
├── Streamlit_app.py # Streamlit web application
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── .gitignore # Ignored files
│
├── notebooks/
│ └── FDA_BOT_20_01.ipynb # Original RAG development notebook
│
├── visuals/
│ └── app_screenshots.png # UI screenshots (optional)
│
└── .streamlit/
└── secrets.example.toml # Example secrets file
git clone https://github.com/MK1404/fda-rag-document-intelligence.git
cd fda-rag-document-intelligencepip install -r requirements.txtCreate a folder:
mkdir .streamlitCreate a file: .streamlit/secrets.toml
OPENAI_API_KEY = "your-openai-api-key"
PINECONE_API_KEY = "your-pinecone-api-key"streamlit run Streamlit_app.py-
By default, the app searches across all PDFs
-
Use the sidebar dropdown to select a specific PDF
-
Ask questions such as:
- “List all observations for this site”
- “Return common FDA observations across all PDFs”
- “What quality issues were identified?”
-
The system responds using only document content
- Return all FDA observations from the selected report
- What repeated quality issues appear across inspections?
- List CAPA-related observations
- Identify compliance gaps mentioned in the document
- PDF upload directly from UI
- Source citation highlighting
- Observation severity tagging
- Compliance dashboards
- Export responses as reports
- This project is for learning and demonstration purposes
- No confidential or proprietary FDA data is included
- Users must provide their own API keys
Mohit Data Analytics & AI (Learning Project)
Consider starring ⭐ the repo to support the project.