Skip to content

abehairy/pdf-extraction-poc

Repository files navigation

Clinera MDT - Medical Document Extraction (Production)

Production-ready medical document OCR and PHI anonymization using Google Cloud.

🚀 Features

  • High-accuracy OCR - Google Cloud Document AI
  • PHI Detection - Automatic detection of patient data (DLP API)
  • HIPAA Compliance - Anonymize PHI with one click
  • Production-ready - Enterprise-grade, scalable
  • Smart extraction - Structured medical data (patient info, meds, labs, diagnoses)
  • Cost-effective - Pay per use, ~$1.50 per 1000 documents

📋 Quick Start

1. Install Dependencies

npm install

2. Set Up Google Cloud

⚡ AUTOMATED (Recommended) - One command:

./setup-google-cloud.sh

This script automatically:

  • ✅ Creates Google Cloud project
  • ✅ Enables all required APIs
  • ✅ Creates Document AI processor
  • ✅ Sets up service account
  • ✅ Generates .env.local

Time: ~5 minutes

📝 MANUAL - Step by step guide:

Follow GOOGLE_CLOUD_SETUP.md for manual setup

3. Run

npm run dev

Open http://localhost:3000

🎯 How It Works

Extraction Flow

Upload Document (PDF/Image)
    ↓
1. Try direct text extraction (PDF only) - Fast & Free
    ↓
2. If scanned → Document AI OCR - High accuracy
    ↓
3. Extract structured medical data
    ↓
4. Optional: Detect & anonymize PHI (DLP)
    ↓
Download JSON

PHI Anonymization

The DLP API detects and anonymizes:

  • 👤 Patient names
  • 📅 Dates of birth
  • 🏥 Medical record numbers (MRN)
  • 📞 Phone numbers
  • 📧 Email addresses
  • 🏠 Addresses
  • 🆔 Social Security Numbers
  • And more...

📊 Usage Examples

Basic OCR

// Upload PDF/image → Get structured data
{
  "patient_info": {
    "name": "John Doe",
    "mrn": "MRN-123456",
    "date_of_birth": "1980-05-15"
  },
  "medications": [...],
  "lab_results": [...]
}

With PHI Anonymization

// Same upload, but with "Anonymize PHI" enabled
{
  "patient_info": {
    "name": "[PERSON_NAME]",
    "mrn": "[MEDICAL_RECORD_NUMBER]",
    "date_of_birth": "[DATE_OF_BIRTH]"
  },
  "phi_anonymization": {
    "replacements": 15,
    "stats": {
      "PERSON_NAME": 3,
      "DATE_OF_BIRTH": 2,
      "MEDICAL_RECORD_NUMBER": 1,
      ...
    }
  }
}

🔧 API Reference

POST /api/extract

Extract data from medical documents.

Request:

FormData {
  file: File,           // PDF or image
  mode: 'structured' | 'raw',  // Extraction mode
  anonymize: 'true' | 'false'  // Enable PHI anonymization
}

Response:

{
  success: true,
  mode: 'structured',
  data: {...},          // Extracted medical data
  method: 'document-ai-ocr',
  phi_anonymization: {  // If anonymize=true
    found: 15,
    replacements: 15,
    stats: {...}
  }
}

💰 Pricing

Service Free Tier After Free Tier
Document AI 1,000 pages/month $1.50 per 1,000 pages
DLP API 1GB text/month $0.20 per GB

Example costs:

  • 1,000 docs/month: ~$2-5/month
  • 10,000 docs/month: ~$15-20/month
  • 100,000 docs/month: ~$150-200/month

🔒 Security

  • ✅ Service account authentication
  • ✅ Never expose API keys to frontend
  • ✅ DLP API for PHI detection
  • ✅ HIPAA-compliant infrastructure (Google Cloud)
  • ✅ Encryption at rest and in transit

Never commit:

  • google-cloud-key.json
  • .env.local

🐛 Troubleshooting

See GOOGLE_CLOUD_SETUP.md for common issues.

Quick fixes:

# Check environment variables
cat .env.local

# Verify service account key exists
ls -la google-cloud-key.json

# Test API access
gcloud auth application-default print-access-token

🚀 Deploy to Production

Cloud Run (Recommended)

gcloud run deploy clinera-mdt \
  --source . \
  --platform managed \
  --region us-central1 \
  --set-env-vars GOOGLE_CLOUD_PROJECT_ID=your-project-id \
  --set-env-vars DOCUMENT_AI_PROCESSOR_ID=your-processor-id

App Engine

gcloud app deploy

📁 Project Structure

clinera-mdt-pdf-extraction-poc/
├── src/
│   ├── app/
│   │   ├── api/extract/route.ts    # Main API endpoint
│   │   └── page.tsx                # UI
│   └── lib/
│       ├── google-cloud-config.ts  # Config
│       ├── document-ai-service.ts  # Document AI integration
│       └── dlp-service.ts          # DLP API integration
├── .env.example                    # Environment template
├── GOOGLE_CLOUD_SETUP.md          # Complete setup guide
└── package.json

🎓 Learn More

📝 License

Private - Clinera MDT


Built with ❤️ for healthcare professionals

About

PDF extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors