Skip to content

Naazimsnh02/medocr-vision

Repository files navigation

MedOCR-Vision: Medical OCR with PaddleOCR-VL

ERNIE AI Developer Challenge Submission

A fine-tuned PaddleOCR-VL model specialized for medical document OCR, trained on a domain-balanced dataset of 2,462 medical and general documents.

GitHub Dataset Python 3.8+ Model

Overview

MedOCR-Vision is a production-ready OCR solution that combines:

  • Fine-tuned Model: PaddleOCR-VL optimized for medical documents
  • Curated Dataset: 2,462 samples balancing medical and general documents
  • Complete Pipeline: From data processing to model deployment

Key Features

  • Specialized for Medical Documents: High accuracy on prescriptions, lab reports, and medical forms
  • Domain-Balanced Training: Maintains general OCR capabilities while specializing in medical domain
  • Production-Ready: Full merged model (float16) ready for deployment
  • Comprehensive Documentation: Complete training pipeline and inference examples

Project Components

1. Fine-tuned Model

URL: https://huggingface.co/naazimsnh02/medocr-vision

A PaddleOCR-VL model fine-tuned with LoRA on medical documents:

  • Base Model: unsloth/PaddleOCR-VL (1B parameters)
  • Training: 3 epochs with domain-balanced data
  • Format: Full merged model (float16)
  • Ready for: Production deployment

2. Training Dataset

URL: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset

Curated dataset with high-quality annotations:

  • 2,462 total samples (80/10/10 train/val/test split)
  • Domain Balance: 59.4% medical, 40.6% general
  • Document Types: Handwritten, scanned, printed
  • PaddleOCR Compatible: Ready-to-use label format

3. Training Notebook

File: medocr_paddle_training_notebook.ipynb

Complete training pipeline including:

  • Dataset loading from Hugging Face
  • LoRA fine-tuning configuration
  • Pre/post-training evaluation
  • Model saving and deployment
  • Automatic push to Hugging Face Hub

4. Evaluation Notebook

File: medocr-vision-evaluation-notebook.ipynb

Comprehensive model evaluation including:

  • Side-by-side comparison with base model
  • Medical information extraction metrics
  • Content coverage analysis
  • Performance improvement quantification

Quick Start

Using the Pre-trained Model

from unsloth import FastVisionModel
from transformers import AutoProcessor
from PIL import Image

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "naazimsnh02/medocr-vision"
)
processor = AutoProcessor.from_pretrained(
    "naazimsnh02/medocr-vision",
    trust_remote_code=True
)

# Prepare input
image = Image.open("medical_document.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Extract all text from this medical document:"}
    ]
}]

# Generate
text_prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(image, text_prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=256)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)

Training Your Own Model

  1. Clone the repository:
git clone https://github.com/Naazimsnh02/medocr-vision
cd medocr-vision
  1. Open the training notebook:
    • Upload medocr_paddle_training_notebook.ipynb to Google Colab or Modal
    • Set your HF_TOKEN environment variable
    • Run all cells

The notebook will:

  • Load the dataset from Hugging Face
  • Fine-tune PaddleOCR-VL for 3 epochs
  • Save checkpoints every 100 steps
  • Push the final model to Hugging Face Hub

Dataset Details

Composition

Dataset Samples Domain Type
Medical Prescriptions 1,000 Medical Handwritten
OMR Scanned Documents 36 Medical Scanned Forms
Medical Lab Reports 426 Medical Printed Reports
Invoices & Receipts 1,000 General Business Docs
Total 2,462 - -

Statistics

  • Training: 1,969 samples (80%)
  • Validation: 246 samples (10%)
  • Test: 247 samples (10%)
  • Domain Balance: 59.4% Medical / 40.6% General

Text Characteristics

  • Short (200-400 chars): Prescriptions, Invoices
  • Medium (400-1,000 chars): OMR Documents
  • Long (1,000-5,000 chars): Medical Lab Reports

Data Sources

  1. Medical Prescriptions: chinmays18/medical-prescription-dataset
  2. OMR Documents: saurabh1896/OMR-scanned-documents
  3. Medical Lab Reports: dikshaasinghhh/bajaj
  4. Invoices & Receipts: mychen76/invoices-and-receipts_ocr_v1

Processing Pipeline

The dataset preparation involves:

  1. Medical Prescriptions: Direct extraction from JSON annotations
  2. OMR Documents: LLM-based text extraction
  3. Medical Lab Reports: Parallel LLM processing
  4. Invoices & Receipts: Extraction from existing annotations
  5. Dataset Combination: Merging and splitting

Training Strategy

Hyperparameters

# Training Duration
num_train_epochs: 3

# Batch Configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
effective_batch_size: 8

# Learning Rate
learning_rate: 5e-5
warmup_steps: 50
lr_scheduler_type: linear

# Optimization
optimizer: adamw_8bit
weight_decay: 0.001

# LoRA Configuration
lora_r: 64
lora_alpha: 64
lora_dropout: 0

Training Process

  1. Checkpointing: Every 100 steps, keeping best 5
  2. Evaluation: Every 100 steps on validation set
  3. Best Model Selection: Based on lowest evaluation loss
  4. Mixed Precision: BF16/FP16 for efficiency
  5. Total Steps: ~738 steps (3 epochs)
  6. Training Time: ~3-4 hours on NVIDIA L4

Repository Structure

medocr-vision/
├── README.md                                    # This file
├── ERNIE_CHALLENGE_SUBMISSION.md                # Challenge submission details
├── medocr_paddle_training_notebook.ipynb        # Training notebook
├── medocr-vision-evaluation-notebook.ipynb      # Model evaluation notebook
├── requirements.txt                             # Python dependencies
├── .env.template                                # Environment template
│
├── scripts/                                     # Data processing pipeline
│   ├── step1_process_prescriptions.py
│   ├── step2_process_omr.py
│   ├── step3_process_bajaj_parallel.py
│   ├── step4_process_invoices.py
│   ├── combine_final_dataset.py
│   ├── verify_all_datasets.py
│   └── run_all.py
│
├── docs/                                        # Documentation
│   ├── DATASET.md                               # Dataset details
│   └── SETUP.md                                 # Setup guide
│
└── data/                                        # Data directory (gitignored)
    ├── README.md                                # Data structure and sources
    ├── final_dataset/                           # Training-ready dataset
    └── medical_ocr_dataset/                     # Processed datasets

Data Preparation (Optional)

If you want to recreate the dataset from scratch:

  1. Configure API keys:
cp .env.template .env
# Edit .env with your API credentials
  1. Download source datasets (see docs/SETUP.md)

  2. Run processing pipeline:

cd scripts
python run_all.py
  1. Verify dataset:
python verify_all_datasets.py

Performance Highlights

Model Improvements Over Base Model

Our fine-tuned model demonstrates significant improvements across multiple metrics:

  • Enhanced Information Extraction: Captures more complete medical information including headers, test values, and reference ranges
  • Better Document Understanding: Improved coverage of document structure and context
  • Continuous Text Output: Produces natural, flowing text that preserves semantic relationships
  • Medical Domain Specialization: Superior performance on medical terminology and clinical data
  • Comprehensive Coverage: Extracts significantly more relevant content from medical documents

For detailed performance metrics, see the evaluation notebook: medocr-vision-evaluation-notebook.ipynb

Key Advantages

  • Specialized for medical documents (prescriptions, lab reports, forms)
  • Maintains general OCR capabilities (invoices, receipts, business docs)
  • Domain-balanced training prevents catastrophic forgetting
  • Production-ready merged model (no adapter loading required)
  • Comprehensive documentation and examples

Use Cases

  • Medical prescription digitization
  • Lab report data extraction
  • Medical form processing
  • Healthcare document management
  • Medical records digitization
  • General business document OCR

Requirements

  • Python 3.8+
  • CUDA-capable GPU (for training/inference)
  • Dependencies: transformers, datasets, unsloth, einops
  • Hugging Face account (for dataset/model access)

Citation

If you use this project, please cite:

@misc{medocr-vision,
  title={MedOCR-Vision: Medical OCR with PaddleOCR-VL},
  author={Naazim},
  year={2025},
  publisher={GitHub},
  url={https://github.com/Naazimsnh02/medocr-vision}
}

License

This project combines multiple sources. Please refer to individual dataset licenses for usage terms.

Acknowledgments

  • Base Model: unsloth/PaddleOCR-VL
  • Framework: Unsloth for efficient training
  • Dataset Creators: For making their data publicly available
  • LLM Providers: Nebius and Novita for API access
  • PaddleOCR Team: For the excellent OCR framework

ERNIE AI Developer Challenge

This project is submitted for the ERNIE AI Developer Challenge.

Submission Components:

  1. Fine-tuned Model: https://huggingface.co/naazimsnh02/medocr-vision
  2. Code Repository: https://github.com/Naazimsnh02/medocr-vision
  3. Training Dataset: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset

For detailed submission information, see ERNIE_CHALLENGE_SUBMISSION.md.


Version: 1.0
Last Updated: December 2025
Challenge: ERNIE AI Developer Challenge

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors