ERNIE AI Developer Challenge Submission
A fine-tuned PaddleOCR-VL model specialized for medical document OCR, trained on a domain-balanced dataset of 2,462 medical and general documents.
MedOCR-Vision is a production-ready OCR solution that combines:
- Fine-tuned Model: PaddleOCR-VL optimized for medical documents
- Curated Dataset: 2,462 samples balancing medical and general documents
- Complete Pipeline: From data processing to model deployment
- Specialized for Medical Documents: High accuracy on prescriptions, lab reports, and medical forms
- Domain-Balanced Training: Maintains general OCR capabilities while specializing in medical domain
- Production-Ready: Full merged model (float16) ready for deployment
- Comprehensive Documentation: Complete training pipeline and inference examples
URL: https://huggingface.co/naazimsnh02/medocr-vision
A PaddleOCR-VL model fine-tuned with LoRA on medical documents:
- Base Model: unsloth/PaddleOCR-VL (1B parameters)
- Training: 3 epochs with domain-balanced data
- Format: Full merged model (float16)
- Ready for: Production deployment
URL: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset
Curated dataset with high-quality annotations:
- 2,462 total samples (80/10/10 train/val/test split)
- Domain Balance: 59.4% medical, 40.6% general
- Document Types: Handwritten, scanned, printed
- PaddleOCR Compatible: Ready-to-use label format
File: medocr_paddle_training_notebook.ipynb
Complete training pipeline including:
- Dataset loading from Hugging Face
- LoRA fine-tuning configuration
- Pre/post-training evaluation
- Model saving and deployment
- Automatic push to Hugging Face Hub
File: medocr-vision-evaluation-notebook.ipynb
Comprehensive model evaluation including:
- Side-by-side comparison with base model
- Medical information extraction metrics
- Content coverage analysis
- Performance improvement quantification
from unsloth import FastVisionModel
from transformers import AutoProcessor
from PIL import Image
# Load model
model, tokenizer = FastVisionModel.from_pretrained(
"naazimsnh02/medocr-vision"
)
processor = AutoProcessor.from_pretrained(
"naazimsnh02/medocr-vision",
trust_remote_code=True
)
# Prepare input
image = Image.open("medical_document.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Extract all text from this medical document:"}
]
}]
# Generate
text_prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(image, text_prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=256)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)- Clone the repository:
git clone https://github.com/Naazimsnh02/medocr-vision
cd medocr-vision- Open the training notebook:
- Upload
medocr_paddle_training_notebook.ipynbto Google Colab or Modal - Set your
HF_TOKENenvironment variable - Run all cells
- Upload
The notebook will:
- Load the dataset from Hugging Face
- Fine-tune PaddleOCR-VL for 3 epochs
- Save checkpoints every 100 steps
- Push the final model to Hugging Face Hub
| Dataset | Samples | Domain | Type |
|---|---|---|---|
| Medical Prescriptions | 1,000 | Medical | Handwritten |
| OMR Scanned Documents | 36 | Medical | Scanned Forms |
| Medical Lab Reports | 426 | Medical | Printed Reports |
| Invoices & Receipts | 1,000 | General | Business Docs |
| Total | 2,462 | - | - |
- Training: 1,969 samples (80%)
- Validation: 246 samples (10%)
- Test: 247 samples (10%)
- Domain Balance: 59.4% Medical / 40.6% General
- Short (200-400 chars): Prescriptions, Invoices
- Medium (400-1,000 chars): OMR Documents
- Long (1,000-5,000 chars): Medical Lab Reports
- Medical Prescriptions: chinmays18/medical-prescription-dataset
- OMR Documents: saurabh1896/OMR-scanned-documents
- Medical Lab Reports: dikshaasinghhh/bajaj
- Invoices & Receipts: mychen76/invoices-and-receipts_ocr_v1
The dataset preparation involves:
- Medical Prescriptions: Direct extraction from JSON annotations
- OMR Documents: LLM-based text extraction
- Medical Lab Reports: Parallel LLM processing
- Invoices & Receipts: Extraction from existing annotations
- Dataset Combination: Merging and splitting
# Training Duration
num_train_epochs: 3
# Batch Configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
effective_batch_size: 8
# Learning Rate
learning_rate: 5e-5
warmup_steps: 50
lr_scheduler_type: linear
# Optimization
optimizer: adamw_8bit
weight_decay: 0.001
# LoRA Configuration
lora_r: 64
lora_alpha: 64
lora_dropout: 0- Checkpointing: Every 100 steps, keeping best 5
- Evaluation: Every 100 steps on validation set
- Best Model Selection: Based on lowest evaluation loss
- Mixed Precision: BF16/FP16 for efficiency
- Total Steps: ~738 steps (3 epochs)
- Training Time: ~3-4 hours on NVIDIA L4
medocr-vision/
├── README.md # This file
├── ERNIE_CHALLENGE_SUBMISSION.md # Challenge submission details
├── medocr_paddle_training_notebook.ipynb # Training notebook
├── medocr-vision-evaluation-notebook.ipynb # Model evaluation notebook
├── requirements.txt # Python dependencies
├── .env.template # Environment template
│
├── scripts/ # Data processing pipeline
│ ├── step1_process_prescriptions.py
│ ├── step2_process_omr.py
│ ├── step3_process_bajaj_parallel.py
│ ├── step4_process_invoices.py
│ ├── combine_final_dataset.py
│ ├── verify_all_datasets.py
│ └── run_all.py
│
├── docs/ # Documentation
│ ├── DATASET.md # Dataset details
│ └── SETUP.md # Setup guide
│
└── data/ # Data directory (gitignored)
├── README.md # Data structure and sources
├── final_dataset/ # Training-ready dataset
└── medical_ocr_dataset/ # Processed datasets
If you want to recreate the dataset from scratch:
- Configure API keys:
cp .env.template .env
# Edit .env with your API credentials-
Download source datasets (see docs/SETUP.md)
-
Run processing pipeline:
cd scripts
python run_all.py- Verify dataset:
python verify_all_datasets.pyOur fine-tuned model demonstrates significant improvements across multiple metrics:
- ✅ Enhanced Information Extraction: Captures more complete medical information including headers, test values, and reference ranges
- ✅ Better Document Understanding: Improved coverage of document structure and context
- ✅ Continuous Text Output: Produces natural, flowing text that preserves semantic relationships
- ✅ Medical Domain Specialization: Superior performance on medical terminology and clinical data
- ✅ Comprehensive Coverage: Extracts significantly more relevant content from medical documents
For detailed performance metrics, see the evaluation notebook: medocr-vision-evaluation-notebook.ipynb
- Specialized for medical documents (prescriptions, lab reports, forms)
- Maintains general OCR capabilities (invoices, receipts, business docs)
- Domain-balanced training prevents catastrophic forgetting
- Production-ready merged model (no adapter loading required)
- Comprehensive documentation and examples
- Medical prescription digitization
- Lab report data extraction
- Medical form processing
- Healthcare document management
- Medical records digitization
- General business document OCR
- Python 3.8+
- CUDA-capable GPU (for training/inference)
- Dependencies:
transformers,datasets,unsloth,einops - Hugging Face account (for dataset/model access)
If you use this project, please cite:
@misc{medocr-vision,
title={MedOCR-Vision: Medical OCR with PaddleOCR-VL},
author={Naazim},
year={2025},
publisher={GitHub},
url={https://github.com/Naazimsnh02/medocr-vision}
}This project combines multiple sources. Please refer to individual dataset licenses for usage terms.
- Base Model: unsloth/PaddleOCR-VL
- Framework: Unsloth for efficient training
- Dataset Creators: For making their data publicly available
- LLM Providers: Nebius and Novita for API access
- PaddleOCR Team: For the excellent OCR framework
This project is submitted for the ERNIE AI Developer Challenge.
Submission Components:
- Fine-tuned Model: https://huggingface.co/naazimsnh02/medocr-vision
- Code Repository: https://github.com/Naazimsnh02/medocr-vision
- Training Dataset: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset
For detailed submission information, see ERNIE_CHALLENGE_SUBMISSION.md.
Version: 1.0
Last Updated: December 2025
Challenge: ERNIE AI Developer Challenge