This project implements a medical Named Entity Recognition (NER) system using two approaches:
- Custom spaCy Model Training: Training a custom NER model from scratch using annotated medical text data.
- Pre-trained Transformer Model: Using a pre-trained biomedical NER model from Hugging Face.
The system is designed to identify and extract medical entities from clinical text, including:
- Medications: Drug names and prescriptions (e.g., Aspirin, Metformin, Warfarin)
- Diseases: Medical conditions and diagnoses (e.g., diabetes, pneumonia, COPD)
- Treatments: Medical procedures and therapies (e.g., surgery, inhaler therapy)
The project is organized into the following steps:
-
Data Loading and Exploration
- Upload annotated JSON dataset.
- Explore data structure and annotation format.
-
Data Preprocessing
- Convert JSON annotations into spaCy-compatible format
(start_char, end_char, label). - Verify entity alignment with text content.
- Convert JSON annotations into spaCy-compatible format
-
spaCy Training Data Preparation
- Convert processed data into
DocBinformat for spaCy. - Handle overlapping entities and alignment issues.
- Convert processed data into
-
Model Training
- Initialize spaCy NER configuration.
- Train custom spaCy NER model on prepared data.
- Save
model-best/for best checkpoint.
-
Model Inference (spaCy)
- Load the trained model and test on sample medical text.
- Visualize entities using spaCy's
displacy.
-
Transformer Model Inference
- Load pre-trained biomedical NER model from Hugging Face (
d4data/biomedical-ner-all). - Use
pipeline("ner")for entity extraction. - Aggregate subword tokens into complete entities.
- Load pre-trained biomedical NER model from Hugging Face (
- Automated extraction of medical information from clinical notes.
- Medical record processing and analysis.
- Drug-disease relationship extraction.
- Clinical decision support systems.
- spaCy: For custom NER model training and inference.
- Transformers (Hugging Face): For pre-trained biomedical NER.
- Pandas: For data manipulation.
- PyTorch: Backend for transformer models.
- Google Colab / Local Python Environment: Development and testing.
- No Training Required: Ready to use without annotated data.
- Broad Coverage: Recognizes Chemicals/Drugs, Diseases, Genes, Species, Cell Types.
- High Accuracy: Transformer-based models handle context and ambiguity better than simple models.
| Aspect | Custom spaCy Model | Pre-trained Transformer |
|---|---|---|
| Training | Requires annotated data | Ready to use |
| Speed | Fast inference | Slower (but more accurate) |
| Customization | Fully customizable | Limited without fine-tuning |
| Entity types | Only what you train | Pre-defined broad set |
| Memory | Small footprint | Large (100s of MBs) |
dmis-lab/biobert-base-cased-v1.1allenai/scibert_scivocab_uncasedmicrosoft/BiomedNLP-PubMedBERT-base-uncased-abstract
Each model is specialized for different biomedical domains, including clinical notes and research articles.