Medical Named Entity Recognition (NER) Project

Overview

This project implements a medical Named Entity Recognition (NER) system using two approaches:

Custom spaCy Model Training: Training a custom NER model from scratch using annotated medical text data.
Pre-trained Transformer Model: Using a pre-trained biomedical NER model from Hugging Face.

The system is designed to identify and extract medical entities from clinical text, including:

Medications: Drug names and prescriptions (e.g., Aspirin, Metformin, Warfarin)
Diseases: Medical conditions and diagnoses (e.g., diabetes, pneumonia, COPD)
Treatments: Medical procedures and therapies (e.g., surgery, inhaler therapy)

Project Structure

The project is organized into the following steps:

Data Loading and Exploration
- Upload annotated JSON dataset.
- Explore data structure and annotation format.
Data Preprocessing
- Convert JSON annotations into spaCy-compatible format (start_char, end_char, label).
- Verify entity alignment with text content.
spaCy Training Data Preparation
- Convert processed data into DocBin format for spaCy.
- Handle overlapping entities and alignment issues.
Model Training
- Initialize spaCy NER configuration.
- Train custom spaCy NER model on prepared data.
- Save model-best/ for best checkpoint.
Model Inference (spaCy)
- Load the trained model and test on sample medical text.
- Visualize entities using spaCy's displacy.
Transformer Model Inference
- Load pre-trained biomedical NER model from Hugging Face (d4data/biomedical-ner-all).
- Use pipeline("ner") for entity extraction.
- Aggregate subword tokens into complete entities.

Use Cases

Automated extraction of medical information from clinical notes.
Medical record processing and analysis.
Drug-disease relationship extraction.
Clinical decision support systems.

Technologies Used

spaCy: For custom NER model training and inference.
Transformers (Hugging Face): For pre-trained biomedical NER.
Pandas: For data manipulation.
PyTorch: Backend for transformer models.
Google Colab / Local Python Environment: Development and testing.

Advantages of Pre-trained Models

No Training Required: Ready to use without annotated data.
Broad Coverage: Recognizes Chemicals/Drugs, Diseases, Genes, Species, Cell Types.
High Accuracy: Transformer-based models handle context and ambiguity better than simple models.

Comparison: Custom spaCy vs Pre-trained Transformer

Aspect	Custom spaCy Model	Pre-trained Transformer
Training	Requires annotated data	Ready to use
Speed	Fast inference	Slower (but more accurate)
Customization	Fully customizable	Limited without fine-tuning
Entity types	Only what you train	Pre-defined broad set
Memory	Small footprint	Large (100s of MBs)

Alternative Biomedical NER Models

dmis-lab/biobert-base-cased-v1.1
allenai/scibert_scivocab_uncased
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract

Each model is specialized for different biomedical domains, including clinical notes and research articles.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Medical NER Dataset		Medical NER Dataset
model-best		model-best
.gitignore		.gitignore
README.md		README.md
app.py		app.py
medical_ner_model_training.ipynb		medical_ner_model_training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Named Entity Recognition (NER) Project

Overview

Project Structure

Use Cases

Technologies Used

Advantages of Pre-trained Models

Comparison: Custom spaCy vs Pre-trained Transformer

Alternative Biomedical NER Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Named Entity Recognition (NER) Project

Overview

Project Structure

Use Cases

Technologies Used

Advantages of Pre-trained Models

Comparison: Custom spaCy vs Pre-trained Transformer

Alternative Biomedical NER Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages