Skip to content

BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text

License

Notifications You must be signed in to change notification settings

monajm36/ohca-classifier-3.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

62 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OHCA Classifier v3.0 - Clinical Ready

πŸ₯ BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) in medical discharge notes

Hugging Face Model Python 3.8+ License: MIT


πŸš€ Quick Start (5 Minutes) - Use Pre-trained Model

Want to test OHCA detection immediately? No training required!

1. Install Dependencies

pip install transformers torch pandas

Option 1: Download Single Script (Easiest!)

Want a single file that does everything?

  1. Download: quick_test.py
  2. Install: pip install transformers torch pandas
  3. Run: python quick_test.py

This script will:

  • βœ… Download the model automatically
  • βœ… Test with realistic examples
  • βœ… Show threshold effects
  • βœ… Let you test your own text
  • βœ… Analyze your CSV files

Option 2: Copy-Paste Code

2. Download and Test

Create a file called test_ohca.py:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load pre-trained model (downloads automatically)
model_name = "monajm36/ohca-classifier-v3-trained"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict_ohca(text, threshold=0.90):  # Using practical 90% threshold
    inputs = tokenizer(text, truncation=True, padding=True, 
                      max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        ohca_prob = probs[0][1].item()
    
    prediction = "OHCA" if ohca_prob >= threshold else "Non-OHCA"
    
    if ohca_prob >= 0.996:
        priority = "πŸ”΄ Immediate Review"
    elif ohca_prob >= 0.95:
        priority = "πŸ”΄ High Priority" 
    elif ohca_prob >= 0.90:
        priority = "🟑 Priority Review"
    elif ohca_prob >= 0.80:
        priority = "🟠 Consider Review"
    else:
        priority = "🟒 Routine"
    
    return {
        "prediction": prediction,
        "probability": round(ohca_prob, 4),
        "confidence": f"{ohca_prob*100:.1f}%",
        "clinical_priority": priority
    }

# Test with realistic examples
test_cases = {
    "Clear OHCA": """HISTORY OF PRESENT ILLNESS: This is a 67-year-old male with a history of coronary artery disease who presented after out-of-hospital cardiac arrest. The patient was at home when he suddenly collapsed. His wife witnessed the event and called 911. EMS arrived and found the patient in ventricular fibrillation. CPR was initiated immediately with defibrillation. Return of spontaneous circulation was achieved after 15 minutes.""",
    
    "Non-OHCA": """HISTORY OF PRESENT ILLNESS: This is a 45-year-old female presenting with acute onset chest pain. The patient was at work when she developed sudden onset substernal chest pain, described as pressure-like, 8/10 in intensity. No loss of consciousness. Vital signs stable on arrival."""
}

print("πŸ₯ Testing OHCA Classifier")
print("=" * 50)

for case_name, text in test_cases.items():
    result = predict_ohca(text)
    print(f"πŸ” {case_name}")
    print(f"   Prediction: {result['prediction']}")
    print(f"   Confidence: {result['confidence']}")
    print(f"   Priority: {result['clinical_priority']}")
    print()

3. Run the Test

python test_ohca.py

Expected Output:

  • OHCA case: ~98% confidence, Priority Review
  • Non-OHCA case: ~63% confidence, Routine

πŸ“Š Analyze Your Data

Process Your Discharge Notes CSV

import pandas as pd

def analyze_discharge_notes(csv_file, text_column='clean_text', threshold=0.90):
    """Analyze your discharge notes - any CSV format works"""
    
    # Load data
    df = pd.read_csv(csv_file)
    print(f"πŸ“‹ Loaded {len(df)} records")
    
    # Analyze each note
    results = []
    for idx, text in enumerate(df[text_column]):
        if idx % 100 == 0:
            print(f"   Processed {idx}/{len(df)}...")
            
        result = predict_ohca(str(text), threshold)
        results.append(result)
    
    # Add results to your data
    df['ohca_prediction'] = [r['prediction'] for r in results]
    df['ohca_probability'] = [r['probability'] for r in results] 
    df['clinical_priority'] = [r['clinical_priority'] for r in results]
    
    # Save results
    output_file = "ohca_analysis_results.csv"
    df.to_csv(output_file, index=False)
    
    # Clinical summary
    total = len(df)
    ohca_cases = len(df[df['ohca_prediction'] == 'OHCA'])
    immediate = len(df[df['clinical_priority'].str.contains('Immediate')])
    
    print(f"\nπŸ₯ SUMMARY:")
    print(f"   Total cases: {total:,}")
    print(f"   Predicted OHCA: {ohca_cases:,} ({ohca_cases/total*100:.1f}%)")
    print(f"   πŸ”΄ Immediate review: {immediate:,}")
    print(f"   πŸ“ Results saved: {output_file}")
    
    return df

# Usage
results = analyze_discharge_notes('your_data.csv', threshold=0.90)

Your CSV just needs:

  • Text column with discharge notes
  • Any column name works (adjust text_column parameter)

⚠️ Important: Choose Your Threshold

The model was trained with a 99.6% threshold, but this may be too conservative:

# Test different thresholds on your data
text = "Your discharge note here..."
thresholds = [0.996, 0.95, 0.90, 0.85]

for threshold in thresholds:
    result = predict_ohca(text, threshold)
    print(f"{threshold*100:.1f}%: {result['prediction']} ({result['confidence']})")

Recommended thresholds:

  • 90%: Good balance for clinical screening (recommended)
  • 95%: More conservative, fewer false positives
  • 99.6%: Ultra-conservative (original), may miss obvious cases

πŸ₯ Clinical Workflow

Recommended Process

  1. Batch analyze all discharge notes

  2. Triage by priority:

    • πŸ”΄ Immediate/High Priority: Medical review within 24h
    • 🟑 Priority Review: Clinical team review within 48h
    • 🟠 Consider Review: Weekly review process
    • 🟒 Routine: Standard processing
  3. Quality assurance: Validate on sample of your data

Large Dataset Processing

def process_large_dataset(csv_file, chunk_size=1000):
    """Process very large datasets efficiently"""
    chunk_results = []
    
    for chunk_num, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
        print(f"Processing chunk {chunk_num + 1}...")
        
        # Process chunk (same as above)
        results = [predict_ohca(text) for text in chunk['text_column']]
        
        # Add predictions to chunk
        chunk['ohca_prediction'] = [r['prediction'] for r in results]
        chunk['clinical_priority'] = [r['clinical_priority'] for r in results]
        chunk_results.append(chunk)
    
    # Combine and save
    final_results = pd.concat(chunk_results, ignore_index=True)
    final_results.to_csv('large_dataset_results.csv', index=False)
    return final_results

πŸ“ Repository Contents

This repository contains:

🎯 For Immediate Use:

  • Pre-trained model: Available on Hugging Face
  • Quick test scripts: Copy-paste examples above
  • Batch processing: Analyze large datasets

πŸ”§ For Advanced Users:

  • Training pipeline: Train custom models on your data
  • Methodology improvements: Patient-level splits, optimal thresholds
  • Research tools: Complete development workflow

πŸ“‚ Structure:

ohca-classifier-3.0/
β”œβ”€β”€ src/                     # Core training modules
β”œβ”€β”€ scripts/                 # User-friendly scripts  
β”œβ”€β”€ examples/                # Usage examples
β”œβ”€β”€ docs/                    # Documentation
└── requirements.txt         # Dependencies

πŸ”¬ Model Details

  • Base: PubMedBERT (medical text optimized)
  • Task: Binary classification (OHCA vs Non-OHCA)
  • Training: 330 MIMIC-III discharge notes
  • Performance: 100% sensitivity, 74% specificity (at 99.6% threshold)
  • Validation: Patient-level splits prevent data leakage

🚨 Important Considerations

Clinical Use

  • Screening tool: Assists, doesn't replace clinical judgment
  • Validation recommended: Test performance on your specific data
  • Human oversight: All predictions should be clinically reviewed
  • HIPAA compliance: Ensure proper data handling

Limitations

  • English medical text only
  • Trained on specific documentation style
  • Performance may vary across different hospital systems
  • Text-based analysis only

πŸš€ Advanced: Train Your Own Model

If the pre-trained model doesn't work well on your data, you can train a custom version:

Installation for Training

git clone https://github.com/monajm36/ohca-classifier-v3.0.git
cd ohca-classifier-3.0
pip install -r requirements.txt
pip install -e .

Training Process

from src.ohca_training_pipeline import complete_improved_training_pipeline

# Create training samples (requires manual annotation)
results = complete_improved_training_pipeline(
    data_path="your_discharge_notes.csv",  # needs: hadm_id, subject_id, clean_text
    annotation_dir="./annotation_v3",
    train_sample_size=800,
    val_sample_size=200
)

# Then manually annotate the Excel files generated
# Finally, train the model (see full documentation in examples/)

Note: Training requires manually labeling 800-1000 discharge notes. Most users should start with the pre-trained model.


πŸ“Š Performance & Validation

Benchmark Performance

  • AUC-ROC: 0.85-0.95
  • Sensitivity: 85-95% (threshold dependent)
  • Specificity: 85-95% (threshold dependent)
  • F1-Score: 0.7-0.9

Validate on Your Data

def validate_model(labeled_test_data_csv):
    """Test model performance on your labeled data"""
    df = pd.read_csv(labeled_test_data_csv)  # needs: text, true_label columns
    
    correct = 0
    total = len(df)
    
    for _, row in df.iterrows():
        result = predict_ohca(row['text'], threshold=0.90)
        predicted = 1 if result['prediction'] == 'OHCA' else 0
        if predicted == row['true_label']:
            correct += 1
    
    accuracy = correct / total
    print(f"Accuracy on your data: {accuracy:.3f}")
    return accuracy

🀝 Support & Contributing

Getting Help

  • πŸ› Issues: GitHub Issues
  • πŸ’¬ Questions: GitHub Discussions
  • πŸ“– Documentation: Check examples/ folder

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Test your changes
  4. Submit pull request

πŸ“š Citation & License

Citation

@software{ohca_classifier_v3,
  title={OHCA Classifier v3.0: Clinical-Ready BERT for Cardiac Arrest Detection},
  author={Mona Moukaddem},
  year={2025},
  url={https://github.com/monajm36/ohca-classifier-3.0}
}

License

MIT License - Free for clinical and research use


πŸŽ‰ Quick Links

  • πŸ€— Try the Model on Hugging Face
  • πŸ“‹ Copy-paste the test script above to get started
  • πŸ“Š Process your data with the batch analysis code
  • πŸ”§ Advanced users: Explore training pipeline in src/

Ready to detect OHCA cases? Start with the Quick Start section above! πŸš€

About

BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages