π₯ BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) in medical discharge notes
Want to test OHCA detection immediately? No training required!
pip install transformers torch pandasWant a single file that does everything?
- Download:
quick_test.py - Install:
pip install transformers torch pandas - Run:
python quick_test.py
This script will:
- β Download the model automatically
- β Test with realistic examples
- β Show threshold effects
- β Let you test your own text
- β Analyze your CSV files
Create a file called test_ohca.py:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load pre-trained model (downloads automatically)
model_name = "monajm36/ohca-classifier-v3-trained"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def predict_ohca(text, threshold=0.90): # Using practical 90% threshold
inputs = tokenizer(text, truncation=True, padding=True,
max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
ohca_prob = probs[0][1].item()
prediction = "OHCA" if ohca_prob >= threshold else "Non-OHCA"
if ohca_prob >= 0.996:
priority = "π΄ Immediate Review"
elif ohca_prob >= 0.95:
priority = "π΄ High Priority"
elif ohca_prob >= 0.90:
priority = "π‘ Priority Review"
elif ohca_prob >= 0.80:
priority = "π Consider Review"
else:
priority = "π’ Routine"
return {
"prediction": prediction,
"probability": round(ohca_prob, 4),
"confidence": f"{ohca_prob*100:.1f}%",
"clinical_priority": priority
}
# Test with realistic examples
test_cases = {
"Clear OHCA": """HISTORY OF PRESENT ILLNESS: This is a 67-year-old male with a history of coronary artery disease who presented after out-of-hospital cardiac arrest. The patient was at home when he suddenly collapsed. His wife witnessed the event and called 911. EMS arrived and found the patient in ventricular fibrillation. CPR was initiated immediately with defibrillation. Return of spontaneous circulation was achieved after 15 minutes.""",
"Non-OHCA": """HISTORY OF PRESENT ILLNESS: This is a 45-year-old female presenting with acute onset chest pain. The patient was at work when she developed sudden onset substernal chest pain, described as pressure-like, 8/10 in intensity. No loss of consciousness. Vital signs stable on arrival."""
}
print("π₯ Testing OHCA Classifier")
print("=" * 50)
for case_name, text in test_cases.items():
result = predict_ohca(text)
print(f"π {case_name}")
print(f" Prediction: {result['prediction']}")
print(f" Confidence: {result['confidence']}")
print(f" Priority: {result['clinical_priority']}")
print()python test_ohca.pyExpected Output:
- OHCA case: ~98% confidence, Priority Review
- Non-OHCA case: ~63% confidence, Routine
import pandas as pd
def analyze_discharge_notes(csv_file, text_column='clean_text', threshold=0.90):
"""Analyze your discharge notes - any CSV format works"""
# Load data
df = pd.read_csv(csv_file)
print(f"π Loaded {len(df)} records")
# Analyze each note
results = []
for idx, text in enumerate(df[text_column]):
if idx % 100 == 0:
print(f" Processed {idx}/{len(df)}...")
result = predict_ohca(str(text), threshold)
results.append(result)
# Add results to your data
df['ohca_prediction'] = [r['prediction'] for r in results]
df['ohca_probability'] = [r['probability'] for r in results]
df['clinical_priority'] = [r['clinical_priority'] for r in results]
# Save results
output_file = "ohca_analysis_results.csv"
df.to_csv(output_file, index=False)
# Clinical summary
total = len(df)
ohca_cases = len(df[df['ohca_prediction'] == 'OHCA'])
immediate = len(df[df['clinical_priority'].str.contains('Immediate')])
print(f"\nπ₯ SUMMARY:")
print(f" Total cases: {total:,}")
print(f" Predicted OHCA: {ohca_cases:,} ({ohca_cases/total*100:.1f}%)")
print(f" π΄ Immediate review: {immediate:,}")
print(f" π Results saved: {output_file}")
return df
# Usage
results = analyze_discharge_notes('your_data.csv', threshold=0.90)Your CSV just needs:
- Text column with discharge notes
- Any column name works (adjust
text_columnparameter)
The model was trained with a 99.6% threshold, but this may be too conservative:
# Test different thresholds on your data
text = "Your discharge note here..."
thresholds = [0.996, 0.95, 0.90, 0.85]
for threshold in thresholds:
result = predict_ohca(text, threshold)
print(f"{threshold*100:.1f}%: {result['prediction']} ({result['confidence']})")Recommended thresholds:
- 90%: Good balance for clinical screening (recommended)
- 95%: More conservative, fewer false positives
- 99.6%: Ultra-conservative (original), may miss obvious cases
-
Batch analyze all discharge notes
-
Triage by priority:
- π΄ Immediate/High Priority: Medical review within 24h
- π‘ Priority Review: Clinical team review within 48h
- π Consider Review: Weekly review process
- π’ Routine: Standard processing
-
Quality assurance: Validate on sample of your data
def process_large_dataset(csv_file, chunk_size=1000):
"""Process very large datasets efficiently"""
chunk_results = []
for chunk_num, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
print(f"Processing chunk {chunk_num + 1}...")
# Process chunk (same as above)
results = [predict_ohca(text) for text in chunk['text_column']]
# Add predictions to chunk
chunk['ohca_prediction'] = [r['prediction'] for r in results]
chunk['clinical_priority'] = [r['clinical_priority'] for r in results]
chunk_results.append(chunk)
# Combine and save
final_results = pd.concat(chunk_results, ignore_index=True)
final_results.to_csv('large_dataset_results.csv', index=False)
return final_resultsThis repository contains:
- Pre-trained model: Available on Hugging Face
- Quick test scripts: Copy-paste examples above
- Batch processing: Analyze large datasets
- Training pipeline: Train custom models on your data
- Methodology improvements: Patient-level splits, optimal thresholds
- Research tools: Complete development workflow
ohca-classifier-3.0/
βββ src/ # Core training modules
βββ scripts/ # User-friendly scripts
βββ examples/ # Usage examples
βββ docs/ # Documentation
βββ requirements.txt # Dependencies
- Base: PubMedBERT (medical text optimized)
- Task: Binary classification (OHCA vs Non-OHCA)
- Training: 330 MIMIC-III discharge notes
- Performance: 100% sensitivity, 74% specificity (at 99.6% threshold)
- Validation: Patient-level splits prevent data leakage
- Screening tool: Assists, doesn't replace clinical judgment
- Validation recommended: Test performance on your specific data
- Human oversight: All predictions should be clinically reviewed
- HIPAA compliance: Ensure proper data handling
- English medical text only
- Trained on specific documentation style
- Performance may vary across different hospital systems
- Text-based analysis only
If the pre-trained model doesn't work well on your data, you can train a custom version:
git clone https://github.com/monajm36/ohca-classifier-v3.0.git
cd ohca-classifier-3.0
pip install -r requirements.txt
pip install -e .from src.ohca_training_pipeline import complete_improved_training_pipeline
# Create training samples (requires manual annotation)
results = complete_improved_training_pipeline(
data_path="your_discharge_notes.csv", # needs: hadm_id, subject_id, clean_text
annotation_dir="./annotation_v3",
train_sample_size=800,
val_sample_size=200
)
# Then manually annotate the Excel files generated
# Finally, train the model (see full documentation in examples/)Note: Training requires manually labeling 800-1000 discharge notes. Most users should start with the pre-trained model.
- AUC-ROC: 0.85-0.95
- Sensitivity: 85-95% (threshold dependent)
- Specificity: 85-95% (threshold dependent)
- F1-Score: 0.7-0.9
def validate_model(labeled_test_data_csv):
"""Test model performance on your labeled data"""
df = pd.read_csv(labeled_test_data_csv) # needs: text, true_label columns
correct = 0
total = len(df)
for _, row in df.iterrows():
result = predict_ohca(row['text'], threshold=0.90)
predicted = 1 if result['prediction'] == 'OHCA' else 0
if predicted == row['true_label']:
correct += 1
accuracy = correct / total
print(f"Accuracy on your data: {accuracy:.3f}")
return accuracy- π Issues: GitHub Issues
- π¬ Questions: GitHub Discussions
- π Documentation: Check
examples/folder
- Fork the repository
- Create feature branch
- Test your changes
- Submit pull request
@software{ohca_classifier_v3,
title={OHCA Classifier v3.0: Clinical-Ready BERT for Cardiac Arrest Detection},
author={Mona Moukaddem},
year={2025},
url={https://github.com/monajm36/ohca-classifier-3.0}
}MIT License - Free for clinical and research use
- π€ Try the Model on Hugging Face
- π Copy-paste the test script above to get started
- π Process your data with the batch analysis code
- π§ Advanced users: Explore training pipeline in
src/
Ready to detect OHCA cases? Start with the Quick Start section above! π