End-to-End Machine Learning Suite for Antimicrobial Resistance & Drug Discovery
A comprehensive collection of production-ready ML projects tackling critical challenges in infectious disease and drug development. From resistance prediction to generative drug design and automated diagnostics.
Classification Model for Antibiotic Resistance Detection
- Task: Binary classification (Susceptible vs Resistant)
- Model: Random Forest Classifier
- Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
- Data: 4,383 E. coli isolates from NCBI
- App:
streamlit run src/app.py
Regression Model for Antimicrobial Peptide Potency Prediction
- Task: MIC (Minimum Inhibitory Concentration) prediction
- Model: Random Forest Regressor
- R² Score: 0.9992 | RMSE: 0.024 log units
- Data: 3,143 E. coli isolates with MIC values
- App:
streamlit run src/app_MIC.py
Generative AI for Antimicrobial Peptide Design
- Task: Generate novel peptide sequences (generative modeling)
- Model: 2-Layer LSTM (PyTorch) - Character-level RNN
- Performance: Loss 0.8541 | Generates realistic AMP sequences
- Data: 2,872 E. coli peptides (10-50 AA length)
- Training: ~10 min CPU / ~2 min GPU | 50 epochs
- Status: ✅ Fully trained, ready for inference
- Use: Computational screening, rational design, drug discovery
1D ResNet for Multi-label Antimicrobial Resistance Prediction from Mass Spectrometry
- Task: Multi-label classification (10 antibiotics)
- Model: ResNet-1D (2M parameters) - Deep CNN with residual blocks
- Architecture: Conv1D → 4 ResBlock stages → Global AvgPool → FC → Sigmoid
- Input: MALDI-TOF mass spectra (6000 m/z bins)
- Loss: BCEWithLogitsLoss with pos_weight (handles class imbalance)
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-5)
- Metrics: AUPRC, AUROC tracked via TensorBoard
- Training: 20 epochs with automatic best model checkpointing
- Features: Flexible model sizes (small/medium/large), feature extraction
- Documentation: See src/README.md for detailed architecture
Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.
Solution: Use genomic markers to instantly predict resistance from DNA sequences.
Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ µM).
Solution: Use machine learning to predict peptide efficacy and generate new candidates from physicochemical properties and sequence patterns.
Challenge: Design space for peptides is massive (20^50 for 50-length sequences = 10^65 possibilities). Manual screening is infeasible.
Solution: Train generative AI to learn natural peptide patterns and create novel, biologically plausible sequences for experimental validation.
Challenge: MALDI-TOF mass spectrometry is fast (minutes) but requires expert interpretation. Multi-drug resistance requires testing 10+ antibiotics.
Solution: Train deep neural networks to directly predict resistance profiles from raw mass spectra, enabling instant multi-drug diagnostics.
ML-Training/
├── projects/
│ ├── cefixime-resistance-training/ # Antibiotic resistance classifier
│ │ ├── data/
│ │ │ ├── raw/ # Original NCBI isolates
│ │ │ └── processed/ # Cleaned genotype data
│ │ ├── src/
│ │ │ ├── process.py # Data preprocessing
│ │ │ └── train.py # Model training (RF classifier)
│ │ ├── models/
│ │ │ └── ceftriaxone_model.pkl # Trained classifier
│ │ └── results/
│ │ ├── confusion_matrix.html # Interactive CM
│ │ └── feature_importance.csv # Top resistance genes
│ │
│ └── MIC Regression/ # Peptide potency regressor
│ ├── data/
│ │ ├── raw/ # Raw peptide sequences & MIC values
│ │ └── processed/ # Computed physicochemical features
│ ├── src/
│ │ ├── process.py # Data preprocessing
│ │ └── train.py # Model training (RF regressor)
│ ├── models/
│ │ └── mic_predictor.pkl # Trained regressor
│ └── results/
│ ├── predicted_vs_actual.png # Predictions visualization
│ └── feature_importance.png # Top peptide features
│ │
│ └── week4_peptide_generator/ # Generative LSTM
│ ├── data/
│ │ └── ecolitraining_set_80.csv # 2,872 E. coli peptides
│ ├── models/
│ │ ├── peptide_lstm.pth # Best model (loss: 0.854)
│ │ └── config.json # Training hyperparameters
│ ├── src/
│ │ ├── vocab.py # PeptideVocab: AA tokenization
│ │ └── train_generator.py # PyTorch LSTM training
│ └── README.md
│
├── src/ # 🆕 DeepG2P Model & Apps
│ ├── model.py # ResNet-1D architecture (DeepG2P, ResidualBlock)
│ ├── train.py # Training pipeline (BCEWithLogitsLoss, AdamW)
│ ├── app.py # Ceftriaxone classifier Streamlit app
│ ├── app_MIC.py # MIC regressor Streamlit app
│ ├── features.py # Biopython feature extraction
│ └── README.md # DeepG2P documentation
│
├── models/ # 🆕 Saved model checkpoints
│ ├── best_model.pth # Best validation loss checkpoint
│ └── checkpoint_epoch_*.pth # Periodic training checkpoints
│
├── results/ # 🆕 Training outputs
│ ├── logs/ # TensorBoard logs
│ └── training_config.json # Hyperparameters & metadata
│
├── utils/
│ └── model_evaluation.py # Shared evaluation metrics
│
├── requirements.txt # Python dependencies (PyTorch, sklearn, etc.)
└── README.md # This file
- Python 3.8+
- Git
# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training
# Install dependencies
pip install -r requirements.txtCeftriaxone Resistance Predictor (Classifier):
streamlit run src/app.pyAccess at http://localhost:8501
AI Peptide Dosing Calculator (Regressor):
streamlit run src/app_MIC.pyAccess at http://localhost:8501
DeepG2P Model Training:
# Train with default parameters
python src/train.py
# Custom training
python src/train.py \
--train-features data/processed/X_train.npy \
--train-labels data/processed/y_train.npy \
--val-features data/processed/X_val.npy \
--val-labels data/processed/y_val.npy \
--epochs 20 \
--batch-size 32 \
--model-size medium
# Monitor training
tensorboard --logdir results/logsAntibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.
- Model: Random Forest Classifier (100 trees, balanced class weights)
- Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
- Features: 352 detected resistance genes/mutations
| Metric | Value |
|---|---|
| Accuracy | 94.9% |
| Sensitivity | 93.9% |
| Specificity | 95.9% |
| ROC-AUC | 0.978 |
| Test Set Size | 876 isolates |
The model independently discovered known resistance mechanisms:
- blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
- blaCMY-2 (AmpC Cephalosporinase)
- gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)
Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.
- Training:
projects/cefixime-resistance-training/src/train.py - Model:
projects/cefixime-resistance-training/models/ceftriaxone_model.pkl - App:
src/app.py
Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.
- Model: Random Forest Regressor (100 trees)
- Data: 3,143 E. coli isolates with MIC values (NCBI)
- Target:
neg_log_mic_microM(-log10 of MIC in µM)
| Metric | Current (K-mers) | Previous (Baseline) |
|---|---|---|
| R² Score | 0.9992 | 0.4461 |
| RMSE | 0.024 log units | 0.629 log units |
| Pearson r | 0.9996 | 0.6742 |
| p-value | < 0.001 | < 0.001 |
| Test Set Size | 629 peptides | 629 peptides |
| Features | 410 (7 + 399 k-mers) | 7 (physicochemical only) |
- RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
- Model explains 99.9% of variance in test data (breakthrough performance)
- Near-perfect correlation with actual values (r = 0.9996)
Physicochemical Properties (7 features via Biopython):
- Molecular Weight - correlates with toxicity vs efficacy
- Aromaticity - aromatic residues enhance membrane interaction
- Instability Index - peptide stability in vivo
- Isoelectric Point - charge affects cellular uptake
- GRAVY (hydrophobicity) - hydrophobic residues improve activity
- Length - longer peptides often more potent but less specific
- Positive Charge - (K + R count) - important for bacterial binding
K-mer (Dipeptide) Features (399 features via CountVectorizer):
- Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
- Captures sequence order information (solves "bag of words" problem)
- Preserves local context: distinguishes
R-R-W-WfromW-R-W-R - Min frequency threshold (min_df=5) filters rare k-mers
- Breakthrough improvement: R² 0.45 → 0.9992 (+122% relative gain)
- < 2 µM: 💎 Excellent (highly potent)
- 2-10 µM: ✅ Good (reasonable activity)
- 10-50 µM:
⚠️ Weak (marginal) -
50 µM: ❌ Inactive (not viable)
Initial Challenge (R² = 0.45)
The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.
The Problem:
- Sequence
R-R-W-W(positive charge → hydrophobic) might be highly potent - Sequence
W-R-W-R(alternating pattern) could be ineffective - Issue: Both have identical weight, charge, GRAVY → model couldn't distinguish them
Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.
Solution: K-mer Features (Implemented)
Added dipeptide counting to capture local sequence context:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
analyzer='char',
ngram_range=(2, 2), # Dipeptides (AA, AK, KE, WW, etc.)
min_df=5 # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence orderBreakthrough Results:
- R² improved from 0.45 → 0.9992 (99.9% variance explained)
- RMSE reduced from 0.63 → 0.024 log units (~27x improvement)
- Model now distinguishes
R-R-W-WfromW-R-W-Rbased on local patterns
Why K-mers Work:
- Capture pairwise amino acid interactions (e.g.,
"KK"= strong positive clustering) - Preserve positional information without overfitting (unlike full sequence embeddings)
- Interpretable: Can analyze top k-mers for biological plausibility
- Computationally efficient for inference
Biological Validation: Top k-mer features likely include:
"KK","RR"- positive charge clustering (enhances bacterial binding)"WW","FF"- hydrophobic patches (membrane insertion)"KE","RD"- charged pairs (amphipathicity)
This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.
- Feature extraction:
src/features.py - Training:
projects/MIC Regression/src/train.py - Model:
projects/MIC Regression/models/mic_predictor.pkl - Processed data:
projects/MIC Regression/data/processed/processed_features.csv - App:
src/app_MIC.py
Designing antimicrobial peptides requires screening millions of candidates. The design space is massive (20^50 ≈ 10^65 for 50-length sequences). Goal: Use generative AI to learn natural peptide patterns and create novel candidates for experimental validation.
- Model: 2-Layer LSTM (PyTorch character-level RNN)
- Data: 2,872 E. coli peptides (10-50 AA length)
- Task: Learn to predict next amino acid in sequence → generate new peptides
| Metric | Value | Status |
|---|---|---|
| Initial Loss (Epoch 1) | 2.81 | Random |
| Target Achieved (Epoch 15) | 1.59 | ✅ Hit target |
| Final Loss (Epoch 50) | 0.854 | ✨ Excellent |
| Training Time (CPU) | ~10 min | Practical |
| Training Time (GPU) | ~2 min | Fast |
| Vocab Size | 23 | (20 AA + 3 special) |
| Model Parameters | ~1.3M | Manageable |
Input: Sequence of amino acid indices
↓
Embedding (vocab_size=23 → embedding_dim=128)
↓
LSTM Layer 1 (128 → 256 units) + Dropout(0.3)
↓
LSTM Layer 2 (256 → 256 units) + Dropout(0.3)
↓
Linear (256 → vocab_size=23)
↓
Output: Logits for next token
Epoch 50 Generations (Temperature=0.8):
1. FLPAIVGAAAKFLPKIFCAITKKC ← Hydrophobic core + basic tail
2. GIGKFLHSAKKFGKAFVGEIMNS ← Alternating hydrophobic/charged
3. SKVGRHWRRFWHRAHRLLHR ← Rich in W (aromatic) & R (cationic)
4. GLRKRLRKFRNKIKEKLKKIGQKIQGLLPKLAPRTDY
5. LLGDFFRKSKEKIGKEFKRIVQRIKDFFRNLVPRTES
Why These Look Realistic:
- Contain hydrophobic residues (L, V, I, F) for membrane interaction
- Cationic clusters (K, R) for bacterial binding
- Avoid D, E (acidic) which would reduce activity
- Length distribution matches natural AMPs
- No known toxins generated
- Model learned biological patterns without explicit rules
- Generative capability → enables computational screening
- Loss convergence shows genuine pattern learning (not memorization)
- Character-level modeling better than sequence models for this task
Next Steps (Future Work):
- ✅ MIC Prediction: Use Project 2 regressor on generated sequences
- ✅ Toxicity Screening: Hemolysis prediction models
- ✅ Structural Validation: AlphaFold2 for 3D verification
- ✅ Lab Validation: Experimental MIC testing
- Vocabulary:
projects/week4_peptide_generator/src/vocab.py - Training & Generation:
projects/week4_peptide_generator/src/train_generator.py - Best Model:
projects/week4_peptide_generator/models/peptide_lstm.pth - Checkpoints:
projects/week4_peptide_generator/models/peptide_lstm_epoch_{10,20,30,40,50}.pth - Documentation:
projects/week4_peptide_generator/README.md
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: GENERATION (Week 4 Peptide Generator) │
│ Generate 1000 candidate sequences │
│ Temperature=0.8 for balanced novelty/realism │
└─────────────────────────┬──────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────┐
│ Stage 2: POTENCY PREDICTION (Project 2: MIC Regressor) │
│ Predict MIC for each candidate │
│ Filter: Keep only high-potency (MIC < 5 µM) │
│ Result: ~50-100 promising candidates │
└─────────────────────────┬──────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────┐
│ Stage 3: EXPERIMENTAL VALIDATION │
│ Synthesize top 20 candidates │
│ Test MIC, toxicity, stability │
│ → 2-3 viable drug leads per iteration │
└─────────────────────────────────────────────────────────────┘
This computational-experimental hybrid dramatically reduces time & cost vs. random screening.
All projects include built-in validation mechanisms to ensure scientific rigor and prevent common ML failures.
Problem: Clustering treating single-sample locations as valid clusters.
Solution: Filter locations with <5 samples before matrix construction.
python src/process_matrix.py --min-location-samples 5Impact: Prevents geographic clustering artifacts, improves statistical reliability.
Problem: Generated peptides might be >90% identical to training data (memorization).
Solution: Check sequence homology using SequenceMatcher before screening.
Filtered 2 candidates for high homology (>90% identity)
✓ Novelty status: NOVEL
Impact: Ensures generated peptides are truly novel for experimental validation.
Problem: Regressor predicts values outside training range (hallucination).
Example: Training MIC range 0.5-256 µM, but model predicts 0.017 µM
Solution: Flag predictions outside training range with confidence indicators.
Flagged 2 predictions with LOW_CONFIDENCE* (outside training range 0.5-256 µM)
prediction_confidence: HIGH_CONFIDENCE or LOW_CONFIDENCE*
Impact: Prevents overconfident predictions on extrapolated values.
Problem: Computer vision fails with poor lighting (too dark or overexposed).
Solution: Validate image intensity before analysis.
Image quality: mean_intensity = 125.4
✓ Image quality validated (within 50-200 range)
Impact: Prevents false positives/negatives from suboptimal imaging conditions.
- Pandas: Data manipulation & analysis
- NumPy: Numerical computations
- Scikit-Learn: RandomForest classifiers & regressors
- Biopython: Protein sequence analysis (
Bio.SeqUtils.ProtParam) - SciPy: Statistical tests (Pearson correlation, etc.)
- Matplotlib: Static publication-ready plots
- Plotly: Interactive HTML charts
- Kaleido: PNG export from Plotly
- Streamlit: Interactive web apps (no frontend coding)
- Joblib: Model persistence (.pkl files)
- GitHub: Version control & deployment integration
Global Impact:
- ~1.3M deaths/year attributable to AMR (WHO, 2022)
- Top 10 global health threat
- Economic cost: $100B+ annually in healthcare
Genetic Basis (Ceftriaxone Example):
- Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
- Target Modification: gyrA mutations alter DNA gyrase binding site
- Efflux Pumps: acrB overexpression exports antibiotics before they act
Natural Defense:
- Found in all life forms (immune system, skin, GI tract)
- Kill bacteria via direct membrane disruption
- Less likely to develop resistance (multiple mechanisms)
Design Challenge:
- Potency (MIC) varies 1000-fold (0.1 - 100+ µM)
- Toxicity risk increases with potency
- Design space is massive (20^n for n-length peptides)
ML Solution:
- Use physicochemical properties to predict potency
- Enable rational design instead of random screening
- Reduce wet-lab costs & timelines
- NCBI MicroBIGG-E: https://microbiggdata.ncbi.nlm.nih.gov/ (genotypes + phenotypes)
- EUCAST Guidelines: https://www.eucast.org/ (standard testing methods)
- CARD Database: https://card.mcmaster.ca/ (resistance gene annotations)
- APD (APD3): https://aps.unmc.edu/APD/ (AMP database)
- BioPep: https://www.bipep.org/ (peptide bioactivity)
ProteinAnalysisdocumentation: https://biopython.org/wiki/Documentation
For research/educational use only. Not a clinical diagnostic device.
- Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
- Consult clinical microbiology before treatment decisions
- Models trained on specific E. coli population; validate locally
For research/design purposes only. Not validated for clinical use.
- Predicted MIC is a computational estimate; always validate experimentally
- Model trained on specific data; performance may vary on novel sequences
- Use as design guidance, not final arbiter of peptide efficacy
- Multi-organism support (Klebsiella, Pseudomonas)
- SHAP explainability for individual predictions
- Confidence intervals for MIC predictions
- REST API for integration with LIS systems
- Additional antibiotics (fluoroquinolones, aminoglycosides)
- Uncertainty quantification via Bayesian methods
- Mobile app (iOS/Android) for field deployment
- Real-time database updates from NCBI
- Community contribution framework
Vihaan Kulkarni — Bioinformatics & Machine Learning Engineer
MIT License — Free for academic and research use.
Last Updated: December 17, 2025
Status: ✅ Active Development
- Fill out
README.mdwith:- Problem statement
- Key insights (with screenshots)
- Model metrics
- Deployment link
- Use "Problem → Method → Insight → Impact" structure
Every project includes:
- Data:
pandas,numpy - Visualization:
plotly,kaleido - Modeling:
scikit-learn - Explainability:
shap - Deployment:
streamlit
Optional (uncomment in requirements.txt if needed):
- Experiment Tracking:
mlflow,wandb - Deep Learning:
torch,tensorflow
- Run baseline first: Always compare against a simple model
- Plotly over Matplotlib: Interactive charts reveal more insights
- Document as you go: Fill README during the project, not after
- Save figures: Use
fig.write_html()to preserve interactivity - Version control: Commit after each major milestone
- ✅ 1 high-quality project per week
- ✅ Every project deployed with Streamlit
- ✅ README formatted for resume/GitHub
- ✅ Interactive visualizations (no static PNGs)
- ✅ Model explainability included
Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller