Welcome to ARCADIAMP! This repository provides all the code, notebooks, and trained models needed to:
- Generate novel antimicrobial peptide (AMP) candidates using a discrete denoising diffusion probabilistic model (D3PM)
- Predict AMP activity with coarse-grained and fine-grained ESM-2 classifier ensembles for screening AMPs and strong AMPs, respectively
- Filter the generated AMP candidates with low sequence novelty by comparing against known AMPs
The workflow is fully notebook-driven and runs in JupyterLab. While CPU-only execution is supported, a GPU is strongly recommended for optimal training and inference speed.
π Quick Start: For immediate results, run
FullPipeline.ipynbto generate AMP candidates end-to-end, or follow the module-by-module workflow for custom training and inference.
- Self-Learning EvoDiff-D3PM AMP Generator β Iteratively retrain models using selected generated sequences with low predicted MIC values
- Two-Stage ESM-2 Classifiers β Combines coarse-grained and fine-grained activity classifiers that first screen AMPs, then screen strong AMPs (low MIC threshold, e.g., <8 ΞΌg/ml)
- Sequence Novelty Filter β Align generated candidates to well-known AMPs and filter for high novelty using customizable similarity and identity thresholds
- Ensemble Averaging β Provides stable activity predictions via averaging across five 5-fold cross-validation models
- Flexible Configuration β Easily adjust thresholds, model parameters, and output specifications through notebook parameters
- Python: 3.11.9 (recommended)
- Hardware: GPU recommended for training (CPU-only supported for inference)
- Storage: ~5 GB for models and datasets
git clone https://github.com/IBPA/ARCADIAMP.git
cd ARCADIAMP# Create and activate conda environment
conda create -n arcadiamp python=3.11.9
conda activate arcadiamp
# Install required packages
pip install -r requirements.txt# Quick test to verify installation
python -c "import torch; import evodiff; import pandas as pd; print('Installation successful!')"Note: On first run, some models may need to download pretrained weights (~500MB), which will be cached automatically.
ARCADIAMP/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ FullPipeline.ipynb # π Main notebook: Full pipeline workflow
βββ NoveltyEvaluation.py # Sequence novelty evaluation functions
βββ ReferenceAMPs.csv # Reference AMPs for novelty evaluation
β
βββ GenerativeModel/ # D3PM generative model components
β βββ config/
β β βββ config38M.json # Model configuration file
β β
β βββ data/ # Training data directory
β β βββ DatasetPreparation/ # Scripts for preparing training datasets
β β β βββ create_peptide_dataset_complete_with_toxicity.ipynb
β β β
β β βββ consensus.fasta # AMP sequences for training
β β βββ sample_weights.json # Sample weights for training sequences
β β βββ lengths_and_offsets.npz # Sequence processing indices
β β βββ splits.json # Train/validation/test split indices
β β
β βββ notebooks/
β β βββ Train_GenerativeModel.ipynb # Train the generative model
β β βββ GenerateCandidates.ipynb # Generate AMP candidates
β βββ src/ # Core generative model source code
β β βββ D3PM_collaters.py # Data collation utilities
β β βββ D3PM_datasets.py # Dataset handling
β β βββ D3PM_losses.py # Loss functions
β β βββ D3PM_train.py # Training utilities
β β βββ D3PM_utils.py # General utilities
β βββ models/ # Directory for trained models
β βββ trained_generative_model.tar # Pre-trained generative model
β
βββ AMPClassifier/ # AMP activity classification
β βββ notebooks/
β β βββ Train_Coarse_Grained.ipynb # Train coarse-grained AMP classifier
β β βββ Train_Fine_Grained.ipynb # Train fine-grained strong AMP classifier (MIC <8 ΞΌg/ml)
β β βββ Predict_Coarse_Grained.ipynb # Predict using coarse-grained model
β β βββ Predict_Fine_Grained.ipynb # Predict using fine-grained model
β βββ models/ # Trained classifier models
β βββ 90K_WEIGHTED_CLASS_AND_INSTANCES/ # Trained weights of the Coarse-grained model (samplewise cross-entropy weights applied)
β β βββ fold_1.pth # 5-fold CV model files
β β βββ fold_2.pth
β β βββ fold_3.pth
β β βββ fold_4.pth
β β βββ fold_5.pth
β βββ 90K_WEIGHTED_CLASS/ # Trained weights of the coarse-grained model (equal sample cross-entropy weights applied)
β βββ FINE_GRAINED_8/ # Trained weights of the fine-grained classifier (MIC threshold of 8 ΞΌg/ml)
β
βββ PreliniminaryPipeline(Old)/ # Archived components of the preliniminary pipeline
βββ results/ # Default output directory for results
For end-to-end AMP candidate generation, run FullPipeline.ipynb. This notebook orchestrates all components and is the easiest way to get started.
Key Configuration Parameters:
N_FINAL_PEPTIDES: Number of final AMP candidates to generateSIM_THRESHOLD: Similarity threshold for novelty filtering (0.0-1.0, default: 0.45, lower = stricter)IDENTITY_THRESHOLD: Identity threshold for novelty filtering (0.0-1.0, default: 0.65, lower = stricter)COARSE_THRESHOLD: Coarse-grained classifier probability cutoff (default: 0.9998)FINE_AMP_PROB: Fine-grained classifier probability cutoff (default: 0.8)MIN_LENGTH/MAX_LENGTH: Generated peptide length range
Required File Paths:
TRAINED_GENERATIVE_MODEL_PATH: Trained generative model file (e.g.,GenerativeModel/models/trained_generative_model.tar)COARSE_GRAINED_MODEL_DIR: Coarse-grained classifier directory (e.g.,AMPClassifier/models/90K_WEIGHTED_CLASS_AND_INSTANCES/)FINE_GRAINED_MODEL_DIR: Fine-grained classifier directory (e.g.,AMPClassifier/models/FINE_GRAINED_8/)OUTPUT_FILE: Output file path (e.g.,results/final_novel_peptides.csv)
- The pipeline generates candidates iteratively until enough meet all criteria
- If too few candidates pass filters, consider loosening
SIM_THRESHOLDorIDENTITY_THRESHOLD
For custom training or detailed control over each component, use individual notebooks:
Notebook: GenerativeModel/notebooks/Train_GenerativeModel.ipynb
Required Parameters:
-
TRAINING_DATA_DIR: Directory containing training data (e.g.,../data/)Must contain these four files:
consensus.fasta- AMP sequences for trainingsample_weights.json- Sample weights for training sequenceslengths_and_offsets.npz- Sequence processing indicessplits.json- Train/validation/test split indices
-
OUTPUT_FILE: Path for saving trained model (e.g.,../models/model_epoch5.tar)
Optional Parameters:
LAMBDA_LOSS_CROSS_ENTROPY(default: 0.0002) - The contribution factor of the samplewise cross-entropy loss (see the original D3PM paper for details)LAMBDA_LOSS_LVB(default: 1) - The contribution factor of the LVB loss (see the original D3PM paper for details)DIFFUSION_TIMESTEPS(default: 500) - Number of diffusion timestepsN_EPOCHS(default: 500) - Training epochs
π Dataset Preparation: Use GenerativeModel/data/DatasetPreparation/create_peptide_dataset_complete_with_toxicity.ipynb to prepare training datasets (especially samplewise cross-entropy weights) from AMP sequences and MIC/toxicity tables.
Notebook: GenerativeModel/notebooks/GenerateCandidates.ipynb
Required Parameters:
MODAL_PATH: Path to trained generative model (e.g.,../models/model_epoch5.tar)OUTPUT_AMP_CANDIDATE_FILE: Output CSV file path (e.g.,../results/AMP_Candidates.csv)N_CANDIDATES: Number of AMP candidates to generateLENGTH_CANDIDATE: Peptide length of generated AMP candidates
Notebook: AMPClassifier/notebooks/Train_Coarse_Grained.ipynb
Required Parameters:
NON_AMP_FILE: Non-AMP sequences in FASTA format (e.g.,../data/negative_AMP_90K.fa)AMP_FILE: AMP sequences in FASTA format (e.g.,../data/positive_AMP_ecoli.fa)TRAIN_MIC_VALUES_FILE: CSV file with MIC values for sequences (e.g.,../data/main_ecoli_plus_90K.csv)- Assign large MIC values for non-AMP sequences
OUT_DIR: Output directory for trained models (e.g.,../trained_coarse_grained_predictors/)
Optional Parameters:
NON_AMP_CLASS_WEIGHT(default: 12) - Sample weight for non-AMP classUSE_INSTANCE_WEIGHTING(default: True) - Enable sample weighting during trainingBATCH_SIZE(default: 64) - Training batch sizeNUM_EPOCHS(default: 100) - Number of training epochs
Notebook: AMPClassifier/notebooks/Predict_Coarse_Grained.ipynb
Required Parameters:
COARSE_MODEL_DIRECTORY: Directory containing trained model (e.g.,../trained_coarse_grained_predictors/)TEST_FILE: Input CSV file with sequences to predict (e.g.,../data/Testset.csv)OUTPUT_FILE: Output CSV file for results (e.g.,../prediction_results/Coarse_Prediction_Results.csv)
Notebook: AMPClassifier/notebooks/Train_Fine_Grained.ipynb
Required Parameters:
TRAIN_MIC_VALUES_FILE: CSV file with MIC values for AMP sequences only (e.g.,../data/main_ecoli.csv)β οΈ Important: Do NOT include non-AMP sequences in fine-grained training
OUT_DIR: Output directory for trained models (e.g.,../trained_fine_grained_predictors/)
Optional Parameters:
THRESHOLD(default: 8) - MIC threshold (ΞΌg/ml) for positive/negative classificationUSE_INSTANCE_WEIGHTING(default: True) - Enable sample weighting during trainingBATCH_SIZE(default: 32) - Training batch sizeNUM_EPOCHS(default: 100) - Number of training epochs
Notebook: AMPClassifier/notebooks/Predict_Fine_Grained.ipynb
Required Parameters:
FINE_MODEL_DIRECTORY: Directory containing trained model (e.g.,../trained_fine_grained_predictors/)TEST_FILE: Input CSV file with sequences to predict (e.g.,../data/Testset.csv)OUTPUT_FILE: Output CSV file for results (e.g.,../prediction_results/Fine_Prediction_Results.csv)
- CPU-only usage: Set
device = 'cpu'in all notebooks. Training will be slow but inference remains feasible for small batches - GPU memory issues: Reduce
batch_sizeif you encounter Out-of-Memory errors, or enable mixed precision withtorch.cuda.amp - Multi-GPU setup: The notebooks are configured for single GPU usage. Modify device assignment for multi-GPU setups
- Threshold tuning: Examine ROC/PR curves from the test notebooks to select operating points that balance precision vs. recall
- Too few candidates: If the pipeline generates insufficient candidates, try:
- Lowering
SIM_THRESHOLDorIDENTITY_THRESHOLD(less strict novelty filtering) - Lowering
COARSE_THRESHOLDorFINE_AMP_PROB(less strict activity prediction)
- Lowering
π§ Installation Issues:
- If
evodifffails to install: Trypip install --no-deps evodiffthen install missing dependencies manually - For
fair-esmcompatibility issues: Ensure PyTorch version compatibility
π Training Issues:
- Long training times: Use GPU acceleration and consider reducing dataset size for initial tests
- Memory errors during training: Reduce batch size or use gradient accumulation
- Poor model performance: Check data quality, class balance, and consider adjusting sample weights
𧬠Generation Issues:
- Generated sequences are too similar: Increase temperature or reduce conditioning strength
- Low novelty scores: The reference database may need updating or similarity thresholds adjustment
πΎ File Path Issues:
- Always use forward slashes (
/) in paths, even on Windows - Ensure all required data files exist before running notebooks
- Check that output directories are writable
- Issues & Bug Reports: Open an issue on GitHub
- General Questions: Email us at tagkopouloslab@ucdavis.edu
- Collaboration Inquiries: Contact the Tagkopoulos Lab at UC Davis
- D3PM Generative Model: Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces" NeurIPS 2021 (arXiv:2107.03006)
- EvoDiff-D3PM: Alamdari et al. "Protein generation with evolutionary diffusion: sequence is all you need" bioRxiv 2023 (bioRxiv) | GitHub
- ESM-2 Protein Model: Lin et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction" Science 2022 (doi:10.1126/science.ade2574) | GitHub
- Markakis et al. "Discovery of potent low-toxicity antimicrobial peptides through diffusion modeling" (under review)
Licensed under the Apache 2.0 License.
Happy peptide designing! π§¬β¨
Written with StackEdit.
