Skip to content
/ ARCADIAMP Public

ARCADIAMP -- A computational platform based on discrete denoising diffusion probabilistic models (D3PM) and a two-stage ESM2 binary classifier to generate, classify, and recommend potent AMPs with high activity, low toxicity, and high bioavailability.

License

Notifications You must be signed in to change notification settings

IBPA/ARCADIAMP

Repository files navigation

ARCADIAMP: Generative AI Workflow for Screening Highly Active Antimicrobial Peptides (AMPs)

Figure 1. Overview of ARCADIAMP

Welcome to ARCADIAMP! This repository provides all the code, notebooks, and trained models needed to:

  1. Generate novel antimicrobial peptide (AMP) candidates using a discrete denoising diffusion probabilistic model (D3PM)
  2. Predict AMP activity with coarse-grained and fine-grained ESM-2 classifier ensembles for screening AMPs and strong AMPs, respectively
  3. Filter the generated AMP candidates with low sequence novelty by comparing against known AMPs

The workflow is fully notebook-driven and runs in JupyterLab. While CPU-only execution is supported, a GPU is strongly recommended for optimal training and inference speed.

πŸš€ Quick Start: For immediate results, run FullPipeline.ipynb to generate AMP candidates end-to-end, or follow the module-by-module workflow for custom training and inference.


✨ Features

  • Self-Learning EvoDiff-D3PM AMP Generator – Iteratively retrain models using selected generated sequences with low predicted MIC values
  • Two-Stage ESM-2 Classifiers – Combines coarse-grained and fine-grained activity classifiers that first screen AMPs, then screen strong AMPs (low MIC threshold, e.g., <8 ΞΌg/ml)
  • Sequence Novelty Filter – Align generated candidates to well-known AMPs and filter for high novelty using customizable similarity and identity thresholds
  • Ensemble Averaging – Provides stable activity predictions via averaging across five 5-fold cross-validation models
  • Flexible Configuration – Easily adjust thresholds, model parameters, and output specifications through notebook parameters

πŸ›  Getting Started

Prerequisites

  • Python: 3.11.9 (recommended)
  • Hardware: GPU recommended for training (CPU-only supported for inference)
  • Storage: ~5 GB for models and datasets

1. Clone the Repository

git clone https://github.com/IBPA/ARCADIAMP.git
cd ARCADIAMP

2. Set Up the Conda Environment & Install Dependencies

# Create and activate conda environment
conda create -n arcadiamp python=3.11.9
conda activate arcadiamp

# Install required packages
pip install -r requirements.txt

3. Verify Installation

# Quick test to verify installation
python -c "import torch; import evodiff; import pandas as pd; print('Installation successful!')"

Note: On first run, some models may need to download pretrained weights (~500MB), which will be cached automatically.


πŸ—‚ Project Structure

ARCADIAMP/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”œβ”€β”€ FullPipeline.ipynb                  # πŸš€ Main notebook: Full pipeline workflow
β”œβ”€β”€ NoveltyEvaluation.py                # Sequence novelty evaluation functions
β”œβ”€β”€ ReferenceAMPs.csv                   # Reference AMPs for novelty evaluation
β”‚   
β”œβ”€β”€ GenerativeModel/                    # D3PM generative model components
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   └── config38M.json # Model configuration file
β”‚   β”‚
β”‚   β”œβ”€β”€ data/                           # Training data directory                        
β”‚   β”‚   β”œβ”€β”€ DatasetPreparation/         # Scripts for preparing training datasets
β”‚   β”‚   β”‚   └── create_peptide_dataset_complete_with_toxicity.ipynb
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ consensus.fasta             # AMP sequences for training 
β”‚   β”‚   β”œβ”€β”€ sample_weights.json         # Sample weights for training sequences
β”‚   β”‚   β”œβ”€β”€ lengths_and_offsets.npz     # Sequence processing indices
β”‚   β”‚   └── splits.json                 # Train/validation/test split indices
β”‚   β”‚
β”‚   β”œβ”€β”€ notebooks/                          
β”‚   β”‚   β”œβ”€β”€ Train_GenerativeModel.ipynb # Train the generative model
β”‚   β”‚   └── GenerateCandidates.ipynb    # Generate AMP candidates
β”‚   β”œβ”€β”€ src/                            # Core generative model source code
β”‚   β”‚   β”œβ”€β”€ D3PM_collaters.py           # Data collation utilities
β”‚   β”‚   β”œβ”€β”€ D3PM_datasets.py            # Dataset handling
β”‚   β”‚   β”œβ”€β”€ D3PM_losses.py              # Loss functions
β”‚   β”‚   β”œβ”€β”€ D3PM_train.py               # Training utilities
β”‚   β”‚   └── D3PM_utils.py               # General utilities
β”‚   └── models/                         # Directory for trained models                     
β”‚       └── trained_generative_model.tar # Pre-trained generative model
β”‚   
β”œβ”€β”€ AMPClassifier/                      # AMP activity classification
β”‚   β”œβ”€β”€ notebooks/
β”‚   β”‚   β”œβ”€β”€ Train_Coarse_Grained.ipynb  # Train coarse-grained AMP classifier
β”‚   β”‚   β”œβ”€β”€ Train_Fine_Grained.ipynb    # Train fine-grained strong AMP classifier (MIC <8 ΞΌg/ml)
β”‚   β”‚   β”œβ”€β”€ Predict_Coarse_Grained.ipynb # Predict using coarse-grained model
β”‚   β”‚   └── Predict_Fine_Grained.ipynb  # Predict using fine-grained model
β”‚   └── models/                         # Trained classifier models
β”‚       β”œβ”€β”€ 90K_WEIGHTED_CLASS_AND_INSTANCES/  # Trained weights of the Coarse-grained model (samplewise cross-entropy weights applied)
β”‚       β”‚   β”œβ”€β”€ fold_1.pth              # 5-fold CV model files
β”‚       β”‚   β”œβ”€β”€ fold_2.pth
β”‚       β”‚   β”œβ”€β”€ fold_3.pth
β”‚       β”‚   β”œβ”€β”€ fold_4.pth
β”‚       β”‚   └── fold_5.pth
β”‚       β”œβ”€β”€ 90K_WEIGHTED_CLASS/         # Trained weights of the coarse-grained model (equal sample cross-entropy weights applied)
β”‚       └── FINE_GRAINED_8/             # Trained weights of the fine-grained classifier (MIC threshold of 8 ΞΌg/ml)    
β”‚
β”œβ”€β”€ PreliniminaryPipeline(Old)/         # Archived components of the preliniminary pipeline
└── results/                            # Default output directory for results

⚑ Training & Generation Workflows

πŸš€ A. Full Pipeline Execution (Recommended)

For end-to-end AMP candidate generation, run FullPipeline.ipynb. This notebook orchestrates all components and is the easiest way to get started.

Key Configuration Parameters:

  • N_FINAL_PEPTIDES: Number of final AMP candidates to generate
  • SIM_THRESHOLD: Similarity threshold for novelty filtering (0.0-1.0, default: 0.45, lower = stricter)
  • IDENTITY_THRESHOLD: Identity threshold for novelty filtering (0.0-1.0, default: 0.65, lower = stricter)
  • COARSE_THRESHOLD: Coarse-grained classifier probability cutoff (default: 0.9998)
  • FINE_AMP_PROB: Fine-grained classifier probability cutoff (default: 0.8)
  • MIN_LENGTH/MAX_LENGTH: Generated peptide length range

Required File Paths:

  • TRAINED_GENERATIVE_MODEL_PATH: Trained generative model file (e.g., GenerativeModel/models/trained_generative_model.tar)
  • COARSE_GRAINED_MODEL_DIR: Coarse-grained classifier directory (e.g., AMPClassifier/models/90K_WEIGHTED_CLASS_AND_INSTANCES/)
  • FINE_GRAINED_MODEL_DIR: Fine-grained classifier directory (e.g., AMPClassifier/models/FINE_GRAINED_8/)
  • OUTPUT_FILE: Output file path (e.g., results/final_novel_peptides.csv)

⚠️ Important Notes:

  • The pipeline generates candidates iteratively until enough meet all criteria
  • If too few candidates pass filters, consider loosening SIM_THRESHOLD or IDENTITY_THRESHOLD

πŸ”§ B. Module-by-Module Execution (Advanced)

For custom training or detailed control over each component, use individual notebooks:

1. Generative Model (D3PM) – Training

Notebook: GenerativeModel/notebooks/Train_GenerativeModel.ipynb

Required Parameters:

  • TRAINING_DATA_DIR: Directory containing training data (e.g., ../data/)

    Must contain these four files:

    • consensus.fasta - AMP sequences for training
    • sample_weights.json - Sample weights for training sequences
    • lengths_and_offsets.npz - Sequence processing indices
    • splits.json - Train/validation/test split indices
  • OUTPUT_FILE: Path for saving trained model (e.g., ../models/model_epoch5.tar)

Optional Parameters:

  • LAMBDA_LOSS_CROSS_ENTROPY (default: 0.0002) - The contribution factor of the samplewise cross-entropy loss (see the original D3PM paper for details)
  • LAMBDA_LOSS_LVB (default: 1) - The contribution factor of the LVB loss (see the original D3PM paper for details)
  • DIFFUSION_TIMESTEPS (default: 500) - Number of diffusion timesteps
  • N_EPOCHS (default: 500) - Training epochs

πŸ“‹ Dataset Preparation: Use GenerativeModel/data/DatasetPreparation/create_peptide_dataset_complete_with_toxicity.ipynb to prepare training datasets (especially samplewise cross-entropy weights) from AMP sequences and MIC/toxicity tables.

2. Generative Model (D3PM) – AMP Candidate Generation

Notebook: GenerativeModel/notebooks/GenerateCandidates.ipynb

Required Parameters:

  • MODAL_PATH: Path to trained generative model (e.g., ../models/model_epoch5.tar)
  • OUTPUT_AMP_CANDIDATE_FILE: Output CSV file path (e.g., ../results/AMP_Candidates.csv)
  • N_CANDIDATES: Number of AMP candidates to generate
  • LENGTH_CANDIDATE: Peptide length of generated AMP candidates

3. Coarse-Grained AMP Classifier – Training

Notebook: AMPClassifier/notebooks/Train_Coarse_Grained.ipynb

Required Parameters:

  • NON_AMP_FILE: Non-AMP sequences in FASTA format (e.g., ../data/negative_AMP_90K.fa)
  • AMP_FILE: AMP sequences in FASTA format (e.g., ../data/positive_AMP_ecoli.fa)
  • TRAIN_MIC_VALUES_FILE: CSV file with MIC values for sequences (e.g., ../data/main_ecoli_plus_90K.csv)
    • Assign large MIC values for non-AMP sequences
  • OUT_DIR: Output directory for trained models (e.g., ../trained_coarse_grained_predictors/)

Optional Parameters:

  • NON_AMP_CLASS_WEIGHT (default: 12) - Sample weight for non-AMP class
  • USE_INSTANCE_WEIGHTING (default: True) - Enable sample weighting during training
  • BATCH_SIZE (default: 64) - Training batch size
  • NUM_EPOCHS (default: 100) - Number of training epochs

4. Coarse-Grained AMP Classifier – Prediction

Notebook: AMPClassifier/notebooks/Predict_Coarse_Grained.ipynb

Required Parameters:

  • COARSE_MODEL_DIRECTORY: Directory containing trained model (e.g., ../trained_coarse_grained_predictors/)
  • TEST_FILE: Input CSV file with sequences to predict (e.g., ../data/Testset.csv)
  • OUTPUT_FILE: Output CSV file for results (e.g., ../prediction_results/Coarse_Prediction_Results.csv)

5. Fine-Grained AMP Classifier – Training

Notebook: AMPClassifier/notebooks/Train_Fine_Grained.ipynb

Required Parameters:

  • TRAIN_MIC_VALUES_FILE: CSV file with MIC values for AMP sequences only (e.g., ../data/main_ecoli.csv)
    • ⚠️ Important: Do NOT include non-AMP sequences in fine-grained training
  • OUT_DIR: Output directory for trained models (e.g., ../trained_fine_grained_predictors/)

Optional Parameters:

  • THRESHOLD (default: 8) - MIC threshold (ΞΌg/ml) for positive/negative classification
  • USE_INSTANCE_WEIGHTING (default: True) - Enable sample weighting during training
  • BATCH_SIZE (default: 32) - Training batch size
  • NUM_EPOCHS (default: 100) - Number of training epochs

6. Fine-Grained AMP Classifier – Prediction

Notebook: AMPClassifier/notebooks/Predict_Fine_Grained.ipynb

Required Parameters:

  • FINE_MODEL_DIRECTORY: Directory containing trained model (e.g., ../trained_fine_grained_predictors/)
  • TEST_FILE: Input CSV file with sequences to predict (e.g., ../data/Testset.csv)
  • OUTPUT_FILE: Output CSV file for results (e.g., ../prediction_results/Fine_Prediction_Results.csv)

βš™οΈ Configuration Tips & Troubleshooting

Hardware Configuration

  • CPU-only usage: Set device = 'cpu' in all notebooks. Training will be slow but inference remains feasible for small batches
  • GPU memory issues: Reduce batch_size if you encounter Out-of-Memory errors, or enable mixed precision with torch.cuda.amp
  • Multi-GPU setup: The notebooks are configured for single GPU usage. Modify device assignment for multi-GPU setups

Threshold Optimization

  • Threshold tuning: Examine ROC/PR curves from the test notebooks to select operating points that balance precision vs. recall
  • Too few candidates: If the pipeline generates insufficient candidates, try:
    • Lowering SIM_THRESHOLD or IDENTITY_THRESHOLD (less strict novelty filtering)
    • Lowering COARSE_THRESHOLD or FINE_AMP_PROB (less strict activity prediction)

Common Issues & Solutions

πŸ”§ Installation Issues:

  • If evodiff fails to install: Try pip install --no-deps evodiff then install missing dependencies manually
  • For fair-esm compatibility issues: Ensure PyTorch version compatibility

πŸ“Š Training Issues:

  • Long training times: Use GPU acceleration and consider reducing dataset size for initial tests
  • Memory errors during training: Reduce batch size or use gradient accumulation
  • Poor model performance: Check data quality, class balance, and consider adjusting sample weights

🧬 Generation Issues:

  • Generated sequences are too similar: Increase temperature or reduce conditioning strength
  • Low novelty scores: The reference database may need updating or similarity thresholds adjustment

πŸ’Ύ File Path Issues:

  • Always use forward slashes (/) in paths, even on Windows
  • Ensure all required data files exist before running notebooks
  • Check that output directories are writable

πŸ“« Contact & Support


πŸ“š References & Citations

Core Technologies

  • D3PM Generative Model: Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces" NeurIPS 2021 (arXiv:2107.03006)
  • EvoDiff-D3PM: Alamdari et al. "Protein generation with evolutionary diffusion: sequence is all you need" bioRxiv 2023 (bioRxiv) | GitHub
  • ESM-2 Protein Model: Lin et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction" Science 2022 (doi:10.1126/science.ade2574) | GitHub

Citation

  • Markakis et al. "Discovery of potent low-toxicity antimicrobial peptides through diffusion modeling" (under review)

πŸͺͺ License

Licensed under the Apache 2.0 License.


Happy peptide designing! 🧬✨

Written with StackEdit.

About

ARCADIAMP -- A computational platform based on discrete denoising diffusion probabilistic models (D3PM) and a two-stage ESM2 binary classifier to generate, classify, and recommend potent AMPs with high activity, low toxicity, and high bioavailability.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published