ARCADIAMP: Generative AI Workflow for Screening Highly Active Antimicrobial Peptides (AMPs)

Welcome to ARCADIAMP! This repository provides all the code, notebooks, and trained models needed to:

Generate novel antimicrobial peptide (AMP) candidates using a discrete denoising diffusion probabilistic model (D3PM)
Predict AMP activity with coarse-grained and fine-grained ESM-2 classifier ensembles for screening AMPs and strong AMPs, respectively
Filter the generated AMP candidates with low sequence novelty by comparing against known AMPs

The workflow is fully notebook-driven and runs in JupyterLab. While CPU-only execution is supported, a GPU is strongly recommended for optimal training and inference speed.

🚀 Quick Start: For immediate results, run FullPipeline.ipynb to generate AMP candidates end-to-end, or follow the module-by-module workflow for custom training and inference.

✨ Features

Self-Learning EvoDiff-D3PM AMP Generator – Iteratively retrain models using selected generated sequences with low predicted MIC values
Two-Stage ESM-2 Classifiers – Combines coarse-grained and fine-grained activity classifiers that first screen AMPs, then screen strong AMPs (low MIC threshold, e.g., <8 μg/ml)
Sequence Novelty Filter – Align generated candidates to well-known AMPs and filter for high novelty using customizable similarity and identity thresholds
Ensemble Averaging – Provides stable activity predictions via averaging across five 5-fold cross-validation models
Flexible Configuration – Easily adjust thresholds, model parameters, and output specifications through notebook parameters

🛠 Getting Started

Prerequisites

Python: 3.11.9 (recommended)
Hardware: GPU recommended for training (CPU-only supported for inference)
Storage: ~5 GB for models and datasets

1. Clone the Repository

git clone https://github.com/IBPA/ARCADIAMP.git
cd ARCADIAMP

2. Set Up the Conda Environment & Install Dependencies

# Create and activate conda environment
conda create -n arcadiamp python=3.11.9
conda activate arcadiamp

# Install required packages
pip install -r requirements.txt

3. Verify Installation

# Quick test to verify installation
python -c "import torch; import evodiff; import pandas as pd; print('Installation successful!')"

Note: On first run, some models may need to download pretrained weights (~500MB), which will be cached automatically.

🗂 Project Structure

ARCADIAMP/
├── README.md                           # This file
├── requirements.txt                    # Python dependencies
├── FullPipeline.ipynb                  # 🚀 Main notebook: Full pipeline workflow
├── NoveltyEvaluation.py                # Sequence novelty evaluation functions
├── ReferenceAMPs.csv                   # Reference AMPs for novelty evaluation
│   
├── GenerativeModel/                    # D3PM generative model components
│   ├── config/
│   │   └── config38M.json # Model configuration file
│   │
│   ├── data/                           # Training data directory                        
│   │   ├── DatasetPreparation/         # Scripts for preparing training datasets
│   │   │   └── create_peptide_dataset_complete_with_toxicity.ipynb
│   │   │
│   │   ├── consensus.fasta             # AMP sequences for training 
│   │   ├── sample_weights.json         # Sample weights for training sequences
│   │   ├── lengths_and_offsets.npz     # Sequence processing indices
│   │   └── splits.json                 # Train/validation/test split indices
│   │
│   ├── notebooks/                          
│   │   ├── Train_GenerativeModel.ipynb # Train the generative model
│   │   └── GenerateCandidates.ipynb    # Generate AMP candidates
│   ├── src/                            # Core generative model source code
│   │   ├── D3PM_collaters.py           # Data collation utilities
│   │   ├── D3PM_datasets.py            # Dataset handling
│   │   ├── D3PM_losses.py              # Loss functions
│   │   ├── D3PM_train.py               # Training utilities
│   │   └── D3PM_utils.py               # General utilities
│   └── models/                         # Directory for trained models                     
│       └── trained_generative_model.tar # Pre-trained generative model
│   
├── AMPClassifier/                      # AMP activity classification
│   ├── notebooks/
│   │   ├── Train_Coarse_Grained.ipynb  # Train coarse-grained AMP classifier
│   │   ├── Train_Fine_Grained.ipynb    # Train fine-grained strong AMP classifier (MIC <8 μg/ml)
│   │   ├── Predict_Coarse_Grained.ipynb # Predict using coarse-grained model
│   │   └── Predict_Fine_Grained.ipynb  # Predict using fine-grained model
│   └── models/                         # Trained classifier models
│       ├── 90K_WEIGHTED_CLASS_AND_INSTANCES/  # Trained weights of the Coarse-grained model (samplewise cross-entropy weights applied)
│       │   ├── fold_1.pth              # 5-fold CV model files
│       │   ├── fold_2.pth
│       │   ├── fold_3.pth
│       │   ├── fold_4.pth
│       │   └── fold_5.pth
│       ├── 90K_WEIGHTED_CLASS/         # Trained weights of the coarse-grained model (equal sample cross-entropy weights applied)
│       └── FINE_GRAINED_8/             # Trained weights of the fine-grained classifier (MIC threshold of 8 μg/ml)    
│
├── PreliniminaryPipeline(Old)/         # Archived components of the preliniminary pipeline
└── results/                            # Default output directory for results

⚡ Training & Generation Workflows

🚀 A. Full Pipeline Execution (Recommended)

For end-to-end AMP candidate generation, run FullPipeline.ipynb. This notebook orchestrates all components and is the easiest way to get started.

Key Configuration Parameters:

N_FINAL_PEPTIDES: Number of final AMP candidates to generate
SIM_THRESHOLD: Similarity threshold for novelty filtering (0.0-1.0, default: 0.45, lower = stricter)
IDENTITY_THRESHOLD: Identity threshold for novelty filtering (0.0-1.0, default: 0.65, lower = stricter)
COARSE_THRESHOLD: Coarse-grained classifier probability cutoff (default: 0.9998)
FINE_AMP_PROB: Fine-grained classifier probability cutoff (default: 0.8)
MIN_LENGTH/MAX_LENGTH: Generated peptide length range

Required File Paths:

TRAINED_GENERATIVE_MODEL_PATH: Trained generative model file (e.g., GenerativeModel/models/trained_generative_model.tar)
COARSE_GRAINED_MODEL_DIR: Coarse-grained classifier directory (e.g., AMPClassifier/models/90K_WEIGHTED_CLASS_AND_INSTANCES/)
FINE_GRAINED_MODEL_DIR: Fine-grained classifier directory (e.g., AMPClassifier/models/FINE_GRAINED_8/)
OUTPUT_FILE: Output file path (e.g., results/final_novel_peptides.csv)

⚠️ Important Notes:

The pipeline generates candidates iteratively until enough meet all criteria
If too few candidates pass filters, consider loosening SIM_THRESHOLD or IDENTITY_THRESHOLD

🔧 B. Module-by-Module Execution (Advanced)

For custom training or detailed control over each component, use individual notebooks:

1. Generative Model (D3PM) – Training

Notebook: GenerativeModel/notebooks/Train_GenerativeModel.ipynb

Required Parameters:

TRAINING_DATA_DIR: Directory containing training data (e.g., ../data/)

Must contain these four files:
- consensus.fasta - AMP sequences for training
- sample_weights.json - Sample weights for training sequences
- lengths_and_offsets.npz - Sequence processing indices
- splits.json - Train/validation/test split indices
OUTPUT_FILE: Path for saving trained model (e.g., ../models/model_epoch5.tar)

Optional Parameters:

LAMBDA_LOSS_CROSS_ENTROPY (default: 0.0002) - The contribution factor of the samplewise cross-entropy loss (see the original D3PM paper for details)
LAMBDA_LOSS_LVB (default: 1) - The contribution factor of the LVB loss (see the original D3PM paper for details)
DIFFUSION_TIMESTEPS (default: 500) - Number of diffusion timesteps
N_EPOCHS (default: 500) - Training epochs

📋 Dataset Preparation: Use GenerativeModel/data/DatasetPreparation/create_peptide_dataset_complete_with_toxicity.ipynb to prepare training datasets (especially samplewise cross-entropy weights) from AMP sequences and MIC/toxicity tables.

2. Generative Model (D3PM) – AMP Candidate Generation

Notebook: GenerativeModel/notebooks/GenerateCandidates.ipynb

Required Parameters:

MODAL_PATH: Path to trained generative model (e.g., ../models/model_epoch5.tar)
OUTPUT_AMP_CANDIDATE_FILE: Output CSV file path (e.g., ../results/AMP_Candidates.csv)
N_CANDIDATES: Number of AMP candidates to generate
LENGTH_CANDIDATE: Peptide length of generated AMP candidates

3. Coarse-Grained AMP Classifier – Training

Notebook: AMPClassifier/notebooks/Train_Coarse_Grained.ipynb

Required Parameters:

NON_AMP_FILE: Non-AMP sequences in FASTA format (e.g., ../data/negative_AMP_90K.fa)
AMP_FILE: AMP sequences in FASTA format (e.g., ../data/positive_AMP_ecoli.fa)
TRAIN_MIC_VALUES_FILE: CSV file with MIC values for sequences (e.g., ../data/main_ecoli_plus_90K.csv)
- Assign large MIC values for non-AMP sequences
OUT_DIR: Output directory for trained models (e.g., ../trained_coarse_grained_predictors/)

Optional Parameters:

NON_AMP_CLASS_WEIGHT (default: 12) - Sample weight for non-AMP class
USE_INSTANCE_WEIGHTING (default: True) - Enable sample weighting during training
BATCH_SIZE (default: 64) - Training batch size
NUM_EPOCHS (default: 100) - Number of training epochs

4. Coarse-Grained AMP Classifier – Prediction

Notebook: AMPClassifier/notebooks/Predict_Coarse_Grained.ipynb

Required Parameters:

COARSE_MODEL_DIRECTORY: Directory containing trained model (e.g., ../trained_coarse_grained_predictors/)
TEST_FILE: Input CSV file with sequences to predict (e.g., ../data/Testset.csv)
OUTPUT_FILE: Output CSV file for results (e.g., ../prediction_results/Coarse_Prediction_Results.csv)

5. Fine-Grained AMP Classifier – Training

Notebook: AMPClassifier/notebooks/Train_Fine_Grained.ipynb

Required Parameters:

TRAIN_MIC_VALUES_FILE: CSV file with MIC values for AMP sequences only (e.g., ../data/main_ecoli.csv)
- ⚠️ Important: Do NOT include non-AMP sequences in fine-grained training
OUT_DIR: Output directory for trained models (e.g., ../trained_fine_grained_predictors/)

Optional Parameters:

THRESHOLD (default: 8) - MIC threshold (μg/ml) for positive/negative classification
USE_INSTANCE_WEIGHTING (default: True) - Enable sample weighting during training
BATCH_SIZE (default: 32) - Training batch size
NUM_EPOCHS (default: 100) - Number of training epochs

6. Fine-Grained AMP Classifier – Prediction

Notebook: AMPClassifier/notebooks/Predict_Fine_Grained.ipynb

Required Parameters:

FINE_MODEL_DIRECTORY: Directory containing trained model (e.g., ../trained_fine_grained_predictors/)
TEST_FILE: Input CSV file with sequences to predict (e.g., ../data/Testset.csv)
OUTPUT_FILE: Output CSV file for results (e.g., ../prediction_results/Fine_Prediction_Results.csv)

⚙️ Configuration Tips & Troubleshooting

Hardware Configuration

CPU-only usage: Set device = 'cpu' in all notebooks. Training will be slow but inference remains feasible for small batches
GPU memory issues: Reduce batch_size if you encounter Out-of-Memory errors, or enable mixed precision with torch.cuda.amp
Multi-GPU setup: The notebooks are configured for single GPU usage. Modify device assignment for multi-GPU setups

Threshold Optimization

Threshold tuning: Examine ROC/PR curves from the test notebooks to select operating points that balance precision vs. recall
Too few candidates: If the pipeline generates insufficient candidates, try:
- Lowering SIM_THRESHOLD or IDENTITY_THRESHOLD (less strict novelty filtering)
- Lowering COARSE_THRESHOLD or FINE_AMP_PROB (less strict activity prediction)

Common Issues & Solutions

🔧 Installation Issues:

If evodiff fails to install: Try pip install --no-deps evodiff then install missing dependencies manually
For fair-esm compatibility issues: Ensure PyTorch version compatibility

📊 Training Issues:

Long training times: Use GPU acceleration and consider reducing dataset size for initial tests
Memory errors during training: Reduce batch size or use gradient accumulation
Poor model performance: Check data quality, class balance, and consider adjusting sample weights

🧬 Generation Issues:

Generated sequences are too similar: Increase temperature or reduce conditioning strength
Low novelty scores: The reference database may need updating or similarity thresholds adjustment

💾 File Path Issues:

Always use forward slashes (/) in paths, even on Windows
Ensure all required data files exist before running notebooks
Check that output directories are writable

📫 Contact & Support

Issues & Bug Reports: Open an issue on GitHub
General Questions: Email us at tagkopouloslab@ucdavis.edu
Collaboration Inquiries: Contact the Tagkopoulos Lab at UC Davis

📚 References & Citations

Core Technologies

D3PM Generative Model: Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces" NeurIPS 2021 (arXiv:2107.03006)
EvoDiff-D3PM: Alamdari et al. "Protein generation with evolutionary diffusion: sequence is all you need" bioRxiv 2023 (bioRxiv) | GitHub
ESM-2 Protein Model: Lin et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction" Science 2022 (doi:10.1126/science.ade2574) | GitHub

Citation

Markakis et al. "Discovery of potent low-toxicity antimicrobial peptides through diffusion modeling" (under review)

🪪 License

Licensed under the Apache 2.0 License.

Happy peptide designing! 🧬✨

Written with StackEdit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARCADIAMP: Generative AI Workflow for Screening Highly Active Antimicrobial Peptides (AMPs)

✨ Features

🛠 Getting Started

Prerequisites

1. Clone the Repository

2. Set Up the Conda Environment & Install Dependencies

3. Verify Installation

🗂 Project Structure

⚡ Training & Generation Workflows

🚀 A. Full Pipeline Execution (Recommended)

🔧 B. Module-by-Module Execution (Advanced)

1. Generative Model (D3PM) – Training

2. Generative Model (D3PM) – AMP Candidate Generation

3. Coarse-Grained AMP Classifier – Training

4. Coarse-Grained AMP Classifier – Prediction

5. Fine-Grained AMP Classifier – Training

6. Fine-Grained AMP Classifier – Prediction

⚙️ Configuration Tips & Troubleshooting

Hardware Configuration

Threshold Optimization

Common Issues & Solutions

📫 Contact & Support

📚 References & Citations

Core Technologies

Citation

🪪 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
AMPClassifier		AMPClassifier
GenerativeModel		GenerativeModel
PreliniminaryPipeline(Old)		PreliniminaryPipeline(Old)
images		images
results		results
.gitignore		.gitignore
FullPipeline.ipynb		FullPipeline.ipynb
LICENSE		LICENSE
NoveltyEvaluation.py		NoveltyEvaluation.py
README.md		README.md
ReferenceAMPs.csv		ReferenceAMPs.csv
requirements.txt		requirements.txt

License

IBPA/ARCADIAMP

Folders and files

Latest commit

History

Repository files navigation

ARCADIAMP: Generative AI Workflow for Screening Highly Active Antimicrobial Peptides (AMPs)

✨ Features

🛠 Getting Started

Prerequisites

1. Clone the Repository

2. Set Up the Conda Environment & Install Dependencies

3. Verify Installation

🗂 Project Structure

⚡ Training & Generation Workflows

🚀 A. Full Pipeline Execution (Recommended)

🔧 B. Module-by-Module Execution (Advanced)

1. Generative Model (D3PM) – Training

2. Generative Model (D3PM) – AMP Candidate Generation

3. Coarse-Grained AMP Classifier – Training

4. Coarse-Grained AMP Classifier – Prediction

5. Fine-Grained AMP Classifier – Training

6. Fine-Grained AMP Classifier – Prediction

⚙️ Configuration Tips & Troubleshooting

Hardware Configuration

Threshold Optimization

Common Issues & Solutions

📫 Contact & Support

📚 References & Citations

Core Technologies

Citation

🪪 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages