GPU-Accelerated MCMC Sampling for Protein Sequences with Codon-Level Mutations
Genie 2.0 is a high-performance tool for generating protein sequences using Direct Coupling Analysis (DCA) models combined with biologically realistic codon substitution dynamics. It implements efficient MCMC sampling on GPUs with two variants:
- Genie: DNA codon-aware evolution with Metropolis-Gibbs sampling
- Genie-AA: Amino acid-only evolution with standard Gibbs sampling
- Features
- Installation
- Quick Start
- Usage
- Command-Line Arguments
- Output Files
- Algorithm Overview
- Performance
- Requirements
- Examples
- Citation
- License
- Codon-Aware Sampling: Biologically realistic single-nucleotide mutations at DNA level
- Hybrid MCMC: Combined Metropolis-Hastings and Gibbs sampling for better mixing
- Reference-Based: Optional convergence tracking against real sequence data
- GPU-Accelerated: Full CUDA support with PyTorch JIT compilation (2-3x speedup)
- Flexible Input: Start from existing sequences or random initialization
- Fully vectorized GPU kernels with zero CPU loops
- Pre-computed codon mutation networks for O(1) neighbor lookups
- Batched random number generation for improved GPU efficiency
- Real-time Pearson correlation tracking for convergence monitoring
pip install genie-dcagit clone https://github.com/spqb/Genie.py.git
cd Genie.py
pip install .This installs two command-line tools:
genie- Codon-aware evolutiongenie-aa- Amino acid evolution
genie -p params.dat -n 1000 --num_iterations 50000 -o output_foldergenie-aa -p params.dat -n 1000 --num_iterations 50000 -o output_folder# Generate sequences from scratch
genie -p params.dat -n 1000 --num_iterations 50000 -o results/
# Start from existing sequences
genie -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/# Generate sequences from scratch
genie-aa -p params.dat -n 1000 --num_iterations 50000 -o results/
# Start from existing sequences
genie-aa -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/# Reconstruct final sequences from mutation log
reconstruct_chains results/
# Reconstruct sequences at specific timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000"from Genie import reconstruct_at_timesteps, reconstruct_chains_from_log
from adabmDCA.fasta import get_tokens
# Reconstruct sequences at specific timesteps
sequences = reconstruct_at_timesteps(
initial_chains_file="results/initial_chains.fasta",
mutation_log_file="results/mutation_log.csv",
timesteps=[0, 100, 500, 1000],
alphabet="protein"
)
# Returns: torch.Tensor of shape (len(timesteps), n_chains, L)
# Reconstruct and validate final sequences
tokens = get_tokens(alphabet="protein")
reconstructed_seqs, headers = reconstruct_chains_from_log(
initial_chains_file="results/initial_chains.fasta",
mutation_log_file="results/mutation_log.csv",
tokens=tokens
)| Argument | Short | Description |
|---|---|---|
--path_params |
-p |
DCA model parameters file (.dat) |
--num_iterations |
Number of MCMC iterations |
| Argument | Short | Default | Description |
|---|---|---|---|
--output |
-o |
DCA_evolution |
Output directory |
--num_chains |
-n |
None | Number of sequences (required if not using -c) |
--path_chains |
-c |
None | Initial sequences (FASTA format) |
--seq_index |
None | Replicate single sequence from -c file |
|
--save_steps |
100 |
Checkpoint interval or comma-separated list (e.g., "100,500,1000") | |
--device |
auto | Device: 'cuda' or 'cpu' | |
--dtype |
float32 | Data type: float32 or float64 |
| Argument | Default | Description |
|---|---|---|
--p_metropolis |
0.5 | Metropolis vs Gibbs ratio (0.0-1.0) |
| Argument | Default | Description |
|---|---|---|
--alphabet |
protein | Alphabet type: 'protein', 'rna', 'dna', or custom |
reconstruct_chains: Takes output folder as positional argument
reconstruct_at_timesteps:
folder- Output folder (positional)--timesteps- Comma-separated list (e.g., "0,100,500,1000")
All files are saved in the output directory specified by -o.
| File | Description |
|---|---|
initial_chains.fasta |
Starting sequences (before evolution) |
final_chains.fasta |
Final sequences (after all iterations) |
mutation_log.csv |
Log of all mutations at checkpoints |
File: mutation_log.csv
CSV file tracking mutations at checkpoints:
| Column | Description |
|---|---|
iteration |
Checkpoint iteration number |
chain_id |
Sequence identifier |
position |
Position in sequence (0-indexed) |
new_aa |
New amino acid at this position |
Example:
iteration,chain_id,position,new_aa
100,seq_0,15,A
100,seq_0,42,G
100,seq_1,23,L
200,seq_0,15,V
...Real-time progress showing:
- Iteration number and speed (iter/sec)
- Elapsed time
- Compilation status (first iteration)
- Initialization: Load DCA model, build codon mutation network
- Sequence Translation: Convert amino acids to codons
- MCMC Sampling: Hybrid Metropolis-Gibbs with codon mutations
- Convergence Tracking: Optional Pearson correlation monitoring
- Initialization: Load DCA model
- Gibbs Sampling: Standard position-wise sampling
- Convergence Tracking: Optional correlation monitoring
Hardware: NVIDIA RTX 4090, 1000 sequences, L=100
| Mode | Iterations/sec | Speedup |
|---|---|---|
| Genie (compiled) | ~45-50 | 2.5x |
| Genie (eager) | ~18-20 | 1.0x |
| Genie-AA (compiled) | ~120-140 | 6.5x |
Note: First iteration includes ~10-30s JIT compilation overhead
torch>=2.0.0
numpy>=1.20.0
pandas>=1.3.0
adabmDCA>=1.0.0Hardware:
- Minimum: CPU with 4GB RAM
- Recommended: NVIDIA GPU (8GB+ VRAM) with CUDA 11.7+
# Generate 1000 sequences with codon awareness
genie -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76
# Generate amino acid sequences only
genie-aa -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76_aa# Save mutations at specific iterations
genie -p params.dat -n 1000 --num_iterations 10000 --save_steps "100,500,1000,5000,10000" -o results/
# Reconstruct sequences at those timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000,5000,10000"This software is based on the following article:
@article{
doi:10.1073/pnas.2406807121,
author = {Leonardo Di Bari and Matteo Bisardi and Sabrina Cotogno and Martin Weigt and Francesco Zamponi },
title = {Emergent time scales of epistasis in protein evolution},
journal = {Proceedings of the National Academy of Sciences},
volume = {121},
number = {40},
pages = {e2406807121},
year = {2024},
doi = {10.1073/pnas.2406807121},
URL = {https://www.pnas.org/doi/abs/10.1073/pnas.2406807121},
eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.2406807121},
}A Julia version of Genie is also available: Genie.jl
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on the adabmDCA library
- PyTorch team for excellent GPU optimization tools