Skip to content

Latest commit

 

History

History
367 lines (275 loc) · 10 KB

File metadata and controls

367 lines (275 loc) · 10 KB

OpenMM Generate

A Python-based molecular dynamics simulation toolkit using OpenMM for generating all-atom protein trajectories at scale.

Overview

This repository provides a module for generating standard molecular dynamics simulations using OpenMM for all-atom protein representations. It is part of a larger pipeline for molecular dynamics and machine learning research, designed to run efficiently on GPU-accelerated HPC systems.

Status

⚠️ IMPORTANT: This code is currently being ported and refactored from private repositories for public release. Full documentation and tutorials will be provided within 1-2 weeks.

Features

  • Automated Protein Preparation: Download PDB files, add missing residues/atoms, solvate systems
  • Ligand Parameterization: Automatic parameterization of small molecules using GAFF
  • Flexible Simulation Options: Configurable timesteps, temperatures, and integrators
  • Efficient Data Storage: HDF5-based trajectory storage with forces and energies
  • GPU Acceleration: Multi-GPU support with automatic distribution
  • Resumable Simulations: Checkpoint-based continuation of interrupted runs
  • Batch Processing: Process multiple proteins in parallel across GPUs
  • HPC Integration: SLURM scripts for Perlmutter, Delta, and other HPC systems

Installation

Prerequisites

  • Python 3.11 or 3.12
  • CUDA 12.1+ (for GPU acceleration)
  • Conda or Mamba package manager

Environment Setup

Create the conda environment using the provided configuration:

conda env create -f conda_envs/delta.yml -n openmm_generate
conda activate openmm_generate

Or for a minimal setup:

conda env create -f conda/openmm_cu121_py312.yml -n openmm
conda activate openmm

Dependencies

Key dependencies include:

  • OpenMM 8.1+
  • OpenMM Forcefields
  • OpenFF Toolkit
  • PDBFixer
  • RDKit
  • MDTraj
  • h5py

Quick Start

Basic Usage

Prepare and simulate a single protein:

cd scripts
python batch_generate.py <pdb_id> --prepare --steps 10000 --report-steps 10

Example Commands

Run a single PDB for 1000 frames, removing ligands:

./batch_generate.py 1ABC --prepare --remove-ligands --data-dir=../data --steps 1000

Run all PDBs from a JSON file for 10,000 frames with ligands:

./batch_generate.py openmm_2023.12.20_N1.json --prepare --data-dir=../data --steps 10000

Run with custom integrator (4fs timestep at 350K):

./batch_generate.py pdb_list.json --prepare --data-dir=../data --steps 10000 \
  --integrator=integrator_350k_4fs.json

Process local PDB files from a directory:

./batch_generate.py --prepare --steps 10000 --report-steps=10 \
  --input-dir=../data/input/ --data-dir=../data/output

Multi-GPU parallel processing:

./batch_generate.py pdb_list.json --prepare --steps 100000 --report-steps=100 \
  --gpus=0,1,2,3 --pool-size=8 --data-dir=../data

Command Line Options

Core Options

  • pdbid_list: PDB IDs to process, or path to JSON file with PDB ID array
  • --prepare: Run preparation (add solvent, fix residues) if not already done
  • --prepare-implicit: Use implicit solvent model instead of explicit
  • --remove-ligands: Remove ligands instead of parameterizing them
  • --steps: Total number of simulation steps (default: 10000)
  • --report-steps: Save data every n frames (default: 1)
  • --data-dir: Output data directory (default: ../data/)
  • --input-dir: Input directory for local PDB files

Advanced Options

  • --integrator: Path to JSON file with integrator parameters (dt, temperature, etc.)
  • --gpus: Comma-separated list of GPU IDs to use (e.g., "0,1,2,3")
  • --pool-size: Number of simultaneous simulations (default: 10)
  • --batch-size: Split input list into batches of this size
  • --batch-index: Select which batch to run (for array jobs)
  • --timeout: Maximum runtime in hours (e.g., 2.5 for 2.5 hours)
  • --force: Force re-run of completed simulations

Integrator Configuration

Custom integrator parameters can be specified via JSON files:

integrator_4fs.json (4 femtosecond timestep):

{
    "dt": "0.004*picoseconds"
}

integrator_350k_4fs.json (4fs timestep at 350K):

{
    "dt": "0.004*picoseconds",
    "temperature": "350*kelvin"
}

Available parameters:

  • dt: Timestep (default: 0.002 ps)
  • temperature: Temperature (default: 300 K)
  • friction: Friction coefficient (default: 1.0/ps)
  • pressure: Pressure for barostat (default: 1.0 atm)
  • barostatInterval: Barostat update frequency (default: 25 steps)

HPC Usage

SLURM Job Submission

The repository includes SLURM scripts for various HPC systems:

Perlmutter (NERSC):

sbatch perlmutter1x4_N1000.slurm

Delta (NCSA):

sbatch delta.slurm

Example SLURM Configuration

#!/bin/bash
#SBATCH --gpus-per-task=4
#SBATCH --cpus-per-task=32
#SBATCH --time=2:00:00

export OMP_NUM_THREADS=32

srun python3 batch_generate.py pdb_list.json \
  --batch-index $SLURM_ARRAY_TASK_ID \
  --batch-size 25 \
  --pool-size 8 \
  --gpus="0,1,2,3" \
  --prepare \
  --steps=1000000 \
  --report-steps=100 \
  --timeout 1.8 \
  --data-dir="${OUTPUT_DIR}"

Output Structure

For each processed protein, the following directory structure is created:

data/
└── <pdb_id>/
    ├── raw/                      # Original PDB file
    │   └── <pdb_id>.pdb
    ├── processed/                # Prepared system
    │   ├── <pdb_id>_processed.pdb
    │   ├── <pdb_id>_processed_ligands_smiles.json
    │   ├── <pdb_id>_processed_ligands_cache.json
    │   ├── forcefield.json
    │   ├── <pdb_id>_process.log
    │   └── finished.txt
    ├── simulation/               # Simulation files
    │   ├── system.xml
    │   ├── integrator.xml
    │   ├── checkpoint.chk
    │   ├── log.txt
    │   ├── <pdb_id>_simulation.log
    │   ├── final_state.xml
    │   ├── final_state.pdb
    │   └── finished.txt
    └── result/                   # Trajectory data
        └── output_<pdb_id>.h5    # HDF5 trajectory file

HDF5 Output Format

The HDF5 trajectory files contain:

  • coordinates: Atomic positions (nm)
  • forces: Atomic forces (kJ/mol/nm)
  • cell_lengths: Periodic box dimensions (nm)
  • cell_angles: Periodic box angles (degrees)
  • time: Simulation time (ps)
  • potentialEnergy: Potential energy (kJ/mol)
  • kineticEnergy: Kinetic energy (kJ/mol)
  • topology: System topology in JSON format

Module Structure

module/function.py

Utility functions for file management and topology operations.

module/preprocess.py

Protein preparation pipeline:

  • Download PDB files from RCSB
  • Fix missing residues and atoms
  • Add hydrogens at specified pH
  • Add explicit or implicit solvent
  • Parameterize ligands

module/simulation.py

Core simulation engine:

  • System setup and minimization
  • Equilibration
  • Production MD
  • Checkpoint management
  • Resumable simulations

module/ligands.py

Ligand handling:

  • RCSB ligand template retrieval
  • SMILES-based parameterization
  • GAFF force field generation
  • Topology insertion

module/reporters.py

Custom HDF5 reporter with:

  • Threaded I/O for performance
  • Force trajectory storage
  • H5MD-compatible format
  • Compression support

module/pdb_lookup.py

RCSB database query utilities for ligand SMILES strings.

Force Field Configuration

Default force fields:

  • Protein: AMBER14
  • Water: TIP3P-FB
  • Ligands: GAFF (via OpenFF)
  • Implicit Solvent: GBN2

Force field selection is automatic based on preparation options and stored in forcefield.json.

Data Processing Workflow

Cecilia Data Cleaning

The cecilia_data_cleaning/ directory contains tools for processing existing MD datasets:

Fix Tyrosine Residues:

python cecilia_data_cleaning/fix_cecilia_tyrosol.py

Generate Starting Poses via TICA:

python cecilia_data_cleaning/tica_starting_pos/main.py

This performs time-lagged independent component analysis (TICA) on bond distances to identify diverse starting conformations for enhanced sampling.

Performance Considerations

GPU Utilization

  • Each simulation requires ~2-4GB GPU memory
  • Pool size should be a multiple of GPU count for efficiency
  • Recommended: 2 simulations per GPU (pool_size = 2 × num_gpus)

CPU Threading

  • Set OMP_NUM_THREADS to match allocated cores
  • Ligand parameterization temporarily limits threads to 2
  • Typical: 32 threads for 4-GPU nodes

I/O Performance

  • HDF5 reporter uses threaded writes to minimize overhead
  • report_steps controls output frequency and file size
  • Compression can be enabled via use_gzip=True

Troubleshooting

Common Issues

CUDA out of memory:

  • Reduce pool size
  • Decrease number of atoms (remove water with atomSubset)

Ligand parameterization fails:

  • Check RCSB connectivity
  • Verify ligand is in PDB Chemical Component Dictionary
  • Use --remove-ligands if parameterization not needed

Simulation doesn't resume:

  • Ensure checkpoint files exist
  • Check that HDF5 file isn't corrupted
  • Verify simulation time matches last recorded frame

Force field errors:

  • Check for non-standard residues
  • Verify all ligands have templates
  • Review preparation logs

Contributing

We welcome contributions! Please use GitHub Issues to:

  • Report bugs
  • Request features
  • Ask questions
  • Suggest improvements

License

License information will be added soon.

Acknowledgments

This work was performed on:

  • Perlmutter at NERSC (National Energy Research Scientific Computing Center)
  • Delta at NCSA (National Center for Supercomputing Applications)

Development supported by computational resources and expertise from these facilities.

Contact

For questions and support, please open an issue on GitHub.

References