This guide covers running AlphaFast on HPC clusters using Singularity/Apptainer and SLURM.
- NVIDIA GPU with CUDA compute capability >= 6.0
- CUDA driver >= 560.28.03 (required for CUDA 12.6 runtime)
- Singularity >= 3.7 or Apptainer >= 1.0
- SLURM workload manager
Convert the Docker image to a Singularity SIF file:
singularity pull alphafast.sif docker://romerolabduke/alphafast:latestSet SINGULARITY_CACHEDIR to scratch storage to avoid filling your home directory:
export SINGULARITY_CACHEDIR=/scratch/$USER/.singularity
mkdir -p $SINGULARITY_CACHEDIR
singularity pull alphafast.sif docker://romerolabduke/alphafast:latestThe resulting alphafast.sif file is approximately 10 GB.
HPC filesystems vary in performance characteristics. The data pipeline writes many small temporary files during MMseqs2 searches, which can be very slow on network-attached storage.
| Filesystem | Temp Files | Databases | Recommendations |
|---|---|---|---|
| Lustre | Slow (metadata-heavy) | Good (large sequential reads) | Use --temp_dir on local storage |
| GPFS | Slow (metadata-heavy) | Good | Use --temp_dir on local storage |
| NFS | Slow | Adequate | Use --temp_dir on local storage |
| Local SSD / /scratch | Fast | Fast | Preferred for temp files |
Recommendation: Always set --temp_dir to fast local storage (e.g., /scratch or /tmp) when databases reside on network-attached storage. This can provide a 10-13x speedup for MMseqs2 searches. Typical temporary space usage is 1-5 GB per batch.
Singularity requires explicit bind mounts for directories outside the container. The run_alphafast.sh script handles this automatically for --db_dir, --weights_dir, --input_dir, --output_dir, --temp_dir, and --jax_compilation_cache_dir. If you need to pass additional paths, you may need to add bind mounts manually or ensure the paths are under an already-bound parent directory.
AlphaFast uses CUDA 12.6 inside the container. The host system must have a compatible NVIDIA driver:
| CUDA Version | Minimum Driver |
|---|---|
| CUDA 12.6 | >= 560.28.03 |
Check your driver version:
nvidia-smi | head -3If your driver is too old, contact your HPC system administrator.
#!/bin/bash
#SBATCH --job-name=alphafast
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=4:00:00
#SBATCH --output=alphafast_%j.out
#SBATCH --error=alphafast_%j.err
# Set paths
INPUT_DIR=/path/to/inputs
OUTPUT_DIR=/path/to/outputs
DB_DIR=/path/to/databases
WEIGHTS_DIR=/path/to/weights
CONTAINER=/path/to/alphafast.sif
# Use local scratch for temporary files (recommended)
export SINGULARITY_CACHEDIR=/scratch/$USER/.singularity
mkdir -p /scratch/$USER/alphafast_tmp /scratch/$USER/alphafast_jax_cache
./scripts/run_alphafast.sh \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--db_dir $DB_DIR \
--weights_dir $WEIGHTS_DIR \
--temp_dir /scratch/$USER/alphafast_tmp \
--jax_compilation_cache_dir /scratch/$USER/alphafast_jax_cache \
--container $CONTAINERMulti-GPU mode uses a phase-separated parallel architecture: all N GPUs run MSA search in parallel (Phase 1), then all N GPUs run inference in parallel (Phase 2).
#!/bin/bash
#SBATCH --job-name=alphafast-multi
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --time=8:00:00
#SBATCH --output=alphafast_multi_%j.out
#SBATCH --error=alphafast_multi_%j.err
# Set paths
INPUT_DIR=/path/to/inputs
OUTPUT_DIR=/path/to/outputs
DB_DIR=/path/to/databases
WEIGHTS_DIR=/path/to/weights
CONTAINER=/path/to/alphafast.sif
# Use local scratch for temporary files (recommended)
export SINGULARITY_CACHEDIR=/scratch/$USER/.singularity
mkdir -p /scratch/$USER/alphafast_tmp /scratch/$USER/alphafast_jax_cache
./scripts/run_alphafast.sh \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--db_dir $DB_DIR \
--weights_dir $WEIGHTS_DIR \
--container $CONTAINER \
--temp_dir /scratch/$USER/alphafast_tmp \
--jax_compilation_cache_dir /scratch/$USER/alphafast_jax_cache \
--gpu_devices 0,1,2,3With 4 GPUs, all 4 GPUs run MSA search simultaneously in Phase 1. After all MSA jobs complete, the results are re-distributed and all 4 GPUs run inference simultaneously in Phase 2.
| Workload | GPUs | CPUs | Memory | Wall Time (est.) |
|---|---|---|---|---|
| 1 protein, single-GPU | 1 | 8 | 64 GB | 30-60 min |
| 32 proteins, single-GPU | 1 | 8 | 64 GB | 4-8 hours |
| 32 proteins, 4 GPUs | 4 | 32 | 256 GB | 1-3 hours |
| 512 proteins, 4 GPUs | 4 | 32 | 256 GB | 12-24 hours |
Memory requirements depend on protein size. For sequences longer than 2000 residues, increase to 128 GB (single-GPU) or 512 GB (multi-GPU).
Container fails to start with GPU errors: Verify the CUDA driver version is >= 560.28.03. Run nvidia-smi to check.
Slow MSA searches: Set --temp_dir to local scratch storage. Network filesystems add significant overhead for the many small I/O operations in MMseqs2.
Out of memory during inference: Reduce XLA_CLIENT_MEM_FRACTION (default 0.95) or use --flash_attention_implementation xla for lower memory usage on older GPUs.
First inference batch is much slower: Use --jax_compilation_cache_dir on fast storage so later runs can reuse compiled JAX executables. A fully cold multi-GPU launch can still spend extra time compiling on the first worker batch before the cache is populated.
Singularity permission errors: Ensure SINGULARITY_CACHEDIR is writable and on a filesystem that supports file locking.