MutationScan is a Snakemake-orchestrated AMR analytics pipeline that transforms local bacterial genome assemblies into:
- Mutation call reports
- Biochemical epistasis network rankings
- Optional structure-guided docking deltas (WT vs mutant)
The repository is structured for production use with deterministic workflow steps, job-scoped output directories, and strict separation of source code vs runtime state.
MutationScan executes a staged workflow:
- Sequence extraction and variant calling from local
.fnagenomes - Biochemical scoring and co-occurrence epistasis network generation
- Optional biophysics docking against a provided protein structure
Current design principle:
- Local genomes are the input source (no built-in metadata download stage in the production DAG).
- Every run is namespaced by
job_nameand writes todata/output/{job_name}/.
The current main branch includes several pipeline correctness and quality upgrades:
- Variant-calling identity filter to suppress weak-homology mutation inflation.
- MVBM docking refinements with fixed-pocket targeting and flexible-residue mutant docking.
- Fast steric quality control with explicit
FAILED_QCstatus for non-physical mutant models. - Confidence and interpretation annotations in biophysics outputs for easier triage.
These updates are now the documented baseline behavior for new runs.
The active workflow in Snakefile calls exactly these scripts:
- src/scripts/02a_extract_proteins.py
- src/scripts/02b_call_variants.py
- src/scripts/03_biochemical_epistasis.py
- src/scripts/04_htvs_biophysics.py
No legacy acquisition script is used in the current production DAG.
Required inputs:
- Local genomes directory (default
data/local_genomes) - Target gene list (default
config/acr_targets.txt) - Optional reference PDB for Phase 3 biophysics (default
data/5o66.pdbin config) - Optional ligand path from config (
ligand)
Primary outputs for a run:
data/output/{job_name}/1_genomics_report.csvdata/output/{job_name}/2_epistasis_networks.csvdata/output/{job_name}/ControlScan_Networks/data/output/{job_name}/3_biophysics_docking.csvdata/output/{job_name}/Mutated_Structures/data/output/{job_name}/README_Biophysics.txt
Edit config/config.yaml to control run behavior.
Minimum important keys:
job_name: output namespace for this runlocal_genomes: folder containing.fnafilestargets_file: target genes listvariant_min_identity_percent: minimum alignment identity threshold for variant emission (default80)default_pdb: structure file for biophysics stageligand: optional ligand file path for dockingpocket_center_x/pocket_center_y/pocket_center_z: optional override for docking pocket center (default AcrB center)exhaustiveness: docking search exhaustiveness (default16)
Example:
job_name: "trial_001"
local_genomes: "data/local_genomes"
targets_file: "config/acr_targets.txt"
variant_min_identity_percent: 80
default_pdb: "data/5o66.pdb"
ligand: "data/ligands/ligand.sdf"
exhaustiveness: 16Identity filtering note:
- Alignments below
variant_min_identity_percentare skipped before mutation emission. - If you want a broader but noisier search, reduce to
75; for stricter calls, keep80or raise it.
Use the project Conda environment definition:
conda env create -f environment.yml
conda activate mutationscan
pip install -e .Dry-run the DAG:
python -m snakemake -n --cores 1 --config job_name="smoke_test"Run the workflow:
python -m snakemake --cores 4 --config job_name="run_2026_03_18"docker compose build
docker compose run --rm mutationscan python -m snakemake -n --cores 1 --config job_name="docker_smoke"
docker compose run --rm mutationscan python -m snakemake --cores 4 --config job_name="docker_run"Repository CI validates:
- Unit tests
- Snakemake DAG buildability
Runtime data/state folders are intentionally quarantined via ignore rules, and .snakemake/ is not tracked.
This pipeline is intended for research and engineering triage workflows.
- Not a clinical diagnostic device.
- Mutation-to-phenotype inference is model- and rule-dependent, not ground truth.
- Docking outputs are best-effort relative estimates, not absolute binding free-energy truth.
- Fast local docking does not fully model large conformational changes, explicit solvent, long-timescale dynamics, or complete thermodynamic integration.
- For high-confidence mechanistic conclusions, use full molecular dynamics and dedicated free-energy methods.
Tracked assets should remain source/config/documentation only.
Not shipped as production code or tracked outputs:
.snakemake/runtime state- Generated output under
data/output/* - Downloaded genome payloads under
data/local_genomes/* - Ad hoc local experiment files
Keep placeholders only (.gitkeep) in runtime data folders.
Common causes of failed runs:
- Missing
.fnafiles inlocal_genomes - Missing/incorrect target genes file
- Missing PDB when biophysics stage is enabled
- Missing external binaries in local environment (
tblastn, docking dependencies)
Recommended first check:
python -m snakemake -n --cores 1 --config job_name="debug_run"See LICENSE.