A Snakemake pipeline that bundles the running of PyClone-VI and PhyClone, for pre-clustering and phylogenetic reconstruction of multi-sample bulk-sequencing data.
This pipeline requires that conda and Snakemake be installed; the Bioconda package channel must also be configured.
- Ensure that you have a working
condainstallation, you can do this by installing Miniforge. - Configure the Bioconda channel and set strict channel priority:
conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict - Install Snakemake:
conda create -c conda-forge -c bioconda --name snakemake snakemake'>=9.14.8'
- Create a working directory for the workflow:
mkdir -p path/to/project-workdir cd path/to/project-workdir - Clone the workflow repository through git:
- To clone the latest code:
git clone --depth 1 https://github.com/Roth-Lab/PhyClone-Workflow.git - To clone a specific version of the workflow:
git clone --branch <version_tag> --depth 1 https://github.com/Roth-Lab/PhyClone-Workflow.git
- To clone the latest code:
For a full description of all available pipeline options, please refer to the pipeline schema.
- Modify the configuration file, config.yaml to suit your dataset.
- The following configuration fields must be configured per experiment:
input_file: A valid filepath to the input file for the pipeline, the format of which can be found in both the PyClone-VI and PhyClone repositories.out_directory: Path to the desired output directory
- The default program options listed under
pyclone-viandphyclonein the configuration schema should suit most cases. However, the following values may be of interest to adjust depending on the data being analysed and computing resources available:pyclone-vioptions of interest:num_threads: Number of threads (compute cores) to use during inference.seed: Can be used to seed the random number generator for reproducible results.
phycloneoptions of interest:num_chains: Number of independent parallel PhyClone sampling chains to use, each chain will use a CPU core. PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared.seed: Can be used to seed the random number generator for reproducible results.
- The remaining configuration options have been named to mirror the options of their respective programs, to read more on the available options and their use cases:
Tip
An example input file can be found in the PyClone-VI repository, here.
Tip
A basic workflow-profile has been set up here, adjust as needed.
-
Navigate to the project directory and activate the snakemake environment:
cd path/to/project-workdir/PhyClone-Workflow conda activate snakemake -
Run a dry-run of the pipeline to confirm the ruleset and outputs are as you expect:
snakemake --cores <number-of-CPU-cores-to-use> --configfile <path/to/config-file> -n -
Run the pipeline:
snakemake --cores <number-of-CPU-cores-to-use> --configfile <path/to/config-file> -
Following the pipeline run, you can additionally create an interactive visual HTML report that bundles together and reports on the pipeline results.
(Note: the report file must have the
.zipextension)To create this report archive, run:
snakemake --configfile <path/to/config-file> --report <path/to/report.zip>
Steps 3 and 4 can also be combined with a command like the following:
snakemake --cores <number-of-CPU-cores-to-use> --configfile <path/to/config-file> --report <path/to/report.zip> --report-after-run
The main outputs of the pipeline are point estimate PhyClone clonal phylogenies and/or the PhyClone topology report/archive. More on the contents of these output files can be found in the PhyClone repository.
<output-directory>
├── benchmarks
│ ├── phyclone
│ │ ├── run_phyclone.benchmark.txt
│ │ ├── write_Consensus_results_phyclone.benchmark.txt
│ │ ├── write_MAP_results_phyclone.benchmark.txt
│ │ └── write_phyclone_topology_archive_and_report.benchmark.txt
│ └── pyclone-vi
│ ├── run_pyclone_vi.benchmark.txt
│ └── write_results_pyclone_vi.benchmark.txt
├── logs
│ ├── main_snakefile_logs
│ │ ├── correct_input.stderr.log
│ │ └── correct_input.stdout.log
│ ├── phyclone_logs
│ │ ├── get_phyclone_version.log
│ │ ├── plot_Consensus_tree.log
│ │ ├── plot_MAP_tree.log
│ │ ├── run_phyclone.log
│ │ ├── write_Consensus_results_phyclone.log
│ │ ├── write_MAP_results_phyclone.log
│ │ └── write_phyclone_topology_archive_and_report.log
│ └── pyclone-vi_logs
│ ├── get_pyclone_version.log
│ ├── run_pyclone_vi.log
│ └── write_results_pyclone_vi.log
└── pipeline_outputs
├── input
│ ├── cleaned_input.tsv.gz
│ └── removed_variants.tsv.gz
├── phyclone
│ ├── Consensus
│ │ ├── Consensus_results_table.tsv.gz
│ │ ├── Consensus_sample_prevalence_table.tsv.gz
│ │ ├── Consensus_tree.nwk
│ │ └── Consensus_tree.svg
│ ├── MAP
│ │ ├── MAP_results_table.tsv.gz
│ │ ├── MAP_sample_prevalence_table.tsv.gz
│ │ ├── MAP_tree.nwk
│ │ └── MAP_tree.svg
│ ├── phyclone.version.txt
│ ├── Topology_Report
│ │ ├── sampled_topologies.tar.gz
│ │ └── topology_report.tsv.gz
│ └── trace.h5
└── pyclone-vi
├── clusters.tsv.gz
├── pyclone-vi.version.txt
└── trace.h5