this repo contains the OncoScanner pipeline for somatic WGS and WES analysis.
the implementation is inspired by various nf-core modules and the nf-core/sarek pipeline.
this pipeline requires about 5 times the input (batch) size as temporary (workdir) storage per WGS sample.
the output is normally about 1.2 times the input size (level 5 bgzip compressed fq.gz)
more than 256 GB of ram are recommended.
a WGS sample is estimated to take about 1800 cpu-hours (epyc milan).
a batch of 4 wgs samples takes about 5 days to calculate.
note: its possible to run with lower resources, but it would slow development and research.
for now we recommend running on a single host with conda.
there are multiple ways of running the workflow.
the simplest way is to run (portable) from a single folder,
dependencies:
- nextflow
- conda
then clone the pipeline:
git clone --recursive https://github.com/feiloo/nextflow_workflows.git
mkdir reference nextflow_calldir nextflow_workdir nextflow_outputdir cache
export NEXTFLOW_MODULES="$(pwd)/nextflow_workflows/modules"
export NGS_REFERENCE_DIR="$(pwd)/reference"
export NEXTFLOW_CALLDIR="$(pwd)/nextflow_calldir"
export NEXTFLOW_WORKDIR_CUSTOM="$(pwd)/nextflow_workdir"
export NEXTFLOW_OUTPUTDIR_CUSTOM="$(pwd)/nextflow_outputdir"
export NEXTFLOW_STOREDIR="$(pwd)/cache"
export NAS_IMPORT_DIR=''
export NAS_EXPORT_DIR=''
see the environment variables and nextflow-configs like modules/oncoscanner/user.config
pip install boto3
python nextflow_workflows/scripts/download_data.py reference $NGS_REFERENCE_DIR/oncoscanner_reference
the pipeline has multiple variations, the main variation is "align_interpret"
this version requires fastq input files with naming that matches this example:
X123-25_N_FFPE_1.fq.gz
X123-25_N_FFPE_2.fq.gz
X123-25_T_FFPE_1.fq.gz
X123-25_T_FFPE_2.fq.gz
where X123-YY is the samplename and 25 is the year.
N or T stands for normal or tumor.
FFPE or FF or BLOOD stands for formalin fixed paraffin embedded, fresh frozen, or blood source material respectively.
1 or 2 are the read direction and .fq.gz stands for bgzf/bgzip compressed fastq files.
it also requires a samplesheet.csv like this:
sample_id,normal_modality,tumor_modality
X123-25,FFPE,FFPE
X123-25,BLOOD,FFPE
lastly, a md5sum.txt that includes the hashes for the input files is required.
the pipeline uses the fastq samplesheet.csv and md5sum.txt files relative from the specified --input_dir.
now start the pipeline.
cd $NEXTFLOW_CALLDIR && nextflow \
-log nextflow.log \
run $NEXTFLOW_MODULES/oncoscanner/ \
--workflow_variation align_interpret \
--samplesheet samplesheet.csv \
--hash_db md5sum.txt \
-c $NEXTFLOW_MODULES/oncoscanner/user.config \
--library_type wgs \
--input_dir $NEXTFLOW_CALLDIR \
-profile standard \
-offline \
-resume \
--tag routine_establish,wgs
for development, you can set the NEXTFLOW_MODULES variable to this repos modules directory:
export NEXTFLOW_MODULES="$(pwd)/modules"
development dependencies:
- meson
the meson buildsystem enables additional steps for building, testing and packaging.
meson uses out of source build directories, which has various advantages.
create one with:
meson setup --wipe -D test_datadir=path_to_datadir/ -D external_deps=path_to_external_deps/ path_to_builddir/
cd /path_to_builddir/
meson compile
test setup: todo
now configure the tests with them:
cd /path_to_builddir
meson configure -Dtest_datadir=/path_to_test_samplesheet_dir
run tests
meson test
instead of conda or containers, the required tools within the pipeline can be installed locally, see:
scripts/build_deps.sh
scripts/install_deps.sh
another way is to use a single container for all processes or for the whole pipeline, see:
scripts/gen_container.sh
Paolo Di Tommaso, Maria Chatzou, Evan Floden, Pablo Prieto Barja, Emilio Palumbo, Cedric Notredame, "{Nextflow enables reproducible computational workflows}", 2017
Friederike Hanssen, Maxime U. Garcia, Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Oskar Wacker, Nicholas Smith, nf-core community, Gisela Gabernet, Sven Nahnsen, "Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery", bioRxiv, 2023
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, "Twelve years of SAMtools and BCFtools", GigaScience, vol. 10, no. 2, pp. giab008, 2021
Mark A. DePristo, Eric Banks, Ryan Poplin, Kiran V. Garimella, Jared R. Maguire, Christopher Hartl, Anthony A. Philippakis, Guillermo del Angel, Manuel A. Rivas, Matt Hanna, Aaron McKenna, Tim J. Fennell, Andrew M. Kernytsky, Andrey Y. Sivachenko, Kristian Cibulskis, Stacey B. Gabriel, David Altshuler, Mark J. Daly, "A framework for variation discovery and genotyping using next-generation DNA sequencing data", Nature genetics, vol. 43, no. 5, pp. 491--498, 2011
Peng Jia, Xiaofei Yang, Li Guo, Bowen Liu, Jiadong Lin, Hao Liang, Jianyong Sun, Chengsheng Zhang, Kai Ye, "MSIsensor-pro: Fast, Accurate, and Matched-normal-sample-free Detection of Microsatellite Instability", Genomics, Proteomics & Bioinformatics, vol. 18, no. 1, pp. 65-71, 2020
Shifu Chen, "fastp 1.0: An ultra-fast all-round tool for FASTQ data quality control and preprocessing", iMeta, vol. 4, no. 5, pp. e70078, 2025