Snakemake pipeline for trackplot to make coverage track plots across a series of intervals
Install the execution pipeline using conda/mamba:
conda env create -f workflow/envs/trackplot_snakemake.yaml
conda activate trackplot_snakemakeThis minimal environment installs snakemake (>=8.0.0) and samtools (>=1.21.0) to be able to run bgzip and tabix on input interval files prior to running the pipeline. All snakemake rules themselves are executed inside Singularity containers.
Trackplot would do this under the hood anyway. But as we want to immediately parallelise across input intervals, we want to avoid race conditions between different invocations of trackplot. The safest way to do this is to pre-validate the input interval files (GTF file and intervals.txt file). Use workflow/scripts/check_interval_inputs.py to do this. Requires samtools installation and python 3.9+ (satisfied by the trackplot_snakemake conda environment):
python workflow/scripts/check_interval_inputs.py -g data/gencode.v40.annotation.gtf.gz -i data/intervals.txt -o data/intervals.validated.txt
Running: tabix -p gff data/gencode.v40.annotation.gtf.gz
Error in indexing gff file: Command '['tabix', '-p', 'gff', 'data/gencode.v40.annotation.gtf.gz']' returned non-zero exit status 1.
STDERR: [tabix] the compression of 'data/gencode.v40.annotation.gtf.gz' is not BGZF
Direct indexing failed. The file might not be properly bgzipped. Attempting to decompress and re-bgzip...
Running: sort -k1,1 -k4,4n -o /tmp/tmpx2p9hxyz.sorted /tmp/tmp1rlkdrh7.uncompressed
Created properly bgzipped file at: data/gencode.v40.annotation.gtf.bgz
Running: tabix -p gff data/gencode.v40.annotation.gtf.bgz
Successfully indexed re-bgzipped file: data/gencode.v40.annotation.gtf.bgz
GTF file: data/gencode.v40.annotation.gtf.bgz
Running: sort -k1V -k2n -k3n -o /tmp/tmpdrpwhm2j.sorted data/le_ids.merged.sorted.bed
Created bgzipped file: data/le_ids.merged.sorted.bed.bgz
Running: tabix -p bed data/le_ids.merged.sorted.bed.bgz
Successfully indexed file: data/le_ids.merged.sorted.bed.bgz
Updated intervals file saved to: data/intervals.validated.txt
Intervals file: data/intervals.validated.txt
All input files have now been validated or re-processed- Only checks file extensions and not the actual file contents themselves.
- Uses temporary files via the
tempfilelibrary. Use environment variables to modify the temporary directory if necessary. - Supports BED or GFF/GTF input interval files only (via the tabix presets).
The pipeline requires as input:
- BED files for the 'view' regions (plotting window, passed to
trackplot -e) and a 'focus' region (highlighted window, passed totrackplot --focus). Focus regions are matched to view regions by the 'name' (4th) field of the BED file. The 'focus' region BED file path must be provided, but it is not required to provide a focus region for any of the input regions- 'view' BED should ideally be BED6 (to include the strand column), the 'focus' bed only needs to be BED4 format (to include the name field)
- density.txt (
trackplot --density) and intervals.txt (trackplot --interval) files as specified by trackplot. This pipeline does not validate the format of the density.txt file in any way. The interval.txt file is checked to ensure the file extensions imply bgzipped and tabix-indexed
Consult trackplot documentation for instructions on how to generate trackplot-specific input formats.
It's possible to specify multiple datasets per pipeline run/set of input intervals. Simply add an additional dataset_name and density.txt file to the density_lists option. Trackplot is run once for each input region and density.txt file. Output PDF files are then concatenated separately per interval across datasets (<region_name>.all_datasets.pdf) and per dataset across intervals (<dataset_name>.all_regions.pdf)
To perform a dry run on the test data:
snakemake -p --configfile config/config.test.tdp43-apa.yaml --cores 2 --use-singularity --singularity-args="--bind /home/sam" -nIf everything looks ok, remove the -n command to execute the pipeline. On the first run this will pull the singularity containers, which can take a bit of extra time:
snakemake -p --configfile config/config.test.tdp43-apa.yaml --cores 2 --use-singularity --singularity-args="--bind /home/sam"On my middle-grade laptop this takes approximately 5 mins to run. Feel free to bump the number of cores to speed up execution. TODO: subsample the test BAM files and other input files to speed up execution and allow shipping with the repository.
$ tree processed/test.tdp43-apa/
processed/test.tdp43-apa/
├── annotations
├── benchmarks
│ ├── regions
│ │ ├── ENSG00000104435.14_1.combine_pdfs.benchmark.txt
│ │ ├── ENSG00000107807.13_2.combine_pdfs.benchmark.txt
│ │ ├── ENSG00000126767.18_2.combine_pdfs.benchmark.txt
│ │ ├── ENSG00000127511.10_2.combine_pdfs.benchmark.txt
│ │ ├── ENSG00000137161.18_1.combine_pdfs.benchmark.txt
│ │ └── ENSG00000138083.5_2.combine_pdfs.benchmark.txt
│ └── seddighi_i3_cortical_test
│ ├── ENSG00000104435.14_1.benchmark.txt
│ ├── ENSG00000107807.13_2.benchmark.txt
│ ├── ENSG00000126767.18_2.benchmark.txt
│ ├── ENSG00000127511.10_2.benchmark.txt
│ ├── ENSG00000137161.18_1.benchmark.txt
│ ├── ENSG00000138083.5_2.benchmark.txt
│ └── combine_pdfs_within_dataset.benchmark.txt
├── logs
│ ├── regions
│ │ ├── ENSG00000104435.14_1.combine_pdfs.log
│ │ ├── ENSG00000107807.13_2.combine_pdfs.log
│ │ ├── ENSG00000126767.18_2.combine_pdfs.log
│ │ ├── ENSG00000127511.10_2.combine_pdfs.log
│ │ ├── ENSG00000137161.18_1.combine_pdfs.log
│ │ └── ENSG00000138083.5_2.combine_pdfs.log
│ └── seddighi_i3_cortical_test
│ ├── ENSG00000104435.14_1.log
│ ├── ENSG00000107807.13_2.log
│ ├── ENSG00000126767.18_2.log
│ ├── ENSG00000127511.10_2.log
│ ├── ENSG00000137161.18_1.log
│ ├── ENSG00000138083.5_2.log
│ └── combine_pdfs.log
└── plots
├── combined
│ ├── regions
│ │ ├── ENSG00000104435.14_1.all_datasets.pdf
│ │ ├── ENSG00000107807.13_2.all_datasets.pdf
│ │ ├── ENSG00000126767.18_2.all_datasets.pdf
│ │ ├── ENSG00000127511.10_2.all_datasets.pdf
│ │ ├── ENSG00000137161.18_1.all_datasets.pdf
│ │ └── ENSG00000138083.5_2.all_datasets.pdf
│ └── seddighi_i3_cortical_test.all_regions.pdf
└── individual
└── seddighi_i3_cortical_test
├── ENSG00000104435.14_1.pdf
├── ENSG00000107807.13_2.pdf
├── ENSG00000126767.18_2.pdf
├── ENSG00000127511.10_2.pdf
├── ENSG00000137161.18_1.pdf
└── ENSG00000138083.5_2.pdf
12 directories, 39 files- If only one dataset is provided, the
combine_pdfs_by_regionrule is unnecessary. Consider skipping this rule if only one dataset/density.txt file is provided. - Only tested with bulk RNA-seq BAM files and standard (BED/GFF) interval formats. Not all trackplot arguments are exposed by the pipeline.