nf-atac

Pipeline Stats

Modules: 5+ (PRE_BAM, QC_REPORT, PEAK_CALL, DAR_PATH, SHIFT_TSS, TFF_PLOTS)
Languages: Nextflow DSL2, Groovy, R
Supported Platforms: Local, HPC, Cloud
Execution Profiles: Local, Docker, Apptainer, UPPMAX
Peak Caller: Genrich (default)
Genome: hg38 (default), currently supports 'hg38' and 'mm10'
Custom Entry Points: 6+
Output Formats: BAM, BED, QC Reports, Pathway Analysis
Retries: 3 (default)
Memory: 20 GB (default per process)
CPUs: 14 (default)
Container: Defined per module, hosted at Seqera
FDR Threshold: 0.05 (default)
TSS Analysis: Enabled with --ss or --tff
Differential Analysis: Supports custom contrasts
Annotation: TxDb, BSgenome, org.Hs/Mm.eg.db
Blacklist Filtering: Enabled by default
QC Tools: FastQC, MultiQC, ATACseqQC
DAR Analysis: DESeq2
Footprinting: Gene-specific TFF analysis
Pathway Analysis: Integrated
Skip Options: 10+ customizable flags
Profiles: Docker, Apptainer, Conda, Arm, UPPMAX, Local
Default Genome: hg38 (configurable)

Quick Links

nf-atac is a bioinformatics pipeline built using Nextflow for analyzing ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data. It automates the preprocessing, peak calling, and downstream analysis of ATAC-seq datasets.

DAG

nf-atac is currently in development and more modules will be added and optimizations performed in time. This DAG will keep evolving.

flowchart TB
    subgraph " "
    v0["Channel.fromPath"]
    v7["blacklist"]
    v12["tag"]
    v19["tag"]
    v48["ref"]
    v62["goi"]
    end
    subgraph PRE_BAM
    v6([ADD_GROUP])
    v8([BAM_BLKLIST])
    v9([BAM_FL_RAW])
    v13([COMBINE_INSERT_METS_RAW])
    v15([BAMQC])
    v16([BAM_FL_QC])
    v20([COMBINE_INSERT_METS_QC])
    v1(( ))
    v11(( ))
    v18(( ))
    end
    subgraph " "
    v10[" "]
    v14[" "]
    v17[" "]
    v21[" "]
    v24[" "]
    v27[" "]
    v36[" "]
    v37[" "]
    v38[" "]
    v39[" "]
    v41[" "]
    v42[" "]
    v43[" "]
    v44[" "]
    v45[" "]
    v46[" "]
    v50[" "]
    v52[" "]
    v53[" "]
    v54[" "]
    v55[" "]
    v56[" "]
    v57[" "]
    v59["group_input"]
    v60[" "]
    v61[" "]
    v64[" "]
    end
    subgraph QC_REPORT
    v23([FASTQC])
    v26([MULTIQC])
    v25(( ))
    end
    subgraph PEAK_CALL
    v28([SORT_N])
    v31([GENRICH])
    v35([FEAT_COUNTS])
    v40([PEAK_ANNOT])
    v29(( ))
    v32(( ))
    v33(( ))
    v47(( ))
    end
    subgraph DAR_PATH
    v49([DAR])
    v51([PATHWAY])
    end
    subgraph SHIFT_TSS
    v58([SHIFT_SPLIT])
    end
    subgraph TFF_PLOTS
    v63([TFF])
    end
    v0 --> v1
    v1 --> v6
    v6 --> v8
    v7 --> v8
    v8 --> v9
    v8 --> v15
    v9 --> v10
    v9 --> v11
    v12 --> v13
    v11 --> v13
    v13 --> v14
    v15 --> v16
    v15 --> v28
    v15 --> v58
    v15 --> v1
    v16 --> v17
    v16 --> v18
    v19 --> v20
    v18 --> v20
    v20 --> v21
    v1 --> v23
    v23 --> v24
    v23 --> v25
    v25 --> v26
    v26 --> v27
    v28 --> v29
    v28 --> v33
    v29 --> v31
    v31 --> v40
    v31 --> v32
    v32 --> v35
    v33 --> v35
    v35 --> v39
    v35 --> v38
    v35 --> v37
    v35 --> v36
    v35 --> v47
    v40 --> v46
    v40 --> v45
    v40 --> v44
    v40 --> v43
    v40 --> v42
    v40 --> v41
    v48 --> v49
    v47 --> v49
    v49 --> v50
    v49 --> v51
    v51 --> v57
    v51 --> v56
    v51 --> v55
    v51 --> v54
    v51 --> v53
    v51 --> v52
    v58 --> v63
    v58 --> v61
    v58 --> v60
    v58 --> v59
    v62 --> v63
    v63 --> v64

Features

Automated Workflow: Streamlines the ATAC-seq analysis process.
Reproducibility: Ensures consistent results across runs using Nextflow.
Scalability: Supports execution on local machines, HPC clusters, and cloud platforms.
Customizable: Easily configurable to suit specific experimental needs.

Requirements

Nextflow
Docker or Apptainer (optional for containerized execution)
Java 8 or higher

Installation

Clone the repository:

git clone https://github.com/addityea/nf-atac.git
cd nf-atac

Install Nextflow:
```
curl -s https://get.nextflow.io | bash
# Make sure it's executable and move to PATH
chmod +x nextflow
sudo mv nextflow /usr/local/bin/
```
In case you don't have sudo access, you can move the nextflow binary to a directory in your PATH (e.g., ~/bin) or run it from the current directory by appending ./ before the command.
Ensure dependencies are installed (e.g., Docker/Singularity, Java).

Offline setup

If you need to run the pipeline in an offline environment, you can cache all the Singularity containers and use the offline profile. In this example, we download all the images to the conts directory. You can do this by running the following command from the nf-xen directory:

cd conts/
bash get_all.sh

After this, move the whole nf-xen directory to the offline environment and run the pipeline with the offline profile in addition to your preferred profile(s).

Keep in mind that when running offline, cellDex download will fail, hence, make sure you provide an already comressed celldex reference in the sample sheet, or download the references beforehand and provide their paths in the sample sheet.

HPC profiles (SUPR NAISS Sweden only)

pdc_kth profile for PDC Dardel cluster uppmax profile for UPPMAX cluster including auto-detect for miarka, pelle, snowy and rackham queues.

If you are running the pipeline on these HPC clusters, you can use the specific profile. These profiles are configured to use Singularity containers and has specific settings for the HPC environments. Profiles originally written by Pontus Freyhult (@pontus) and adapted for the pipeline by Aditya Singh (@addityea).

Usage

Run the pipeline with the following command:

nextflow run main.nf --samplesheet <sample sheet> -resume

You may customise the pipeline by modifying the nextflow.config file or by passing additional parameters in the command line. There are some profiles available for different environments:

'local' for running on a local machine
'docker' for running with Docker
'arm' for specific flags for ARM architecture
'apptainer' for running with Apptainer
'UPPMAX' for running on UPPMAX cluster

Custom Entry points

The pipeline provides several custom entry points to execute specific parts of the workflow without running the entire pipeline. These entry points correspond to the sub-workflows defined in the main.nf file:

PREBAM: Preprocess BAM files, including blacklist filtering, fragment length evaluation, and quality control.
TSS: Perform TSS (Transcription Start Site) shift analysis.
PEAKS: Execute peak calling and generate annotated peaks.
TFF: Perform transcription factor footprinting analysis.
DIFF: Conduct differential accessibility analysis.
QC_REP: Generate quality control reports.

To use a specific entry point, run the pipeline with the -entry flag followed by the desired entry point name. For example:

nextflow run main.nf -entry PREBAM --samplesheet <sample sheet>

The other skip options can be used to skip the other steps in the pipeline, even in the sub-workflows. For example, if you want to run the PREBAM entry point but skip the BAMQC step, you can use:

nextflow run main.nf -entry PREBAM --samplesheet <sample sheet> --noqc

There are certain decision(s) that the pipeline will make based on the parameters provided.

For example:

If you provide a gene list for transcription factor footprinting analysis using the --tff parameter but do not specifiy that the supplied BAM file(s) are already shifted by providing the flag --ssBam, the pipeline will automatically switch on SHIFT_TSS by setting --ss parameter to true, as the alignment shifting is a required step for TFF analysis.

This allows users to focus on specific steps of the analysis as needed.

Parameters

The pipeline accepts the following parameters, which can be customized to suit your analysis needs:

--sampleSheet: Path to the sample sheet file. (Default: null)
--outdir: Directory where output files will be saved. (Default: results)
--cpus: Number of threads to use for computation. (Default: 14)
--account: UPPMAX account to use for job submission. (Default: null)
--memory: Amount of memory allocated per process. (Default: 20 GB)
--time: Maximum runtime for processes. (Default: 10-00:00:00)
--retry: Number of retries for failed processes. (Default: 3)
--highMemForks: Maximum number of high-memory forks. (Default: 2)
--fatMemForks: Maximum number of fat-memory forks. (Default: 1)

Example of a sample sheet:

file,group
/path/to/BAM/sample_A.bam,Control
/path/to/BAM/sample_B.bam,Control
/path/to/BAM/sample_C.bam,Treatment
/path/to/BAM/sample_D.bam,Treatment

Group is used to define group calling in Genrich peak caller.

Genome Details

--bsgenome: Reference genome package. (Default: BSgenome.Hsapiens.UCSC.hg38)
--genome_name: Name of the genome. (Default: Hsapiens)
--txdb: Transcript database package. (Default: TxDb.Hsapiens.UCSC.hg38.knownGene)
--hsdb: Annotation database package. (Default: org.Hs.eg.db)
--blk: Path to the blacklist file. (Default: /Users/xsinad/Documents/Codes/dockers/omics/test/hg38-blacklist.v2.bed)
--ref: Path to the reference genome file. (Default: /Users/xsinad/Documents/NBIS/refs/hg38.fa)

Optional Flags

--ss: Perform Split Shift analysis. (Default: false)
--ssBam: Use pre-shifted BAM files. (Default: false)
--noqc: Skip BAM quality control. (Default: false)
--nobl: Skip blacklist filtering. (Default: false)
--nofl: Skip Picard fragment length evaluation. (Default: false)
--noanno: Skip annotation steps. (Default: false)
--nofc: Skip featureCounts analysis. (Default: false)
--nodiff: Skip differential analysis. (Default: false)
--tff: Perform transcription factor footprinting for a gene list. (Default: null)
--nopc: Skip peak calling. (Default: false)
--nopw: Skip pathway analysis. (Default: false)
--nogp: Skip adding group information. (Default: false)
--norep: Skip QC report generation. (Default: false)

Special note when using the --justdar flag:

The merged peak counts file(s) should be in multi-sample BED file format.

Shift Split Plot Parameters

--ss_tss_filter: TSS filter threshold. (Default: 0.5)
--ss_ntile: Number of tiles for analysis. (Default: 101)
--ss_ups: Upstream region size. (Default: 1010)
--ss_dws: Downstream region size. (Default: 1010)
--fdr: False discovery rate threshold. (Default: 0.05)

Peak Calling

--peakCaller: Peak calling tool to use. (Default: genrich) Currently only genrich is supported.
--peakCallerOpts: Additional options for the peak caller. (Default: -j)
--model: Specify a model for DAR. eg "Age + Sex + Condition" (Default: null)
--meta: Metadata CSV file for DAR analysis. (Default: null)
--conts: Contrats to calculate DAR for. (Default: null)

Special note on the --conts parameter:

The --conts parameter is used to specify the contrasts for differential analysis. It should be a comma-separated list of contrasts, such as Condition__Control__Treatment,Condition__Control__Placebo. The first value is the column name where the annotation is stored, while the second and third values are for the contrasts. The pipeline will automatically generate the necessary comparisons based on the provided contrasts. Please note that the separator for the contrasts is __ (double underscore) and not a single underscore _. This was done to avoid confusion with the underscore used in the column names sometimes.

Output

The pipeline generates the following outputs:

Aligned BAM files
Peak calling results
Quality control reports
Summary statistics
Transcription factor footprinting results (if --tff is specified)
Differential Accessibility results
Pathway analysis results

Programs used

Program	Purpose	Link
SAMtools	Manipulating SAM/BAM files	SAMtools
BEDTools	Manipulating genomic intervals	BEDTools
FastQC	Quality control for raw sequencing data	FastQC
MultiQC	Aggregates results from multiple tools into a single report	MultiQC
Picard	Tools for manipulating high-throughput sequencing data	Picard
Genrich	Peak calling for ATAC-seq data	Genrich
FeatureCounts	Counting reads in genomic features	FeatureCounts
R:DESeq2	Differential expression analysis	DESeq2
R:TxDb	Annotation database for transcript features	TxDb
R:BSgenome	Provides access to genome sequences	BSgenome
R:ATACseqQC	Quality control for ATAC-seq data	ATACseqQC
R:EDAseq	Normalization and quality control for RNA-seq data	EDAseq

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

Acknowledgments

Built with Nextflow
Inspired by the NBIS Epigenomics Workshop repo
This pipeline is a work in progress and may not cover all edge cases or be fully optimized for all environments. Testing and validation are ongoing.
Disclosure: The author is affiliated with NBIS and GU hence probable bias exists towards the use of NBIS course materials/ SOPs.

For questions or issues, please open an issue in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nf-atac

Pipeline Stats

Quick Links

DAG

Features

Requirements

Installation

Offline setup

HPC profiles (SUPR NAISS Sweden only)

Usage

Custom Entry points

Parameters

Genome Details

Optional Flags

Shift Split Plot Parameters

Peak Calling

Output

Programs used

Contributing

Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
conf		conf
modules		modules
subworkflows		subworkflows
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

addityea/nf-atac

Folders and files

Latest commit

History

Repository files navigation

nf-atac

Pipeline Stats

Quick Links

DAG

Features

Requirements

Installation

Offline setup

HPC profiles (SUPR NAISS Sweden only)

Usage

Custom Entry points

Parameters

Genome Details

Optional Flags

Shift Split Plot Parameters

Peak Calling

Output

Programs used

Contributing

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages