- Modules: 5+ (PRE_BAM, QC_REPORT, PEAK_CALL, DAR_PATH, SHIFT_TSS, TFF_PLOTS)
- Languages: Nextflow DSL2, Groovy, R
- Supported Platforms: Local, HPC, Cloud
- Execution Profiles: Local, Docker, Apptainer, UPPMAX
- Peak Caller: Genrich (default)
- Genome: hg38 (default), currently supports 'hg38' and 'mm10'
- Custom Entry Points: 6+
- Output Formats: BAM, BED, QC Reports, Pathway Analysis
- Retries: 3 (default)
- Memory: 20 GB (default per process)
- CPUs: 14 (default)
- Container: Defined per module, hosted at Seqera
- FDR Threshold: 0.05 (default)
- TSS Analysis: Enabled with
--ssor--tff - Differential Analysis: Supports custom contrasts
- Annotation: TxDb, BSgenome, org.Hs/Mm.eg.db
- Blacklist Filtering: Enabled by default
- QC Tools: FastQC, MultiQC, ATACseqQC
- DAR Analysis: DESeq2
- Footprinting: Gene-specific TFF analysis
- Pathway Analysis: Integrated
- Skip Options: 10+ customizable flags
- Profiles: Docker, Apptainer, Conda, Arm, UPPMAX, Local
- Default Genome: hg38 (configurable)
nf-atac is a bioinformatics pipeline built using Nextflow for analyzing ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data. It automates the preprocessing, peak calling, and downstream analysis of ATAC-seq datasets.
nf-atac is currently in development and more modules will be added and optimizations performed in time.
This DAG will keep evolving.
flowchart TB
subgraph " "
v0["Channel.fromPath"]
v7["blacklist"]
v12["tag"]
v19["tag"]
v48["ref"]
v62["goi"]
end
subgraph PRE_BAM
v6([ADD_GROUP])
v8([BAM_BLKLIST])
v9([BAM_FL_RAW])
v13([COMBINE_INSERT_METS_RAW])
v15([BAMQC])
v16([BAM_FL_QC])
v20([COMBINE_INSERT_METS_QC])
v1(( ))
v11(( ))
v18(( ))
end
subgraph " "
v10[" "]
v14[" "]
v17[" "]
v21[" "]
v24[" "]
v27[" "]
v36[" "]
v37[" "]
v38[" "]
v39[" "]
v41[" "]
v42[" "]
v43[" "]
v44[" "]
v45[" "]
v46[" "]
v50[" "]
v52[" "]
v53[" "]
v54[" "]
v55[" "]
v56[" "]
v57[" "]
v59["group_input"]
v60[" "]
v61[" "]
v64[" "]
end
subgraph QC_REPORT
v23([FASTQC])
v26([MULTIQC])
v25(( ))
end
subgraph PEAK_CALL
v28([SORT_N])
v31([GENRICH])
v35([FEAT_COUNTS])
v40([PEAK_ANNOT])
v29(( ))
v32(( ))
v33(( ))
v47(( ))
end
subgraph DAR_PATH
v49([DAR])
v51([PATHWAY])
end
subgraph SHIFT_TSS
v58([SHIFT_SPLIT])
end
subgraph TFF_PLOTS
v63([TFF])
end
v0 --> v1
v1 --> v6
v6 --> v8
v7 --> v8
v8 --> v9
v8 --> v15
v9 --> v10
v9 --> v11
v12 --> v13
v11 --> v13
v13 --> v14
v15 --> v16
v15 --> v28
v15 --> v58
v15 --> v1
v16 --> v17
v16 --> v18
v19 --> v20
v18 --> v20
v20 --> v21
v1 --> v23
v23 --> v24
v23 --> v25
v25 --> v26
v26 --> v27
v28 --> v29
v28 --> v33
v29 --> v31
v31 --> v40
v31 --> v32
v32 --> v35
v33 --> v35
v35 --> v39
v35 --> v38
v35 --> v37
v35 --> v36
v35 --> v47
v40 --> v46
v40 --> v45
v40 --> v44
v40 --> v43
v40 --> v42
v40 --> v41
v48 --> v49
v47 --> v49
v49 --> v50
v49 --> v51
v51 --> v57
v51 --> v56
v51 --> v55
v51 --> v54
v51 --> v53
v51 --> v52
v58 --> v63
v58 --> v61
v58 --> v60
v58 --> v59
v62 --> v63
v63 --> v64
- Automated Workflow: Streamlines the ATAC-seq analysis process.
- Reproducibility: Ensures consistent results across runs using Nextflow.
- Scalability: Supports execution on local machines, HPC clusters, and cloud platforms.
- Customizable: Easily configurable to suit specific experimental needs.
-
Clone the repository:
git clone https://github.com/addityea/nf-atac.git cd nf-atac -
Install Nextflow:
curl -s https://get.nextflow.io | bash # Make sure it's executable and move to PATH chmod +x nextflow sudo mv nextflow /usr/local/bin/
In case you don't have
sudoaccess, you can move thenextflowbinary to a directory in yourPATH(e.g.,~/bin) or run it from the current directory by appending./before the command. -
Ensure dependencies are installed (e.g., Docker/Singularity, Java).
If you need to run the pipeline in an offline environment, you can cache all the Singularity containers and use the offline profile.
In this example, we download all the images to the conts directory. You can do this by running the following command from the nf-xen directory:
cd conts/
bash get_all.shAfter this, move the whole nf-xen directory to the offline environment and run the pipeline with the offline profile in addition to your preferred profile(s).
Keep in mind that when running offline, cellDex download will fail, hence, make sure you provide an already comressed celldex reference in the sample sheet, or download the references beforehand and provide their paths in the sample sheet.
pdc_kth profile for PDC Dardel cluster
uppmax profile for UPPMAX cluster including auto-detect for miarka, pelle, snowy and rackham queues.
If you are running the pipeline on these HPC clusters, you can use the specific profile. These profiles are configured to use Singularity containers and has specific settings for the HPC environments.
Profiles originally written by Pontus Freyhult (@pontus) and adapted for the pipeline by Aditya Singh (@addityea).
Run the pipeline with the following command:
nextflow run main.nf --samplesheet <sample sheet> -resume You may customise the pipeline by modifying the nextflow.config file or by passing additional parameters in the command line.
There are some profiles available for different environments:
- 'local' for running on a local machine
- 'docker' for running with Docker
- 'arm' for specific flags for ARM architecture
- 'apptainer' for running with Apptainer
- 'UPPMAX' for running on UPPMAX cluster
The pipeline provides several custom entry points to execute specific parts of the workflow without running the entire pipeline. These entry points correspond to the sub-workflows defined in the main.nf file:
PREBAM: Preprocess BAM files, including blacklist filtering, fragment length evaluation, and quality control.TSS: Perform TSS (Transcription Start Site) shift analysis.PEAKS: Execute peak calling and generate annotated peaks.TFF: Perform transcription factor footprinting analysis.DIFF: Conduct differential accessibility analysis.QC_REP: Generate quality control reports.
To use a specific entry point, run the pipeline with the -entry flag followed by the desired entry point name. For example:
nextflow run main.nf -entry PREBAM --samplesheet <sample sheet>The other skip options can be used to skip the other steps in the pipeline, even in the sub-workflows. For example, if you want to run the PREBAM entry point but skip the BAMQC step, you can use:
nextflow run main.nf -entry PREBAM --samplesheet <sample sheet> --noqcThere are certain decision(s) that the pipeline will make based on the parameters provided.
For example:
If you provide a gene list for transcription factor footprinting analysis using the --tff parameter but do not specifiy that the supplied BAM file(s) are already shifted by providing the flag --ssBam, the pipeline will automatically switch on SHIFT_TSS by setting --ss parameter to true, as the alignment shifting is a required step for TFF analysis.
This allows users to focus on specific steps of the analysis as needed.
The pipeline accepts the following parameters, which can be customized to suit your analysis needs:
--sampleSheet: Path to the sample sheet file. (Default:null)--outdir: Directory where output files will be saved. (Default:results)--cpus: Number of threads to use for computation. (Default:14)--account: UPPMAX account to use for job submission. (Default:null)--memory: Amount of memory allocated per process. (Default:20 GB)--time: Maximum runtime for processes. (Default:10-00:00:00)--retry: Number of retries for failed processes. (Default:3)--highMemForks: Maximum number of high-memory forks. (Default:2)--fatMemForks: Maximum number of fat-memory forks. (Default:1)
Example of a sample sheet:
file,group
/path/to/BAM/sample_A.bam,Control
/path/to/BAM/sample_B.bam,Control
/path/to/BAM/sample_C.bam,Treatment
/path/to/BAM/sample_D.bam,Treatment
Group is used to define group calling in Genrich peak caller.
--bsgenome: Reference genome package. (Default:BSgenome.Hsapiens.UCSC.hg38)--genome_name: Name of the genome. (Default:Hsapiens)--txdb: Transcript database package. (Default:TxDb.Hsapiens.UCSC.hg38.knownGene)--hsdb: Annotation database package. (Default:org.Hs.eg.db)--blk: Path to the blacklist file. (Default:/Users/xsinad/Documents/Codes/dockers/omics/test/hg38-blacklist.v2.bed)--ref: Path to the reference genome file. (Default:/Users/xsinad/Documents/NBIS/refs/hg38.fa)
--ss: Perform Split Shift analysis. (Default:false)--ssBam: Use pre-shifted BAM files. (Default:false)--noqc: Skip BAM quality control. (Default:false)--nobl: Skip blacklist filtering. (Default:false)--nofl: Skip Picard fragment length evaluation. (Default:false)--noanno: Skip annotation steps. (Default:false)--nofc: Skip featureCounts analysis. (Default:false)--nodiff: Skip differential analysis. (Default:false)--tff: Perform transcription factor footprinting for a gene list. (Default:null)--nopc: Skip peak calling. (Default:false)--nopw: Skip pathway analysis. (Default:false)--nogp: Skip adding group information. (Default:false)--norep: Skip QC report generation. (Default:false)
Special note when using the --justdar flag:
- The merged peak counts file(s) should be in multi-sample
BEDfile format.
--ss_tss_filter: TSS filter threshold. (Default:0.5)--ss_ntile: Number of tiles for analysis. (Default:101)--ss_ups: Upstream region size. (Default:1010)--ss_dws: Downstream region size. (Default:1010)--fdr: False discovery rate threshold. (Default:0.05)
--peakCaller: Peak calling tool to use. (Default:genrich) Currently onlygenrichis supported.--peakCallerOpts: Additional options for the peak caller. (Default:-j)--model: Specify a model for DAR. eg "Age + Sex + Condition" (Default:null)--meta: Metadata CSV file for DAR analysis. (Default:null)--conts: Contrats to calculate DAR for. (Default:null)
Special note on the --conts parameter:
The --conts parameter is used to specify the contrasts for differential analysis. It should be a comma-separated list of contrasts, such as Condition__Control__Treatment,Condition__Control__Placebo. The first value is the column name where the annotation is stored, while the second and third values are for the contrasts. The pipeline will automatically generate the necessary comparisons based on the provided contrasts. Please note that the separator for the contrasts is __ (double underscore) and not a single underscore _. This was done to avoid confusion with the underscore used in the column names sometimes.
The pipeline generates the following outputs:
- Aligned BAM files
- Peak calling results
- Quality control reports
- Summary statistics
- Transcription factor footprinting results (if
--tffis specified) - Differential Accessibility results
- Pathway analysis results
| Program | Purpose | Link |
|---|---|---|
| SAMtools | Manipulating SAM/BAM files | SAMtools |
| BEDTools | Manipulating genomic intervals | BEDTools |
| FastQC | Quality control for raw sequencing data | FastQC |
| MultiQC | Aggregates results from multiple tools into a single report | MultiQC |
| Picard | Tools for manipulating high-throughput sequencing data | Picard |
| Genrich | Peak calling for ATAC-seq data | Genrich |
| FeatureCounts | Counting reads in genomic features | FeatureCounts |
| R:DESeq2 | Differential expression analysis | DESeq2 |
| R:TxDb | Annotation database for transcript features | TxDb |
| R:BSgenome | Provides access to genome sequences | BSgenome |
| R:ATACseqQC | Quality control for ATAC-seq data | ATACseqQC |
| R:EDAseq | Normalization and quality control for RNA-seq data | EDAseq |
Contributions are welcome! Please fork the repository and submit a pull request.
- Built with Nextflow
- Inspired by the NBIS Epigenomics Workshop repo
- This pipeline is a work in progress and may not cover all edge cases or be fully optimized for all environments. Testing and validation are ongoing.
- Disclosure: The author is affiliated with
NBISandGUhence probable bias exists towards the use ofNBIScourse materials/ SOPs.
For questions or issues, please open an issue in the repository.