Skip to content

Yet another ATACseq pipeline written in Nextflow, but this one is different (so said everyone). See the DAG, if you prefer the components in this pipeline, give it a spin!

Notifications You must be signed in to change notification settings

addityea/nf-atac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-atac

Workflow Version DOI

Pipeline Stats

  • Modules: 5+ (PRE_BAM, QC_REPORT, PEAK_CALL, DAR_PATH, SHIFT_TSS, TFF_PLOTS)
  • Languages: Nextflow DSL2, Groovy, R
  • Supported Platforms: Local, HPC, Cloud
  • Execution Profiles: Local, Docker, Apptainer, UPPMAX
  • Peak Caller: Genrich (default)
  • Genome: hg38 (default), currently supports 'hg38' and 'mm10'
  • Custom Entry Points: 6+
  • Output Formats: BAM, BED, QC Reports, Pathway Analysis
  • Retries: 3 (default)
  • Memory: 20 GB (default per process)
  • CPUs: 14 (default)
  • Container: Defined per module, hosted at Seqera
  • FDR Threshold: 0.05 (default)
  • TSS Analysis: Enabled with --ss or --tff
  • Differential Analysis: Supports custom contrasts
  • Annotation: TxDb, BSgenome, org.Hs/Mm.eg.db
  • Blacklist Filtering: Enabled by default
  • QC Tools: FastQC, MultiQC, ATACseqQC
  • DAR Analysis: DESeq2
  • Footprinting: Gene-specific TFF analysis
  • Pathway Analysis: Integrated
  • Skip Options: 10+ customizable flags
  • Profiles: Docker, Apptainer, Conda, Arm, UPPMAX, Local
  • Default Genome: hg38 (configurable)

Quick Links

nf-atac is a bioinformatics pipeline built using Nextflow for analyzing ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data. It automates the preprocessing, peak calling, and downstream analysis of ATAC-seq datasets.

DAG

nf-atac is currently in development and more modules will be added and optimizations performed in time. This DAG will keep evolving.

flowchart TB
    subgraph " "
    v0["Channel.fromPath"]
    v7["blacklist"]
    v12["tag"]
    v19["tag"]
    v48["ref"]
    v62["goi"]
    end
    subgraph PRE_BAM
    v6([ADD_GROUP])
    v8([BAM_BLKLIST])
    v9([BAM_FL_RAW])
    v13([COMBINE_INSERT_METS_RAW])
    v15([BAMQC])
    v16([BAM_FL_QC])
    v20([COMBINE_INSERT_METS_QC])
    v1(( ))
    v11(( ))
    v18(( ))
    end
    subgraph " "
    v10[" "]
    v14[" "]
    v17[" "]
    v21[" "]
    v24[" "]
    v27[" "]
    v36[" "]
    v37[" "]
    v38[" "]
    v39[" "]
    v41[" "]
    v42[" "]
    v43[" "]
    v44[" "]
    v45[" "]
    v46[" "]
    v50[" "]
    v52[" "]
    v53[" "]
    v54[" "]
    v55[" "]
    v56[" "]
    v57[" "]
    v59["group_input"]
    v60[" "]
    v61[" "]
    v64[" "]
    end
    subgraph QC_REPORT
    v23([FASTQC])
    v26([MULTIQC])
    v25(( ))
    end
    subgraph PEAK_CALL
    v28([SORT_N])
    v31([GENRICH])
    v35([FEAT_COUNTS])
    v40([PEAK_ANNOT])
    v29(( ))
    v32(( ))
    v33(( ))
    v47(( ))
    end
    subgraph DAR_PATH
    v49([DAR])
    v51([PATHWAY])
    end
    subgraph SHIFT_TSS
    v58([SHIFT_SPLIT])
    end
    subgraph TFF_PLOTS
    v63([TFF])
    end
    v0 --> v1
    v1 --> v6
    v6 --> v8
    v7 --> v8
    v8 --> v9
    v8 --> v15
    v9 --> v10
    v9 --> v11
    v12 --> v13
    v11 --> v13
    v13 --> v14
    v15 --> v16
    v15 --> v28
    v15 --> v58
    v15 --> v1
    v16 --> v17
    v16 --> v18
    v19 --> v20
    v18 --> v20
    v20 --> v21
    v1 --> v23
    v23 --> v24
    v23 --> v25
    v25 --> v26
    v26 --> v27
    v28 --> v29
    v28 --> v33
    v29 --> v31
    v31 --> v40
    v31 --> v32
    v32 --> v35
    v33 --> v35
    v35 --> v39
    v35 --> v38
    v35 --> v37
    v35 --> v36
    v35 --> v47
    v40 --> v46
    v40 --> v45
    v40 --> v44
    v40 --> v43
    v40 --> v42
    v40 --> v41
    v48 --> v49
    v47 --> v49
    v49 --> v50
    v49 --> v51
    v51 --> v57
    v51 --> v56
    v51 --> v55
    v51 --> v54
    v51 --> v53
    v51 --> v52
    v58 --> v63
    v58 --> v61
    v58 --> v60
    v58 --> v59
    v62 --> v63
    v63 --> v64
Loading

Features

  • Automated Workflow: Streamlines the ATAC-seq analysis process.
  • Reproducibility: Ensures consistent results across runs using Nextflow.
  • Scalability: Supports execution on local machines, HPC clusters, and cloud platforms.
  • Customizable: Easily configurable to suit specific experimental needs.

Requirements

Installation

  1. Clone the repository:

    git clone https://github.com/addityea/nf-atac.git
    cd nf-atac
  2. Install Nextflow:

    curl -s https://get.nextflow.io | bash
    # Make sure it's executable and move to PATH
    chmod +x nextflow
    sudo mv nextflow /usr/local/bin/

    In case you don't have sudo access, you can move the nextflow binary to a directory in your PATH (e.g., ~/bin) or run it from the current directory by appending ./ before the command.

  3. Ensure dependencies are installed (e.g., Docker/Singularity, Java).

Offline setup

If you need to run the pipeline in an offline environment, you can cache all the Singularity containers and use the offline profile. In this example, we download all the images to the conts directory. You can do this by running the following command from the nf-xen directory:

cd conts/
bash get_all.sh

After this, move the whole nf-xen directory to the offline environment and run the pipeline with the offline profile in addition to your preferred profile(s).

Keep in mind that when running offline, cellDex download will fail, hence, make sure you provide an already comressed celldex reference in the sample sheet, or download the references beforehand and provide their paths in the sample sheet.

HPC profiles (SUPR NAISS Sweden only)

pdc_kth profile for PDC Dardel cluster uppmax profile for UPPMAX cluster including auto-detect for miarka, pelle, snowy and rackham queues.

If you are running the pipeline on these HPC clusters, you can use the specific profile. These profiles are configured to use Singularity containers and has specific settings for the HPC environments. Profiles originally written by Pontus Freyhult (@pontus) and adapted for the pipeline by Aditya Singh (@addityea).

Usage

Run the pipeline with the following command:

nextflow run main.nf --samplesheet <sample sheet> -resume  

You may customise the pipeline by modifying the nextflow.config file or by passing additional parameters in the command line. There are some profiles available for different environments:

  • 'local' for running on a local machine
  • 'docker' for running with Docker
  • 'arm' for specific flags for ARM architecture
  • 'apptainer' for running with Apptainer
  • 'UPPMAX' for running on UPPMAX cluster

Custom Entry points

The pipeline provides several custom entry points to execute specific parts of the workflow without running the entire pipeline. These entry points correspond to the sub-workflows defined in the main.nf file:

  • PREBAM: Preprocess BAM files, including blacklist filtering, fragment length evaluation, and quality control.
  • TSS: Perform TSS (Transcription Start Site) shift analysis.
  • PEAKS: Execute peak calling and generate annotated peaks.
  • TFF: Perform transcription factor footprinting analysis.
  • DIFF: Conduct differential accessibility analysis.
  • QC_REP: Generate quality control reports.

To use a specific entry point, run the pipeline with the -entry flag followed by the desired entry point name. For example:

nextflow run main.nf -entry PREBAM --samplesheet <sample sheet>

The other skip options can be used to skip the other steps in the pipeline, even in the sub-workflows. For example, if you want to run the PREBAM entry point but skip the BAMQC step, you can use:

nextflow run main.nf -entry PREBAM --samplesheet <sample sheet> --noqc

There are certain decision(s) that the pipeline will make based on the parameters provided.

For example:

If you provide a gene list for transcription factor footprinting analysis using the --tff parameter but do not specifiy that the supplied BAM file(s) are already shifted by providing the flag --ssBam, the pipeline will automatically switch on SHIFT_TSS by setting --ss parameter to true, as the alignment shifting is a required step for TFF analysis.

This allows users to focus on specific steps of the analysis as needed.

Parameters

The pipeline accepts the following parameters, which can be customized to suit your analysis needs:

  • --sampleSheet: Path to the sample sheet file. (Default: null)
  • --outdir: Directory where output files will be saved. (Default: results)
  • --cpus: Number of threads to use for computation. (Default: 14)
  • --account: UPPMAX account to use for job submission. (Default: null)
  • --memory: Amount of memory allocated per process. (Default: 20 GB)
  • --time: Maximum runtime for processes. (Default: 10-00:00:00)
  • --retry: Number of retries for failed processes. (Default: 3)
  • --highMemForks: Maximum number of high-memory forks. (Default: 2)
  • --fatMemForks: Maximum number of fat-memory forks. (Default: 1)

Example of a sample sheet:

file,group
/path/to/BAM/sample_A.bam,Control
/path/to/BAM/sample_B.bam,Control
/path/to/BAM/sample_C.bam,Treatment
/path/to/BAM/sample_D.bam,Treatment

Group is used to define group calling in Genrich peak caller.

Genome Details

  • --bsgenome: Reference genome package. (Default: BSgenome.Hsapiens.UCSC.hg38)
  • --genome_name: Name of the genome. (Default: Hsapiens)
  • --txdb: Transcript database package. (Default: TxDb.Hsapiens.UCSC.hg38.knownGene)
  • --hsdb: Annotation database package. (Default: org.Hs.eg.db)
  • --blk: Path to the blacklist file. (Default: /Users/xsinad/Documents/Codes/dockers/omics/test/hg38-blacklist.v2.bed)
  • --ref: Path to the reference genome file. (Default: /Users/xsinad/Documents/NBIS/refs/hg38.fa)

Optional Flags

  • --ss: Perform Split Shift analysis. (Default: false)
  • --ssBam: Use pre-shifted BAM files. (Default: false)
  • --noqc: Skip BAM quality control. (Default: false)
  • --nobl: Skip blacklist filtering. (Default: false)
  • --nofl: Skip Picard fragment length evaluation. (Default: false)
  • --noanno: Skip annotation steps. (Default: false)
  • --nofc: Skip featureCounts analysis. (Default: false)
  • --nodiff: Skip differential analysis. (Default: false)
  • --tff: Perform transcription factor footprinting for a gene list. (Default: null)
  • --nopc: Skip peak calling. (Default: false)
  • --nopw: Skip pathway analysis. (Default: false)
  • --nogp: Skip adding group information. (Default: false)
  • --norep: Skip QC report generation. (Default: false)

Special note when using the --justdar flag:

  • The merged peak counts file(s) should be in multi-sample BED file format.

Shift Split Plot Parameters

  • --ss_tss_filter: TSS filter threshold. (Default: 0.5)
  • --ss_ntile: Number of tiles for analysis. (Default: 101)
  • --ss_ups: Upstream region size. (Default: 1010)
  • --ss_dws: Downstream region size. (Default: 1010)
  • --fdr: False discovery rate threshold. (Default: 0.05)

Peak Calling

  • --peakCaller: Peak calling tool to use. (Default: genrich) Currently only genrich is supported.
  • --peakCallerOpts: Additional options for the peak caller. (Default: -j)
  • --model: Specify a model for DAR. eg "Age + Sex + Condition" (Default: null)
  • --meta: Metadata CSV file for DAR analysis. (Default: null)
  • --conts: Contrats to calculate DAR for. (Default: null)

Special note on the --conts parameter:

The --conts parameter is used to specify the contrasts for differential analysis. It should be a comma-separated list of contrasts, such as Condition__Control__Treatment,Condition__Control__Placebo. The first value is the column name where the annotation is stored, while the second and third values are for the contrasts. The pipeline will automatically generate the necessary comparisons based on the provided contrasts. Please note that the separator for the contrasts is __ (double underscore) and not a single underscore _. This was done to avoid confusion with the underscore used in the column names sometimes.

Output

The pipeline generates the following outputs:

  • Aligned BAM files
  • Peak calling results
  • Quality control reports
  • Summary statistics
  • Transcription factor footprinting results (if --tff is specified)
  • Differential Accessibility results
  • Pathway analysis results

Programs used

Program Purpose Link
SAMtools Manipulating SAM/BAM files SAMtools
BEDTools Manipulating genomic intervals BEDTools
FastQC Quality control for raw sequencing data FastQC
MultiQC Aggregates results from multiple tools into a single report MultiQC
Picard Tools for manipulating high-throughput sequencing data Picard
Genrich Peak calling for ATAC-seq data Genrich
FeatureCounts Counting reads in genomic features FeatureCounts
R:DESeq2 Differential expression analysis DESeq2
R:TxDb Annotation database for transcript features TxDb
R:BSgenome Provides access to genome sequences BSgenome
R:ATACseqQC Quality control for ATAC-seq data ATACseqQC
R:EDAseq Normalization and quality control for RNA-seq data EDAseq

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

Acknowledgments

  • Built with Nextflow
  • Inspired by the NBIS Epigenomics Workshop repo
  • This pipeline is a work in progress and may not cover all edge cases or be fully optimized for all environments. Testing and validation are ongoing.
  • Disclosure: The author is affiliated with NBIS and GU hence probable bias exists towards the use of NBIS course materials/ SOPs.

For questions or issues, please open an issue in the repository.

About

Yet another ATACseq pipeline written in Nextflow, but this one is different (so said everyone). See the DAG, if you prefer the components in this pipeline, give it a spin!

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published