This is an automated workflow pipeline for analyzing and processing ATAC-seq data, implemented primarily in bash, and wrapped in a NextFlow workflow to characterize the chromatin landscape in bulk ATAC-seq samples. Here are the steps for data processing:
- [Completed] Running Trim galore to cut the adapters
- [Completed] Running alignment to the reference genome using Bowtie2
- [Completed] Running filtering using Samtools
- [Completed] Running mark duplicates using picard
- [Completed] Running peak calling using MACS2
- [In-progress] Calculating TSSe score
- [Completed] Generating bigWig and heatmap using Deeptools
This tool is used to process bulk ATAC-seq data by mapping paired-end reads to a reference genome and identifying areas of open chromatin after peak calling. This tool generates files that can be visualized on a genome browser.
Running the tool is pretty straight forward, however a good understanding of bash is recommended. Please familiarize yourself with data types, basic functions, data structures in each language.
There are two ways to run the ATAC-seq pipeline: either by installing the necessary packages manually on your local system, or by using a Docker container, where everything is pre-installed. If you choose to use Docker, skip ahead to the section Running the Tool in Docker.
You can install ATAC-Seq NextFlow Pipeline via git:
git clone https://github.com/utdal/ATACSeq-NextFlow-Pipeline
To execute the tool, essential modifications need to be made to the file(s):
a) pipeline.config
b) atac_seq_samples.txt
Note:
- Install FastQC, MultiQC, cutadapt, trim-galore, and macs2 packages.
- To run
MarkDuplicates, you will need the Picard Java archive filepicard.jar, which can be downloaded from the Broad Institute's website. Make sure to update the file path to this archive in thepipeline.configfile.- Before executing the pipeline, you must build the Bowtie2 index from the reference genome and place it in the config directory:
params.config_directory = '/path/to/config'. Download the reference genome:hg38canon.faand, to build the index, execute:bowtie2-build hg38canon.fa /path/to/reference_genome/index/hg38Here
/path/to/reference_genome/index/hg38is Bowtie2 human genome index directory, that needs to be updated in thepipeline.configfile.
Here is an example of how to run the pipeline:
- Command to run the pipeline:
nextflow run atac_seq_nextflow_pipeline.nf -c pipeline.config - Command to re-run from fail-point:
nextflow run atac_seq_nextflow_pipeline.nf -c pipeline.config -resume
The results generated are stored in the params.config_directory = '/path/to/config' directory, as mentioned in the pipeline.config file.
Running ATAC-seq in Docker is straightforward, here is an example of how to run the ATAC-seq pipeline using Docker.
- Check if docker is already installed:
docker --version
Below are the required input and configuration files needed to run the tool:
2. Place all the necessary files in the config directory / data, i.e., /mnt/Working/ATACSeq-NextFlow-Pipeline/data using docker volume
Note: The config directory in the docker image would be:
/mnt/Working/ATACSeq-NextFlow-Pipelineand all the data that would be added via a docker volume mount would be accessible from thedatadirectory (/mnt/Working/ATACSeq-NextFlow-Pipeline/data). Modify thepipeline.configfile accordingly.
-
Paired-end fastq files in a
fastq_filesdirectory. -
Bowtie2 genome index files in a directory (e.g., hg38`).
-
Reference genome from NCBI in the
refdata-gex-GRCh38-2020-Adirectory. -
atac_seq_samples.txtcontaining sample names without paired-end information. -
pipeline.configfile containing paths to all the necessary files and the genome reference. -
Run the docker image by setting up a working directory and mounting a volume where the input and configuration files are located.
docker run -it -v C:\Users\NXI220005\Documents\docker_atac_mount_testing:/mnt/Working/ATACSeq-NextFlow-Pipeline/data -w /mnt/Working/ATACSeq-NextFlow-Pipeline unikill066/atac_seq_nextflow_pipeline:latest /bin/bashAfter entering the container; follow the following commands:
- Activate the working environment:
conda activate atac_seq - Run the nextflow pipeline:
nextflow run atac_seq_nextflow_pipeline.nf -c data/pipeline.config - If the pipeline encounters errors, dont worry—fix the issues and resume the process from the last checkpoint with:
nextflow run atac_seq_nextflow_pipeline.nf -c data/pipeline.config -resume
- Activate the working environment:
-
Once the run is completed, all output files will be copied back to the mounted volume.
