ATAC-seq Snakemake Pipeline

A comprehensive, reproducible Snakemake workflow for processing and analyzing ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) data. This pipeline performs end-to-end analysis from raw FASTQ files to peak calling, annotation, and reproducibility assessment.

Features

Quality Control: Fastp adapter trimming, alignment quality metrics, TSS enrichment analysis
Flexible Alignment: Supports both Bowtie2 and BWA-MEM2 aligners
Prealignment Filtering: Optional prealignment to filter reads (e.g., mitochondrial DNA)
Peak Calling: MACS2 peak calling with Tn5 shift correction
Reproducibility Analysis:
- IDR (Irreproducible Discovery Rate) analysis for replicate consistency
- Replicate Correlation: Pearson and Spearman correlation metrics based on read counts in merged peaks
- Self-consistency (pseudo-replicates) and true replicate comparisons
- Optimal peak selection following ENCODE standards
Annotation: UROPA-based peak annotation to genomic features
Quantification: Read counts in consensus peaks, promoter regions, and TSS regions
Motif Analysis: HOMER motif enrichment analysis
Visualization: MultiQC reports, correlation scatter plots, BigWig tracks

Requirements

Snakemake >= 8.20.1
Conda or Mamba (for environment management)
Python 3.8+
PEP (Portable Encapsulated Projects) for sample metadata management

Installation

Clone the repository:

git clone <repository-url>
cd ATAC-sm

Install Snakemake (if not already installed):

conda install -c bioconda snakemake
# or
mamba install -c bioconda snakemake

Install PEP (if not already installed):

pip install peppy

The pipeline uses conda environments defined in workflow/envs/. These will be automatically created when you run the pipeline.

Quick Start

Prepare your configuration files:
- Edit config/config.yaml with your project settings
- Edit pep/project_config.yaml with your path variables
- Prepare your sample annotation CSV file (see Configuration section)
Run the pipeline:

cd workflow
snakemake --use-conda --cores <number_of_cores>

For cluster execution (SLURM example):

snakemake --use-conda --profile profiles/default --cores <number_of_cores>

Configuration

Project Configuration (`config/config.yaml`)

Key configuration sections:

Project Settings

project:
  name: your_project_name
  genome: hg38  # or mm10, etc.
  out_dir: "{process_dir}/your_output_directory"
  read_type: paired  # or "single"

Alignment Configuration

alignment:
  tool: bwa-mem2  # or "bowtie2"
  sequencing_platform: illumina
  
  # Optional prealignment (e.g., filter mitochondrial reads)
  prealign:
    enabled: true
    indices:
      - name: chrM
        path: "{database_dir}/hg38_rcrds/rcrds"
  
  # BWA-MEM2 specific
  bwa:
    index: "{shared_genome_dir}/bwa-mem2/GRCh38"
    min_score: null
    extra_args: ""
  
  # Bowtie2 specific
  bowtie2:
    index: "{database_dir}/hg38/indices_for_Bowtie2/hg38"

Peak Calling Configuration

peaks:
  macs2_keep_dup: "all"
  macs2_shift: -75      # Tn5 shift correction
  macs2_extsize: 150
  macs2_pval: 1e-4
  disable_tn5_shift: false

Reproducibility Analysis

replicates:
  pval_thresh: 1e-3
  smooth_win: 150
  cap_num_peak: 300000
  peak_combination_method: "idr"  # or "overlap"
  idr_thresh: 0.05
  idr_rank: "signal.value"

PEP Configuration (`pep/project_config.yaml`)

Define path variables that can be referenced in the main config:

paths:
  data_dir: "/path/to/raw/data"
  process_dir: "/path/to/output"
  database_dir: "/path/to/reference/genomes"
  shared_genome_dir: "/path/to/shared/genomes"

Sample Annotation File

The sample annotation CSV should contain at minimum:

sample_name: Unique identifier for each sample
run: Run number (integer)
R1: Path to forward read FASTQ file
R2: Path to reverse read FASTQ file (for paired-end)
read_type: "paired" or "single"

Optional columns:

replicate_sample_name: Group identifier for biological replicates (enables reproducibility analysis)
passqc: QC flag (1 = pass, 0 = fail)

Example:

sample_name,run,R1,R2,read_type,replicate_sample_name
sample1,1,/path/to/sample1_R1.fq.gz,/path/to/sample1_R2.fq.gz,paired,group1
sample1,2,/path/to/sample1_run2_R1.fq.gz,/path/to/sample1_run2_R2.fq.gz,paired,group1
sample2,1,/path/to/sample2_R1.fq.gz,/path/to/sample2_R2.fq.gz,paired,group1

Pipeline Workflow

The pipeline consists of several main phases:

1. Preprocessing

Adapter Trimming: Fastp removes adapters and performs quality filtering
Prealignment (optional): Filters reads against specified indices (e.g., mitochondrial DNA)

2. Alignment

Alignment using Bowtie2 or BWA-MEM2
Duplicate marking with Samblaster
Quality filtering and sorting

3. Peak Calling

MACS2 peak calling per sample
Tn5 shift correction for ATAC-seq
Peak merging across samples

4. Reproducibility Analysis (for replicate groups)

Pseudo-replicate Generation: Splits each sample into two pseudo-replicates
Self-Consistency: IDR between pseudo-replicates (N1, N2, ...)
True Replicate Comparison: IDR between biological replicates (Nt)
Pooled Pseudo-Replicates: IDR on pooled data (Np)
Replicate Correlation: Pearson and Spearman correlation based on read counts in group-merged peaks
Optimal Peak Selection: ENCODE-standard selection (max(Nt, Np))

5. Annotation & Quantification

Peak annotation to genomic features (promoters, TSS, etc.)
Read counting in consensus peaks
Promoter and TSS region quantification

6. Downstream Analysis

HOMER motif enrichment
BigWig track generation
MultiQC report generation

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
config		config
pep		pep
schemas		schemas
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATAC-seq Snakemake Pipeline

Features

Requirements

Installation

Quick Start

Configuration

Project Configuration (`config/config.yaml`)

Project Settings

Alignment Configuration

Peak Calling Configuration

Reproducibility Analysis

PEP Configuration (`pep/project_config.yaml`)

Sample Annotation File

Pipeline Workflow

1. Preprocessing

2. Alignment

3. Peak Calling

4. Reproducibility Analysis (for replicate groups)

5. Annotation & Quantification

6. Downstream Analysis

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ATAC-seq Snakemake Pipeline

Features

Requirements

Installation

Quick Start

Configuration

Project Configuration (config/config.yaml)

Project Settings

Alignment Configuration

Peak Calling Configuration

Reproducibility Analysis

PEP Configuration (pep/project_config.yaml)

Sample Annotation File

Pipeline Workflow

1. Preprocessing

2. Alignment

3. Peak Calling

4. Reproducibility Analysis (for replicate groups)

5. Annotation & Quantification

6. Downstream Analysis

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Project Configuration (`config/config.yaml`)

PEP Configuration (`pep/project_config.yaml`)

Packages