map_adna_linear

Snakemake workflow for mapping ancient DNA reads to reference genomes, with damage estimation and summary statistics. It supports single-end, paired-end, and collapsed reads.

Repository layout

workflow/Snakefile - main Snakemake workflow
workflow/rules/mapping.smk - mapping and BAM processing rules
workflow/rules/damage.smk - metaDMG and mapDamage rules
workflow/rules/stats.smk - summary statistics and plots
dataset_example/config.yml - example configuration
dataset_example/units.tsv - example sample sheet
run_dataset.sh - run locally
run_dataset_slurm.sh - submit to SLURM

Requirements

snakemake (9+ recommended)
bowtie2, samtools, picard
mapDamage, metaDMG-cpp
R (for plotting/stats scripts)
R packages: argparse, tidyverse, ggrepel, viridis, scales, Rsamtools, cowplot, cigarillo, vroom
Python with pandas

Configuration

Create a dataset directory with a config/ folder that contains:

config/config.yml
config/units.tsv

An example dataset lives at dataset_example/.

Example config/config.yml (pairs with the units.tsv example below):

prefix: demo
out_dir: results
units: config/units.tsv
ref:
  ref_genome_a: /path/to/bowtie2_prefix
bt_param: "-D 20 -R 3 -N 1 -L 20 -i S,1,0.50 --end-to-end --no-unal"
bam_filter:
  mq: 20
  qlen: 25
  rlen: 25
tmp_dir: /path/to/tmp/dir
metaDMG-cpp: /path/to/metaDMG-cpp

config/units.tsv columns:

sample_id: sample identifier used in output directory names
lib_id: library identifier (used in library-level outputs)
unit_id: lane/run identifier
fq_se: single-end or collapsed FASTQ (optional if paired-end only)
fq_pe1: paired-end read 1 FASTQ (optional)
fq_pe2: paired-end read 2 FASTQ (optional)

Example:

sample_id	lib_id	unit_id	fq_se	fq_pe1	fq_pe2
SAMPLE01	SAMPLE01_LIB01	SAMPLE01_LIB01_L001	/data/fastq/SAMPLE01_LIB01_L001.fq.gz	/data/fastq/SAMPLE01_LIB01_L001_R1.fq.gz	/data/fastq/SAMPLE01_LIB01_L001_R2.fq.gz
SAMPLE01	SAMPLE01_LIB01	SAMPLE01_LIB01_L002	/data/fastq/SAMPLE01_LIB01_L002.fq.gz	/data/fastq/SAMPLE01_LIB01_L002_R1.fq.gz	/data/fastq/SAMPLE01_LIB01_L002_R2.fq.gz
SAMPLE02	SAMPLE02_LIB01	SAMPLE02_LIB01_L001	/data/fastq/SAMPLE02_LIB01_L001.fq.gz	/data/fastq/SAMPLE02_LIB01_L001_R1.fq.gz	/data/fastq/SAMPLE02_LIB01_L001_R2.fq.gz

Run locally

From the repo root:

bash run_dataset.sh /path/to/dataset
bash run_dataset.sh dataset_example

You can pass extra Snakemake args after the dataset path:

bash run_dataset.sh /path/to/dataset -- --cores 8 --rerun-incomplete

Run on SLURM

bash run_dataset_slurm.sh /path/to/dataset --jobs 50 --partition general
bash run_dataset_slurm.sh dataset_example --dry-run

See full options:

bash run_dataset_slurm.sh --help

Outputs

For each sample_id, outputs are written under ${out_dir}/{sample_id}/:

bam/ - merged and duplicate-marked BAMs
metadamage/ - sample- and library-level damage estimates
mapdamage/ - mapDamage plots
tables/ - per-sample statistics tables
plots/ - per-sample summary plots (reference similarity, read length distribution and aDNA damage)
logs/ - mapping and damage logs

Summary outputs across all samples are written under ${out_dir}/summary/.

Example output structure:

${out_dir}/
├── SAMPLE01/
│   ├── bam/
│   │   ├── SAMPLE01.SAMPLE01_LIB01.SAMPLE01_LIB01_L001.demo.ref_genome_a.init.bam
│   │   ├── SAMPLE01.SAMPLE01_LIB01.SAMPLE01_LIB01_L002.demo.ref_genome_a.init.bam
│   │   ├── SAMPLE01.demo.ref_genome_a.mrkdup.bam
│   │   ├── SAMPLE01.demo.ref_genome_a.mrkdup.bam.bai
│   │   ├── SAMPLE01.demo.ref_genome_a.filter.bam
│   │   └── SAMPLE01.demo.ref_genome_a.filter.bam.bai
│   ├── logs/
│   │   ├── SAMPLE01.SAMPLE01_LIB01.SAMPLE01_LIB01_L001.demo.ref_genome_a.map_init.log
│   │   ├── SAMPLE01.SAMPLE01_LIB01.SAMPLE01_LIB01_L002.demo.ref_genome_a.map_init.log
│   │   ├── SAMPLE01.SAMPLE01_LIB01.demo.ref_genome_a.mrkdup_metrics.txt
│   │   ├── ref_genome_a.get_damage_global_sample.log
│   │   ├── ref_genome_a.get_damage_local_sample.log
│   │   ├── SAMPLE01_LIB01.ref_genome_a.get_damage_global_library.log
│   │   └── SAMPLE01_LIB01.ref_genome_a.get_damage_local_library.log
│   ├── metadamage/
│   │   ├── sample_level/
│   │   │   ├── ref_genome_a.demo.filter.global.tsv
│   │   │   ├── ref_genome_a.demo.filter.global.bdamage.gz
│   │   │   ├── ref_genome_a.demo.filter.global.dfit.gz
│   │   │   ├── ref_genome_a.demo.filter.local.tsv
│   │   │   ├── ref_genome_a.demo.filter.local.bdamage.gz
│   │   │   └── ref_genome_a.demo.filter.local.dfit.gz
│   │   └── library_level/
│   │       └── SAMPLE01_LIB01/
│   │           ├── ref_genome_a.demo.filter.global.tsv
│   │           ├── ref_genome_a.demo.filter.global.bdamage.gz
│   │           ├── ref_genome_a.demo.filter.global.dfit.gz
│   │           ├── ref_genome_a.demo.filter.local.tsv
│   │           ├── ref_genome_a.demo.filter.local.bdamage.gz
│   │           └── ref_genome_a.demo.filter.local.dfit.gz
│   ├── mapdamage/
│   │   └── SAMPLE01.demo.ref_genome_a.filter/
│   │       ├── Fragmisincorporation_plot.pdf
│   │       └── Length_plot.pdf
│   ├── tables/
│   │   ├── SAMPLE01.demo.ref_genome_a.filter.bam_stats.tsv
│   │   └── SAMPLE01.demo.ref_genome_a.filter.lib_stats.tsv
│   └── plots/
│       └── SAMPLE01.demo.ref_genome_a.filter.summary.pdf
└── summary/
    ├── tables/
    │   ├── demo.ref_genome_a.filter.bam_stats.tsv
    │   └── demo.ref_genome_a.filter.lib_stats.tsv
    └── plots/
        ├── demo.ref_genome_a.filter.bam_stats_coverage.pdf
        ├── demo.ref_genome_a.filter.lib_stats_n_map_uniq_p_dup.pdf
        ├── demo.ref_genome_a.filter.lib_stats_n_reads_n_map_uniq.pdf
        ├── demo.ref_genome_a.filter.lib_stats_n_reads_p_dup.pdf
        └── demo.ref_genome_a.filter.lib_stats_p_dup.pdf

Output tables

bam_stats.tsv (${out_dir}/{sample}/tables/{sample}.{PREFIX}.{ref}.filter.bam_stats.tsv) columns:

sample_id: sample identifier
library_id: library identifier (all indicates sample-level aggregate)
contig_id: reference contig ID (genome indicates genome-wide aggregate)
contig_l: contig length
n_reads: number of reads mapped to the contig
dp_avg: average depth of coverage
dp_sd: depth standard deviation
coverage_bp: covered bases (depth > 0)
coverage_p: fraction of covered bases
coverage_p_exp: expected covered fraction (Poisson)
coverage_p_ratio: observed/expected coverage fraction
dp_cv: depth coefficient of variation
coverage_evenness_score: evenness metric for coverage
cov_pos_rel_entropy_100: relative coverage position entropy (100 bp bins)
cov_pos_rel_entropy_1000: relative coverage position entropy (1000 bp bins)
read_l_avg: average read length
mq_avg: average mapping quality
n_soft_clip_avg: average soft-clipped bases per read
ani: average nucleotide identity proxy (1 - mismatches/read length)
edit_dist_mode: modal edit distance (NM)
edit_dist_avg: mean edit distance
edit_dist_avg_decay: mean decay of edit distance histogram
edit_dist_decay_end: edit distance bin where counts begin to increase
dmg_f_5p: 5' damage frequency (metaDMG)
dmg_f_3p: 3' damage frequency (metaDMG)
dmg_amp: damage amplitude (metaDMG A)
dmg_decay: damage decay (metaDMG q)
dmg_noise: damage noise (metaDMG c)
dmg_Zfit: metaDMG Z-fit statistic
dmg_ntot_5p: total 5' counts used for damage fit
dmg_ntot_3p: total 3' counts used for damage fit

lib_stats.tsv (${out_dir}/{sample}/tables/{sample}.{PREFIX}.{ref}.filter.lib_stats.tsv) columns:

sample_id: sample identifier
library_id: library identifier
num_reads_total: total reads from fastq logs (paired reads counted as two)
num_reads_mapped: reads examined after mapping
prop_mapped: fraction mapped (num_reads_mapped/num_reads_total)
num_reads_dup: duplicate reads
prop_dup: fraction duplicates (num_reads_dup/num_reads_mapped)
num_reads_final: final mapped reads after filtering
prop_final: fraction retained (num_reads_final/num_reads_total)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

map_adna_linear

Repository layout

Requirements

Configuration

Run locally

Run on SLURM

Outputs

Output tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset_example		dataset_example
workflow		workflow
README.md		README.md
run_dataset.sh		run_dataset.sh
run_dataset_slurm.sh		run_dataset_slurm.sh

Folders and files

Latest commit

History

Repository files navigation

map_adna_linear

Repository layout

Requirements

Configuration

Run locally

Run on SLURM

Outputs

Output tables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages