A Nextflow workflow for comprehensive evolutionary analysis using CodeML to detect selection pressure through dN/dS ratio calculations and likelihood ratio tests.
This NEXTFLOW workflow performs custom dN/dS analysis by building codon-aware alignments from orthologous sequences and testing multiple evolutionary models using CodeML. It implements branch-specific and subtree-specific models inspired by Zhang et al. 2019 to detect positive selection across phylogenetic trees.
- outdir = name of the results directory
- A = focal nucleotide FASTA file with the query sequences
- B = subject nucleotides FASTA file with the orthologs
- includeA = boolean flag (default: false, unnecessary)
- A_seqlist = a subset list of the focal FASTA queries
- ortho = a directory with 2-column TSV files containing pairs of orthologs (should be named "{A}_vs_{anyB}_orthologs.tsv")
- tree = a NEWICK tree of the genomes
- ctl_file = should be "${projectDir}/ctl/codeml_one_omega_model.ctl" (others not tested)
- style = should be 'branch_models' (others not tested)
- fasta_batch_size = batch size for FASTA processing (default: 10000)
- alignment_batch_size = batch size for alignment processing (default: 100)
- codeml_batch_size = batch size for CodeML analysis (default: 10)
nextflow run <repository>/dNdS -profile <SINGULARITY-APPTAINER-DOCKER> --outdir <OUTDIR> --A <FOCAL_FASTA> --B <SUBJECT_FASTA> --ortho <ORTHO_DIR> --tree <NEWICK_TREE>
The workflow is expected to be run with a container manager (Singularity, Apptainer, or Docker).
The pipeline builds codon alignments using the following strategy:
- Dual alignment with MAFFT: Align both nucleotides and amino acids using translatorX.pl with '-p F' option
- Codon alignment: Generate codon-aligned sequences using pal2nal.pl respecting amino acid alignment
- Gap removal: Remove mostly gapped positions using remove_mostly_gapped_positions.py
- Format conversion: Convert FASTA alignment to PHYLIP format for CodeML compatibility
# Align both nucleotides and AAs with MAFFT
translatorX.pl -i $ortho_fna -p F -o $orf
# Get codons aligned with respect to amino acids
pal2nal.pl ${orf}.aa_ali.fasta ${orf}.nt_ali.fasta -output fasta > ${orf}_pal2nal.aln
# Remove mostly gapped positions
remove_mostly_gapped_positions.py ${orf}_pal2nal.aln ${orf}_clean.aln
# Convert to PHYLIP format for CodeML
FASTAtoPHYL.pl ${orf}_clean.aln $(grep -c ">" ${orf}_clean.aln) $(awk 'FNR==2 {print length}' ${orf}_clean.aln)For each alignment, the workflow tests multiple evolutionary models:
- 1omega: Single omega (dN/dS) across entire tree
- 1omega_fixed1: Single omega fixed to 1
- 2omega: Two-omega model with foreground branch having different omega
- 2omega_fixed1: Two-omega model with foreground omega fixed to 1
- free_omega: Free omega model allowing different omega for each branch
- Likelihood ratio tests: Compare nested models to detect significant differences
- AIC/BIC criteria: Evaluate model fit and complexity
- Parameter estimation: Extract dN/dS ratios and confidence intervals
- P-value thresholds: Default significance level of 0.05
- Bonferroni correction: Multiple testing correction available
- Branch-specific selection: Identify specific lineages under selection
- *_models.tsv_bonferroni_corrected
- under_selection_bonferroni_corrected.txt
- Nextflow >= 23.10.0
- Singularity OR Apptainer OR Docker
The branch-specific and subtree-specific analysis approach is inspired by: Zhang et al. 2019. "The evolution of gene expression in cis and trans." Nature Ecology & Evolution. https://doi.org/10.1038/s41559-019-0822-5
If you use this pipeline in your research, please cite the associated publication and the Zhang et al. 2019 methodology reference.