Ancestral Open Reading Frame Reconstruction Pipeline for de novo genes

A Nextflow workflow for identifying ancestral ORFs that gave birth to present-day de novo genes through phylogenetic reconstruction and sequence analysis.

Overview

This NEXTFLOW workflow takes a list of de novo emerged genes in a focal genome, along with their homologous sequences/regions across neighbor genomes. Using PRANK or PREQUEL, it reconstructs an ancestral DNA locus that is assumed to be non-coding. From this locus it extracts all the ORFs (stop-to-stop) >= 20 codons, and aligns them to the current de novo CDS. It thus identifies ancestral ORFs that gave birth to today's de novo genes.

Input Parameters

focal = the name of the focal genome
gendir = the directory with the genomes GFF3 and FASTA files (focal and neighbors)
tree = a NEWICK tree with the focal and neighbor genomes
TRG_table = a DENSE workflow output (TSV file) with the precomputed matches of the de novo CDS with CDS or genome regions of the neighbor genomes
queries = a text file with the desired list of de novo CDSs to process
mode = different modes of ancestral locus reconstruction (default: 'prank')
outdir = name of the results directory

Usage

nextflow run proginski/ancorf -profile <SINGULARITY-APPTAINER-DOCKER> --gendir <DIR_WITH_GFF_AND_FASTA> --focal <FOCAL_GENOME_NAME> --tree <NEWICK_WITH_FOCAL_AND_NEIGHBORS> --trg_table <TRG_TABLE> --queries <QUERIES> --mode <PRANK-PREQUEL> --outdir <OUTDIR>

Test

nextflow run proginski/ancorf -profile singularity --gendir data/test/genomes/ --focal Scer_NCBI --tree data/test/saccharomyces.nwk --trg_table data/test/TRG_table.tsv --queries data/test/SCER_NCOS_CDS_without_distant_homologs.txt -resume

Note

The test command including the (first and only one) pull of the Dockerhub image took about 7m on a computer with 10 CPUs and 20Gb of RAM

Warning

The PRANK algorithm presents an intrinsic variability, which might marginally cause the results to fluctuate.

Container Requirements

The workflow is expected to be run with a container manager (Singularity, Apptainer, Docker). It automatically pulls the DockerHub image proginski/ancorf.

Workflow Architecture

main.nf is just a generic entry to the specific workflows/ancorf.nf workflow
workflows/ancorf.nf orchestrates the execution of the different modules
modules/local/ancorf_modules.nf contains the different modules
bin/ contains the utility scripts
containers/ contains the pulled images
nextflow.config the general config file
nextflow_schema.json the helper json for the parameters

Processes

Core Modules

CHECK_INPUTS: check the integrity of the gendir, tree, and TRG_table
EXTRACT_CDS: extract the CDSs from the genome GFF3 and FASTA files
ELONGATE_CDS: elongate the CDSs of 99 nucleotides upstream and downstream to add local genomic context for the reconstruction
ALIGNMENT_FASTA: build an alignment FASTA file for each query de novo CDS with its CDS or genome matches across the neighbor genomes

Ancestral ORF Reconstruction

ANCESTRAL_ORFS_PRANK (mode: "prank")

Align the entries of the alignment FASTA file with macse_v2.07, fix the format and run prank providing the genomes' tree without iteration (-once), preserving the input alignment (-keep), and providing the different inferred ancestors in the tree (-showanc). It then selects the most recent common ancestor between the last CDS match, and the first outgroup (see https://github.com/i2bc/dense). It extracts all the ORFs (stop to stop) of at least 20 codons across the three frames. It finally tries to align the query de novo CDS with these ancestral ORFs, using ssearch36 (e-value 1e-2).

ANCESTRAL_ORFS_PREQUEL (mode: "prequel")

Same as ANCESTRAL_ORFS_PRANK, except that for the reconstruction of the ancestral locus, it uses PREQUEL. It first builds a model with phyloFit based on the alignment and the provided genomes tree. It preserves the input tree in the model that is provided to prequel.

Output Generation

ANCORFS_FASTA

Build FASTA files with the selected ancestral ORFs:

best: only the best matching ORF (e-value based) for each query
1e-2: all matches < 1e-2
1e-3: all matches < 1e-3

Dependencies

Nextflow >= 23.10.0

DENSE workflow: https://github.com/i2bc/dense >= v1.0.8

Singularity OR Apptainer OR Docker

Tested on release v1.0.0

Citation

If you use this pipeline in your research, please cite the associated publication and the DENSE workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ancestral Open Reading Frame Reconstruction Pipeline for de novo genes

Overview

Input Parameters

Usage

Test

Container Requirements

Workflow Architecture

Processes

Core Modules

Ancestral ORF Reconstruction

ANCESTRAL_ORFS_PRANK (mode: "prank")

ANCESTRAL_ORFS_PREQUEL (mode: "prequel")

Output Generation

ANCORFS_FASTA

Dependencies

Citation

About

Uh oh!

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bin		bin
data/test		data/test
modules/local		modules/local
workflows		workflows
LICENCE		LICENCE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

License

Proginski/ancorf

Folders and files

Latest commit

History

Repository files navigation

Ancestral Open Reading Frame Reconstruction Pipeline for de novo genes

Overview

Input Parameters

Usage

Test

Container Requirements

Workflow Architecture

Processes

Core Modules

Ancestral ORF Reconstruction

ANCESTRAL_ORFS_PRANK (mode: "prank")

ANCESTRAL_ORFS_PREQUEL (mode: "prequel")

Output Generation

ANCORFS_FASTA

Dependencies

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages