gene2phylo

gene2phylo is a snakemake pipeline for batch phylogenetic analysis of a given set of input genes.

Setup

The pipeline is written in Snakemake and uses conda to install the necessary tools.

It is strongly recommended to install conda using Mambaforge. See details here https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Once conda is installed, you can pull the github repo and set up the base conda environment.

# get github repo
git clone https://github.com/o-william-white/gene2phylo

# change dir
cd gene2phylo

# setup conda env
conda env create -n gene2phylo_env -f workflow/envs/conda_env.yaml

# set channel priority to avoid warning messages from snakemake
conda config --set channel_priority strict

If you need to install the conda environment to a specific location, use the following example, where the prefix argument can be updated to include a specific path:

conda env create -n gene2phylo_env --prefix /your_path/gene2phylo_env -f workflow/envs/conda_env.yaml

↥ back to top

Example data

Before you run your own data, it is recommended to run the example datasets provided. This will confirm there are no user-specific issues with the setup and it also installs all the dependencies. The example data includes mitochondrial and ribosomal genes from 25 different butterfly species.

To run the example data, use the code below. The first time you run the pipeline, it will take some time to install each of the conda environments, so it is a good time to take a tea break :).

conda activate gene2phylo_env

snakemake --profile workflow/profiles/test

↥ back to top

Example data with SLURM

If you have access to High Performance Computing Facilities (HPC) with a job scheduler, you can submit jobs so that each rule is submitted as a separate job with resources you can specify. The advantage of this approach is that you often have access to larger computational resources and many jobs can be run simultaneously.

snakemake --profile workflow/profiles/slurm

↥ back to top

Input

Snakemake requires a config.yaml to define input parameters.

For the example data provided, the config file is located here config/config.yaml and it looks like this:

# name of input directory containg genes
input_dir: .test

# realign (True or False)
realign: True

# alignment missing data threshold for alignment (0.0 - 1.0), only required if realign == True
missing_threshold: 0.5

# alignment trimming method to use (gblocks or clipkit), only required if realign == True
alignment_trim: gblocks

# name of outgroup sample (optional)
# use "NA" if there is no obvious outgroup
# if more than one outgroup use a comma separated list i.e. "sampleA,sampleB"
outgroup: Eurema_blanda

# plot dimensions (cm)
plot_height: 20
plot_width: 20

↥ back to top

Output

All output files are saved to the results directory. Below is a table summarising all of the output files generated by the pipeline.

Directory	Description
mafft	Optional: Mafft aligned fasta files of all genes
mafft_filtered	Optional: Mafft aligned fasta files after the removal of sequences based on a missing data threshold
alignment_trim	Optional: Ambiguous parts of alignment removed using either gblocks or clipkit
iqtree	Iqtree phylogenetic analysis for each gene
iqtree_plots	Plots of Iqtree phylogenetic tree for each gene
concatenate_alignments	Partitioned alignment of all genes
iqtree_partitioned	Iqtree partitioned phylogenetic analysis
iqtree_partitioned_plot	Plot of Iqtree partitioned tree
astral	Astral phylogenetic analysis of all gene trees
astral_plot	Plot of Astral tree

↥ back to top

Running your own data

For the pipeline to function properly, the input gene alignments must be:

in a single directory
end with ".fasta"
named after the aligned gene (e.g. "cox1.fasta" or "28S.fasta")
share identical sample names across alignments (e.g. all genes from sample A share the same name)

Please see the example data in the .test/ directory as an example.

Then you need to generate your own config.yaml file, using the example template provided.

↥ back to top

Getting help

If you have any questions, please do get in touch in the issues or by email o.william.white@gmail.com

↥ back to top

Citations

If you use the pipeline, please cite our bioarxiv preprint: https://doi.org/10.1101/2023.08.11.552985

Since the pipeline is a wrapper for several other bioinformatic tools we also ask that you cite the tools used by the pipeline:

Gblocks (default) https://doi.org/10.1093/oxfordjournals.molbev.a026334
Clipkit (optional) https://doi.org/10.1371/journal.pbio.3001007
Mafft (optional) https://doi.org/10.1093/molbev/mst010
Iqtree https://doi.org/10.1093/molbev/msu300
Ete3 https://doi.org/10.1093/molbev/msw046
Ggtree https://doi.org/10.1111/2041-210X.12628
Astral https://doi.org/10.1186/s12859-018-2129-y

↥ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.benchmarking		.benchmarking
.test		.test
config		config
workflow		workflow
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gene2phylo

Contents

Setup

Example data

Example data with SLURM

Input

Output

Running your own data

Getting help

Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gene2phylo

Contents

Setup

Example data

Example data with SLURM

Input

Output

Running your own data

Getting help

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages