BAD2matrix

A Python script for merging and translating FASTA alignments into TNT (extended XREAD), RAxML-NG/IQ-Tree (extended PHYLIP) and FastTree (fasta) input matrices. Optionally, it encodes indel characters using the simple gap coding method of Simmons and Ochoterena (2000; Gaps as characters in sequence-based phylogenetic analysis. Systematic Biology 49: 369-381. DOI 10.1080/10635159950173889), and gene content as binary characters (absence/presence). This script is slower than 2matrix.pl due to more disk use, but will not run out of RAM (hopefully).

Installation

Simply clone the GitHub repository or download the main script (bad2matrix.py). A Python 3 interpreter is required.

Usage

python bad2matrix.py -d <directory> -n <root-name> [-a 2|3|4|5|6|6dso|6kgb|6sr|8|10|11|12|15|18|20] [-f] [-g] [-i] [-m int] [-t <directory>]

option	description
-a	Number of amino acid states (default = 20). Reduction with option `6dso` follows Dayhoff et al. (1978); option `6kgb` follows Kosiol et al. (2004); option `6sr` follows Susko and Roger (2007); option `11` follows Buchfink et al. (2015); and all other options follow Murphy et al. (2000).
-d	The input directory of aligned FASTA files. The default behavior aggregates sequences of the same species across partitions, in which case names should use the following convention: `>species#sequenceID`. This implies that a species can be represented by only one read within each FASTA file. Characters other than letters, numbers, periods, and underscores will be deleted. Use `-f` for an alternate naming convention.
-f	Use full FASTA names rather than default settings (see `-d` description for default). Sequences from the same species but different reads will not be aggregated and will be considered distinct OTUs. Characters other than letters, numbers, periods, and underscores will be deleted.
-g	Do not code gene content (absence/presence). If this flag is not set, gene content is coded.
-i	Do not code indels. If this flag is not set, indels are coded using simple indel coding following Simmons and Ochoterena (2000).
-m	Retain the upper `x` percentile of genes in the distribution of missing sequences. By default `x` = 1 (i.e. include all genes with four or more sequences).
-n	Specify the root-name for output files.
-r	Folder containing a morphological matrix or a set of ortholog duplication matrices. Datasets should be saved a .tsv tables. Multiple states (polymorphic characters) should be separated by pipes (`\|`).

Input specification

The program only accepts alignments in fasta format, and their file extensions should be fa, fas, or fasta. Alignments should also be located in a single folder, which path is parsed with the option -d.

Other kinds of data (e.g. a morphological matrix or multiple ortholog encoding tables) can also be concatenated. The first row of each table should contain the character names, and the first column the names of the terminals. These matrices should be saved as tab-separated values: simple text tables with tabs as column separators and tsv extension. These tables should be located in a single directory, which is parsed with the option -r. Polymorphisms should be separated by pipes (|): leaves_pinnate|leaves_bipinnate. Be aware that polymorphisms are only fully supported for the TNT output, they will be encoded as missing data in the rest of output datasets.

Sample input/output

python bad2matrix.py -d test-data/fastas -f -g -n test

Citation

Little, D. P. & N. R. Salinas. 2023. BAD2matrix: better phylogenomic matrix concatenation, indel coding, gene content coding, reduced amino acid alphabets, and occupancy filtering. Software distributed by the authors. DOI: 10.5281/zenodo.10028408.

Additional files

Python scripts used for development.

make_readme.py: Updates README.md file from source documentation.

sim_data.py: Generates fasta alignments, executes BAD2matrix contatenation, and verifies gene content encoding.

test.py: Preliminary test functions.

License

GPL2

Related repositories

2matrix

Contact

Nelson R. Salinas
nrsalinas@gmail.com
New York Botanical Garden

Damon P. Little
dlittle@nybg.org
New York Botanical Garden

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BAD2matrix

Installation

Usage

Input specification

Sample input/output

Citation

Additional files

License

Related repositories

Contact

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
test-data		test-data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bad2matrix.py		bad2matrix.py
make_readme.py		make_readme.py
sim_data.py		sim_data.py
test.py		test.py

License

dpl10/BAD2matrix

Folders and files

Latest commit

History

Repository files navigation

BAD2matrix

Installation

Usage

Input specification

Sample input/output

Citation

Additional files

License

Related repositories

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages