This tool finds tandem duplications of domains using the method described here. We take as input a gene tree and a species tree, as well as the relative ordering of domains on extant sequences and output a list of tandem duplications. Below are instructions on how to install dependencies, test the code, and run your own examples.
This tool has been tested on MacOS and Ubuntu using Python 2.7, 3.5, and 3.9, and requires installations of the ete3 Tree package and the gurobi ILP solver. The ete3 Tree package can be downloaded using pip or easyinstall:
pip install --upgrade ete3
or
easy_install -U ete3
For full instructions, see the download page
We use Gurobi's python api as our ILP solver. In the future, we hope to add support for FOSS ILP solvers as well. However, Gurobi is free to use for academic purposes. See this page for full instructions. Briefly, the setup requires 3 steps:
- Download Gurobi Optimizer from the downloads page
- Get a Gurobi academic license here
- Install your license using
grbgetkey <YOUR_KEY_HERE> - Link Gurobi with your python installation. Navigate to gurobi's installation directory and run the setup script with
python setup.py install
You can test the code using our test script on the simulated examples in the data folder. Our test script takes two optional arguments, type and edist. If run with no arguments, the default options are used.
type: The type of solver to use. Choose between anexactILP solution to the TDL Reconciliation Problem, or a fastheuristic. (Default: heuristic)edist: The average event distance between events in the simulated examples. The smaller the event distance, the larger the number of domain level events per domain tree. Examples are given for event distances of0.1,0.075,0.05,0.025,0.01. (Default: 0.1)
To run the tests, run
python test.py -type <exact/heuristic> -edist <0.1/0.075/0.05/0.025/0.01>
When running with the exact solver, note that event distances less than 0.05 may take a very long time to run.
To infer tandem duplications and single losses in a domain tree, our tool requires as input a gene tree, a domain tree, a mapping from domains to genes and the relative positions of domains within each gene. We require domain and gene tree files as input, and infer position and mapping from the names of the leaf nodes in the gene and domain trees. Note that the tool can instead be used to infer events at the gene level by using a species tree and domain tree respectively. The input formats are as follows:
- Gene Tree: The input gene tree in newick format. Names are required for all nodes in the tree. Leaf names must be of the form h, where is a unique integer identifier for each gene. Internal node names must be unique but may otherwise be in any form.
Example:'((h1,h2)A,h3)B;' - Domain Tree: The input domain tree in newick format. Names are required for all nodes in the tree. Leaf names must be of the form g_, where is the integer identifier corresponding to the gene in which the domain exists, and is the relative, 0-indexed position of the domain with respect to the other domains on the sequence. Internal node names must be unique but may otherwise be in any form.
Example:'((g1_0,g2_0)a,(g3_0,g3_1)b)c;'
With the example inputs, the tool would infer three extant sequences with the following domain layouts:
| Sequence | Domains |
|---|---|
| g1 | ---g1_0--- |
| g2 | ---g2_0--- |
| g3 | ---g3_0---g3_1--- |
To run the tool on your own inputs, use the run.py script. It takes two positional and one optional argument:
gtree: The input gene tree in the correct formatdtree: The input domain tree in the correct formattype: The type of solver to use. Choose between anexactILP solution to the TDL Reconciliation Problem, or a fastheuristic. (Optional, default: heuristic)
The tool outputs two files:
tandem_duplications.out includes all tandem duplications found in the example. Each line contains a list of all domains participating in a single tandem duplication. Single duplications are represented as lines with only one domain.
Example:
[a,b]
[c,d,e]
[f]
mapfile.out contains a full mapping of nodes in the domain tree to nodes in the gene tree. Each line contains the mapping of a single domain, in the form
<domain_name> <gene_name>