fDOG Assembly

Installation and Setup

We strongly recommend installing the program in a fresh conda environment to ensure compatibility.

fDOG-Assembly is part of fdog python package and can be installed via pip.

python3 -m pip install fdog

For more detailed installation instructions, please check this wiki page.

After installation, you need to run the setup both fDOG and FAS (FAS is optional, but recommended to take full advantages of all features).

#setup fDOG
fdog.setup -d <directory_for_fDOG_data>

#setup FAS
fas.setup -t <directory_for_annotation_tools>

For more information about setting up FAS and fDOG, please refer to their respective wikis.

Test run

We strongly recommend testing fDOG and fDOG-Assembly before running them on your own data.

First, test whether fDOG is working correctly:

fdog.run --seqFile infile.fa --jobName test --refspec HUMAN@9606@qfo24_02

Next, use the fDOG output to test fDOG-Assembly:

fdog.assembly --gene test --refSpec HUMAN@9606@qfo24_02 --augustus --augustusRefSpec human --coregroupPath core_orthologs/ --out test_assembly

The results of fDOG-Assembly will be stored in the test_assembly folder.

Data preparation

Core-ortholog group

fDOG-Assembly can directly use core-ortholog groups calculated by fDOG. Otherwise, already existing core-ortholog groups can be downloaded from databases like OMA. The module fdog.addCoreGroup can be used to calculate the required data structure. While specific headers are required, we recommend calculating core-ortholog groups with fDOG. Each fDOG-Assembly run requires one reference species. The reference species has to be part of the core-ortholog group, and a protein set of the reference species must be available in the fDOG data structure, described in the next paragraph or more in detail in the fDOG section.

Species data

fDOG-Assembly uses the same data structure as fDOG. The genomic data of the reference species have to be located in the folder annotation_dir, searchTaxa_dir and coreTaxa_dir. The data can be prepared with the script fdog.addTaxon or for multiple species with fDOG.addTaxa. Additionally, a folder is required containing the genome assemblies that should be traced for orthologs. The helper script fdog.addAssembly can be used for easy data preparation. As an example, if we have the reference species Human (HUMAN@9606@2209) and want to search in a mouse assembly (MOUSE@10090@2209), the required data structure is the following:

├── annotation_dir/
│   └── HUMAN@9606@2209.json
├── assembly_dir/
│   └── MOUSE@10090@2209/
│       └── MOUSE@10090@2209.fa
├── coreTaxa_dir/
│   └── HUMAN@9606@2209/
│       ├── HUMAN@9606@2209.fa
│       ├── HUMAN@9606@2209.fa.fai
│       ├── HUMAN@9606@2209.pdb
│       ├── HUMAN@9606@2209.phr
│       ├── HUMAN@9606@2209.pin
│       ├── HUMAN@9606@2209.pjs
│       ├── HUMAN@9606@2209.pot
│       ├── HUMAN@9606@2209.psq
│       ├── HUMAN@9606@2209.ptf
│       └── HUMAN@9606@2209.pto
└── searchTaxa_dir/
    └── HUMAN@9606@2209/
        ├── HUMAN@9606@2209.fa.fai
        ├── HUMAN@9606@2209.fa.checked
        └── HUMAN@9606@2209.fa

The path to the prepared data should be parsed to fDOG-Assembly by using the parameter --dataPath.

Annotation tool

The user can choose between two annotation methods. By default, fDOG-Assembly uses MetaEuk for gene prediction. MetaEuk offers precalculated databases that can be downloaded with the command mmseqs databases < database name>, or own databases can be computed. Please have a look at the general MetaEuk Github Page or directly at the download instructions for more information. As a second option, the user can use Augustus for gene predictions with the parameter --augustus. A precomputed Augustus species model has to be selected and given with '--augustusRefSpec'. Available Augustus gene models are listed on the Augustus Github page.

How to run fDOG-Assembly

The most basic run is by using:

MetaEuk:

fdog.assembly --gene <gene name> --refSpec <reference species> --metaeukDb </path/to/metaeukDb>

Augustus:

fdog.assembly --gene <gene name> --refSpec <reference species> --augustus --augustusRefSpec <Augustus reference Species>

If fDOG-Assembly does not automatically find your fDOG data, give the path to your data with --dataPath. Please use the folder names as described above. You can move your assembly_dir to the same folder and it will be automatically recognised. By default, fDOG-Assembly traces all assemblies located in assembly_dir. If you want to search in only a subset, use the parameter --searchTaxa and give the search species as a list, separated by a space.

Parameters

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Required arguments:
  --gene GENE           Core_ortholog group name. Folder inlcuding the fasta file, hmm file and aln file has to be located in core_orthologs/
  --refSpec REFSPEC [REFSPEC ...]
                        Reference taxon/taxa for fDOG.

Optional arguments:
  --avIntron AVINTRON   average intron length of the assembly species in bp (default: 50000)
  --lengthExtension LENGTHEXTENSION
                        length extension of the candidate regions in bp (default:20000)
  --assemblyPath ASSEMBLYPATH
                        Path for the assembly directory, (default dataPath)
  --tmp                 tmp files will not be deleted
  --out OUT             Output directory
  --dataPath DATAPATH   fDOG data directory containing searchTaxa_dir, coreTaxa_dir and annotation_dir
  --coregroupPath COREGROUPPATH
                        core_ortholog directory containing ortholog groups of gene of interest
  --evalBlast EVALBLAST
                        E-value cut-off for the Blast search. (default: 0.00001)
  --strict              An ortholog is only then accepted when the reciprocity is fulfilled for each sequence in the core set
  --msaTool {mafft-linsi,muscle}
                        Choose between mafft-linsi or muscle for the multiple sequence alignment. (default:muscle)
  --checkCoorthologsRefOff
                        Turn off that during the final ortholog search, an ortholog is accepted also when its best hit in the reverse search is not the core ortholog
                        itself, but a co-ortholog of it
  --scoringmatrix {identity,blastn,trans,benner6,benner22,benner74,blosum100,blosum30,blosum35,blosum40,blosum45,blosum50,blosum55,blosum60,blosum62,blosum65,blosum70,blosum75,blosum80,blosum85,blosum90,blosum95,feng,fitch,genetic,gonnet,grant,ident,johnson,levin,mclach,miyata,nwsgappep,pam120,pam180,pam250,pam30,pam300,pam60,pam90,rao,risler,structure}
                        Choose a scoring matrix for the distance criteria used by the option --checkCoorthologsRef. (default: blosum62)
  --coreTaxa CORETAXA [CORETAXA ...]
                        List of core taxa used during --strict
  --fasoff              Turn off FAS support
  --pathFile PATHFILE   Config file contains paths to data folder (in yaml format)
  --searchTaxa SEARCHTAXA [SEARCHTAXA ...]
                        List of Taxa to search in, (default: all species located in assembly_dir)
  --debug               Stdout and Stderr from fdog.assembly and every used tool will be printed, caution: using --parallel can result in messy output
  --force               Overwrite existing output files
  --append              Append the output to existing output files, caution: reference species must be identical
  --parallel            The ortholog search of multiple species will be done in parallel
  --augustus            Gene prediction is done by using the tool Augustus PPX
  --augustusRefSpec AUGUSTUSREFSPEC
                        Augustus reference species identifier (use command: augustus --species=help to get precomputed augustus gene models)
  --augustusRefSpecFile AUGUSTUSREFSPECFILE
                        Mapping file tab seperated containing Assembly Names and augustus reference species that should be used
  --metaeukDb METAEUKDB
                        Path to MetaEuk reference database
  --isoforms            All Isoforms of a gene passing the ortholog verification will be included in the output
  --gff                 GFF files will be included in output
  --cpus CPUS           The number of CPUs fDOG-Assembly is allowed to use. The maximal number of species in which orthologs will be searched in parallel.