-
Notifications
You must be signed in to change notification settings - Fork 4
fDOG Assembly
- Installation and Setup
- Test run
- Data preparation
- Annotation tool
- How to run fDOG-Assembly
- Parameters
- Errors
We strongly recommend installing the program in a fresh conda environment to ensure compatibility.
fDOG-Assembly is part of fdog python package and can be installed via pip.
python3 -m pip install fdog
For more detailed installation instructions, please check this wiki page.
After installation, you need to run the setup both fDOG and FAS (FAS is optional, but recommended to take full advantages of all features).
#setup fDOG
fdog.setup -d <directory_for_fDOG_data>
#setup FAS
fas.setup -t <directory_for_annotation_tools>
For more information about setting up FAS and fDOG, please refer to their respective wikis.
We strongly recommend testing fDOG and fDOG-Assembly before running them on your own data.
First, test whether fDOG is working correctly:
fdog.run --seqFile infile.fa --jobName test --refspec HUMAN@9606@qfo24_02
Next, use the fDOG output to test fDOG-Assembly:
fdog.assembly --gene test --refSpec HUMAN@9606@qfo24_02 --augustus --augustusRefSpec human --coregroupPath core_orthologs/ --out test_assembly
The results of fDOG-Assembly will be stored in the test_assembly folder.
fDOG-Assembly can directly use core-ortholog groups calculated by fDOG. Otherwise, already existing core-ortholog groups can be downloaded from databases like OMA. The module fdog.addCoreGroup can be used to calculate the required data structure. While specific headers are required, we recommend calculating core-ortholog groups with fDOG. Each fDOG-Assembly run requires one reference species. The reference species has to be part of the core-ortholog group, and a protein set of the reference species must be available in the fDOG data structure, described in the next paragraph or more in detail in the fDOG section.
fDOG-Assembly uses the same data structure as fDOG. The genomic data of the reference species have to be located in the folder annotation_dir, searchTaxa_dir and coreTaxa_dir. The data can be prepared with the script fdog.addTaxon or for multiple species with fDOG.addTaxa. Additionally, a folder is required containing the genome assemblies that should be traced for orthologs. The helper script fdog.addAssembly can be used for easy data preparation. As an example, if we have the reference species Human (HUMAN@9606@2209) and want to search in a mouse assembly (MOUSE@10090@2209), the required data structure is the following:
├── annotation_dir/
│ └── HUMAN@9606@2209.json
├── assembly_dir/
│ └── MOUSE@10090@2209/
│ └── MOUSE@10090@2209.fa
├── coreTaxa_dir/
│ └── HUMAN@9606@2209/
│ ├── HUMAN@9606@2209.fa
│ ├── HUMAN@9606@2209.fa.fai
│ ├── HUMAN@9606@2209.pdb
│ ├── HUMAN@9606@2209.phr
│ ├── HUMAN@9606@2209.pin
│ ├── HUMAN@9606@2209.pjs
│ ├── HUMAN@9606@2209.pot
│ ├── HUMAN@9606@2209.psq
│ ├── HUMAN@9606@2209.ptf
│ └── HUMAN@9606@2209.pto
└── searchTaxa_dir/
└── HUMAN@9606@2209/
├── HUMAN@9606@2209.fa.fai
├── HUMAN@9606@2209.fa.checked
└── HUMAN@9606@2209.fa
The path to the prepared data should be parsed to fDOG-Assembly by using the parameter --dataPath.
The user can choose between two annotation methods. By default, fDOG-Assembly uses MetaEuk for gene prediction. MetaEuk offers precalculated databases that can be downloaded with the command mmseqs databases < database name>, or own databases can be computed. Please have a look at the general MetaEuk Github Page or directly at the download instructions for more information. As a second option, the user can use Augustus for gene predictions with the parameter --augustus. A precomputed Augustus species model has to be selected and given with '--augustusRefSpec'. Available Augustus gene models are listed on the Augustus Github page.
The most basic run is by using:
MetaEuk:
fdog.assembly --gene <gene name> --refSpec <reference species> --metaeukDb </path/to/metaeukDb>
Augustus:
fdog.assembly --gene <gene name> --refSpec <reference species> --augustus --augustusRefSpec <Augustus reference Species>
If fDOG-Assembly does not automatically find your fDOG data, give the path to your data with --dataPath. Please use the folder names as described above. You can move your assembly_dir to the same folder and it will be automatically recognised. By default, fDOG-Assembly traces all assemblies located in assembly_dir. If you want to search in only a subset, use the parameter --searchTaxa and give the search species as a list, separated by a space.
options:
-h, --help show this help message and exit
--version show program's version number and exit
Required arguments:
--gene GENE Core_ortholog group name. Folder inlcuding the fasta file, hmm file and aln file has to be located in core_orthologs/
--refSpec REFSPEC [REFSPEC ...]
Reference taxon/taxa for fDOG.
Optional arguments:
--avIntron AVINTRON average intron length of the assembly species in bp (default: 50000)
--lengthExtension LENGTHEXTENSION
length extension of the candidate regions in bp (default:20000)
--assemblyPath ASSEMBLYPATH
Path for the assembly directory, (default dataPath)
--tmp tmp files will not be deleted
--out OUT Output directory
--dataPath DATAPATH fDOG data directory containing searchTaxa_dir, coreTaxa_dir and annotation_dir
--coregroupPath COREGROUPPATH
core_ortholog directory containing ortholog groups of gene of interest
--evalBlast EVALBLAST
E-value cut-off for the Blast search. (default: 0.00001)
--strict An ortholog is only then accepted when the reciprocity is fulfilled for each sequence in the core set
--msaTool {mafft-linsi,muscle}
Choose between mafft-linsi or muscle for the multiple sequence alignment. (default:muscle)
--checkCoorthologsRefOff
Turn off that during the final ortholog search, an ortholog is accepted also when its best hit in the reverse search is not the core ortholog
itself, but a co-ortholog of it
--scoringmatrix {identity,blastn,trans,benner6,benner22,benner74,blosum100,blosum30,blosum35,blosum40,blosum45,blosum50,blosum55,blosum60,blosum62,blosum65,blosum70,blosum75,blosum80,blosum85,blosum90,blosum95,feng,fitch,genetic,gonnet,grant,ident,johnson,levin,mclach,miyata,nwsgappep,pam120,pam180,pam250,pam30,pam300,pam60,pam90,rao,risler,structure}
Choose a scoring matrix for the distance criteria used by the option --checkCoorthologsRef. (default: blosum62)
--coreTaxa CORETAXA [CORETAXA ...]
List of core taxa used during --strict
--fasoff Turn off FAS support
--pathFile PATHFILE Config file contains paths to data folder (in yaml format)
--searchTaxa SEARCHTAXA [SEARCHTAXA ...]
List of Taxa to search in, (default: all species located in assembly_dir)
--debug Stdout and Stderr from fdog.assembly and every used tool will be printed, caution: using --parallel can result in messy output
--force Overwrite existing output files
--append Append the output to existing output files, caution: reference species must be identical
--parallel The ortholog search of multiple species will be done in parallel
--augustus Gene prediction is done by using the tool Augustus PPX
--augustusRefSpec AUGUSTUSREFSPEC
Augustus reference species identifier (use command: augustus --species=help to get precomputed augustus gene models)
--augustusRefSpecFile AUGUSTUSREFSPECFILE
Mapping file tab seperated containing Assembly Names and augustus reference species that should be used
--metaeukDb METAEUKDB
Path to MetaEuk reference database
--isoforms All Isoforms of a gene passing the ortholog verification will be included in the output
--gff GFF files will be included in output
--cpus CPUS The number of CPUs fDOG-Assembly is allowed to use. The maximal number of species in which orthologs will be searched in parallel.
augustus: error while loading shared libraries: libboost_iostreams.so.1.85.0: cannot open shared object file: No such file or directory
libboost can not be found or is not the correct version. Please install it again via Anaconda:
conda install -c conda-forge boost=1.85
Check afterwards that all libraries are found with:
ldd $(which augustus)