-
Notifications
You must be signed in to change notification settings - Fork 4
Input and Output Files
Beside the given input file, for each taxon that is included in your study fDOG need these 3 directories to be functional:
- searchTaxa_dir: Contains sub-directories for proteome fasta files for each taxon. All taxa in this folder will be used for ortholog search.
-
coreTaxa_dir: Contains sub-directories for BLAST databases made with
makeblastdbout of your proteomes. It is not necessary that all taxa within the genome_dir have to have a BLAST database. Only taxa that should be included in the core ortholog group compilation must be present in this folder. - annotation_dir: Contains feature annotation files for all taxa present in searchTaxa_dir and coreTaxa_dir. These annotation files are not a must. However, to utilize all the features of fDOG including the FAS scores calculations, we recommend that you should have these data available.
fDOG comes together with a pre-calculated data for 81 QFO reference species (data set 2024_02). If you want to work with other taxa, you can add them into fDOG following this instruction.
NOTE: you can rename searchTaxa_dir, coreTaxa_dir and annotation_dir to anything as well as place them anywhere you want.
NOTE 2: we recommend you should check your own data for their validity before running fDOG.
During the process of fDOG, an additional folder core_orthologs will be created to store the core ortholog groups. By default, this directory will be created inside the current directory. All these 4 folders can be manually specified using the corresponding command parameters --hmmpath, --searchpath, --corepath, --annopath, or obtained from a yaml file with the option --pathFile. An example of this pathConfig.yml file will look like:
hmmpath: /home/yourname/working_dir/core_orthologs
searchpath: /home/yourname/fdog_data/searchTaxa_dir
corepath: /home/yourname/fdog_data/coreTaxa_dir
annopath: /home/yourname/fdog_data/annotation_dir
In case all of those folders are located in the same directory, you only need to put a single line to the pathConfig.yml file:
dataPath: /home/yourname/fdog_data
Input (or seed sequence) for fDOG is a single FASTA file. For example:
>HUMAN@9606@qfo24_02|P83876
MSYMLPHLHNGWQVDQAILSEEDRVVVIRFGHDWDPTCMKMDEVLYSIAEKVKNFAVIYL
VDITEVPDFNKMYELYDPCTVMFFFRNKHIMIDLGTGNNNKINWAMEDKQEMVDIIETVY
RGARKGRGLVVSPKDYSTKYRY
The taxon of this seed sequence, which is called reference taxon and specified by the option --refspec, must be present in the blast database directory (coreTaxa_dir) of fDOG.
For one seed sequence, fDOG output consist of these text files (note: test is your defined job name using the --jobName parameter)
-
test.extended.fa: a multiple FASTA file containing the seed and its ortholog sequences -
test.phyloprofile: an input file for analysing the phylogenetic profile of the query gene using PhyloProfile tool -
test_forward.domainsand optionally,test_reverse.domains: protein domain annotation files for all the sequences present in the orthologous group. The_forwardor_reversesuffix indicates the direction of the feature architecture comparison (FAS), in which_forwardmeans that the query gene is used as seed and it orthologs as target for the comparison, while_reverseis vice versa. These files can be submitted into PhyloProfile for visualising
For a rich visualisation of the provided information from the fDOG outputs, you can plug them into the Phyloprofile tool.
The main input file for PhyloProfile is test.phyloprofile, which contains list of all orthologous gene names and the taxonomy IDs of their taxa together with the FAS scores (if available). For analysing more information such as the FASTA sequences or the domain annotations, you can optionally input test.extended.fa and test_forward.domains (or test_reverse.domains) to PhyloProfile.
You can combine multiple fDOG runs into a single phylogenetic profile input using fdog.mergeOutput function.
fdog.mergeOutput -i /path/to/fdog/single/output/files/ -o output_name
in which /path/to/fdog/single/output/files/ is a directory where all single *.phyloprofile, *.domains, *.extended.fa file can be found.
The resulting file output_name.phyloprofile, /output_name.extended.fa, output_name_forward.domains and output_name_backward.domains are saved in the current directory.