-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Mikado pick generates a metrics file for the final models (mikado.loci.metrics.tsv) and for the input models (mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv). The mono and sunbloci files describe spliced and single exon models respectively. NOTE we are currently not generating the monosubloci file so this option just needs to be added to the pick command that is run.
It would be useful to create a file that gives counts of transcripts with metrics that might suggest a problematic/incorrectly annotated gene model) i.e. biologically unusual or lack evidence support for junctions (based on the portcullis results)
Below are the metrics which I feel would be useful to extract and summarise
Oddities (derived from the mikado mono loci, sunbloci and loci metrics file), provide count of transcripts with
five_utr_length >=10000
five_utr_num >=5
three_utr_length >=10000
three_utr_num >=4
is_complete = False
has_start_codon = False
has_stop_codon = False
max_exon_length = >=10000
max_intron_length >=500000
min_exon_length <=5
min_intron_length <=5
selected_cds_fraction <=0.3
canonical_intron_proportion != 1
non_verified_introns_num >=1
only_non_canonical_splicing = False
proportion_verified_introns <=0.5
suspicious_splicing = True
This would be for the final set of models using mikado.loci.metrics.tsv (note this file will contain some models we have excluded from the final gene set through the classification so we should exclude those models when determining these counts.
From mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv) we would need to break this down based on label (similar to for the busco results) so that we can generate a table with the above rows and the columns as the different gene sets (i.e. final and input gene sets).
The values chosen for the metrics should highlight potential issues (though there will be genuine exceptions), intron size is variable between species and for example there will be genuine mammalian introns over 500000bp but you still expect these to be small in number and for many species these will be artefacts.