Skip to content

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

@swarbred

Description

@swarbred

Mikado pick generates a metrics file for the final models (mikado.loci.metrics.tsv) and for the input models (mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv). The mono and sunbloci files describe spliced and single exon models respectively. NOTE we are currently not generating the monosubloci file so this option just needs to be added to the pick command that is run.

It would be useful to create a file that gives counts of transcripts with metrics that might suggest a problematic/incorrectly annotated gene model) i.e. biologically unusual or lack evidence support for junctions (based on the portcullis results)

Below are the metrics which I feel would be useful to extract and summarise

Oddities (derived from the mikado mono loci, sunbloci and loci metrics file), provide count of transcripts with

five_utr_length >=10000
five_utr_num >=5
three_utr_length >=10000
three_utr_num >=4
is_complete = False
has_start_codon = False
has_stop_codon = False
max_exon_length = >=10000
max_intron_length >=500000
min_exon_length <=5
min_intron_length <=5
selected_cds_fraction <=0.3
canonical_intron_proportion != 1
non_verified_introns_num >=1
only_non_canonical_splicing = False
proportion_verified_introns <=0.5
suspicious_splicing = True

This would be for the final set of models using mikado.loci.metrics.tsv (note this file will contain some models we have excluded from the final gene set through the classification so we should exclude those models when determining these counts.

From mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv) we would need to break this down based on label (similar to for the busco results) so that we can generate a table with the above rows and the columns as the different gene sets (i.e. final and input gene sets).

The values chosen for the metrics should highlight potential issues (though there will be genuine exceptions), intron size is variable between species and for example there will be genuine mammalian introns over 500000bp but you still expect these to be small in number and for many species these will be artefacts.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions