Skip to content

generate_augustus_test_and_train default --target_mono_exonic_pct #34

@swarbred

Description

@swarbred

I suggest we lower the default --target_mono_exonic_pct from 20 to 5%

With some species with smaller gene sets finding 20% of 1200 train and test genes wont be possible, this was the case for a recent fungal genome.

REAT Failed, the following file might contain information with the reasons behind the failure
/ei/.project-scratch/e/e701c73c-45b1-4784-9385-6c69cf3272cf/CB-GENANNO-508_ERGA_Spongipellis_delectans/Analysis/reat-dev-issue25/Prediction/cromwell-executions/ei_prediction/d18b476e-faa4-4c2f-98a7-b5797c30ddde/call-SelectAugustusTestAndTrain/execution/stderr
+ generate_augustus_test_and_train /ei/.project-scratch/e/e701c73c-45b1-4784-9385-6c69cf3272cf/CB-GENANNO-508_ERGA_Spongipellis_delectans/Analysis/reat-dev-issue25/Prediction/cromwell-executions/ei_prediction/d18b476e-faa4-4c2f-98a7-b5797c30ddde/call-SelectAugustusTestAndTrain/inputs/-1046222641/with_utr.extra.gff --train_min 400 --train_max 1000 --test_max 200 --target_mono_exonic_pct 20
+ gff2gbSmallDNA.pl test.gff /ei/.project-scratch/e/e701c73c-45b1-4784-9385-6c69cf3272cf/CB-GENANNO-508_ERGA_Spongipellis_delectans/Analysis/reat-dev-issue25/Prediction/cromwell-executions/ei_prediction/d18b476e-faa4-4c2f-98a7-b5797c30ddde/call-SelectAugustusTestAndTrain/inputs/1001504700/gfSpoDele1_1.curated_primary.softmasked.fa 200 test.gb
Couldn't open test.gff.

When examined I could see that we simply dont have 240 single exon genes and the generate_augustus_test_and_train script generates no output with no info in an error log so it's not entirely transparant to a user what caused the error

Note the -f force option does not override the target_mono_exonic_pct 20% requirement though this does give an error indication

generate_augustus_test_and_train /ei/.project-scratch/e/e701c73c-45b1-4784-9385-6c69cf3272cf/CB-GENANNO-508_ERGA_Spongipellis_delectans/Analysis/reat-dev-issue25/Prediction/cromwell-executions/ei_prediction/2482d9fe-d7e9-42dc-bbaa-8259e9e25fb8/call-SelectAugustusTestAndTrain/inputs/-578101069/with_utr.extra.gff  --train_min 400 --train_max 1000 --test_max 200 --target_mono_exonic_pct 20 -f
Requested minimum number of mono-exonic models: 240
Real possible minimum number of mono-exonic models: 6
Number of train models: 32
Number of mono-exonic models in train set: 6
Traceback (most recent call last):
  File "/ei/software/cb/reat/dev-issue32/x86_64/bin/generate_augustus_test_and_train", line 138, in <module>
    main()
  File "/ei/software/cb/reat/dev-issue32/x86_64/bin/generate_augustus_test_and_train", line 101, in main
    test_models = random.sample(train_models, args.test_max)
  File "/ei/software/cb/reat/dev-issue32/x86_64/lib/python3.9/random.py", line 449, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

The idea was that target_mono_exonic_pct set a maximum percentage of single exon genes, as coded it works as a target. That being the case I would just lower it to 5%

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions