feat(prepro): add custom lineage functions#6174
feat(prepro): add custom lineage functions#6174anna-parker wants to merge 14 commits intofieldPresetsfrom
Conversation
1457a0f to
67227c6
Compare
|
Claude finished @anna-parker's task in 3m 10s —— View job Code Review
Good addition overall. The new
Tests
|
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
| "uppercase": True, | ||
| "is_insdc_ingest_group": True, | ||
| "submittedAt": 1774128802, | ||
| "ACCESSION_VERSION": "LOC_00020W3.1", |
There was a problem hiding this comment.
is_insdc_ingest_group, submittedAt, and ACCESSION_VERSION appear unrelated to lineage assignment and look like copy-paste residue from another test helper. If assign_custom_lineage doesn't use them they can be removed to keep the test fixture minimal and self-documenting.
| def test_h7n9_returns_none(): | ||
| input_data = make_flu_input( | ||
| ha_subtype="H7", | ||
| na_subtype="N9", |
There was a problem hiding this comment.
There's no test covering the case where some segments are missing reference_segN entirely (partial genome). This is the scenario that could trigger a false reassortant flag (see comment on processing_functions.py line 1310). Worth adding a test like:
def test_partial_genome_no_false_reassortant():
input_data = make_flu_input(ha_subtype="H1", na_subtype="N1",
seg4_ref="h1_h1n1pdm", seg6_ref="n1_h1n1pdm", other_ref="h1n1pdm")
del input_data["reference_seg2"] # simulate missing segment
assert call(input_data) == "H1N1pdm"| inputs: {input: "nextclade.totalSubstitutions"} #custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference | ||
| mu: 0.004 | ||
| inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S} | ||
| - name: reference |
There was a problem hiding this comment.
The old config had a comment explaining why totalSubstitutions is used instead of privateNucMutations for this organism (the custom nextclade dataset doesn't have private mutations). That context was lost when the config was rewritten. Consider adding it back:
| - name: reference | |
| inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S} # custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference |
resolves #
thanks to claude for writing the tests!
Screenshot
PR Checklist
🚀 Preview: https://add-custom-lineage-functi.loculus.org