Skip to content

feat(prepro): add custom lineage functions#6174

Open
anna-parker wants to merge 14 commits intofieldPresetsfrom
add_custom_lineage_functions
Open

feat(prepro): add custom lineage functions#6174
anna-parker wants to merge 14 commits intofieldPresetsfrom
add_custom_lineage_functions

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Mar 20, 2026

resolves #

thanks to claude for writing the tests!

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://add-custom-lineage-functi.loculus.org

@claude claude bot added the preprocessing Issues related to the preprocessing component label Mar 20, 2026
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Mar 20, 2026
@anna-parker anna-parker force-pushed the add_custom_lineage_functions branch from 1457a0f to 67227c6 Compare March 20, 2026 14:10
@anna-parker anna-parker changed the base branch from main to fieldPresets March 20, 2026 14:11
@anna-parker anna-parker marked this pull request as ready for review March 22, 2026 11:40
@claude
Copy link
Contributor

claude bot commented Mar 22, 2026

Claude finished @anna-parker's task in 3m 10s —— View job


Code Review

  • Gather context and read changed files
  • Review processing_functions.py changes
  • Review test_assign_custom_lineage.py
  • Review values.yaml changes
  • Review test_metadata_processing_functions.py changes
  • Post final review

Good addition overall. The new is_variant function is clean and well-tested. A few issues worth addressing:

is_variant

  • Lines 1218–1219: direct input_data["key"] access will raise KeyError (not caught by the local except (ValueError, TypeError)) if the key is absent rather than None. Use .get(). → inline comment
  • Line 1241: error message says "non-numeric threshold value" but the threshold is computed — the actual non-numeric value is length or numMutations. → inline comment

assign_custom_lineage

  • Line 1310: if a segment has no reference_segN key in input (partial genome), extract_regex returns None, which ends up in extracted_subtypes. The reassortant check len(set(...)) > 1 would then count None as a distinct subtype and false-positive. Filtering out Nones fixes it. → inline comment

Tests

  • ARGS constant contains unrelated fields (is_insdc_ingest_group, submittedAt, ACCESSION_VERSION). → inline comment
  • No test for partial genomes (missing reference for some segments) — the most direct way to exercise the false-reassortant bug above. → inline comment

values.yaml

  • The explanatory comment about why totalSubstitutions is used (no private mutations in custom nextclade dataset) was dropped when rewriting the SARS-CoV-2 variant fields. → inline comment

Comment on lines +9 to +12
"uppercase": True,
"is_insdc_ingest_group": True,
"submittedAt": 1774128802,
"ACCESSION_VERSION": "LOC_00020W3.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_insdc_ingest_group, submittedAt, and ACCESSION_VERSION appear unrelated to lineage assignment and look like copy-paste residue from another test helper. If assign_custom_lineage doesn't use them they can be removed to keep the test fixture minimal and self-documenting.

Fix this →

def test_h7n9_returns_none():
input_data = make_flu_input(
ha_subtype="H7",
na_subtype="N9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no test covering the case where some segments are missing reference_segN entirely (partial genome). This is the scenario that could trigger a false reassortant flag (see comment on processing_functions.py line 1310). Worth adding a test like:

def test_partial_genome_no_false_reassortant():
    input_data = make_flu_input(ha_subtype="H1", na_subtype="N1",
                                 seg4_ref="h1_h1n1pdm", seg6_ref="n1_h1n1pdm", other_ref="h1n1pdm")
    del input_data["reference_seg2"]  # simulate missing segment
    assert call(input_data) == "H1N1pdm"

inputs: {input: "nextclade.totalSubstitutions"} #custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference
mu: 0.004
inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S}
- name: reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old config had a comment explaining why totalSubstitutions is used instead of privateNucMutations for this organism (the custom nextclade dataset doesn't have private mutations). That context was lost when the config was rewritten. Consider adding it back:

Suggested change
- name: reference
inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S} # custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference

anna-parker and others added 3 commits March 22, 2026 12:44
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant