feat(prepro): add custom lineage functions by anna-parker · Pull Request #6174 · loculus-project/loculus

anna-parker · 2026-03-20T11:40:45Z

resolves #

thanks to claude for writing the tests!

Screenshot

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://add-custom-lineage-functi.loculus.org

claude · 2026-03-22T11:40:55Z

Claude finished @anna-parker's task in 3m 10s —— View job

Code Review

Gather context and read changed files
Review processing_functions.py changes
Review test_assign_custom_lineage.py
Review values.yaml changes
Review test_metadata_processing_functions.py changes
Post final review

Good addition overall. The new is_variant function is clean and well-tested. A few issues worth addressing:

is_variant

Lines 1218–1219: direct input_data["key"] access will raise KeyError (not caught by the local except (ValueError, TypeError)) if the key is absent rather than None. Use .get(). → inline comment
Line 1241: error message says "non-numeric threshold value" but the threshold is computed — the actual non-numeric value is length or numMutations. → inline comment

assign_custom_lineage

Line 1310: if a segment has no reference_segN key in input (partial genome), extract_regex returns None, which ends up in extracted_subtypes. The reassortant check len(set(...)) > 1 would then count None as a distinct subtype and false-positive. Filtering out Nones fixes it. → inline comment

Tests

ARGS constant contains unrelated fields (is_insdc_ingest_group, submittedAt, ACCESSION_VERSION). → inline comment
No test for partial genomes (missing reference for some segments) — the most direct way to exercise the false-reassortant bug above. → inline comment

values.yaml

The explanatory comment about why totalSubstitutions is used (no private mutations in custom nextclade dataset) was dropped when rewriting the SARS-CoV-2 variant fields. → inline comment

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

claude · 2026-03-22T11:43:39Z

preprocessing/nextclade/tests/test_assign_custom_lineage.py

+    "uppercase": True,
+    "is_insdc_ingest_group": True,
+    "submittedAt": 1774128802,
+    "ACCESSION_VERSION": "LOC_00020W3.1",


is_insdc_ingest_group, submittedAt, and ACCESSION_VERSION appear unrelated to lineage assignment and look like copy-paste residue from another test helper. If assign_custom_lineage doesn't use them they can be removed to keep the test fixture minimal and self-documenting.

Fix this →

claude · 2026-03-22T11:43:52Z

preprocessing/nextclade/tests/test_assign_custom_lineage.py

+    def test_h7n9_returns_none():
+        input_data = make_flu_input(
+            ha_subtype="H7",
+            na_subtype="N9",


There's no test covering the case where some segments are missing reference_segN entirely (partial genome). This is the scenario that could trigger a false reassortant flag (see comment on processing_functions.py line 1310). Worth adding a test like:

def test_partial_genome_no_false_reassortant(): input_data = make_flu_input(ha_subtype="H1", na_subtype="N1", seg4_ref="h1_h1n1pdm", seg6_ref="n1_h1n1pdm", other_ref="h1n1pdm") del input_data["reference_seg2"] # simulate missing segment assert call(input_data) == "H1N1pdm"

claude · 2026-03-22T11:43:58Z

kubernetes/loculus/values.yaml

-            inputs: {input: "nextclade.totalSubstitutions"} #custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference
+              mu: 0.004
+            inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S}
        - name: reference


The old config had a comment explaining why totalSubstitutions is used instead of privateNucMutations for this organism (the custom nextclade dataset doesn't have private mutations). That context was lost when the config was rewritten. Consider adding it back:

Suggested change

- name: reference

inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S} # custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

claude bot added the preprocessing Issues related to the preprocessing component label Mar 20, 2026

anna-parker added the preview Triggers a deployment to argocd label Mar 20, 2026

anna-parker added 3 commits March 20, 2026 15:10

feat(prepro): add custom lineage functions

8665745

test on preview

e8d9c02

add values for multi seg, multi ref

67227c6

anna-parker force-pushed the add_custom_lineage_functions branch from 1457a0f to 67227c6 Compare March 20, 2026 14:10

anna-parker changed the base branch from main to fieldPresets March 20, 2026 14:11

anna-parker added 8 commits March 20, 2026 15:12

format

514f485

improve

f1faaf9

wupps

e66fea6

require less processed fields

b82af98

add more logging

769b789

feat: add unit tests

e185ccd

improve docs

bb470bd

add is_variant tests

18049ff

anna-parker marked this pull request as ready for review March 22, 2026 11:40

anna-parker requested review from maverbiest and theosanderson March 22, 2026 11:40

claude bot reviewed Mar 22, 2026

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 22, 2026

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 22, 2026

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 22, 2026

View reviewed changes

anna-parker and others added 3 commits March 22, 2026 12:44

Apply suggestions from code review

905a000

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

fix types

9c59c1e

format

2da239d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prepro): add custom lineage functions#6174

feat(prepro): add custom lineage functions#6174
anna-parker wants to merge 14 commits intofieldPresetsfrom
add_custom_lineage_functions

anna-parker commented Mar 20, 2026 •

edited by loculus-bot

Loading

Uh oh!

claude bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 22, 2026

Uh oh!

claude bot Mar 22, 2026

Uh oh!

claude bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- name: reference
	inputs: {numMutations: "nextclade.totalSubstitutions", length: processed.length_S} # custom nextclade dataset does not have private mutations, so using total substitutions as a proxy for distance from reference

Conversation

anna-parker commented Mar 20, 2026 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

claude bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anna-parker commented Mar 20, 2026 •

edited by loculus-bot

Loading

claude bot commented Mar 22, 2026 •

edited

Loading