Skip to content

feat(preprocessing): move maxSequencesPerEntry validation from backend to preprocessing#6166

Open
theosanderson-agent wants to merge 10 commits intomainfrom
feat/move-max-sequences-validation-to-prepro
Open

feat(preprocessing): move maxSequencesPerEntry validation from backend to preprocessing#6166
theosanderson-agent wants to merge 10 commits intomainfrom
feat/move-max-sequences-validation-to-prepro

Conversation

@theosanderson-agent
Copy link
Collaborator

@theosanderson-agent theosanderson-agent commented Mar 19, 2026

Summary

Moves the maxSequencesPerEntry validation from the backend submission endpoint to the preprocessing pipeline. This addresses #6165.

Problem: When an entry exceeded the allowed number of sequences per entry, the backend threw an UnprocessableEntityException (HTTP 422) that failed the entire batch submission — causing valid entries in the same batch to also be rejected. This was particularly problematic for the ingest pipeline, where one malformed entry would block all other valid entries submitted in the same batch.

Solution: The validation now happens per-entry during preprocessing. Entries that exceed the limit receive a processing error that only affects that specific entry, while other entries in the batch proceed normally.

Changes

Backend (Kotlin):

  • Removed the maxSequencesPerEntry count check from extractAndValidateFastaIds() in MetadataEntry.kt — the backend still validates FASTA ID parsing and duplicate detection, just no longer enforces the count limit
  • Removed the maxSequencesPerEntry parameter from metadataEntryStreamAsSequence() and revisionEntryStreamAsSequence()
  • Removed the config retrieval of maxSequencesPerEntry from SubmitModel.uploadMetadata()
  • Updated unit tests: removed rejection tests, kept acceptance and duplicate detection tests

Preprocessing (Python):

  • Added max_sequences_per_entry: int | None = None config option to Config class
  • Added check_max_sequences_per_entry() function in prepro.py that creates a ProcessingAnnotation error when the limit is exceeded
  • Validation runs after process_all() completes, checking each entry's unalignedNucleotideSequences count
  • Added 3 new tests: exceeding limit produces error, within limit passes, None (unlimited) passes

Kubernetes (Helm):

  • Updated loculus-preprocessing-config.yaml template to pass maxSequencesPerEntry from the organism's submissionDataTypes schema config to the preprocessing config as max_sequences_per_entry

Test submission of a CCHF batch where 1 entry has 4 sequences, the only only 1 and confirm both can be processed and only entry with too many sequences is flagged as having errors

image

Test plan

  • Backend unit tests pass (MetadataEntryTest — no longer rejects on count)
  • Preprocessing unit tests pass (new test_max_sequences_per_entry_* tests)
  • Helm lint passes
  • Integration tests pass (submission with too many sequences should now be accepted by backend but flagged as error by preprocessing)
  • Verify that the ingest pipeline can submit batches where some entries exceed the limit without the entire batch failing

🤖 Generated with Claude Code

🚀 Preview: https://feat-move-max-sequences-v.loculus.org

…d to preprocessing

Move the validation of maxSequencesPerEntry from the backend submission
endpoint to the preprocessing pipeline. Previously, when an entry
exceeded the allowed number of sequences, the entire batch submission
would fail with an HTTP 422 error — causing valid entries in the same
batch to also be rejected.

By moving this validation to preprocessing, entries that exceed the
limit now receive a per-entry processing error instead. This means
other valid entries in the same batch are unaffected, which is
particularly important for the ingest pipeline where one bad entry
should not block the rest.

Changes:
- Backend: Remove maxSequencesPerEntry check from extractAndValidateFastaIds(),
  metadataEntryStreamAsSequence(), and revisionEntryStreamAsSequence()
- Backend: Remove maxSequencesPerEntry config retrieval from SubmitModel.uploadMetadata()
- Preprocessing: Add max_sequences_per_entry config option to Config
- Preprocessing: Add check_max_sequences_per_entry() validation in process_all()
- Kubernetes: Pass maxSequencesPerEntry from Helm values to preprocessing config
- Tests: Update backend tests to reflect removed validation, add preprocessing tests

Closes #6165

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude claude bot added backend related to the loculus backend component preprocessing Issues related to the preprocessing component deployment Code changes targetting the deployment infrastructure labels Mar 19, 2026
@theosanderson theosanderson added the preview Triggers a deployment to argocd label Mar 19, 2026
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Mar 20, 2026
@anna-parker anna-parker marked this pull request as ready for review March 23, 2026 16:08
@anna-parker anna-parker self-requested a review March 23, 2026 16:08
@claude
Copy link
Contributor

claude bot commented Mar 23, 2026

Claude finished @anna-parker's task in 2m 33s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze code quality, security, performance, testing
  • Post review

The overall approach is sound — moving the validation from a batch-level HTTP exception to a per-entry preprocessing error is clearly the right fix for #6165. The Kotlin cleanup is clean, the Helm template handles the optional config correctly, and the new Python logic is straightforward. A few things worth addressing before merging:

Unused parameter (prepro.py:618) — accession_version is passed to check_max_sequences_per_entry but never referenced. Either drop it or include it in the error message so it's clear which entry triggered the error. (See inline comment.)

Annotation type (prepro.py:627) — ProcessingAnnotationAlignment ("alignment") is used as the annotation name, but this error isn't an alignment failure. The resulting annotation will have unprocessedFields/processedFields pointing to "alignment", which doesn't map to a real field. (See inline comment.)

Sequence count timing (prepro.py:563) — in process_single_unaligned, the count check runs after assign_segment_using_header has modified the sequences. The post-assignment count may differ from the original submission. This may be intentional but deserves a comment. (See inline comment.)

Missing boundary test (test_nextclade_preprocessing.py:1221) — there's no at-the-limit test case (count == max). Worth adding to confirm the > comparison is correct and there's no off-by-one. (See inline comment.)

Helm falsy-zero edge case (loculus-preprocessing-config.yaml:19) — maxSequencesPerEntry: 0 would be silently dropped because 0 is falsy in Helm. Minor, but worth documenting. (See inline comment.)

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Mar 26, 2026
@anna-parker
Copy link
Contributor

@theosanderson-agent can you check that integration tests exist and pass that show that if one sequence in a batch has too many sequences it alone will be flagged but other sequences will succeed?

Add test_max_sequences_per_entry_batch_isolation that verifies when one
entry in a batch exceeds the sequence limit, only that entry receives
errors while other entries in the same batch succeed without
max-sequence errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@theosanderson-agent
Copy link
Collaborator Author

I checked and there were preprocessing unit tests covering maxSequencesPerEntry but no test that specifically verified batch isolation (one bad entry in a batch doesn't affect other entries).

Existing tests (in preprocessing/nextclade/tests/test_nextclade_preprocessing.py):

  • test_max_sequences_per_entry_rejects_exceeding_limit — single entry with too many sequences gets error
  • test_max_sequences_per_entry_allows_within_limit — single entry within limit, no error
  • test_max_sequences_per_entry_not_set_allows_any — no limit configured, no error

Added: test_max_sequences_per_entry_batch_isolation (0e36509) — processes a batch of 2 entries via process_all() where one entry has 2 sequences (exceeding limit of 1) and another has 1 sequence (within limit). Asserts the first entry gets a max-sequence error while the second entry has no max-sequence errors.

Note: I couldn't run these tests locally because they require nextclade3 (installed via conda). All 4 max_sequences tests fail identically with FileNotFoundError: nextclade3 in this environment. They should pass in CI where the full conda environment is available.

There are no end-to-end integration tests (Playwright) for this feature — those would require deploying with a maxSequencesPerEntry config and submitting via the UI/CLI, which is beyond what the current test infrastructure tests for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component deployment Code changes targetting the deployment infrastructure preprocessing Issues related to the preprocessing component preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants