feat(preprocessing): move maxSequencesPerEntry validation from backend to preprocessing by theosanderson-agent · Pull Request #6166 · loculus-project/loculus

theosanderson-agent · 2026-03-19T17:09:41Z

Summary

Moves the maxSequencesPerEntry validation from the backend submission endpoint to the preprocessing pipeline. This addresses #6165.

Problem: When an entry exceeded the allowed number of sequences per entry, the backend threw an UnprocessableEntityException (HTTP 422) that failed the entire batch submission — causing valid entries in the same batch to also be rejected. This was particularly problematic for the ingest pipeline, where one malformed entry would block all other valid entries submitted in the same batch.

Solution: The validation now happens per-entry during preprocessing. Entries that exceed the limit receive a processing error that only affects that specific entry, while other entries in the batch proceed normally.

Changes

Backend (Kotlin):

Removed the maxSequencesPerEntry count check from extractAndValidateFastaIds() in MetadataEntry.kt — the backend still validates FASTA ID parsing and duplicate detection, just no longer enforces the count limit
Removed the maxSequencesPerEntry parameter from metadataEntryStreamAsSequence() and revisionEntryStreamAsSequence()
Removed the config retrieval of maxSequencesPerEntry from SubmitModel.uploadMetadata()
Updated unit tests: removed rejection tests, kept acceptance and duplicate detection tests

Preprocessing (Python):

Added max_sequences_per_entry: int | None = None config option to Config class
Added check_max_sequences_per_entry() function in prepro.py that creates a ProcessingAnnotation error when the limit is exceeded
Validation runs after process_all() completes, checking each entry's unalignedNucleotideSequences count
Added 3 new tests: exceeding limit produces error, within limit passes, None (unlimited) passes

Kubernetes (Helm):

Updated loculus-preprocessing-config.yaml template to pass maxSequencesPerEntry from the organism's submissionDataTypes schema config to the preprocessing config as max_sequences_per_entry

Test submission of a CCHF batch where 1 entry has 4 sequences, the only only 1 and confirm both can be processed and only entry with too many sequences is flagged as having errors

Test plan

Backend unit tests pass (MetadataEntryTest — no longer rejects on count)
Preprocessing unit tests pass (new test_max_sequences_per_entry_* tests)
Helm lint passes
Integration tests pass (submission with too many sequences should now be accepted by backend but flagged as error by preprocessing)
Verify that the ingest pipeline can submit batches where some entries exceed the limit without the entire batch failing

🤖 Generated with Claude Code

🚀 Preview: https://feat-move-max-sequences-v.loculus.org

…d to preprocessing Move the validation of maxSequencesPerEntry from the backend submission endpoint to the preprocessing pipeline. Previously, when an entry exceeded the allowed number of sequences, the entire batch submission would fail with an HTTP 422 error — causing valid entries in the same batch to also be rejected. By moving this validation to preprocessing, entries that exceed the limit now receive a per-entry processing error instead. This means other valid entries in the same batch are unaffected, which is particularly important for the ingest pipeline where one bad entry should not block the rest. Changes: - Backend: Remove maxSequencesPerEntry check from extractAndValidateFastaIds(), metadataEntryStreamAsSequence(), and revisionEntryStreamAsSequence() - Backend: Remove maxSequencesPerEntry config retrieval from SubmitModel.uploadMetadata() - Preprocessing: Add max_sequences_per_entry config option to Config - Preprocessing: Add check_max_sequences_per_entry() validation in process_all() - Kubernetes: Pass maxSequencesPerEntry from Helm values to preprocessing config - Tests: Update backend tests to reflect removed validation, add preprocessing tests Closes #6165 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-03-23T16:08:23Z

Claude finished @anna-parker's task in 2m 33s —— View job

Code Review

Gather context
Read changed files
Analyze code quality, security, performance, testing
Post review

The overall approach is sound — moving the validation from a batch-level HTTP exception to a per-entry preprocessing error is clearly the right fix for #6165. The Kotlin cleanup is clean, the Helm template handles the optional config correctly, and the new Python logic is straightforward. A few things worth addressing before merging:

Unused parameter (prepro.py:618) — accession_version is passed to check_max_sequences_per_entry but never referenced. Either drop it or include it in the error message so it's clear which entry triggered the error. (See inline comment.)

Annotation type (prepro.py:627) — ProcessingAnnotationAlignment ("alignment") is used as the annotation name, but this error isn't an alignment failure. The resulting annotation will have unprocessedFields/processedFields pointing to "alignment", which doesn't map to a real field. (See inline comment.)

Sequence count timing (prepro.py:563) — in process_single_unaligned, the count check runs after assign_segment_using_header has modified the sequences. The post-assignment count may differ from the original submission. This may be intentional but deserves a comment. (See inline comment.)

Missing boundary test (test_nextclade_preprocessing.py:1221) — there's no at-the-limit test case (count == max). Worth adding to confirm the > comparison is correct and there's no off-by-one. (See inline comment.)

Helm falsy-zero edge case (loculus-preprocessing-config.yaml:19) — maxSequencesPerEntry: 0 would be silently dropped because 0 is falsy in Helm. Minor, but worth documenting. (See inline comment.)

preprocessing/nextclade/src/loculus_preprocessing/prepro.py

preprocessing/nextclade/tests/test_nextclade_preprocessing.py

kubernetes/loculus/templates/loculus-preprocessing-config.yaml

anna-parker · 2026-03-26T10:58:48Z

@theosanderson-agent can you check that integration tests exist and pass that show that if one sequence in a batch has too many sequences it alone will be flagged but other sequences will succeed?

Add test_max_sequences_per_entry_batch_isolation that verifies when one entry in a batch exceeds the sequence limit, only that entry receives errors while other entries in the same batch succeed without max-sequence errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

theosanderson-agent · 2026-03-26T11:04:45Z

I checked and there were preprocessing unit tests covering maxSequencesPerEntry but no test that specifically verified batch isolation (one bad entry in a batch doesn't affect other entries).

Existing tests (in preprocessing/nextclade/tests/test_nextclade_preprocessing.py):

test_max_sequences_per_entry_rejects_exceeding_limit — single entry with too many sequences gets error
test_max_sequences_per_entry_allows_within_limit — single entry within limit, no error
test_max_sequences_per_entry_not_set_allows_any — no limit configured, no error

Added: test_max_sequences_per_entry_batch_isolation (0e36509) — processes a batch of 2 entries via process_all() where one entry has 2 sequences (exceeding limit of 1) and another has 1 sequence (within limit). Asserts the first entry gets a max-sequence error while the second entry has no max-sequence errors.

Note: I couldn't run these tests locally because they require nextclade3 (installed via conda). All 4 max_sequences tests fail identically with FileNotFoundError: nextclade3 in this environment. They should pass in CI where the full conda environment is available.

There are no end-to-end integration tests (Playwright) for this feature — those would require deploying with a maxSequencesPerEntry config and submitting via the UI/CLI, which is beyond what the current test infrastructure tests for.

claude bot added backend related to the loculus backend component preprocessing Issues related to the preprocessing component deployment Code changes targetting the deployment infrastructure labels Mar 19, 2026

theosanderson added the preview Triggers a deployment to argocd label Mar 19, 2026

anna-parker removed the preview Triggers a deployment to argocd label Mar 20, 2026

use consistent structure

e75566d

anna-parker marked this pull request as ready for review March 23, 2026 16:08

anna-parker self-requested a review March 23, 2026 16:08