feat(preprocessing): move maxSequencesPerEntry validation from backend to preprocessing#6166
feat(preprocessing): move maxSequencesPerEntry validation from backend to preprocessing#6166theosanderson-agent wants to merge 10 commits intomainfrom
Conversation
…d to preprocessing Move the validation of maxSequencesPerEntry from the backend submission endpoint to the preprocessing pipeline. Previously, when an entry exceeded the allowed number of sequences, the entire batch submission would fail with an HTTP 422 error — causing valid entries in the same batch to also be rejected. By moving this validation to preprocessing, entries that exceed the limit now receive a per-entry processing error instead. This means other valid entries in the same batch are unaffected, which is particularly important for the ingest pipeline where one bad entry should not block the rest. Changes: - Backend: Remove maxSequencesPerEntry check from extractAndValidateFastaIds(), metadataEntryStreamAsSequence(), and revisionEntryStreamAsSequence() - Backend: Remove maxSequencesPerEntry config retrieval from SubmitModel.uploadMetadata() - Preprocessing: Add max_sequences_per_entry config option to Config - Preprocessing: Add check_max_sequences_per_entry() validation in process_all() - Kubernetes: Pass maxSequencesPerEntry from Helm values to preprocessing config - Tests: Update backend tests to reflect removed validation, add preprocessing tests Closes #6165 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Claude finished @anna-parker's task in 2m 33s —— View job Code Review
The overall approach is sound — moving the validation from a batch-level HTTP exception to a per-entry preprocessing error is clearly the right fix for #6165. The Kotlin cleanup is clean, the Helm template handles the optional config correctly, and the new Python logic is straightforward. A few things worth addressing before merging: Unused parameter ( Annotation type ( Sequence count timing ( Missing boundary test ( Helm falsy-zero edge case ( |
|
@theosanderson-agent can you check that integration tests exist and pass that show that if one sequence in a batch has too many sequences it alone will be flagged but other sequences will succeed? |
Add test_max_sequences_per_entry_batch_isolation that verifies when one entry in a batch exceeds the sequence limit, only that entry receives errors while other entries in the same batch succeed without max-sequence errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I checked and there were preprocessing unit tests covering Existing tests (in
Added: Note: I couldn't run these tests locally because they require There are no end-to-end integration tests (Playwright) for this feature — those would require deploying with a |
Summary
Moves the
maxSequencesPerEntryvalidation from the backend submission endpoint to the preprocessing pipeline. This addresses #6165.Problem: When an entry exceeded the allowed number of sequences per entry, the backend threw an
UnprocessableEntityException(HTTP 422) that failed the entire batch submission — causing valid entries in the same batch to also be rejected. This was particularly problematic for the ingest pipeline, where one malformed entry would block all other valid entries submitted in the same batch.Solution: The validation now happens per-entry during preprocessing. Entries that exceed the limit receive a processing error that only affects that specific entry, while other entries in the batch proceed normally.
Changes
Backend (Kotlin):
maxSequencesPerEntrycount check fromextractAndValidateFastaIds()inMetadataEntry.kt— the backend still validates FASTA ID parsing and duplicate detection, just no longer enforces the count limitmaxSequencesPerEntryparameter frommetadataEntryStreamAsSequence()andrevisionEntryStreamAsSequence()maxSequencesPerEntryfromSubmitModel.uploadMetadata()Preprocessing (Python):
max_sequences_per_entry: int | None = Noneconfig option toConfigclasscheck_max_sequences_per_entry()function inprepro.pythat creates aProcessingAnnotationerror when the limit is exceededprocess_all()completes, checking each entry'sunalignedNucleotideSequencescountNone(unlimited) passesKubernetes (Helm):
loculus-preprocessing-config.yamltemplate to passmaxSequencesPerEntryfrom the organism'ssubmissionDataTypesschema config to the preprocessing config asmax_sequences_per_entryTest submission of a CCHF batch where 1 entry has 4 sequences, the only only 1 and confirm both can be processed and only entry with too many sequences is flagged as having errors
Test plan
MetadataEntryTest— no longer rejects on count)test_max_sequences_per_entry_*tests)🤖 Generated with Claude Code
🚀 Preview: https://feat-move-max-sequences-v.loculus.org