-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Note: This issue was drafted by an LLM (Claude) based on a Slack conversation and a read of the relevant codebase. It has only been lightly reviewed by a human — please check the details before acting on it.
Background
When ingest runs and a sequence already exists in Loculus with a non-APPROVED_FOR_RELEASE status, the current code in ingest/scripts/compare_hashes.py skips updating it entirely:
# compare_hashes.py ~line 112
if status != "APPROVED_FOR_RELEASE":
update_manager.blocked[status][metadata_id] = corresponding_loculus_accession
return update_managerThis means that if upstream data is corrected (e.g. author name formatting changed), sequences that are stuck with preprocessing errors will never receive the fix — they remain in their broken state indefinitely, even across preprocessing version bumps.
This was discovered during a PPX rollout: sequences had author names in the old comma-separated format (A. Marcello, B.M. Marycelin, ...) rather than the semicolon-separated format required by current validation. These sequences had been erroring since a formatting change over a year ago and were never updated because ingest skipped them.
The statuses
Sequences can be in four states (SubmissionTypes.kt):
RECEIVED— submitted, not yet sent to preprocessingIN_PROCESSING— currently being preprocessedPROCESSED— preprocessing complete; this includes sequences with errors awaiting user correctionAPPROVED_FOR_RELEASE— released
Sequences stuck with errors live in PROCESSED status.
Proposed fix
The backend already has a /submit-edited-data endpoint (SubmissionController.kt) specifically for editing sequences in PROCESSED status — this is what users use to correct their own errors. Ingest should use this same endpoint for sequences where:
- The sequence is in
PROCESSEDstatus (i.e. has errors / awaiting release) - The upstream hash has changed
- The sequence has not been curated
In compare_hashes.py, this means adding an edit path alongside the existing submit/revise/noop/blocked paths:
if status == "PROCESSED" and not previously_submitted_entry.curated:
update_manager.edit[metadata_id] = corresponding_loculus_accession
return update_managerSequences in RECEIVED or IN_PROCESSING don't need special handling — they will be reprocessed with the latest data naturally.
Curated sequence safety
The codebase already detects curation in compare_hashes.py:
# A sequence is considered curated if it has ever been submitted by anyone
# other than insdc_ingest_user
latest["curated"] = {v["submitter"] for v in sorted_versions} != {"insdc_ingest_user"}Curated sequences in PROCESSED state should remain in the existing blocked["CURATION_ISSUE"] path and trigger a notification, as they do today for APPROVED_FOR_RELEASE curated sequences. This is important to avoid the problem described in #3084.
Workaround used
Sequences with errors due to the old author format were deleted from staging and production so that ingest would re-ingest them fresh. This wastes accessions and requires manual intervention.
Related
- Ingest should not treat curator revisions as latest and revise #3084 — Ingest should not treat curator revisions as latest and revise (the curated-sequence caveat in our fix directly addresses this)
- How to maintain curation changes across ingest revisions #3085 — How to maintain curation changes across ingest revisions (broader context on curated sequence handling)
- Original author formatting change: feat(prepro, ingest, deposition): Enforce author formatting in prepro, map authors accordingly in ingest and deposition #2986
- Ingest skip logic:
ingest/scripts/compare_hashes.py,process_hashes()function - Edit endpoint:
backend/.../SubmissionController.kt→submitEditedData() - Edit status precondition: requires
Status.PROCESSED(SubmissionDatabaseService.kt)