Skip to content

Ingest should edit sequences in PROCESSED (erroring) state when upstream data changes #6072

@theosanderson-agent

Description

@theosanderson-agent

Note: This issue was drafted by an LLM (Claude) based on a Slack conversation and a read of the relevant codebase. It has only been lightly reviewed by a human — please check the details before acting on it.

Background

When ingest runs and a sequence already exists in Loculus with a non-APPROVED_FOR_RELEASE status, the current code in ingest/scripts/compare_hashes.py skips updating it entirely:

# compare_hashes.py ~line 112
if status != "APPROVED_FOR_RELEASE":
    update_manager.blocked[status][metadata_id] = corresponding_loculus_accession
    return update_manager

This means that if upstream data is corrected (e.g. author name formatting changed), sequences that are stuck with preprocessing errors will never receive the fix — they remain in their broken state indefinitely, even across preprocessing version bumps.

This was discovered during a PPX rollout: sequences had author names in the old comma-separated format (A. Marcello, B.M. Marycelin, ...) rather than the semicolon-separated format required by current validation. These sequences had been erroring since a formatting change over a year ago and were never updated because ingest skipped them.

The statuses

Sequences can be in four states (SubmissionTypes.kt):

  • RECEIVED — submitted, not yet sent to preprocessing
  • IN_PROCESSING — currently being preprocessed
  • PROCESSED — preprocessing complete; this includes sequences with errors awaiting user correction
  • APPROVED_FOR_RELEASE — released

Sequences stuck with errors live in PROCESSED status.

Proposed fix

The backend already has a /submit-edited-data endpoint (SubmissionController.kt) specifically for editing sequences in PROCESSED status — this is what users use to correct their own errors. Ingest should use this same endpoint for sequences where:

  1. The sequence is in PROCESSED status (i.e. has errors / awaiting release)
  2. The upstream hash has changed
  3. The sequence has not been curated

In compare_hashes.py, this means adding an edit path alongside the existing submit/revise/noop/blocked paths:

if status == "PROCESSED" and not previously_submitted_entry.curated:
    update_manager.edit[metadata_id] = corresponding_loculus_accession
    return update_manager

Sequences in RECEIVED or IN_PROCESSING don't need special handling — they will be reprocessed with the latest data naturally.

Curated sequence safety

The codebase already detects curation in compare_hashes.py:

# A sequence is considered curated if it has ever been submitted by anyone
# other than insdc_ingest_user
latest["curated"] = {v["submitter"] for v in sorted_versions} != {"insdc_ingest_user"}

Curated sequences in PROCESSED state should remain in the existing blocked["CURATION_ISSUE"] path and trigger a notification, as they do today for APPROVED_FOR_RELEASE curated sequences. This is important to avoid the problem described in #3084.

Workaround used

Sequences with errors due to the old author format were deleted from staging and production so that ingest would re-ingest them fresh. This wastes accessions and requires manual intervention.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendrelated to the loculus backend componentingestIngest pipeline

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions