Skip to content

feat(prepro): refactor prepro to split up backend internal metadata from customizable metadata fields#6071

Open
anna-parker wants to merge 13 commits intomainfrom
refactor_prepro_metadata
Open

feat(prepro): refactor prepro to split up backend internal metadata from customizable metadata fields#6071
anna-parker wants to merge 13 commits intomainfrom
refactor_prepro_metadata

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Mar 3, 2026

When reviewing #6066 I realized the prepro doesn't properly distinguish between customizable metadata (received in a dictionary) and other internal metadata fields such as submissionId, submitter, and groupId. We inconsistently added these fields to the inputMetadata or the function args. I decided to make this distinction more clear in the code.

Input from the extract-unprocessed-data endpoint:

{
  "accession": "LOC_000S01D",
  "version": 1,
  "data": {
    "metadata": {
      "date": "2020-01-01",
      "country": "Germany"
    },
    "unalignedNucleotideSequences": {
      "segment1": "ACTG",
      "segment2": "GTCA"
    },
    "files": "{\"raw_reads\": [{\"fileId\": \"f1le-uuId-asdf\", \"name\": \"myfile.fastaq\"]}"
  },
  "submissionId": "string",
  "submitter": "string",
  "groupId": 42,
  "submittedAt": 1720304713
}

We now use pydantic to validate the response and convert to a type BackendEntry and then restructure for easier use in the code as:

dataclass
class InternalMetadata:
    accession_version: AccessionVersion  # {accession}.{version}
    submitter: str
    group_id: int
    submitted_at: int  # timestamp
    submission_id: str

@dataclass
class UnprocessedData:
    metadata: InputMetadata
    internal_metadata: InternalMetadata
    unalignedNucleotideSequences: dict[SequenceName, NucleotideSequence | None]

I also improved the typing of the processing function args.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

…rom otehr metadata fields to make code cleaner and clearer
@claude claude bot added the preprocessing Issues related to the preprocessing component label Mar 3, 2026
@claude
Copy link
Contributor

claude bot commented Mar 3, 2026

This PR may be related to: #2283 (Make Loculus internal metadata available to common preprocessing functions) — the refactoring of backend internal metadata fields in prepro may address or contribute to that issue.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Mar 3, 2026
@anna-parker anna-parker marked this pull request as ready for review March 3, 2026 16:37
@claude
Copy link
Contributor

claude bot commented Mar 3, 2026

Claude finished @anna-parker's task in 3m 28s —— View job


Code Review

  • Read changed files and understand the refactoring
  • Review datatypes changes
  • Review processing_functions changes
  • Review prepro.py changes
  • Review backend.py and nextclade.py changes
  • Review test changes
  • Post final review

Good refactoring overall. Moving backend internal metadata into a dedicated InternalMetadata dataclass is a clear improvement — it removes the awkward pattern of storing internal fields (group_id, submitter, submittedAt) inside the generic inputMetadata dict and passing them as implicit args to processing functions.

A few issues found (see inline comments for details):

  1. Mutation of call_args.args in parse_date_into_range (line 312) — call_args.args is reassigned directly on the input dataclass. This is safe today since a fresh ProcessingFunctionCallArgs is created per call, but it's a confusing side effect. Use a local variable instead.

  2. group_id is typed int but still has a None guard in prepro.py:590 — the check is now unreachable and should either be removed or the type changed to int | None.

  3. accession_version is duplicated — it's stored in both UnprocessedEntry.accessionVersion and UnprocessedData.internal_metadata.accession_version. These could drift out of sync. Worth considering whether UnprocessedEntry.accessionVersion can be removed in favour of always reading from internal_metadata.

  4. submission_id is stored but never read — set in parse_ndjson but not consumed anywhere. Presumably for issue Make Loculus internal metadata available to common preprocessing functions #2283; worth a comment.

  5. Stale comment in concatenate (line 689) refers to _call_preprocessing_function() providing ACCESSION_VERSION, but it now comes via internal_metadata.

  6. Missing space in log message at processing_functions.py:749–750 — two adjacent string literals produce "...in input_dataThis is probably...".

@anna-parker anna-parker changed the title feat(prepro): refactor prepro to split up backend internal metadata f… feat(prepro): refactor prepro to split up backend internal metadata from customizable metadata fields Mar 3, 2026
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Mar 10, 2026
@theosanderson
Copy link
Member

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants