Skip to content

Refactor: Simplify storage format for predictions and target sequences #138

@JemmaLDaniel

Description

@JemmaLDaniel

Summary:

The current storage format for prediction and target sequences in preds_and_fdr_metrics is a stringified Python list (e.g., '[C, P, Q, ...]').

This approach introduces friction for downstream analysis because loading the CSV output requires mandatory post-processing (e.g., using ast.literal_eval) to parse the string back into a usable list of residue tokens.

Proposed Solution

Change the default storage format to a single, concatenated string (e.g., 'CPQ...'). The data can then be read and used immediately as a standard sequence string, eliminating the need for any parsing overhead when loading the CSV.

Optional Extension

Consider adding a column that formats predictions to be compatible with InstaNovo's _split_peptide function to re-obtain tokenised residues.

Description & Purpose:

No response

Additional Notes:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    refactorCode cleanup and architectural improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions