Skip to content

Basic LitData optimization of unpaired OAS; dataloader#214

Open
everyday847 wants to merge 8 commits intomainfrom
w/oas_litdata
Open

Basic LitData optimization of unpaired OAS; dataloader#214
everyday847 wants to merge 8 commits intomainfrom
w/oas_litdata

Conversation

@everyday847
Copy link
Collaborator

Description

Unpaired OAS is a huge relevant antibody sequence dataset. Depending on who you're reading it from, it might be a huge number of gzipped CSVs (with an extra header line for metadata describing all the sequences in that CSV) or a bunch of parquets (where those metadata lines have been applied into new columns). Either way, you probably want a DataLoader/DataModule streaming this from s3, and you probably want to be able to subset it in useful ways based on that metadata before splitting (e.g., by organism, vaccine, isotype, etc.).

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring

logger = logging.getLogger(__name__)

# Metadata columns that can be used for filtering OAS files.
FILTERABLE_METADATA_COLUMNS = ("Species", "Vaccine", "Disease", "Chain", "Isotype")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would make sense to me to have a general optimization script/module (for any file, not even sequences) and then a specific one for sequences/OAS


# Hash modulus for split assignment. Using 10 000 gives 0.01% resolution
# on val_fraction, which is more than enough.
_HASH_MODULUS = 10_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could live in lobster.constants

return 0


def _sort_files_by_size(files: list[str], num_workers: int = 1) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we'd have some of the utils live in lobster.data or somewhere so that cmdline is relatively simple and clean

return raw_bytes.decode("utf-8")


def _convert_oas_csv(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ same comment as above about separating OAS specific code

dict[str, str]
Dictionary with a ``"sequence"`` key for each qualifying row.
"""
import pandas as pd
Copy link
Collaborator

@karinazad karinazad Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import would be better top level

# ---------------------------------------------------------------------------


class _SplitCSVConverter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these need to be classes?

logger = logging.getLogger(__name__)


class StreamingSequenceDataset(StreamingDataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that one just has assumptions on the tokenizers (which it shouldn't) and assumes s3 (not local)

so could be definitely combined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants