Basic LitData optimization of unpaired OAS; dataloader#214

Open

everyday847 wants to merge 8 commits intomainfrom

Collaborator

everyday847 commented Feb 23, 2026

Description

Unpaired OAS is a huge relevant antibody sequence dataset. Depending on who you're reading it from, it might be a huge number of gzipped CSVs (with an extra header line for metadata describing all the sequences in that CSV) or a bunch of parquets (where those metadata lines have been applied into new columns). Either way, you probably want a DataLoader/DataModule streaming this from s3, and you probably want to be able to subset it in useful ways based on that metadata before splitting (e.g., by organism, vaccine, isotype, etc.).

Type of Change

everyday847 added 6 commits

February 21, 2026 17:52


          Streaming .csv.gz unpaired OAS sequences from s3

37e8c4e


          let's not make bizarre uv.lock changes

f0ceefc


          parquet

8a37f09


          iid per sequence

43c42d1


          progress

ce21056


          split without loading everything into memory

3f07e21

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

+              logger = logging.getLogger(__name__)
+              # Metadata columns that can be used for filtering OAS files.
+              FILTERABLE_METADATA_COLUMNS = ("Species", "Vaccine", "Disease", "Chain", "Isotype")

Collaborator

karinazad Mar 8, 2026

it would make sense to me to have a general optimization script/module (for any file, not even sequences) and then a specific one for sequences/OAS

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

+              # Hash modulus for split assignment.  Using 10 000 gives 0.01% resolution
+              # on val_fraction, which is more than enough.
+              _HASH_MODULUS = 10_000

Collaborator

karinazad Mar 8, 2026

could live in lobster.constants

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

		return 0


		def _sort_files_by_size(files: list[str], num_workers: int = 1) -> list[str]:

Collaborator

karinazad Mar 8, 2026

ideally we'd have some of the utils live in lobster.data or somewhere so that cmdline is relatively simple and clean

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

		return raw_bytes.decode("utf-8")


		def _convert_oas_csv(

Collaborator

karinazad Mar 8, 2026

^ same comment as above about separating OAS specific code

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

+                  dict[str, str]
+                      Dictionary with a ``"sequence"`` key for each qualifying row.
+                  """
+                  import pandas as pd

Collaborator

karinazad Mar 8, 2026 •

edited

Loading

import would be better top level

karinazad reviewed

View reviewed changes

src/lobster/cmdline/optimize_sequences.py Outdated

		# ---------------------------------------------------------------------------


		class _SplitCSVConverter:

Collaborator

karinazad Mar 8, 2026

do these need to be classes?

karinazad reviewed

View reviewed changes

src/lobster/datasets/_streaming_sequence_dataset.py Outdated

		logger = logging.getLogger(__name__)


		class StreamingSequenceDataset(StreamingDataset):

Collaborator

karinazad Mar 8, 2026

a lot of similarity with https://github.com/prescient-design/lobster/blob/main/src/lobster/datasets/s3_datasets/base.py

we can refactor into one

Collaborator

karinazad Mar 8, 2026

that one just has assumptions on the tokenizers (which it shouldn't) and assumes s3 (not local)

so could be definitely combined

watkina6 added 2 commits

March 9, 2026 08:03


          Merge remote-tracking branch 'origin/main' into w/oas_litdata

f92dd31


          Address reviews via refactors

34eb42c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet