Basic LitData optimization of unpaired OAS; dataloader#214
Basic LitData optimization of unpaired OAS; dataloader#214everyday847 wants to merge 8 commits intomainfrom
Conversation
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Metadata columns that can be used for filtering OAS files. | ||
| FILTERABLE_METADATA_COLUMNS = ("Species", "Vaccine", "Disease", "Chain", "Isotype") |
There was a problem hiding this comment.
it would make sense to me to have a general optimization script/module (for any file, not even sequences) and then a specific one for sequences/OAS
|
|
||
| # Hash modulus for split assignment. Using 10 000 gives 0.01% resolution | ||
| # on val_fraction, which is more than enough. | ||
| _HASH_MODULUS = 10_000 |
There was a problem hiding this comment.
could live in lobster.constants
| return 0 | ||
|
|
||
|
|
||
| def _sort_files_by_size(files: list[str], num_workers: int = 1) -> list[str]: |
There was a problem hiding this comment.
ideally we'd have some of the utils live in lobster.data or somewhere so that cmdline is relatively simple and clean
| return raw_bytes.decode("utf-8") | ||
|
|
||
|
|
||
| def _convert_oas_csv( |
There was a problem hiding this comment.
^ same comment as above about separating OAS specific code
| dict[str, str] | ||
| Dictionary with a ``"sequence"`` key for each qualifying row. | ||
| """ | ||
| import pandas as pd |
There was a problem hiding this comment.
import would be better top level
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class _SplitCSVConverter: |
There was a problem hiding this comment.
do these need to be classes?
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class StreamingSequenceDataset(StreamingDataset): |
There was a problem hiding this comment.
a lot of similarity with https://github.com/prescient-design/lobster/blob/main/src/lobster/datasets/s3_datasets/base.py
we can refactor into one
There was a problem hiding this comment.
that one just has assumptions on the tokenizers (which it shouldn't) and assumes s3 (not local)
so could be definitely combined
Description
Unpaired OAS is a huge relevant antibody sequence dataset. Depending on who you're reading it from, it might be a huge number of gzipped CSVs (with an extra header line for metadata describing all the sequences in that CSV) or a bunch of parquets (where those metadata lines have been applied into new columns). Either way, you probably want a DataLoader/DataModule streaming this from s3, and you probably want to be able to subset it in useful ways based on that metadata before splitting (e.g., by organism, vaccine, isotype, etc.).
Type of Change