Updated PTM featured - Improved TF Data pipeline - Prosit Model refactoring - Feature Extraction fixes#86
Conversation
…ument values in datasets (#82) * dataset guide and minor doc additions * changed default alphabet value to be None to trigger learning the tokens and be more explicit * comments NOTE: this PR breaks previous usage if the alphabet was implicitly assumed by the user to be ALPHABET_UNMOD. Yet, we choose to move to a more explicit approach for better transperancy and reproducibility.
* cache - shuffle - batch - prefetch order * fixes related to termini - hf label column future warning * pr comments + tests
…mproved tests (#84) * prosit model changes and tests * fix in feature extraction + tests * refined run scripts for intensity * refactored tests
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the Prosit intensity models (TensorFlow + PyTorch) to support configurable PTM and metadata branches, improves the TensorFlow dataset pipeline behavior, fixes lookup-based feature extraction for overlength sequences, and expands tests/docs around datasets and processors.
Changes:
- Refactor Prosit intensity predictors (TF + Torch) with explicit configuration/validation for PTM features, metadata, and optional instrument embeddings.
- Update
PeptideDatasetTF pipeline to control shuffling/batching/prefetching outside ofto_tf_dataset, and improve padding/feature-extraction lengths when termini are included. - Add/expand test coverage for processors, datasets, and feature extractors; add a datasets guide to the docs.
Reviewed changes
Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_torch_models.py | Updates TF↔Torch equivalence test to pass explicit seq_length. |
| tests/test_torch_dataset.py | Moves test data wiring to fixtures; updates expected batch shapes and termini behavior. |
| tests/test_processors.py | Adds comprehensive tests for encoding, PTM removal, function processor, and edge cases. |
| tests/test_models.py | Aligns expected error behavior for missing PTM/metadata inputs; adds “no metadata” case. |
| tests/test_feature_extractors.py | Adds tests for LookupFeatureExtractor padding/truncation behavior. |
| tests/test_datasets.py | Switches to fixtures for assets/data; adds TF dataset label-shape test coverage. |
| tests/conftest.py | Replaces global pytest variables with explicit fixtures; adds alphabets and helper fixtures. |
| src/dlomix/models/prosit_torch.py | Major refactor of Torch Prosit intensity predictor (configurable inputs, PTM/meta branches). |
| src/dlomix/models/prosit.py | Major refactor of TF Prosit intensity predictor (configurable inputs, PTM/meta branches). |
| src/dlomix/data/retention_time.py | Changes default alphabet behavior to allow alphabet learning (alphabet=None). |
| src/dlomix/data/processing/processors.py | Updates docs example output for parsing to use []- (docstring change). |
| src/dlomix/data/processing/pickled_feature_dicts/saved_loss_atoms.pkl | Adds/updates serialized lookup data for PTM-related feature extraction. |
| src/dlomix/data/processing/pickled_feature_dicts/saved_gained_atoms.pkl | Adds/updates serialized lookup data for PTM-related feature extraction. |
| src/dlomix/data/processing/pickled_feature_dicts/mz_diff.pkl | Adds/updates serialized lookup data for PTM-related feature extraction. |
| src/dlomix/data/processing/feature_extractors.py | Fixes lookup feature extraction to cap lookup at max_length before padding. |
| src/dlomix/data/ion_mobility.py | Changes default alphabet behavior to allow alphabet learning (alphabet=None). |
| src/dlomix/data/fragment_ion_intensity.py | Changes default alphabet behavior to allow alphabet learning (alphabet=None). |
| src/dlomix/data/detectability.py | Changes default alphabet behavior to allow alphabet learning (alphabet=None). |
| src/dlomix/data/dataset_config.py | Adds post-init validation and normalizes label_column to list form. |
| src/dlomix/data/dataset.py | Adjusts padding/feature lengths with termini; revises TF dataset shuffle/batch/prefetch flow. |
| src/dlomix/data/charge_state.py | Changes default alphabet behavior to allow alphabet learning (alphabet=None). |
| src/dlomix/_metadata.py | Bumps version to 0.2.4 and updates copyright year. |
| run_scripts/run_prosit_intensity_torch.py | Updates example script to use learned alphabet from dataset. |
| run_scripts/run_prosit_intensity_ptms_torch.py | Updates PTM Torch example script (seq len, features, termini, learned alphabet). |
| run_scripts/run_prosit_intensity_ptms.py | Updates PTM TF example script to use dataset-learned alphabet and meta branch. |
| run_scripts/run_prosit_intensity.py | Updates TF training example to use meta branch, termini, learned alphabet, and .keras checkpoint path. |
| docs/notes/dataset_guide.rst | Adds a comprehensive datasets guide (new). |
| docs/index.rst | Adds the dataset guide into the docs structure/navigation. |
| docs/dlomix.rst | Updates package docs TOC to include callbacks; removes some module listings. |
| docs/dlomix.callbacks.rst | Adds Sphinx page for dlomix.callbacks (new). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR refactors the Prosit intensity models (TensorFlow + PyTorch) to support configurable PTM and metadata branches, improves the TensorFlow dataset pipeline behavior, fixes lookup-based feature extraction for overlength sequences, and expands tests/docs around datasets and processors.
Changes:
PeptideDatasetTF pipeline to control shuffling/batching/prefetching outside ofto_tf_dataset, and improve padding/feature-extraction lengths when termini are included.