Updated PTM featured - Improved TF Data pipeline - Prosit Model refactoring - Feature Extraction fixes by omsh · Pull Request #86 · wilhelm-lab/dlomix

omsh · 2026-02-09T08:49:13Z

This PR refactors the Prosit intensity models (TensorFlow + PyTorch) to support configurable PTM and metadata branches, improves the TensorFlow dataset pipeline behavior, fixes lookup-based feature extraction for overlength sequences, and expands tests/docs around datasets and processors.

Changes:

Refactor Prosit intensity predictors (TF + Torch) with explicit configuration/validation for PTM features, metadata, and optional instrument embeddings.
Update PeptideDataset TF pipeline to control shuffling/batching/prefetching outside of to_tf_dataset, and improve padding/feature-extraction lengths when termini are included.
Add/expand test coverage for processors, datasets, and feature extractors; add a datasets guide to the docs.

…ument values in datasets (#82) * dataset guide and minor doc additions * changed default alphabet value to be None to trigger learning the tokens and be more explicit * comments NOTE: this PR breaks previous usage if the alphabet was implicitly assumed by the user to be ALPHABET_UNMOD. Yet, we choose to move to a more explicit approach for better transperancy and reproducibility.

* cache - shuffle - batch - prefetch order * fixes related to termini - hf label column future warning * pr comments + tests

…mproved tests (#84) * prosit model changes and tests * fix in feature extraction + tests * refined run scripts for intensity * refactored tests

…ests (#85)

Copilot

Pull request overview

This PR refactors the Prosit intensity models (TensorFlow + PyTorch) to support configurable PTM and metadata branches, improves the TensorFlow dataset pipeline behavior, fixes lookup-based feature extraction for overlength sequences, and expands tests/docs around datasets and processors.

Changes:

Refactor Prosit intensity predictors (TF + Torch) with explicit configuration/validation for PTM features, metadata, and optional instrument embeddings.
Update PeptideDataset TF pipeline to control shuffling/batching/prefetching outside of to_tf_dataset, and improve padding/feature-extraction lengths when termini are included.
Add/expand test coverage for processors, datasets, and feature extractors; add a datasets guide to the docs.

Reviewed changes

Copilot reviewed 27 out of 31 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_torch_models.py	Updates TF↔Torch equivalence test to pass explicit `seq_length`.
tests/test_torch_dataset.py	Moves test data wiring to fixtures; updates expected batch shapes and termini behavior.
tests/test_processors.py	Adds comprehensive tests for encoding, PTM removal, function processor, and edge cases.
tests/test_models.py	Aligns expected error behavior for missing PTM/metadata inputs; adds “no metadata” case.
tests/test_feature_extractors.py	Adds tests for `LookupFeatureExtractor` padding/truncation behavior.
tests/test_datasets.py	Switches to fixtures for assets/data; adds TF dataset label-shape test coverage.
tests/conftest.py	Replaces global pytest variables with explicit fixtures; adds alphabets and helper fixtures.
src/dlomix/models/prosit_torch.py	Major refactor of Torch Prosit intensity predictor (configurable inputs, PTM/meta branches).
src/dlomix/models/prosit.py	Major refactor of TF Prosit intensity predictor (configurable inputs, PTM/meta branches).
src/dlomix/data/retention_time.py	Changes default alphabet behavior to allow alphabet learning (`alphabet=None`).
src/dlomix/data/processing/processors.py	Updates docs example output for parsing to use `[]-` (docstring change).
src/dlomix/data/processing/pickled_feature_dicts/saved_loss_atoms.pkl	Adds/updates serialized lookup data for PTM-related feature extraction.
src/dlomix/data/processing/pickled_feature_dicts/saved_gained_atoms.pkl	Adds/updates serialized lookup data for PTM-related feature extraction.
src/dlomix/data/processing/pickled_feature_dicts/mz_diff.pkl	Adds/updates serialized lookup data for PTM-related feature extraction.
src/dlomix/data/processing/feature_extractors.py	Fixes lookup feature extraction to cap lookup at `max_length` before padding.
src/dlomix/data/ion_mobility.py	Changes default alphabet behavior to allow alphabet learning (`alphabet=None`).
src/dlomix/data/fragment_ion_intensity.py	Changes default alphabet behavior to allow alphabet learning (`alphabet=None`).
src/dlomix/data/detectability.py	Changes default alphabet behavior to allow alphabet learning (`alphabet=None`).
src/dlomix/data/dataset_config.py	Adds post-init validation and normalizes `label_column` to list form.
src/dlomix/data/dataset.py	Adjusts padding/feature lengths with termini; revises TF dataset shuffle/batch/prefetch flow.
src/dlomix/data/charge_state.py	Changes default alphabet behavior to allow alphabet learning (`alphabet=None`).
src/dlomix/_metadata.py	Bumps version to 0.2.4 and updates copyright year.
run_scripts/run_prosit_intensity_torch.py	Updates example script to use learned alphabet from dataset.
run_scripts/run_prosit_intensity_ptms_torch.py	Updates PTM Torch example script (seq len, features, termini, learned alphabet).
run_scripts/run_prosit_intensity_ptms.py	Updates PTM TF example script to use dataset-learned alphabet and meta branch.
run_scripts/run_prosit_intensity.py	Updates TF training example to use meta branch, termini, learned alphabet, and `.keras` checkpoint path.
docs/notes/dataset_guide.rst	Adds a comprehensive datasets guide (new).
docs/index.rst	Adds the dataset guide into the docs structure/navigation.
docs/dlomix.rst	Updates package docs TOC to include callbacks; removes some module listings.
docs/dlomix.callbacks.rst	Adds Sphinx page for `dlomix.callbacks` (new).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/dlomix/data/dataset.py

victorgiurcoiu and others added 8 commits December 9, 2025 17:32

Update pickled ptm feature dicts (#81)

0420d08

Feature/revisit tf dataset pipeline (#83)

c51f9a2

* cache - shuffle - batch - prefetch order * fixes related to termini - hf label column future warning * pr comments + tests

dev version for better version management

01dadc6

Fix/prosit dynamic creation refactoring - fixes feature extractor - i…

44609ea

…mproved tests (#84) * prosit model changes and tests * fix in feature extraction + tests * refined run scripts for intensity * refactored tests

version 0.2.4.dev0

5b6d846

feature array subsetting fixed for correct padding after np.empty + t…

890fd8e

…ests (#85)

version 0.2.4

6443de3

omsh requested a review from Copilot February 9, 2026 08:49

Copilot started reviewing on behalf of omsh February 9, 2026 08:49 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

src/dlomix/data/dataset.py Outdated Show resolved Hide resolved

comment - lazy TF import refactor

20ab379

omsh merged commit 3631902 into main Feb 9, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated PTM featured - Improved TF Data pipeline - Prosit Model refactoring - Feature Extraction fixes#86

Updated PTM featured - Improved TF Data pipeline - Prosit Model refactoring - Feature Extraction fixes#86
omsh merged 9 commits intomainfrom
develop

omsh commented Feb 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omsh commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

omsh commented Feb 9, 2026 •

edited

Loading