Skip to content

Conversation

@JemmaLDaniel
Copy link
Collaborator

Summary

This PR implements the Typer + Hydra hybrid architecture proposed in #146, refactoring Winnow's configuration management from flat CLI signatures to a flexible, hierarchical system that enables scalable configuration of complex nested components and automatic object instantiation.

Implementation Details

1. Typer + Hydra Hybrid Architecture

Typer now acts as a thin command dispatcher, passing all configuration to Hydra:

def train(ctx: typer.Context) -> None:
    """Passes control directly to the Hydra training pipeline."""
    overrides = ctx.args if ctx.args else None
    train_entry_point(overrides)Pipeline logic moved to `train_entry_point()` and `predict_entry_point()` functions that handle Hydra initialization, configuration composition and pipeline execution.

2. Structured Configuration with Composition

Created modular configuration structure in config/:

  • train.yaml / predict.yaml - Main pipeline configurations
  • calibrator.yaml - Model architecture and features
  • residues.yaml - Amino acid masses and modifications (shared via composition)
  • data_loader/ - Pluggable dataset format loaders (InstaNovo, MZTab, PointNovo, Winnow)
  • fdr_method/ - Pluggable FDR methods (nonparametric, database-grounded)

Configuration files use Hydra's defaults mechanism to compose shared components.

3. Hydra-Based Object Instantiation

Used Hydra's _target_ field for automatic instantiation:

  • Data loaders instantiated from configuration without manual if/elif logic
  • FDR methods selected and configured via YAML
  • Users can inject custom implementations by creating YAML configs with _target_ pointing to their classes

4. Configuration Inspection Commands

Added winnow config command group:

  • winnow config train - Display resolved training configuration
  • winnow config predict - Display resolved prediction configuration

Implemented custom ConfigFormatter class with hierarchical colour-coding based on YAML nesting depth for improved terminal readability.

5. Lazy Imports for CLI Performance

Implemented lazy import pattern using TYPE_CHECKING to defer heavy dependencies (PyTorch, InstaNovo, etc.) until command execution. This makes --help and config commands respond instantly whilst pipeline commands still have access to all required dependencies.

Added module-level docstring in main.py explaining the rationale.

6. Documentation Updates

Minor improvements to CLI help text and documentation to reflect the new Hydra-based configuration system with examples of dot-notation overrides.

Migration Notes

Existing users will need to:

  • Use configuration files in config/ instead of passing all parameters via CLI flags
  • Override parameters using dot notation: winnow train calibrator.seed=42
  • Consult winnow config <pipeline> to inspect resolved configurations

@JemmaLDaniel
Copy link
Collaborator Author

JemmaLDaniel commented Nov 26, 2025

Commits 20ee8b3 and 2529582 also address #143 and #140

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek November 26, 2025 18:18
@JemmaLDaniel JemmaLDaniel self-assigned this Nov 26, 2025
@JemmaLDaniel JemmaLDaniel added enhancement New feature or request documentation Improvements or additions to documentation labels Nov 26, 2025
winnow train data_loader=mztab model_output_dir=models/my_model

# Specify dataset paths
winnow train dataset.spectrum_path_or_directory=data/spectra.parquet dataset.predictions_path=data/preds.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try this:

winnow train dataset.spectrum_path_or_directory=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/parquet/mouse/dataset-mus-musculus-train-0000-0001.parquet dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

I get:

[...]
ConfigCompositionException: Could not override 'dataset.predictions'.
To append to your config use +dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

When I try that:

winnow train dataset.spectrum_path_or_directory=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/parquet/mouse/dataset-mus-musculus-train-0000-0001.parquet +dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

I get:

[...]
│ /home/j-vangoey/code/winnow/winnow/datasets/data_loaders.py:62 in _load_beam_preds                                                                                                          │
│                                                                                                                                                                                             │
│    59 │   │   Returns:                                                                         ╭───────────────── locals ──────────────────╮                                                │
│    60 │   │   │   Tuple[pl.DataFrame, pl.DataFrame]: A tuple containing the predictions and be │ predictions_path = 'data/predictions.csv' │                                                │
│    61 │   │   """                                                                              ╰───────────────────────────────────────────╯                                                │
│ ❱  62 │   │   if predictions_path.suffix != ".csv":                                                                                                                                         │
│    63 │   │   │   raise ValueError(                                                                                                                                                         │
│    64 │   │   │   │   f"Unsupported file format for InstaNovo beam predictions: {predictions_p                                                                                              │
│    65 │   │   │   )                                                                                                                                                                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'str' object has no attribute 'suffix'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be dataset.predictions_path, not dataset.predictions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see predictions_path was coming in from the config as a string, so I'll convert to a Path before file loading.

--output-folder ./predictions
```bash
# Change MLP architecture
winnow train calibrator.hidden_layer_sizes=[100,50,25]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try this

winnow train calibrator.hidden_layer_sizes=[100,50,25]

I get:

zsh: no matches found: calibrator.hidden_layer_sizes=[100,50,25]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, strange. This works fine for me


### InstaNovo Configuration
# Specify dataset paths
winnow predict dataset.spectrum_path_or_directory=data/spectra.parquet dataset.predictions_path=data/preds.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier. When I try this:

winnow predict dataset.spectrum_path_or_directory=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/parquet/mouse/dataset-mus-musculus-train-0000-0001.parquet +dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

I get:

[ ...]
ConfigCompositionException: Could not override 'dataset.predictions'.
To append to your config use +dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

and when I try that:

winnow predict dataset.spectrum_path_or_directory=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/parquet/mouse/dataset-mus-musculus-train-0000-0001.parquet +dataset.predictions=/home/j-vangoey/code/InstaNovo-internal/data/nine-species-balanced/instanovo_af3456d3_9to1/mouse.csv

I get

[...]
AttributeError: 'str' object has no attribute 'suffix'

docs/cli.md Outdated

Winnow supports multiple input formats:

- **InstaNovo**: Parquet spectra + CSV predictions (beam search format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people will have their input data in *.MGF so I think it would be good to either point to instructions on how to use instanovo convert to convert *.MGF to *.parquet or to add functionality to do that in winnow on the fly.

Copy link
Collaborator Author

@JemmaLDaniel JemmaLDaniel Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! I will address mgf file loading in a new PR, and I can add a bit on this in the docs as a patch for now

# Predict using pretrained model, InstaNovo predictions and default settings
winnow predict \
dataset.spectrum_path_or_directory=data/test_spectra.parquet \
dataset.predictions_path=data/test_predictions.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add small sample test_spectra.parquet and test_predictions.csv files to the repo (or add them to a new relase as assets and add a file to download them) so that people quickly have some sample files to play around with.


```bash
# Train with default settings
winnow train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training with default settings gives me:

[...]
FileNotFoundError: No such file or directory (os error 2): data/spectra.ipc

winnow train

# Predict with default settings
winnow predict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

FileNotFoundError: No such file or directory (os error 2): data/spectra.ipc

from hydra.utils import instantiate

with initialize(
config_path="../../config", version_base="1.3", job_name="winnow_train"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config_path="../../config" won't work when we distribute the package via PyPI. To confirm:

  1. Build the winnow-fdr package.
$ uv build
Building source distribution...
[...]
Successfully built dist/winnow_fdr-1.0.3.tar.gz
Successfully built dist/winnow_fdr-1.0.3-py3-none-any.whl
  1. Install this wheel
$ cd /tmp
$ uv init winnow_demo
Initialized project `winnow-demo` at `/tmp/winnow_demo`
$ cd winnow_demo
$ uv add ~/code/winnow/dist/winnow_fdr-1.0.3-py3-none-any.whl
Using CPython 3.13.6
Creating virtual environment at: .venv
Resolved 167 packages in 1.94s
Prepared 23 packages in 2m 23s
[...]
$ source .venv/bin/activate
$ winnow config train
[...]
MissingConfigException: Primary config directory not found.
Check that the config directory '/tmp/winnow_demo/.venv/lib/python3.13/site-packages/config' exists
and readable

We have had the same problem in InstaNovo. The solution is to move your config folder inside the winnow folder

winnow/
    config/
        data_loader
            instanovo.yaml
            [...]

and the use importlib:

from importlib.resources import files

def train_entry_point(overrides=None, execute=True):
    from hydra import initialize, compose
    from hydra.utils import instantiate
    from hydra.core.global_hydra import GlobalHydra

    # Reset Hydra if called multiple times in same process
    GlobalHydra.instance().clear()

    # Resolve config directory inside package
    config_dir = files("winnow").joinpath("config")

    with initialize(
        config_path=str(config_dir),
        version_base="1.3",
        job_name="winnow_train",
    ):
        cfg = compose(config_name="train", overrides=overrides)

    if not execute:
        print_config(cfg)
        return

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to the config path issue, I have made the following changs :

  1. Moved configs inside package - Configs are now in winnow/configs/ as suggested
  2. Used importlib.resources.files() - Implemented get_config_dir() in winnow/scripts/config_path_utils.py that uses files("winnow").joinpath("configs") for package mode
  3. Added package data - Updated pyproject.toml to include configs in the built package
  4. Switched to initialize_config_dir() - Changed from initialize(config_path=...) to initialize_config_dir(config_dir=...) to handle absolute paths correctly

The solution includes a fallback to dev mode when running from a cloned repo, and also adds support for custom config directories with partial overrides.

Tested and confirmed working when installed from a wheel from my side. Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configs are now in winnow/configs/

The winnow/configs folder is not checked in yet, so running winnow train gives:

╭───────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────╮
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:352 in train                                                                                                                             │
│                                                                                                                                                                                             │
│   349 │   """Passes control directly to the Hydra training pipeline."""                        ╭────────────────────────── locals ──────────────────────────╮                               │
│   350 │   # Capture extra arguments as Hydra overrides (--config-dir already parsed out by Typ │ config_dir = None                                          │                               │
│   351 │   overrides = ctx.args if ctx.args else None                                           │        ctx = <click.core.Context object at 0x76d835d88cd0> │                               │
│ ❱ 352 │   train_entry_point(overrides, config_dir=config_dir)                                  │  overrides = None                                          │                               │
│   353                                                                                          ╰────────────────────────────────────────────────────────────╯                               │
│   354                                                                                                                                                                                       │
│   355 @app.command(                                                                                                                                                                         │
│                                                                                                                                                                                             │
│ /home/j-vangoey/code/winnow/winnow/scripts/main.py:165 in train_entry_point                                                                                                                 │
│                                                                                                                                                                                             │
│   162 │   from winnow.scripts.config_path_utils import get_primary_config_dir                  ╭───── locals ──────╮                                                                        │
│   163 │                                                                                        │ config_dir = None │                                                                        │
│   164 │   # Get primary config directory (custom if provided, otherwise package/dev)           │    execute = True │                                                                        │
│ ❱ 165 │   primary_config_dir = get_primary_config_dir(config_dir)                              │  overrides = None │                                                                        │
│   166 │                                                                                        ╰───────────────────╯                                                                        │
│   167 │   # Initialise Hydra with primary config directory                                                                                                                                  │
│   168 │   with initialize_config_dir(                                                                                                                                                       │
│                                                                                                                                                                                             │
│ /home/j-vangoey/code/winnow/winnow/scripts/config_path_utils.py:190 in get_primary_config_dir                                                                                               │
│                                                                                                                                                                                             │
│   187 │   │   │   f"package: {package_path}) -> {merged_dir}"                                  ╭───────── locals ─────────╮                                                                 │
│   188 │   │   )                                                                                │ custom_config_dir = None │                                                                 │
│   189 │   │   return merged_dir                                                                ╰──────────────────────────╯                                                                 │
│ ❱ 190 │   return get_config_dir().resolve()                                                                                                                                                 │
│   191                                                                                                                                                                                       │
│                                                                                                                                                                                             │
│ /home/j-vangoey/code/winnow/winnow/scripts/config_path_utils.py:66 in get_config_dir                                                                                                        │
│                                                                                                                                                                                             │
│    63 │   if alt_dev_configs.exists() and alt_dev_configs.is_dir():                            ╭───────────────────────────────── locals ──────────────────────────────────╮                │
│    64 │   │   return alt_dev_configs                                                           │ alt_dev_configs = PosixPath('/home/j-vangoey/code/winnow/configs')        │                │
│    65 │                                                                                        │     config_path = PosixPath('/home/j-vangoey/code/winnow/winnow/configs') │                │
│ ❱  66 │   raise FileNotFoundError(                                                             │     dev_configs = PosixPath('/home/j-vangoey/code/winnow/winnow/configs') │                │
│    67 │   │   f"Could not locate configs directory. Tried:\n"                                  │       repo_root = PosixPath('/home/j-vangoey/code/winnow')                │                │
│    68 │   │   f"  - Package configs: winnow.configs\n"                                         │      script_dir = PosixPath('/home/j-vangoey/code/winnow/winnow/scripts') │                │
│    69 │   │   f"  - Dev configs: {dev_configs}\n"                                              ╰───────────────────────────────────────────────────────────────────────────╯                │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not locate configs directory. Tried:
 - Package configs: winnow.configs
 - Dev configs: /home/j-vangoey/code/winnow/winnow/configs
 - Alt dev configs: /home/j-vangoey/code/winnow/configs

docs/cli.md Outdated
1. **Model checkpoints** (in `--model-output-folder`):
- `calibrator.pkl`: Complete trained calibrator with all features and parameters

2. **Training results** (`--dataset-output-path`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still references the old CLI style --dataset-output-path instead of the Hydra style dataset_output_path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, thanks

@JemmaLDaniel JemmaLDaniel force-pushed the feat-hydra-config branch 4 times, most recently from f0bafe0 to b459d60 Compare December 4, 2025 18:59
@JemmaLDaniel JemmaLDaniel requested a review from BioGeek December 4, 2025 19:03
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py2651195%162–163, 326–328, 565–567, 1043–1045
   calibrator.py911583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
datasets
   __init__.py00100% 
   calibration_dataset.py861286%137, 190, 192–193, 199–202, 204–207
   data_loaders.py2052050%7–18, 20–21, 27, 30, 43, 50–51, 62–64, 67, 69, 74, 79, 81–82, 91, 93–96, 98, 102–103, 105, 107, 109–110, 122, 128, 137–138, 142, 144, 160–161, 163–165, 167–169, 171–172, 174, 176, 188, 191–192, 194–195, 200, 206–207, 216, 219, 229, 234, 247–249, 258, 260, 272, 277–279, 281, 285–286, 288–289, 294, 297, 304, 310, 312, 324–325, 328, 331–332, 341, 350, 353, 377, 390, 397–398, 407–408, 410–413, 415, 419–420, 422, 424–425, 434–436, 439–440, 442, 458–459, 461–462, 465–467, 472, 475–476, 479–480, 486–487, 489, 491, 501, 504, 516–517, 524, 530, 534, 536, 539, 543, 545, 561, 573, 586–587, 589–590, 593–595, 605–606, 608, 610–612, 614, 625, 627, 638–639, 643, 647, 649, 661, 668, 681, 691–692, 703, 738, 744, 764, 767, 780, 787, 800–801, 803–805, 811–815, 820–821, 824, 827, 832, 838–844, 850–851
   interfaces.py330%6–8
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py250100% 
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   config_formatter.py53530%3–5, 8, 16, 27, 29, 31, 37–38, 40–42, 44, 46, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 82, 91, 93, 102, 104, 113, 115, 127–128, 130–132, 134, 145–147, 150, 153–154, 157–158, 160
   config_path_utils.py76593%24–26, 117–118
   main.py1361360%8, 10–14, 17–21, 24–25, 27–29, 33, 40, 45, 48, 54, 56–57, 60, 69, 77, 80, 87, 89–91, 93, 95–100, 103, 105–106, 111, 126, 129, 135–136, 138–140, 143–144, 147, 160–162, 165, 168, 173, 175–177, 179, 181–182, 185–186, 189, 191–192, 194, 196–197, 200–201, 204–205, 208–209, 212–213, 215, 218, 232–234, 237, 240, 245, 247–249, 251–252, 254–255, 258–259, 262, 264–265, 267, 269–270, 273–274, 280–281, 284–285, 288–289, 292–293, 301–302, 305–308, 312, 315, 338, 351–352, 355, 380, 393–394, 397, 412, 424–425, 428, 443, 455–456
TOTAL105245956% 

Tests Skipped Failures Errors Time
137 0 💤 0 ❌ 0 🔥 36.657s ⏱️

…nstalled as a package

chore: fix pre-commit on main script

chore: remove testing Make commands

fix: correct the path for config_path_utils

fix: correct the path for config_path_utils

chore: pre-commit formatting fixes for test_config_paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

3 participants