Skip to content

Feat: Migrate Winnow's argument management to a Typer + Hydra yaml architecture #146

@JemmaLDaniel

Description

@JemmaLDaniel

Feature Request: Scalable Configuration Management and Improved CLI Reporting

Summary

This issue proposes migrating Winnow's configuration management from its current flat Typer structure to a robust, hierarchical Typer + Hydra architecture. This is refactoring step is necessary to enable built-in experiment management (hyperparameter tuning) and scalable configuration of complex, nested components, while introducing a powerful, self-documenting CLI command (winnow config show).

1. Current Limitations (The Problem)

The current project structure relies on two inflexible methods for defining critical values: flat Python function signatures and hard-coded global variables/dictionaries.

Current Typer Command Signatures

# predict command:
def predict(
    data_source: Annotated[...], 
    dataset_config_path: Annotated[...], # Path to config file, but params aren't directly configurable
    method: Annotated[...], 
    fdr_threshold: Annotated[...],
    confidence_column: Annotated[...],
    # ... 
):
# train command:
def train(
    data_source: Annotated[...], 
    dataset_config_path: Annotated[...], # Same issue here
    model_output_dir: Annotated[...],
    dataset_output_path: Annotated[...],
    learn_prosit_missing: Annotated[bool, ...], # Flat boolean flag
    learn_chimeric_missing: Annotated[bool, ...], # Another flat boolean flag
    # ...
):

This structure leads to significant scalability and maintenance limitations:

  • Rigid, Flat Structure: Every new configuration option must be added as a separate, mandatory, top-level argument to the function signature, making the command line cumbersome and difficult to read.

  • Difficulty with Nested Configuration: It is difficult to configure nested objects or pass complex structures like a dictionary of column renames, forcing the use of separate, poorly integrated config files (dataset_config_path) alongside CLI arguments.

  • Hard-Coded Constants and Global Variables: Shared values like RESIDUE_MASSES (from winnow.constants.py) and simple hyperparameters like SEED and MZ_TOLERANCE (from main.py globals) are currently hard-coded in Python files. This prevents:

    • CLI Overrides: Users cannot easily override these values (e.g., change MZ_TOLERANCE) via the command line.
    • Composition: We cannot easily swap out large datasets (e.g., a different set of residue masses) via Hydra configuration files.
    • Documentation: These values are not documented or validated by the configuration system.
  • Lack of Experiment Management: The current system cannot support built-in hyperparameter tuning via multirun or automatically log the exact configuration used for any given run, which is useful for reproducible scientific computing.

  • Poor Argument Discovery: Users lack a single, reliable command to view all configurable parameters, their types, defaults and descriptions.

2. Proposed Solution: Typer + Hydra Hybrid Architecture

We will adopt Hydra as the single source of truth for all configuration data, responsible for parameter defaulting, overriding and object instantiation. Typer will be delegated to act as a thin command dispatcher for winnow train, winnow predict and the new reporting command.

A. Architectural Components

  1. Configuration Composition: We will split configuration into small, logical YAML files. The main pipeline files will include the constants file via Hydra's defaults keyword, ensuring constants like RESIDUE_MASSES are defined only once but are available everywhere in the configuration object.

  2. Hydra Instantiation and Extensibility: Hydra will use the _target_ field in the configuration to automatically instantiate complex objects (like the ProbabilityCalibrator and its base features) based on configuration, eliminating manual initialisation logic in our Python code.

    • Extending Feature Sets: If a power user creates a new feature class (e.g., NewCalibrationFeature), they simply define a new YAML file that points the _target_ field to that class (your_module.calibration_features.NewCalibrationFeature). They can then inject this new feature into the calibrator list via a simple command-line override, without changing any core Python code.

    • Extending Data Loading: The old --data-source flag, which required manual if/elif logic, is replaced by configuration. A user implements a new NewDataLoader class and defines its parameters in a YAML file using _target_. They can then swap the data source for a run: winnow predict data_source=new_source.

  3. Typer Dispatch: The train and predict Typer functions will be stripped of their configuration arguments and configured to pass all command-line input directly to Hydra for processing.

B. New Feature: Configuration Inspection

We will introduce a new command for viewing the resolved configuration. This command will display the fully resolved configuration for the specified pipeline (e.g., train or predict), showing all parameter values after defaults, composition and any command-line overrides have been applied.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions