-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Feature Request: Scalable Configuration Management and Improved CLI Reporting
Summary
This issue proposes migrating Winnow's configuration management from its current flat Typer structure to a robust, hierarchical Typer + Hydra architecture. This is refactoring step is necessary to enable built-in experiment management (hyperparameter tuning) and scalable configuration of complex, nested components, while introducing a powerful, self-documenting CLI command (winnow config show).
1. Current Limitations (The Problem)
The current project structure relies on two inflexible methods for defining critical values: flat Python function signatures and hard-coded global variables/dictionaries.
Current Typer Command Signatures
# predict command:
def predict(
data_source: Annotated[...],
dataset_config_path: Annotated[...], # Path to config file, but params aren't directly configurable
method: Annotated[...],
fdr_threshold: Annotated[...],
confidence_column: Annotated[...],
# ...
):
# train command:
def train(
data_source: Annotated[...],
dataset_config_path: Annotated[...], # Same issue here
model_output_dir: Annotated[...],
dataset_output_path: Annotated[...],
learn_prosit_missing: Annotated[bool, ...], # Flat boolean flag
learn_chimeric_missing: Annotated[bool, ...], # Another flat boolean flag
# ...
):
This structure leads to significant scalability and maintenance limitations:
-
Rigid, Flat Structure: Every new configuration option must be added as a separate, mandatory, top-level argument to the function signature, making the command line cumbersome and difficult to read.
-
Difficulty with Nested Configuration: It is difficult to configure nested objects or pass complex structures like a dictionary of column renames, forcing the use of separate, poorly integrated config files (
dataset_config_path) alongside CLI arguments. -
Hard-Coded Constants and Global Variables: Shared values like
RESIDUE_MASSES(fromwinnow.constants.py) and simple hyperparameters likeSEEDandMZ_TOLERANCE(frommain.pyglobals) are currently hard-coded in Python files. This prevents:- CLI Overrides: Users cannot easily override these values (e.g., change
MZ_TOLERANCE) via the command line. - Composition: We cannot easily swap out large datasets (e.g., a different set of residue masses) via Hydra configuration files.
- Documentation: These values are not documented or validated by the configuration system.
- CLI Overrides: Users cannot easily override these values (e.g., change
-
Lack of Experiment Management: The current system cannot support built-in hyperparameter tuning via multirun or automatically log the exact configuration used for any given run, which is useful for reproducible scientific computing.
-
Poor Argument Discovery: Users lack a single, reliable command to view all configurable parameters, their types, defaults and descriptions.
2. Proposed Solution: Typer + Hydra Hybrid Architecture
We will adopt Hydra as the single source of truth for all configuration data, responsible for parameter defaulting, overriding and object instantiation. Typer will be delegated to act as a thin command dispatcher for winnow train, winnow predict and the new reporting command.
A. Architectural Components
-
Configuration Composition: We will split configuration into small, logical YAML files. The main pipeline files will include the constants file via Hydra's
defaultskeyword, ensuring constants likeRESIDUE_MASSESare defined only once but are available everywhere in the configuration object. -
Hydra Instantiation and Extensibility: Hydra will use the
_target_field in the configuration to automatically instantiate complex objects (like theProbabilityCalibratorand its base features) based on configuration, eliminating manual initialisation logic in our Python code.-
Extending Feature Sets: If a power user creates a new feature class (e.g.,
NewCalibrationFeature), they simply define a new YAML file that points the_target_field to that class (your_module.calibration_features.NewCalibrationFeature). They can then inject this new feature into the calibrator list via a simple command-line override, without changing any core Python code. -
Extending Data Loading: The old
--data-sourceflag, which required manualif/eliflogic, is replaced by configuration. A user implements a newNewDataLoaderclass and defines its parameters in a YAML file using_target_. They can then swap the data source for a run:winnow predict data_source=new_source.
-
-
Typer Dispatch: The
trainandpredictTyper functions will be stripped of their configuration arguments and configured to pass all command-line input directly to Hydra for processing.
B. New Feature: Configuration Inspection
We will introduce a new command for viewing the resolved configuration. This command will display the fully resolved configuration for the specified pipeline (e.g., train or predict), showing all parameter values after defaults, composition and any command-line overrides have been applied.