Skip to content

Latest commit

 

History

History
123 lines (100 loc) · 5.51 KB

File metadata and controls

123 lines (100 loc) · 5.51 KB

Refactor Plan for Running-Optimizer

1. Executive Summary

The repository currently suffers from "script-heavy" architecture where core business logic (training loops, feature orchestration) is embedded in scripts/ or app.py rather than reusable library modules. This leads to code duplication, testing difficulties, and fragile import hacks (e.g., sys.argv manipulation).

This refactor will move all reusable logic into src/, treating scripts/ strictly as CLI entry points. It will also standardize path handling and configuration.

2. Current Architecture & Smells

Path: Data -> Features -> Train -> Eval -> Predict

  • Data: data/raw/ -> scripts/ingest_local_csv.py (implied) or app.py loading directly.
  • Features: scripts/make_features.py calls src.features but also contains filtering logic. Output: data/processed/features_*.csv.
  • Train: scripts/train_model.py defines pipelines, performs grid search, and saves to models/.
  • Predict/Eval: scripts/predict.py and scripts/evaluate_model.py.

Architectural Smells

  1. Fat Scripts: scripts/train_model.py contains ~200 lines of model definitions, cross-validation loops, and plotting logic.
  2. Duplicated Logic: app.py reimplements run filtering/cleaning logic found partly in src/features.py.
  3. Import Hacks: src/pipeline.py mocks sys.argv to call scripts.make_features.main.
  4. Hardcoded Paths: REPO_ROOT is redefined in almost every file.
  5. Hidden Dependencies: src/models.py hardcodes paths to data/processed/.

3. Target Architecture

Running-Optimizer/
├── archive/                 # Deprecated scripts/modules
├── configs/                 # YAML configs
├── data/                    # Data artifacts (ignored by git)
├── scripts/                 # Thin CLI wrappers
│   ├── make_features.py
│   ├── train_model.py
│   ├── evaluate_model.py
│   └── predict.py
├── src/
│   ├── __init__.py
│   ├── config.py            # Centralized config & path definitions
│   ├── data/
│   │   ├── __init__.py
│   │   ├── io.py            # Loaders/Savers (abstract CSV/Parquet paths)
│   │   └── clean.py         # Domain cleaning (moving time, pause ratio)
│   ├── features/
│   │   ├── __init__.py
│   │   ├── generator.py     # Orchestration (was make_features.py)
│   │   └── transformations.py # Core math (rolling windows, etc.)
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py      # Model pipeline definitions
│   │   ├── training.py      # CV loops, Grid Search logic
│   │   └── evaluation.py    # Metrics, plots
│   └── visualization/       # Plotting helpers
├── tests/
└── app.py                   # Streamlit app (imports from src)

4. Migration Plan

Phase 1: Foundation (Paths & Config)

Goal: Remove REPO_ROOT duplication and centralize constants.

  • Action: Create src/config.py defining REPO_ROOT, DATA_DIR, MODELS_DIR.
  • Refactor: Update src/utils.py and others to import paths from src/config.py.

Phase 2: Data & Features

Goal: Decouple feature generation from the script.

  • Action: Move app.py cleaning logic to src/data/clean.py (function: clean_raw_runs).
  • Action: Move scripts/make_features.py logic to src/features/generator.py (function: generate_features_dataset).
  • Update: scripts/make_features.py becomes a 10-line wrapper.

Phase 3: Training & Models

Goal: Make training testable and importable.

  • Action: Move model definitions (Ridge, RF, pipelines) from scripts/train_model.py to src/models/registry.py.
  • Action: Move the CV/GridSearch loop to src/models/training.py (function: run_training_job).
  • Update: scripts/train_model.py becomes a wrapper calling run_training_job.

Phase 4: App Integration

Goal: Ensure App uses the same logic as the pipeline.

  • Action: Update app.py to import clean_raw_runs from src/data/clean.py.

Phase 5: Cleanup

  • Action: Move scripts/train_baseline.py and scripts/convert_strava_activities.py to archive/ if unused.
  • Action: Remove src/pipeline.py (replaced by Makefile or simple script chaining).

5. Golden Path (Runbook)

1. Install

make install
source venv/bin/activate

2. Generate Features

# Uses src/features/generator.py
python scripts/make_features.py --dataset dhruva --inp data/raw/runs.csv

3. Train Model

# Uses src/models/training.py
python scripts/train_model.py --name dhruva --table 5k

4. Predict/Eval

python scripts/evaluate_model.py --name dhruva --split test

6. Risks & Mitigations

  • Risk: app.py breakage.
    • Mitigation: Run streamlit run app.py locally after Phase 2 and Phase 4.
  • Risk: Circular imports (e.g., models importing features importing config).
    • Mitigation: Keep config.py dependency-free. strict hierarchy: models -> features -> data -> config.
  • Risk: Path resolution in Streamlit vs CLI.
    • Mitigation: Use pathlib relative to __file__ in src/config.py to robustly find the repo root.

7. Verification Checklist

  • pytest passes (existing tests).
  • python scripts/make_features.py ... produces identical output to before.
  • python scripts/train_model.py ... runs without error and saves models.
  • streamlit run app.py loads successfully.