The repository currently suffers from "script-heavy" architecture where core business logic (training loops, feature orchestration) is embedded in scripts/ or app.py rather than reusable library modules. This leads to code duplication, testing difficulties, and fragile import hacks (e.g., sys.argv manipulation).
This refactor will move all reusable logic into src/, treating scripts/ strictly as CLI entry points. It will also standardize path handling and configuration.
- Data:
data/raw/->scripts/ingest_local_csv.py(implied) orapp.pyloading directly. - Features:
scripts/make_features.pycallssrc.featuresbut also contains filtering logic. Output:data/processed/features_*.csv. - Train:
scripts/train_model.pydefines pipelines, performs grid search, and saves tomodels/. - Predict/Eval:
scripts/predict.pyandscripts/evaluate_model.py.
- Fat Scripts:
scripts/train_model.pycontains ~200 lines of model definitions, cross-validation loops, and plotting logic. - Duplicated Logic:
app.pyreimplements run filtering/cleaning logic found partly insrc/features.py. - Import Hacks:
src/pipeline.pymockssys.argvto callscripts.make_features.main. - Hardcoded Paths:
REPO_ROOTis redefined in almost every file. - Hidden Dependencies:
src/models.pyhardcodes paths todata/processed/.
Running-Optimizer/
├── archive/ # Deprecated scripts/modules
├── configs/ # YAML configs
├── data/ # Data artifacts (ignored by git)
├── scripts/ # Thin CLI wrappers
│ ├── make_features.py
│ ├── train_model.py
│ ├── evaluate_model.py
│ └── predict.py
├── src/
│ ├── __init__.py
│ ├── config.py # Centralized config & path definitions
│ ├── data/
│ │ ├── __init__.py
│ │ ├── io.py # Loaders/Savers (abstract CSV/Parquet paths)
│ │ └── clean.py # Domain cleaning (moving time, pause ratio)
│ ├── features/
│ │ ├── __init__.py
│ │ ├── generator.py # Orchestration (was make_features.py)
│ │ └── transformations.py # Core math (rolling windows, etc.)
│ ├── models/
│ │ ├── __init__.py
│ │ ├── registry.py # Model pipeline definitions
│ │ ├── training.py # CV loops, Grid Search logic
│ │ └── evaluation.py # Metrics, plots
│ └── visualization/ # Plotting helpers
├── tests/
└── app.py # Streamlit app (imports from src)
Goal: Remove REPO_ROOT duplication and centralize constants.
- Action: Create
src/config.pydefiningREPO_ROOT,DATA_DIR,MODELS_DIR. - Refactor: Update
src/utils.pyand others to import paths fromsrc/config.py.
Goal: Decouple feature generation from the script.
- Action: Move
app.pycleaning logic tosrc/data/clean.py(function:clean_raw_runs). - Action: Move
scripts/make_features.pylogic tosrc/features/generator.py(function:generate_features_dataset). - Update:
scripts/make_features.pybecomes a 10-line wrapper.
Goal: Make training testable and importable.
- Action: Move model definitions (Ridge, RF, pipelines) from
scripts/train_model.pytosrc/models/registry.py. - Action: Move the CV/GridSearch loop to
src/models/training.py(function:run_training_job). - Update:
scripts/train_model.pybecomes a wrapper callingrun_training_job.
Goal: Ensure App uses the same logic as the pipeline.
- Action: Update
app.pyto importclean_raw_runsfromsrc/data/clean.py.
- Action: Move
scripts/train_baseline.pyandscripts/convert_strava_activities.pytoarchive/if unused. - Action: Remove
src/pipeline.py(replaced by Makefile or simple script chaining).
1. Install
make install
source venv/bin/activate2. Generate Features
# Uses src/features/generator.py
python scripts/make_features.py --dataset dhruva --inp data/raw/runs.csv3. Train Model
# Uses src/models/training.py
python scripts/train_model.py --name dhruva --table 5k4. Predict/Eval
python scripts/evaluate_model.py --name dhruva --split test- Risk:
app.pybreakage.- Mitigation: Run
streamlit run app.pylocally after Phase 2 and Phase 4.
- Mitigation: Run
- Risk: Circular imports (e.g.,
modelsimportingfeaturesimportingconfig).- Mitigation: Keep
config.pydependency-free. strict hierarchy:models->features->data->config.
- Mitigation: Keep
- Risk: Path resolution in Streamlit vs CLI.
- Mitigation: Use
pathlibrelative to__file__insrc/config.pyto robustly find the repo root.
- Mitigation: Use
-
pytestpasses (existing tests). -
python scripts/make_features.py ...produces identical output to before. -
python scripts/train_model.py ...runs without error and saves models. -
streamlit run app.pyloads successfully.