This repository contains the data processing and model training code used for ML studies on MINERvA events.
The processed dataset used by this project is available on Hugging Face:
- gregorkrzmanc/minerva-ml
- It is a preprocessed version of the MINERvA open data release for ML/physics tasks such as available-energy estimation and event tagging.
- Source data comes from MINERvA open data: MINERvA Open Data.
- This is a derived dataset and is not an official MINERvA collaboration product.
For detailed data fields and semantics, see DATASET.md. For model architecture details, see MODELS.md.
The typical workflow is:
- Download raw playlists
- Preprocess ROOT files into ML-ready tensors
- Split into train/val/test
- Train models locally or submit SLURM jobs
- Run test evaluation (
eval) on checkpoints, then produce figures withsrc.eval(or notebooks)
Use the gkrz/minerva_ml:v1 container (see the provided Dockerfile).
Choose one of the following:
- Option A (from scratch): Download raw MINERvA playlists, then preprocess locally.
- Option B (quick start): Download the already preprocessed dataset from Hugging Face and skip local preprocessing.
Set SCRATCH first, then run:
# Monte Carlo playlists
python -m src.scripts.download_data
# Recorded data playlists
python -m src.scripts.download_data --prefix MediumEnergy_FHC_Data_PlaylistIf you want to skip raw playlist processing, download the preprocessed dataset snapshot:
pip install -U "huggingface_hub[cli]"
huggingface-cli download gregorkrzmanc/minerva-ml \
--repo-type dataset \
--local-dir <HF_DATA_DIR>After download, point your training/splitting commands to the downloaded folder structure.
Skip this section if you used Option B and already have the preprocessed files you need.
Minimal invocation (creates .pb files with event-wise particle tensors and labels):
python -m src.scripts.preprocess_dataset --output-dir <OUTPUT_DIR>For a full pipeline on this project’s layout—preprocess, split playlists 1A / 1B, and extract baselines—edit paths in the script if needed, then run:
bash src/scripts/preprocess.shsrc/scripts/preprocess.sh sets DATA_DIR, runs preprocess_dataset with blob/prong limits and playlist selection, runs split_dataset per playlist (with different val/test ratios for 1B vs 1A), and runs extract_baselines against the raw playlist directories under scratch.
python -m src.scripts.split_dataset \
--input-dir <PREPROCESSED_DIR> \
--output-dir <SPLIT_OUTPUT_DIR>To inspect created features quickly, see notebooks/stats.ipynb.
Use src/scripts/train.py for both regression and classification.
python -m src.scripts.train \
-bs 2048 \
--mode regression \
-E-available-no-muon \
-name Run_debug \
--d_model 128 --depth 4 --n_heads 8 \
--max_steps 500000 \
--data_path <SPLIT_OUTPUT_DIR>src/jobs/submit_train_jobs.py builds training commands, writes SLURM scripts, and submits them with sbatch.
Current defaults in that script:
- loops over
seed,data_cap,task, andmodel - uses
task in {regression, classifier}(these map directly to--modevalues) - supports
model in {Transformer1, OLS, OLS_RW, OLM} - maps each
(data_cap, model)to a SLURM walltime - writes
.slurm,.log, and.error.logfiles under fixed NERSC paths
Before running submission:
- Create a
.envfile in repo root with environment variables needed in your cluster job. - Update hardcoded paths in
submit_train_jobs.pyif you are not using the default NERSC layout (for example--data_pathingenerate_cmd,CKPT_DIRin resume mode, and thelog_dir/error_dir/slurm_filepaths in__main__). - Optionally edit
get_cmds_and_slurm_times()to choose your model/task/data-cap sweep.
Then submit:
python src/jobs/submit_train_jobs.pyThe script also includes get_cmds_and_slurm_times_continue() for checkpoint resume runs.
After training, group the runs you want to compare in Weights & Biases by assigning the same tag to each run (in the run’s overview or via the API). The tools below use that tag against the minerva-models project under your W&B entity (set WANDB_ENTITY and use wandb login as needed).
Generate the python -m src.scripts.eval ... commands for checkpoints that still need test_results (skipped if an .npz for that dataset already exists):
python -m src.scripts.print_eval_commands --wandb-flag <TAG>--wandb-flag only lists runs whose checkpoint folder name matches a wandb run name with that tag; omit it to consider every run under --ckpt-dir (default: see --help). Run each printed line locally. Evaluation is very small and fast—it is fine to run on login nodes without a GPU job.
Offline plotting reads cached pickles under out/ (by default) and writes PDFs under plots/ (by default). Run from the repository root with PYTHONPATH set to the repo (or install the package in editable mode) so imports resolve.
1. Cache eval inputs (loads checkpoints / W&B histories; paths are configurable in src/eval/_constants.py):
export PYTHONPATH="$PWD"
python -m src.eval.collect_eval_data --flag Run_2703 --out-dir /global/cfs/cdirs/m3246/gregork/Minerva/runs/
This writes out/classification_<TAG>.pkl and out/regression_<TAG>.pkl. Use --out-dir optionally.
Optional: Download latest cached model outputs
- You can use download the latest files with evaluation data (https://huggingface.co/datasets/gregorkrzmanc/minerva-ml-eval), and set $OUT to the path where you download this data.
2. Generate PDFs (each script accepts --flag, --out-dir, and --plots-dir; defaults match the layout below). Plotting code lives under src.eval.classification_plots and src.eval.e_available_plots (several modules each); notebooks may still use from eval_*_plots import … via thin shims in notebooks/.
export FLAG=Run_2703
export OUT=/global/cfs/cdirs/m3246/gregork/Minerva/runs/
python -m src.eval.plot_steps --flag $FLAG --out-dir $OUT # training curves → plots/steps_combined/ (1×2 clf|reg, one legend); add --separate-panels for plots/{classification,regression}/steps/
python -m src.eval.plot_regression --flag $FLAG --out-dir $OUT # energy / q₃ / scaling → plots/regression/
python -m src.eval.plot_classification_W --flag $FLAG --out-dir $OUT # vs hadronic W → plots/classification/w_bins/
python -m src.eval.plot_classification_q3 --flag $FLAG --out-dir $OUT # vs q₃, CCNπ, light appendix → plots/classification/q3/ and .../light/
python -m src.eval.plot_classification_Pions --flag $FLAG --out-dir $OUT # pion kinematics, CC1π⁰, light appendix → plots/classification/pions/ and .../light/
python -m src.eval.plot_classification_light --flag $FLAG --out-dir $OUT # light PDFs only → plots/classification/light/ (see --components)3. Figures for a LaTeX paper (copies or single-page extracts into figures_latex/):
python -m src.scripts.copy_figures_for_paper
# or: python -m src.scripts.copy_figures_for_paper --dry-runpython -m src.scripts.make_event_displays \
--input_file <PATH_TO_ROOT_FILE> \
--output_dir <PATH_TO_OUTPUT_DIR> \
--n_events 10