This repository contains the data processing and model training code used for ML studies on MINERvA events.
The processed dataset used by this project is available on Hugging Face:
- gregorkrzmanc/minerva-ml
- It is a preprocessed version of the MINERvA open data release for ML/physics tasks such as available-energy estimation and event tagging.
- Source data comes from MINERvA open data: MINERvA Open Data.
- This is a derived dataset and is not an official MINERvA collaboration product.
For detailed data fields and semantics, see DATASET.md. For model architecture details, see MODELS.md.
The typical workflow is:
- Download raw playlists
- Preprocess ROOT files into ML-ready tensors
- Split into train/val/test
- Train models locally or submit SLURM jobs
- Run test evaluation (
eval) on checkpoints, then analyze with notebooks
Choose one of the following:
- Option A (from scratch): Download raw MINERvA playlists, then preprocess locally.
- Option B (quick start): Download the already preprocessed dataset from Hugging Face and skip local preprocessing.
Set SCRATCH first, then run:
# Monte Carlo playlists
python -m src.scripts.download_data
# Recorded data playlists
python -m src.scripts.download_data --prefix MediumEnergy_FHC_Data_PlaylistIf you want to skip raw playlist processing, download the preprocessed dataset snapshot:
pip install -U "huggingface_hub[cli]"
huggingface-cli download gregorkrzmanc/minerva-ml \
--repo-type dataset \
--local-dir <HF_DATA_DIR>After download, point your training/splitting commands to the downloaded folder structure.
Skip this section if you used Option B and already have the preprocessed files you need.
Minimal invocation (creates .pb files with event-wise particle tensors and labels):
python -m src.scripts.preprocess_dataset --output-dir <OUTPUT_DIR>For a full pipeline on this project’s layout—preprocess, split playlists 1A / 1B, and extract baselines—edit paths in the script if needed, then run:
bash src/scripts/preprocess.shsrc/scripts/preprocess.sh sets DATA_DIR, runs preprocess_dataset with blob/prong limits and playlist selection, runs split_dataset per playlist (with different val/test ratios for 1B vs 1A), and runs extract_baselines against the raw playlist directories under scratch.
python -m src.scripts.split_dataset \
--input-dir <PREPROCESSED_DIR> \
--output-dir <SPLIT_OUTPUT_DIR>To inspect created features quickly, see notebooks/stats.ipynb.
Use src/scripts/train.py for both regression and classification.
python -m src.scripts.train \
-bs 2048 \
--mode regression \
-E-available-no-muon \
-name Run_debug \
--d_model 128 --depth 4 --n_heads 8 \
--max_steps 500000 \
--data_path <SPLIT_OUTPUT_DIR>src/jobs/submit_train_jobs.py builds training commands, writes SLURM scripts, and submits them with sbatch.
Current defaults in that script:
- loops over
seed,data_cap,task, andmodel - uses
task in {regression, classifier}(these map directly to--modevalues) - supports
model in {Transformer1, OLS, OLS_RW, OLM} - maps each
(data_cap, model)to a SLURM walltime - writes
.slurm,.log, and.error.logfiles under fixed NERSC paths
Before running submission:
- Create a
.envfile in repo root with environment variables needed in your cluster job. - Update hardcoded paths in
submit_train_jobs.pyif you are not using the default NERSC layout (for example--data_pathingenerate_cmd,CKPT_DIRin resume mode, and thelog_dir/error_dir/slurm_filepaths in__main__). - Optionally edit
get_cmds_and_slurm_times()to choose your model/task/data-cap sweep.
Then submit:
python src/jobs/submit_train_jobs.pyThe script also includes get_cmds_and_slurm_times_continue() for checkpoint resume runs.
After training, group the runs you want to compare in Weights & Biases by assigning the same tag to each run (in the run’s overview or via the API). The tools below use that tag against the minerva-models project under your W&B entity (set WANDB_ENTITY and use wandb login as needed).
Generate the python -m src.scripts.eval ... commands for checkpoints that still need test_results (skipped if an .npz for that dataset already exists):
python -m src.scripts.print_eval_commands --wandb-flag <TAG>--wandb-flag only lists runs whose checkpoint folder name matches a wandb run name with that tag; omit it to consider every run under --ckpt-dir (default: see --help). Run each printed line locally. Evaluation is very small and fast—it is fine to run on login nodes without a GPU job.
- Open
notebooks/Eval_Classification.ipynbornotebooks/Eval_Regression.ipynb. - Set the
WANDB_TAGvariable at the top to match your tag (both notebooks use this to query runs). - Run all cells.
To run the same notebooks headlessly from the repo root (requires nbconvert; e.g. pip install nbconvert):
cd notebooks
# Execute (updates notebooks in place with fresh outputs)
jupyter nbconvert --to notebook --execute Eval_Regression.ipynb --inplace
jupyter nbconvert --to notebook --execute Eval_Classification.ipynb --inplace
jupyter nbconvert --to notebook --execute Eval_Classification_Light.ipynb --inplace
# Export static copies (HTML; use --to pdf instead if pandoc/LaTeX are available, or --to webpdf with Chromium)
jupyter nbconvert --to html Eval_Regression.ipynb
jupyter nbconvert --to html Eval_Classification.ipynbClassification evaluation covers tagging and related metrics; regression evaluation covers energy-scale and scaling plots. Figures and PDFs are written under paths configured in each notebook (typically under out/).
python -m src.scripts.make_event_displays \
--input_file <PATH_TO_ROOT_FILE> \
--output_dir <PATH_TO_OUTPUT_DIR> \
--n_events 10