This repository contains the Jupyter Notebook and supplementary materials for the paper "Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems." This study examines GPU resource utilization patterns on the Perlmutter supercomputer, highlighting temporal and spatial imbalances and their impact on workload efficiency.
All datasets are stored under the directory: /pscratch/sd/e/esencan/.
- Zipped Raw DCGM Time Series Data:
dcgm_extended_july_all.tar.gzis under/pscratch/sd/e/esencan/. - Unzipped Raw DCGM Time Series Data: Unzipped files are under
/pscratch/sd/e/esencan/extracted_dcgm_jobs_data_additional_metrics/dcgm_2/. This directory contains.pklfiles for each day of July 2024 (one file per day). - SLURM Job Data:
perlmutter_gpu_jobs_july_2024.csvis located under/pscratch/sd/e/esencan/. - Feature-Extracted DCGM Data:
tsfresh_feature_extracted_all_jobs_minimal_features_corrected_node_id_mem_util.parquetis located under/pscratch/sd/e/esencan/.
To request access to these datasets, please contact Efe Sencan.
- 📁
scripts/: Scripts for submitting jobs, extracting features, and workload analysis.submit_jobs.sh: Bash script for daily feature extraction.extract_features_day.py: Python script for time-series feature extraction.workload_analysis.py: Script for analyzing workload patterns from extracted features.
- 📁
notebooks/NERSC_data_analysis.ipynb: Jupyter Notebook with core analysis and results.inspect_dcgm_metrics.ipynb: Unzips.tar.gzfiles, loads.pklfiles, and inspects columns.feature_extraction.ipynb: Combines daily extracted features into a single Parquet file.
- 📄
requirements.txt: Python package dependencies. - 📄
README.md: This file.
- Python 3.9 +
- Jupyter Notebook
- pandas
- numpy
- matplotlib
- seaborn
- tsfresh (for time-series feature extraction)
Install dependencies using:
pip install -r requirements.txtThe study uses data from the Perlmutter supercomputer collected during July 2024:
- DCGM Telemetry: Metrics collected every 10 seconds from four NVIDIA A100 GPUs per node.
- SLURM Metadata: Job-level details including submission times, durations, and allocated resources.
- Filtering: Only jobs under the 'regular' QoS were included, totaling 118,276 jobs.
Since the raw time series dataset was provided as a compressed .tar.gz file, inspect_dcgm_metrics.ipynb was created to:
- Unzip
dcgm_extended_july_all.tar.gzintoextracted_dcgm_jobs_data_additional_metrics/. - Load daily
.pklfiles fromdcgm_2/. - Print available column names and inspect the data structure for one sample job.
- Used
tsfreshto extract 17 statistical features from each DCGM metric. - Generated 408 feature columns from time-series telemetry.
- Stored results in Parquet format for efficient analysis.
Because extracting features from the concatenated time series of all jobs is time-consuming, the extraction process is split by day.
- Step 1: Navigate to the
scripts/directory. - Step 2: Make the script executable:
chmod +x submit_jobs.sh- Step 3: Submit jobs:
bash submit_jobs.shThe submit_jobs.sh script submits Slurm jobs for daily feature extraction. Each job:
- Loads the required Python environment (
feature_extraction_envby default, update as needed). - Runs
extract_features_day.pyon daily.pklfiles.
The notebook feature_extraction.ipynb:
- Reads the extracted feature files
- Merges them into a single Parquet file for analysis.
- Outputs the combined file to the spesified directory
- Calculated temporal imbalance factors using:
- Coefficient of Variation (CV)
- Linear Trend (via linear regression slope)
- Mean Absolute Change (MAC)
- Combined these metrics into a merged temporal imbalance factor.
- Computed intra-node and inter-node imbalance using:
- Normalized Range (NR)
- Variance of Utilization
- Derived merged spatial imbalance factors for both intra-node and inter-node imbalances.
- Analyzed relationships between temporal and spatial imbalances, maximum utilization metrics, and job node hours.
- GPU Utilization: 62.75% of jobs reach at least 75% GPU utilization, contributing to 80.01% of node hours.
- Memory Utilization: Low utilization observed, with 37.12% of jobs never exceeding 15% memory usage.
- Temporal Imbalance: Minimal for most jobs, but a subset shows significant fluctuations.
- Spatial Imbalance: More pronounced intra-node than inter-node disparities.
- Correlation Findings: Stable memory utilization correlates with higher GPU efficiency.
jupyter notebook notebooks/NERSC_workload_analysis.ipynbFollow the steps in the notebook to reproduce the analysis and generate figures.
- Figure 1: Distribution of jobs and node hours for GPU and memory utilization.
- Figure 2: CDF and PDF of temporal imbalance factors for GPU metrics.
- Figure 3: Spatial imbalance CDFs and PDFs.
- Figure 4: Correlation matrix showing relationships between imbalance factors and utilization metrics.
If you use this repository for your research, please cite our paper:
@incollection{sencan2025analyzing,
title={Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems},
author={Sencan, Efe and Kulkarni, Dhruva and Coskun, Ayse and Konate, Kadidia},
booktitle={Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration},
pages={1--8},
year={2025}
}For any questions, please contact:
- Author: Efe Sencan
- Email: esencan@bu.edu
- Affiliation: Boston University/ National Energy Research Scientific Computing Center (NERSC)
License: This repository is licensed under the MIT License. See LICENSE for details.