Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems

📘 Overview

This repository contains the Jupyter Notebook and supplementary materials for the paper "Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems." This study examines GPU resource utilization patterns on the Perlmutter supercomputer, highlighting temporal and spatial imbalances and their impact on workload efficiency.

📂 Dataset Locations

All datasets are stored under the directory: /pscratch/sd/e/esencan/.

Zipped Raw DCGM Time Series Data: dcgm_extended_july_all.tar.gz is under /pscratch/sd/e/esencan/.
Unzipped Raw DCGM Time Series Data: Unzipped files are under /pscratch/sd/e/esencan/extracted_dcgm_jobs_data_additional_metrics/dcgm_2/. This directory contains .pkl files for each day of July 2024 (one file per day).
SLURM Job Data: perlmutter_gpu_jobs_july_2024.csv is located under /pscratch/sd/e/esencan/.
Feature-Extracted DCGM Data: tsfresh_feature_extracted_all_jobs_minimal_features_corrected_node_id_mem_util.parquet is located under /pscratch/sd/e/esencan/.

To request access to these datasets, please contact Efe Sencan.

📂 Repository Structure

📁 scripts/: Scripts for submitting jobs, extracting features, and workload analysis.
- submit_jobs.sh: Bash script for daily feature extraction.
- extract_features_day.py: Python script for time-series feature extraction.
- workload_analysis.py: Script for analyzing workload patterns from extracted features.
📁 notebooks/
- NERSC_data_analysis.ipynb: Jupyter Notebook with core analysis and results.
- inspect_dcgm_metrics.ipynb: Unzips .tar.gz files, loads .pkl files, and inspects columns.
- feature_extraction.ipynb: Combines daily extracted features into a single Parquet file.
📄 requirements.txt: Python package dependencies.
📄 README.md: This file.

🛠️ Dependencies

Python 3.9 +
Jupyter Notebook
pandas
numpy
matplotlib
seaborn
tsfresh (for time-series feature extraction)

Install dependencies using:

pip install -r requirements.txt

📝 Dataset

The study uses data from the Perlmutter supercomputer collected during July 2024:

DCGM Telemetry: Metrics collected every 10 seconds from four NVIDIA A100 GPUs per node.
SLURM Metadata: Job-level details including submission times, durations, and allocated resources.
Filtering: Only jobs under the 'regular' QoS were included, totaling 118,276 jobs.

📝 Workflow Overview

1️⃣ Inspecting DCGM Metrics

Since the raw time series dataset was provided as a compressed .tar.gz file, inspect_dcgm_metrics.ipynb was created to:

Unzip dcgm_extended_july_all.tar.gz into extracted_dcgm_jobs_data_additional_metrics/.
Load daily .pkl files from dcgm_2/.
Print available column names and inspect the data structure for one sample job.

🧩 Methodology

1️⃣ Feature Extraction

Used tsfresh to extract 17 statistical features from each DCGM metric.
Generated 408 feature columns from time-series telemetry.
Stored results in Parquet format for efficient analysis.

📝 Feature Extraction Workflow

1️⃣ Submitting Jobs for Daily Feature Extraction

Because extracting features from the concatenated time series of all jobs is time-consuming, the extraction process is split by day.

Step 1: Navigate to the scripts/ directory.
Step 2: Make the script executable:

chmod +x submit_jobs.sh

Step 3: Submit jobs:

bash submit_jobs.sh

The submit_jobs.sh script submits Slurm jobs for daily feature extraction. Each job:

Loads the required Python environment (feature_extraction_env by default, update as needed).
Runs extract_features_day.py on daily .pkl files.

2️⃣ Combining Daily Features

The notebook feature_extraction.ipynb:

Reads the extracted feature files
Merges them into a single Parquet file for analysis.
Outputs the combined file to the spesified directory

2️⃣ Temporal Imbalance Analysis

Calculated temporal imbalance factors using:
- Coefficient of Variation (CV)
- Linear Trend (via linear regression slope)
- Mean Absolute Change (MAC)
Combined these metrics into a merged temporal imbalance factor.

3️⃣ Spatial Imbalance Analysis

Computed intra-node and inter-node imbalance using:
- Normalized Range (NR)
- Variance of Utilization
Derived merged spatial imbalance factors for both intra-node and inter-node imbalances.

4️⃣ Correlation Analysis

Analyzed relationships between temporal and spatial imbalances, maximum utilization metrics, and job node hours.

📊 Results Highlights

GPU Utilization: 62.75% of jobs reach at least 75% GPU utilization, contributing to 80.01% of node hours.
Memory Utilization: Low utilization observed, with 37.12% of jobs never exceeding 15% memory usage.
Temporal Imbalance: Minimal for most jobs, but a subset shows significant fluctuations.
Spatial Imbalance: More pronounced intra-node than inter-node disparities.
Correlation Findings: Stable memory utilization correlates with higher GPU efficiency.

🚀 How to Run the Notebook

jupyter notebook notebooks/NERSC_workload_analysis.ipynb

Follow the steps in the notebook to reproduce the analysis and generate figures.

📈 Example Outputs

Figure 1: Distribution of jobs and node hours for GPU and memory utilization.
Figure 2: CDF and PDF of temporal imbalance factors for GPU metrics.
Figure 3: Spatial imbalance CDFs and PDFs.
Figure 4: Correlation matrix showing relationships between imbalance factors and utilization metrics.

🎯 Citation

If you use this repository for your research, please cite our paper:

@incollection{sencan2025analyzing,
  title={Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems},
  author={Sencan, Efe and Kulkarni, Dhruva and Coskun, Ayse and Konate, Kadidia},
  booktitle={Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration},
  pages={1--8},
  year={2025}
}

💬 Contact

For any questions, please contact:

Author: Efe Sencan
Email: esencan@bu.edu
Affiliation: Boston University/ National Energy Research Scientific Computing Center (NERSC)

License: This repository is licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems

📘 Overview

📂 Dataset Locations

📂 Repository Structure

🛠️ Dependencies

📝 Dataset

📝 Workflow Overview

1️⃣ Inspecting DCGM Metrics

🧩 Methodology

1️⃣ Feature Extraction

📝 Feature Extraction Workflow

1️⃣ Submitting Jobs for Daily Feature Extraction

2️⃣ Combining Daily Features

2️⃣ Temporal Imbalance Analysis

3️⃣ Spatial Imbalance Analysis

4️⃣ Correlation Analysis

📊 Results Highlights

🚀 How to Run the Notebook

📈 Example Outputs

🎯 Citation

💬 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
notebooks		notebooks
scripts		scripts
GPU_Utilization_Analysis_of_Jobs_in_NERSC_s_Perlmutter_final.pdf		GPU_Utilization_Analysis_of_Jobs_in_NERSC_s_Perlmutter_final.pdf
README.md		README.md
requirements.txt		requirements.txt

peaclab/NERSC_workload_analysis

Folders and files

Latest commit

History

Repository files navigation

Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems

📘 Overview

📂 Dataset Locations

📂 Repository Structure

🛠️ Dependencies

📝 Dataset

📝 Workflow Overview

1️⃣ Inspecting DCGM Metrics

🧩 Methodology

1️⃣ Feature Extraction

📝 Feature Extraction Workflow

1️⃣ Submitting Jobs for Daily Feature Extraction

2️⃣ Combining Daily Features

2️⃣ Temporal Imbalance Analysis

3️⃣ Spatial Imbalance Analysis

4️⃣ Correlation Analysis

📊 Results Highlights

🚀 How to Run the Notebook

📈 Example Outputs

🎯 Citation

💬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages