Skip to content

is a compilation of dataset, models, analyses and efforts to have more accurate estimation for GPU memory requirement of a deep learning training task

License

Notifications You must be signed in to change notification settings

itu-rad/GPUMemNet

Repository files navigation

GPUMemNet: GPU Memory estimator and Neural Network training dataset

GPUMemNet logo

This repository contains the artifacts for our work on building a deep learning–based GPU memory estimator for training deep learning models. Since data is central to this effort, we structured the workflow in several key stages:

  • Data Generation: We developed scripts to automatically generate diverse deep learning training configurations and monitor GPU behavior during training.

  • Data Cleaning: After collecting raw logs, we processed and cleaned the data using dedicated scripts included here.

  • Analysis & Modeling: With the cleaned data, we performed exploratory analysis and trained various models to estimate GPU memory usage.

  • We explored ensemble method, reviewed related work, and analyzed the overhead introduced by both the data parsers and model inference.

How to Use GPUMemNet

TODO: add a good description, easy and fast to use script, doing estimations

Data Generation Scripts

For each neural network type (MLP, CNN, Transformer), we provide two key files: one that defines the network architecture, and a launcher script that spawns multiple training instances with varying architectural parameters. During training, GPU usage (alongisde with other metrics) is monitored using dcgmi and nvidia-smi, while system metrics are tracked with top.

Note 1: Each deep learning configuration is trained for one minute, one at a time. This sequential execution avoids interference from simultaneous training jobs, which could affect system performance due to shared CPU and DRAM usage.

Future/ Possible Contributions at This Level

  1. Refactoring the launcher script to read parameters from a YAML configuration file.
  2. Extending the Transformer model to support architectures with 1D convolutional layers (e.g., GPT-style models), as it currently supports only linear-layer-based designs.

Data Cleaning Script

Future/ Possible Contributions at This Level

  • Extend the Transformer data cleaning script to support models that include Conv1D and other types of layers

Data

Visualization, Analysis, and Training Notebooks

We looked into the cleaned data by looking into its distribution based on different selected features, visualized through PCA and TSNE glasses. Also, trained MLP-, and transformer-based models on then to validate the idea of using deep learning for estimating GPU memory usage. For diving into this check more here.

Training, Validation, and Testing with Ensemble Models

To train and test ensemble models, ensure that you are using the correct dataset. When running training or evaluation, specify both the dataset and the model type using the appropriate command-line arguments.

Training:

python train.py --d [mlp, cnn, transformer] --m [mlp, transformer]

Validation:

python kfold_cross_validation.py --d [mlp, cnn, transformer] --m [mlp, transformer]

Training:

python test.py --d [mlp, cnn, transformer] --m [mlp, transformer]

To visualize the results, including the confusion matrix and other statistics, see the visualization notebook.

Overheads of the parser and the models' inference

We also considered and characterized the overheads of parsers and the estimator models' overhead since one of the primary purpose of these estimators can be informing schedules/ resource managers to make more efficient decisions.

Related Work data and sources

We designed experiments to evaluate the effectiveness of the Horus formula and the Fake Tensor library in estimating the GPU memory requirements of deep learning training tasks. Read more here.

Vision

In the discussion section of our paper, we draw the roadmap on how contributors can contribute. As it is an deep learning-based estimator, the potential contributions and improvements to the current study can come from more data points, data points from different GPU models, with broader range of arguments, and also innovations on how to view the GPU memory estimation.

License & Citation

© 2025 Ehsan Yousefzadeh-Asl-Miandoab. Affiliated with the RAD, IT University of Copenhagen. All rights reserved.

This repository is released for non-commercial academic research purposes only under the following terms:

  • 📦 Code and Notebooks: Custom research-only license. You may use, modify, and share for academic research, but commercial use is prohibited.
  • 🧠 Trained Models: Provided for academic evaluation only. Do not use in commercial products or services without explicit permission.
  • 📊 Dataset: Licensed under CC BY-NC 4.0.
  • 📈 Figures and Visualizations: Also under CC BY-NC 4.0.

📚 Citation

If you use this repository (code, models, data, or ideas), you must cite the following:

GitHub Repository
Ehsan Yousefzadeh-Asl-Miandoab. GPUMemNet: Estimating GPU Memory Requirements for Deep Learning Training Tasks. GitHub Repository: https://github.com/ehsanyousefzadehasl/gpumemnet

@misc{yousefzadeh2025gpumemnet,
  author       = {Ehsan Yousefzadeh-Asl-Miandoab},
  title        = {GPUMemNet: Estimating GPU Memory Requirements for Deep Learning Training Tasks},
  year         = {2025},
  howpublished = {\url{https://github.com/ehsanyousefzadehasl/gpumemnet}},
}

Academic Paper

@article{yousefzadeh2025carma,
  title={CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator},
  author={Yousefzadeh-Asl-Miandoab, Ehsan and Karimzadeh, Reza and Ibragimov, Bulat and Ciorba, Florina M and Tozun, Pinar},
  journal={arXiv preprint arXiv:2508.19073},
  year={2025}
}

About

is a compilation of dataset, models, analyses and efforts to have more accurate estimation for GPU memory requirement of a deep learning training task

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages