This repository contains the artifacts for our work on building a deep learning–based GPU memory estimator for training deep learning models. Since data is central to this effort, we structured the workflow in several key stages:
-
Data Generation: We developed scripts to automatically generate diverse deep learning training configurations and monitor GPU behavior during training.
-
Data Cleaning: After collecting raw logs, we processed and cleaned the data using dedicated scripts included here.
-
Analysis & Modeling: With the cleaned data, we performed exploratory analysis and trained various models to estimate GPU memory usage.
-
We explored ensemble method, reviewed related work, and analyzed the overhead introduced by both the data parsers and model inference.
TODO: add a good description, easy and fast to use script, doing estimations
For each neural network type (MLP, CNN, Transformer), we provide two key files: one that defines the network architecture, and a launcher script that spawns multiple training instances with varying architectural parameters. During training, GPU usage (alongisde with other metrics) is monitored using dcgmi and nvidia-smi, while system metrics are tracked with top.
Note 1: Each deep learning configuration is trained for one minute, one at a time. This sequential execution avoids interference from simultaneous training jobs, which could affect system performance due to shared CPU and DRAM usage.
- MLP: MLP model | MLP model launcher
- CNN: CNN model | CNN model launcher
- Transformer: Transformer model | Transformer model launcher
- Refactoring the launcher script to read parameters from a YAML configuration file.
- Extending the Transformer model to support architectures with 1D convolutional layers (e.g., GPT-style models), as it currently supports only linear-layer-based designs.
- Extend the Transformer data cleaning script to support models that include Conv1D and other types of layers
We looked into the cleaned data by looking into its distribution based on different selected features, visualized through PCA and TSNE glasses. Also, trained MLP-, and transformer-based models on then to validate the idea of using deep learning for estimating GPU memory usage. For diving into this check more here.
To train and test ensemble models, ensure that you are using the correct dataset. When running training or evaluation, specify both the dataset and the model type using the appropriate command-line arguments.
Training:
python train.py --d [mlp, cnn, transformer] --m [mlp, transformer]Validation:
python kfold_cross_validation.py --d [mlp, cnn, transformer] --m [mlp, transformer]Training:
python test.py --d [mlp, cnn, transformer] --m [mlp, transformer]To visualize the results, including the confusion matrix and other statistics, see the visualization notebook.
We also considered and characterized the overheads of parsers and the estimator models' overhead since one of the primary purpose of these estimators can be informing schedules/ resource managers to make more efficient decisions.
We designed experiments to evaluate the effectiveness of the Horus formula and the Fake Tensor library in estimating the GPU memory requirements of deep learning training tasks. Read more here.
In the discussion section of our paper, we draw the roadmap on how contributors can contribute. As it is an deep learning-based estimator, the potential contributions and improvements to the current study can come from more data points, data points from different GPU models, with broader range of arguments, and also innovations on how to view the GPU memory estimation.
© 2025 Ehsan Yousefzadeh-Asl-Miandoab. Affiliated with the RAD, IT University of Copenhagen. All rights reserved.
This repository is released for non-commercial academic research purposes only under the following terms:
- 📦 Code and Notebooks: Custom research-only license. You may use, modify, and share for academic research, but commercial use is prohibited.
- 🧠 Trained Models: Provided for academic evaluation only. Do not use in commercial products or services without explicit permission.
- 📊 Dataset: Licensed under CC BY-NC 4.0.
- 📈 Figures and Visualizations: Also under CC BY-NC 4.0.
If you use this repository (code, models, data, or ideas), you must cite the following:
GitHub Repository
Ehsan Yousefzadeh-Asl-Miandoab. GPUMemNet: Estimating GPU Memory Requirements for Deep Learning Training Tasks. GitHub Repository: https://github.com/ehsanyousefzadehasl/gpumemnet
@misc{yousefzadeh2025gpumemnet,
author = {Ehsan Yousefzadeh-Asl-Miandoab},
title = {GPUMemNet: Estimating GPU Memory Requirements for Deep Learning Training Tasks},
year = {2025},
howpublished = {\url{https://github.com/ehsanyousefzadehasl/gpumemnet}},
}Academic Paper
@article{yousefzadeh2025carma,
title={CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator},
author={Yousefzadeh-Asl-Miandoab, Ehsan and Karimzadeh, Reza and Ibragimov, Bulat and Ciorba, Florina M and Tozun, Pinar},
journal={arXiv preprint arXiv:2508.19073},
year={2025}
}