radT (Resource Aware Data science Tracker) is an extension to MLFlow that simplifies the collection and exploration of hardware metrics of machine learning and deep learning applications. Usually, collecting and processing all the required metrics for these workloads is a hassle. In contrast, RADT is easy to deploy and use, with minimal impact on both performance and time investment. The codebase of RADT is documented and easily expandable.
This work has been published at the SIGMOD workshop DEEM 2023: Data Management and Visualization for Benchmarking Deep Learning Training Systems
pip install radtThe current release is 0.2.16. radT has been recently released and is frequently receiving updates.
If you find any issues or bugs, feel free to message titr (at) itu.dk or open an issue in this repository.
- 0.2.23: Added external scheduling, removed
max_epochandmax_time. - 0.2.22: Removed default conda dependency
- 0.2.21: Listeners now export system metrics, added name column.
- 0.2.20: Resolve runs being closed when listeners exit.
- 0.2.19: Add free listener, add pytorch data workers to top.
- 0.2.18: Resolved issue of listeners duplicating runs under new mlflow versions.
- 0.2.17: Removed sudo requirement for iostat, renamed iostat fields.
- 0.2.16: Fixed an issue that could lock the process under extreme levels of collocation.
- 0.2.15: RADT now runs correctly on machines that have a corrupt DCGMI installation.
- 0.2.14: Automatically disable the DCGMI listener when DCGMI is not found.
- 0.2.13: Enable RADT on systems without DCGMI.
- 0.2.12: Fixed an issue with dependencies.
- 0.2.11: Workloads are now nested to group them together. Run names include the workload and letter. Improved flexibility of param passthrough.
- 0.2.10: Workload listeners now upload logs when
filepoints to a different folder.rerunargument now works correctly. - 0.2.9: Allow text printing while env is setting up.
- 0.2.8: Resolved issue preventing logs from being collected.
- 0.2.7: Resolved race condition that could sometimes disrupt collocated model execution.
- 0.2.6: Resolved synchronisation issues with
.csvruns. - 0.2.5: Automatically log
pip,condapackage lists andnvidia-smidriver info for reproducability. - 0.2.4: Fixed
rerunflag, added run names to status - 0.2.3: Reintroduced manual mode, fixed issue with context attributes,
max_epoch,max_time, andmanualare now logged as parameters - 0.2.2: Reintroduced contexts, fixed issue of not having
migeditas a formal requirement - 0.2.1: Removed legacy print-statements
- 0.2.0: Moved
radtrunto be a subcommand inradt, reintroduced workload listeners, usemigeditfor mig management, local mode - 0.1.4: Fixed several minor issues
- 0.1.3: Fixed several bugs that prevented correct logging
- 0.1.0: Initial
- Wide configuration support including collocation
- Track hardware and software metrics, including Nsight
- Handle continuous streams of data
- Support multiple visualization use-cases
- Filter large amounts of inconsequential data
- Minimal code impact
Replace python in your training script by radt, e.g.:
>>> radt train.py --batch-size 256or, when using virtual environments/conda:
>>> python -m radt train.py --batch-size 256For a complete getting started guide and examples please visit the Examples.
radT will automatically track hardware metrics for your application. The listeners will start tracking your application on invocation.
As radT extends MLFlow, you can either use the advanced tracking or use MLFlow to track software metrics (e.g. loss).
If you want to have more control over what is logged, you can encapsulate your training loop in the RADT context. This allows for logging of ML metrics among other MLFlow functions:
import radt
with radt.run.RADTBenchmark() as run:
# training loop
run.log_metric("Metric A", amount)
run.log_artifact("artifact.file")All methods and functions under mlflow are accessible this way. These functions are disabled when running the codebase without radt, ensuring code flexibility.
RADT can take the hassle of large experiments off you by training multiple models in succession. Models can even be trained at the same time on different gpus or at the same gpu using a range of collocation schemes.
Experiment,Workload,Status,Run,Devices,Collocation, File, Listeners,Params
0,1,no sharing,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,2,shared gpu 1,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,2,shared gpu 2,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,3,MPS shared gpu 1,,,0,MPS,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,3,MPS shared gpu 2,,,0,MPS,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,4,MIG shared gpu 1,,,2,3g.20gb,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,4,MIG shared gpu 2,,,2,3g.20gb,../pytorch/cifar10_context.py,smi+top,--batch-size 128
When interrupted by any means, a csv experiment can be rescheduled to continue from where it left off.
- Linux
If you need to cite this repository in academic research:
@inproceedings{robroek2023data,
title={Data Management and Visualization for Benchmarking Deep Learning Training Systems},
author={Robroek, Ties and Duane, Aaron and Yousefzadeh-Asl-Miandoab, Ehsan and Tozun, Pinar},
booktitle={Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning},
pages={1--5},
year={2023}
}Thank You!
Contributions are welcome. (Please add yourself to the list)
