radT

radT (Resource Aware Data science Tracker) is an extension to MLFlow that simplifies the collection and exploration of hardware metrics of machine learning and deep learning applications. Usually, collecting and processing all the required metrics for these workloads is a hassle. In contrast, RADT is easy to deploy and use, with minimal impact on both performance and time investment. The codebase of RADT is documented and easily expandable.

This work has been published at the SIGMOD workshop DEEM 2023: Data Management and Visualization for Benchmarking Deep Learning Training Systems

pip install radt

Releases

The current release is 0.2.16. radT has been recently released and is frequently receiving updates.

If you find any issues or bugs, feel free to message titr (at) itu.dk or open an issue in this repository.

Changelog

0.2.23: Added external scheduling, removed max_epoch and max_time.
0.2.22: Removed default conda dependency
0.2.21: Listeners now export system metrics, added name column.
0.2.20: Resolve runs being closed when listeners exit.
0.2.19: Add free listener, add pytorch data workers to top.
0.2.18: Resolved issue of listeners duplicating runs under new mlflow versions.
0.2.17: Removed sudo requirement for iostat, renamed iostat fields.
0.2.16: Fixed an issue that could lock the process under extreme levels of collocation.
0.2.15: RADT now runs correctly on machines that have a corrupt DCGMI installation.
0.2.14: Automatically disable the DCGMI listener when DCGMI is not found.
0.2.13: Enable RADT on systems without DCGMI.
0.2.12: Fixed an issue with dependencies.
0.2.11: Workloads are now nested to group them together. Run names include the workload and letter. Improved flexibility of param passthrough.
0.2.10: Workload listeners now upload logs when file points to a different folder. rerun argument now works correctly.
0.2.9: Allow text printing while env is setting up.
0.2.8: Resolved issue preventing logs from being collected.
0.2.7: Resolved race condition that could sometimes disrupt collocated model execution.
0.2.6: Resolved synchronisation issues with .csv runs.
0.2.5: Automatically log pip, conda package lists and nvidia-smi driver info for reproducability.
0.2.4: Fixed rerun flag, added run names to status
0.2.3: Reintroduced manual mode, fixed issue with context attributes, max_epoch, max_time, and manual are now logged as parameters
0.2.2: Reintroduced contexts, fixed issue of not having migedit as a formal requirement
0.2.1: Removed legacy print-statements
0.2.0: Moved radtrun to be a subcommand in radt, reintroduced workload listeners, use migedit for mig management, local mode
0.1.4: Fixed several minor issues
0.1.3: Fixed several bugs that prevented correct logging
0.1.0: Initial

Features

Wide configuration support including collocation
Track hardware and software metrics, including Nsight
Handle continuous streams of data
Support multiple visualization use-cases
Filter large amounts of inconsequential data
Minimal code impact

Sample usage & getting started

Replace python in your training script by radt, e.g.:

>>> radt train.py --batch-size 256

or, when using virtual environments/conda:

>>> python -m radt train.py --batch-size 256

For a complete getting started guide and examples please visit the Examples.

Easy to use via automated tracking

radT will automatically track hardware metrics for your application. The listeners will start tracking your application on invocation.

As radT extends MLFlow, you can either use the advanced tracking or use MLFlow to track software metrics (e.g. loss).

Advanced tracking options via context

If you want to have more control over what is logged, you can encapsulate your training loop in the RADT context. This allows for logging of ML metrics among other MLFlow functions:

import radt

with radt.run.RADTBenchmark() as run:
  # training loop
  run.log_metric("Metric A", amount)
  run.log_artifact("artifact.file")

All methods and functions under mlflow are accessible this way. These functions are disabled when running the codebase without radt, ensuring code flexibility.

CSV syntax for larger experiments

RADT can take the hassle of large experiments off you by training multiple models in succession. Models can even be trained at the same time on different gpus or at the same gpu using a range of collocation schemes.

Experiment,Workload,Status,Run,Devices,Collocation,    File,    Listeners,Params
0,1,no sharing,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,2,shared gpu 1,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,2,shared gpu 2,,,0,-,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,3,MPS shared gpu 1,,,0,MPS,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,3,MPS shared gpu 2,,,0,MPS,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,4,MIG shared gpu 1,,,2,3g.20gb,../pytorch/cifar10_context.py,smi+top,--batch-size 128
0,4,MIG shared gpu 2,,,2,3g.20gb,../pytorch/cifar10_context.py,smi+top,--batch-size 128

When interrupted by any means, a csv experiment can be rescheduled to continue from where it left off.

Supported platforms

Linux

Citation

If you need to cite this repository in academic research:

@inproceedings{robroek2023data,
  title={Data Management and Visualization for Benchmarking Deep Learning Training Systems},
  author={Robroek, Ties and Duane, Aaron and Yousefzadeh-Asl-Miandoab, Ehsan and Tozun, Pinar},
  booktitle={Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning},
  pages={1--5},
  year={2023}
}

Contributors

Thank You!

Contributions are welcome. (Please add yourself to the list)

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
environments		environments
examples		examples
frontend		frontend
media		media
queries		queries
radt		radt
.gitignore		.gitignore
README.md		README.md
clean_mlflow_envs.sh		clean_mlflow_envs.sh
docker-compose.yml		docker-compose.yml
nginx.conf		nginx.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

radT

Releases

Changelog

Features

Sample usage & getting started

Easy to use via automated tracking

Advanced tracking options via context

CSV syntax for larger experiments

Supported platforms

Citation

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

itu-rad/radt

Folders and files

Latest commit

History

Repository files navigation

radT

Releases

Changelog

Features

Sample usage & getting started

Easy to use via automated tracking

Advanced tracking options via context

CSV syntax for larger experiments

Supported platforms

Citation

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages