-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
A goal for 2026 is to build out enough tutorial material on the topic of deployment to fill a large 3-hour tutorial slot at a conference like GTC, PyCon or SciPy.
This content should also be built in orthogonal chapters that can be extracted and put into other content. This way we can scale our reach further.
Topics
The material should broadly cover the following topics:
- The software stack from driver, through CUDA to Python
- Common tools and package managers for installing GPU Python code (
pip,uv,conda,pixi) - Verifying software environments
- Troubleshooting common install problems
- Multi-node deployments (Spark, Dask, Ray)
- Monitoring
- Local monitoring with
nvidia-smiandnvtop - Broad monitoring with Prometheus and DCGM
- Local monitoring with
- Debugging
- Attaching debuggers or running traces in managed cloud environments
Prior work
Much of this material already exists but has not been put together in a cohesive way. The following resources will be useful:
Metadata
Metadata
Assignees
Labels
No labels