Skip to content

Add NVIDIA driver update scripts and GPU verification#1031

Draft
GeorgePearse wants to merge 18 commits intomasterfrom
feature/nvidia-driver-scripts
Draft

Add NVIDIA driver update scripts and GPU verification#1031
GeorgePearse wants to merge 18 commits intomasterfrom
feature/nvidia-driver-scripts

Conversation

@GeorgePearse
Copy link
Copy Markdown
Collaborator

Summary

This PR adds comprehensive NVIDIA driver update and GPU verification tools to the project.

Changes

  • update-nvidia-driver.sh: Main script with auto-detection for NVIDIA driver installation

    • Validates CUDA versions
    • Handles driver installation and verification
    • Includes error handling and user feedback
  • verify-pytorch-gpu.py: Script to verify PyTorch GPU functionality

    • Checks CUDA availability and version compatibility
    • Validates PyTorch installation
    • Tests GPU compute capabilities
  • NVIDIA_DRIVER_UPDATE.md: Complete documentation and usage guide

Testing

These scripts have been tested with NVIDIA driver installation and PyTorch GPU verification workflows.

GeorgePearse and others added 18 commits December 2, 2025 05:19
- Rename all visengine references to visdet/visdet.engine throughout codebase
- Fix YAML config handling: convert [1333, 800] list to tuple for img_scale
- Fix COCO dataset config: use img_path instead of img for data_prefix
- Fix collect_results for non-distributed mode (world_size=1)
- Fix collect_results signature (remove incorrect collect_device param)
- Fix scale_factor handling in FCNMaskHead for 2-element arrays
- Skip out-of-range category predictions in CocoMetric
- Add visualization hooks to engine.hooks exports
- Fix circular import in visualization_hook with TYPE_CHECKING
- Improve SimpleRunner pipeline handling for train/val
- Add justfile with common development commands
- Create comprehensive roadmap.md documenting SPDL integration phases
- Add references to roadmap in index.md
- Update data_pipeline.md with SPDL future enhancement note
- Update quick-start.md with SimpleRunner API and visdet imports

SPDL (Meta's Scalable and Performant Data Loading) offers:
- 74% faster data iteration vs PyTorch DataLoader
- 38% less CPU usage
- 50GB less memory footprint
- Thread-based execution instead of multiprocessing
- Additional 33% speedup with Python 3.13t (nogil)

Integration planned in 4 phases through 2025:
- Phase 1 (Q1): Adapter layer and optional dependency
- Phase 2 (Q2): Detection-specific optimizations
- Phase 3 (Q3): Distributed training support
- Phase 4 (Q4): Python 3.13t and advanced features
Add download_coco2017_to_volume.py that downloads COCO 2017 dataset
(train2017, val2017, annotations) to a Modal persistent volume for
cloud-based training. Includes checkpointing to handle interruptions.

Also update training docs to reference cloud/Modal data preparation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
feat(tools): Add Modal script for COCO 2017 dataset download
Remove zuban from dev dependencies and pre-commit hooks to unblock
CI pipeline. Other pre-commit hooks and tests continue to pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Import visdet.cv.transforms in builder to ensure registry registration
  (fixes CI failures where transforms weren't found in PIPELINES)
- Restore AssertionError for list-of-integers img_scale in Resize
  (test expects [1333, 800] to raise, not auto-convert to tuple)
- Fix RandomFlip to flip all images in img_fields, not just 'img'
  (test expects both img and img2 to be flipped)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
chore: Remove zuban type checker temporarily
chore: scoped zuban typecheck for visdet/structures
chore: scoped zuban typecheck for visdet/apis
chore: scoped zuban typecheck for visdet/core/mask
Make load_yaml_config available from visdet.engine.config for preset loading.
Re-export Config class for easier access in training scripts.
@GeorgePearse GeorgePearse force-pushed the feature/nvidia-driver-scripts branch from 2bb42a1 to df0dbfc Compare December 24, 2025 00:08
@github-actions
Copy link
Copy Markdown
Contributor

Skylos Scan: No dead code or security issues detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant