-
Notifications
You must be signed in to change notification settings - Fork 94
Description
[Bug] CUDA Illegal Instruction Error on NVIDIA H20 GPU (Hopper sm_90)
🐛 Bug Description
StreamPETR training fails with RuntimeError: CUDA error: an illegal instruction was encountered during backward pass on NVIDIA H20-3e GPU (Hopper architecture, sm_90).
The issue occurs ONLY during backward pass, forward pass works perfectly.
📋 Environment
- GPU: NVIDIA H20-3e (Hopper architecture, compute capability sm_90)
- VRAM: 140GB HBM3
- Driver: CUDA 12.4
- OS: Linux 5.15.0-60-generic
- Python: 3.8.20
- PyTorch: 2.0.1 (compiled with CUDA 11.8)
- MMCV: 1.7.0 (mmcv-full, compiled with CUDA 11.8)
- MMDetection3D: 1.0.0rc6
- NumPy: 1.22.0
🔄 Steps to Reproduce
-
Setup environment:
conda create -n streampetr python=3.8 conda activate streampetr pip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
-
Install dependencies:
pip install mmcv-full==1.7.0 pip install mmdet==2.28.2 pip install mmsegmentation==0.30.0 cd mmdetection3d && pip install -e .
-
Run training:
python tools/train.py projects/configs/StreamPETR/stream_petr_r50_flash_704_bs1_semi_supervised_h20_optimized.py \ --work-dir work_dirs/run_2stage/
❌ Error Output
Traceback (most recent call last):
File "tools/train.py", line 269, in <module>
main()
File "tools/train.py", line 257, in main
custom_train_model(
...
File "/path/to/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
🔍 Investigation & Attempted Solutions
✅ What Works:
- Forward pass: Model inference works perfectly
- Single operation backward: Simple tensor operations work fine
- GPU recognition: CUDA device is correctly detected
❌ What Doesn't Work:
- Complex model backward pass: Training fails during loss.backward()
- FP16 training: Same error with mixed precision
- Gradient checkpointing: Same error with
with_cp=True
🛠️ Attempted Fixes:
-
Disabled Flash-Attention (not compatible with sm_90):
- Added fallback to
torch.nn.functional.scaled_dot_product_attention - Modified
projects/mmdet3d_plugin/models/utils/attention.py
- Added fallback to
-
Disabled Gradient Checkpointing:
- Set
with_cp=Falsein bothimg_backboneandtransformerlayers
- Set
-
Disabled FP16:
- Set
fp16=Nonein config
- Set
-
Compiled MMCV with matching CUDA versions:
- Tried CUDA 11.8 (matching PyTorch)
- Tried CUDA 12.5 (matching system driver)
-
Attempted PyTorch 2.1.2 upgrade:
- Failed due to ABI incompatibility with MMCV 1.7.0
- Error:
undefined symbol: _ZNK3c106SymIntltEl
All attempts still result in the same illegal instruction error.
🤔 Root Cause Analysis
The issue appears to be PyTorch 2.0.1's incomplete support for Hopper (sm_90) architecture:
- PyTorch 2.0.1 (March 2023) was released shortly after Hopper GPUs
- According to PyTorch docs, sm_90 support was marked as "early/experimental" in 2.0.x
- The backward pass likely uses CUDA kernels not fully optimized/tested for sm_90
Evidence:
- Forward pass works (simpler kernels)
- Backward pass fails (complex autograd kernels)
- Error persists across different CUDA compilation versions
- Cannot upgrade to PyTorch 2.1+ due to MMCV 1.7.0 ABI incompatibility
✨ Expected Behavior
Training should work on H20 GPU without illegal instruction errors, similar to how it works on V100/A100 GPUs.
🙋 Questions for Maintainers
- Has StreamPETR been tested on Hopper (H100/H20) GPUs?
- Is there a recommended PyTorch/MMCV version combination for sm_90 GPUs?
- Would upgrading to MMCV 2.x + PyTorch 2.3+ solve this? (requires code migration)
- Are there any H20-specific branches or forks available?
💡 Suggested Solutions
Short-term (for current users):
- Provide Docker image with PyTorch 2.3+ (native sm_90 support)
- Document known incompatibilities with specific GPU architectures
Long-term:
- Update to MMCV 2.x and PyTorch 2.3+ for better Hopper support
- Add GPU architecture detection and warnings in setup
📌 Additional Notes
- Inference works perfectly - users can still deploy trained models on H20
- This affects all Hopper GPUs (H100, H20, H200, etc.)
- Workaround: Train on older GPUs (V100/A100), deploy on H20 for inference
🔗 Related Issues
- PyTorch sm_90 support: https://github.com/pytorch/pytorch/issues/...
- MMCV Hopper compatibility: https://github.com/open-mmlab/mmcv/issues/...
Any guidance or suggestions would be greatly appreciated! The H20 has 140GB VRAM which would be perfect for large-scale training if we can resolve this compatibility issue.
Thank you! 🙏