start faq entry on cuda/native code error possible causes [skip ci]#992
start faq entry on cuda/native code error possible causes [skip ci]#992eordentlich merged 2 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
docs/site/FAQ.md
Outdated
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. Check your Spark GPU configs to ensure 1 task per GPU during fit() calls. |
There was a problem hiding this comment.
Looks like a check in the code here should be better
There was a problem hiding this comment.
Good suggestion and can address in the future. Could be some perf penalty for null/nan checking we'd have to test.
There was a problem hiding this comment.
I think Bobby's comment was about checking task per GPU conf?
This line is a little confusing to me because we recommend fractional tasks/GPU under performance docs under the assumption that stage-level scheduling will handle the adjustment. Is this line intended for cases where stage level scheduling isn't supported?
Nvm didn't see the rest of the line.
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Adds a new FAQ entry documenting common causes of low-level CUDA and native code errors to help users troubleshoot these issues.
- Added question "What are some possible causes of low-level CUDA and/or native code errors?"
- Lists three main causes: NaNs/nulls in input data, NCCL library limitations with multiple processes per GPU, and unknown bugs
- Includes reference to stage-level scheduling documentation
- Minor formatting issue: extra spaces after periods in bullet points
Confidence Score: 4/5
- Safe to merge with minor spacing issues
- Documentation-only change with technically accurate content verified against recent codebase changes. Three minor spacing inconsistencies detected (extra spaces after periods in bullet points) that should be fixed for consistency with standard markdown formatting.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| docs/site/FAQ.md | 4/5 | Added FAQ entry for CUDA/native code errors with three common causes: NaNs/nulls, NCCL multi-process limitations, and unknown bugs. Minor spacing issue found. |
Sequence Diagram
sequenceDiagram
participant User
participant Documentation
participant FAQ
User->>FAQ: Encounters CUDA/native code error
FAQ->>User: Check for NaNs/nulls in input data
FAQ->>User: Verify NCCL multi-process config
Note over FAQ,User: Stage-level scheduling can help<br/>but not supported everywhere
FAQ->>User: Check Spark GPU configs<br/>(1 task per GPU during fit)
FAQ->>User: If issue persists, file bug report
1 file reviewed, 3 comments
|
|
||
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. |
There was a problem hiding this comment.
syntax: Extra space before "These"
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | |
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. |
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. |
There was a problem hiding this comment.
syntax: Extra space before "[Stage level scheduling]"
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. | |
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. |
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. | ||
| - Previously unknown bugs. Please file an issue. |
There was a problem hiding this comment.
syntax: Extra space before "Please"
| - Previously unknown bugs. Please file an issue. | |
| - Previously unknown bugs. Please file an issue. |
No description provided.