-
Notifications
You must be signed in to change notification settings - Fork 55
Add CUDA FP8 type + conversion bindings (E5M2/E4M3/E8M0), HW-accel detection, and comprehensive tests #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Greptile SummaryThis PR successfully adds comprehensive FP8 type support to numba-cuda by introducing auto-generated bindings for CUDA FP8 types (E5M2, E4M3, E8M0), proper compiler integration via typing and target registries, and a public hardware-acceleration detection API. The implementation closely follows established patterns from existing fp16/bf16 support and includes thorough test coverage with proper edge case handling (NaN, saturation modes, type conversions). Key Changes:
Integration Quality:
Confidence Score: 5/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (4)
-
numba_cuda/numba/cuda/include/13/cuda_fp8.h, line 65 (link)syntax: typo in macro condition comment - should be 'CUDACC' not '_CUDACC'
-
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py, line 368 (link)logic: using np.array([float("nan")] * 9) creates array of NaN values but assertion may not behave as expected since NaN != NaN
-
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py, line 294 (link)logic: result array declared as uint64 but should be int64 or float64 since zero conversions include both int64(zero) and float32(zero)
-
numba_cuda/numba/cuda/_internal/cuda_fp8.py, line 832-853 (link)logic: Another duplicate function definition for int8 conversion with different C++ type mapping (char vs signed char)
13 files reviewed, 4 comments
|
I addressed point 3 and 4.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py, line 242-250 (link)style: Commented-out test cases for E5M2 and E4M3 negative values should be removed or implemented if they're valid test scenarios. Are these commented test cases intentionally disabled due to known limitations or should they be implemented?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py, line 368 (link)logic: Using
np.array([float('nan')] * 9)for NaN comparison may not work reliably since NaN != NaN - consider usingnp.testing.assert_array_equalwithnp.isnan()checks instead -
numba_cuda/numba/cuda/_internal/cuda_fp8.py, line 784-825 (link)logic: Duplicate function definition detected - both
_from___nv_fp8_e5m2_to_uint64_lowerfunctions register the same@lower_cast(_type_fp8_e5m2, uint64)decorator but with different mangled names (_ZNK13__nv_fp8_e5m2cvmEvvs_ZNK13__nv_fp8_e5m2cvyEv). The second registration will override the first, potentially causing runtime issues. Is this intentional behavior where unsigned long and unsigned long long both map to uint64, or should these have different target types?
13 files reviewed, 3 comments
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
/ok to test |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds comprehensive FP8 type support to numba-cuda, introducing bindings for CUDA's native FP8 types (E5M2, E4M3, E8M0) along with conversion intrinsics and hardware acceleration detection.
Key Changes
- New public API:
cuda.is_fp8_accelerated()returns whether the current device has native FP8 hardware support (CC >= 8.9) - Autogenerated bindings: ~3900 lines of generated code providing type definitions, constructors, conversions, and intrinsics for all FP8 variants
- Compiler integration: FP8 types registered in both typing and target contexts, enabling seamless use in JIT kernels
- Comprehensive tests: 550+ lines covering constructors, type conversions, NaN handling, saturation modes, and rounding behaviors
- Simulator support: FP8 API works in simulation mode (always returns
Falsefor hardware acceleration) - Header vendoring: CUDA Toolkit 12 and 13 FP8 headers included for binding generation
Implementation Quality
The implementation follows established patterns from existing FP16/BF16 support. The autogenerated bindings are excluded from linting (appropriate for generated code). Tests are thorough and cover edge cases including overflow behavior with saturation modes and rounding mode variants for E8M0 conversions.
The hardware detection logic correctly identifies Ada Lovelace and Hopper architectures (CC 8.9+) as having native FP8 acceleration, while older devices fall back to software emulation.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- Well-structured implementation following existing patterns for FP16/BF16, comprehensive test coverage, autogenerated bindings reduce manual error risk, and clear separation of concerns across files
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| numba_cuda/numba/cuda/api.py | 5/5 | Adds is_fp8_accelerated() function to check FP8 hardware support, and updates detect() to display FP8 acceleration status |
| numba_cuda/numba/cuda/cudadrv/driver.py | 5/5 | Adds accelerates_fp8 property to Device class that checks compute capability >= 8.9 for hardware FP8 acceleration |
| numba_cuda/numba/cuda/fp8.py | 5/5 | Simple registry shim that exports typing and target registries from autogenerated FP8 bindings |
| numba_cuda/numba/cuda/target.py | 5/5 | Integrates FP8 type registries into CUDA typing and target contexts for compiler support |
| configs/cuda_fp8.yml | 5/5 | Configuration file for FP8 binding generator specifying types, data models, and code generation parameters |
| numba_cuda/numba/cuda/_internal/cuda_fp8.py | 5/5 | Autogenerated bindings (~3900 lines) for CUDA FP8 types (E5M2/E4M3/E8M0) and conversion intrinsics |
| numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py | 5/5 | Comprehensive test suite covering FP8 constructors, conversions, NaN handling, saturation modes, and rounding |
| numba_cuda/numba/cuda/tests/cudapy/test_fp8_e4m3_int8_debug.py | 5/5 | Debug test for fp8_e4m3 to int8 conversions with comprehensive assertions and diagnostic output |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. Obviously we need to get to the bottom of the A100 issues, but otherwise I think everything is good.
Edited: Originally I mentioned "L4 issues" as well, but then I saw that that fail was just a network error. Only the A100 shows issues, not L4 as well.
|
I had a quick look at the FP8 headers, and I think that things are going wrong in That said, I wonder if it's acceptable and more straightforward to only support FP8 on Hopper / Ada and beyond? Implementation of the function lifted from __CUDA_HOSTDEVICE_FP8_DECL__ __half_raw
__nv_cvt_fp8_to_halfraw(const __nv_fp8_storage_t x,
const __nv_fp8_interpretation_t fp8_interpretation) {
__half_raw res;
res.x = 0U;
#if (defined __CUDA_ARCH__) && (__CUDA_ARCH__ >= 890)
res.x =
__nv_cvt_fp8x2_to_halfraw2((__nv_fp8x2_storage_t)x, fp8_interpretation)
.x;
#else
unsigned short int ur = (unsigned short int)x;
ur = (unsigned short int)(ur << 8U);
if (fp8_interpretation == __NV_E5M2) {
if ((ur & 0x7FFFU) > 0x7C00U) {
/* If NaN, return canonical NaN */
ur = 0x7FFFU;
}
} else { // __NV_E4M3
unsigned short int sign = ur & 0x8000U;
unsigned short int exponent =
(unsigned short int)(((ur & 0x7800U) >> 1U) + 0x2000U);
unsigned short int mantissa = (ur & 0x0700U) >> 1U;
unsigned char absx = 0x7FU & (unsigned char)x;
if (absx == 0x7FU) // NaN
{
ur = 0x7FFFU; // fp16 canonical NaN, discard sign
} else if (exponent == 0x2000U) {
// zero or denormal
if (mantissa != 0U) {
// normalize
mantissa = (unsigned short int)(mantissa << 1U);
while ((mantissa & 0x0400U) == 0U) {
mantissa = (unsigned short int)(mantissa << 1U);
exponent = (unsigned short int)(exponent - 0x0400U);
}
// discard implicit leading bit
mantissa &= 0x03FFU;
} else { // Zero
exponent = 0U;
}
ur = (sign | exponent) | mantissa;
} else {
ur = (sign | exponent) | mantissa;
}
}
res.x = ur;
#endif
return res;
} |
Summary
Adds FP8 support in
numba-cudaby introducing generated bindings for CUDA FP8 types and conversion intrinsics (including saturation + rounding-mode variants), wiring them into CUDA typing/target registries, and adding a comprehensive test suite. Also adds a public runtime check for FP8 hardware acceleration.Motivation / Context
__nv_fp8_*) and conversion intrinsics from Numba CUDA kernels.cuda.is_fp8_accelerated()) so users can branch on HW-accelerated FP8 vs software-emulated behavior.What’s in this PR
FP8 bindings + API surface
numba_cuda/numba/cuda/_internal/cuda_fp8.pynumba_cuda/numba/cuda/fp8.py(exportstyping_registry+target_registry)numba_cuda/numba/cuda/include/12/cuda_fp8.hnumba_cuda/numba/cuda/include/12/cuda_fp8.hppnumba_cuda/numba/cuda/include/13/cuda_fp8.hnumba_cuda/numba/cuda/include/13/cuda_fp8.hppconfigs/cuda_fp8.ymlCompiler integration
numba_cuda/numba/cuda/target.py(fp8.typing_registry,fp8.target_registry).Public capability API
numba_cuda/numba/cuda/api.py::is_fp8_accelerated()numba_cuda/numba/cuda/cudadrv/driver.py::Device.accelerates_fp8(CC >= 8.9)numba_cuda/numba/cuda/simulator/api.py::is_fp8_accelerated()→ alwaysFalsedetect()output reports FP8 HW acceleration status.Tests
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.pysaturation_t)cuda.bindings.runtime.cudaRoundMode(notably for E8M0 paths)User-facing API changes
numba.cuda.is_fp8_accelerated()returnsTruewhen device supports FP8 HW acceleration; otherwiseFalse.numba.cuda._internal.cuda_fp8providing FP8 types and conversion intrinsics (available for advanced use).Testing
Added test coverage via
numba_cuda/numba/cuda/tests/cudapy/test_fp8_bindings.py.Notes
numba_cuda/numba/cuda/_internal/cuda_fp8.pyis autogenerated (see header for generator/version).cuda.is_fp8_accelerated()is the recommended guard for performance-sensitive paths.Follow ups