-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Implement an FP8 GEMM kernel optimized for inference workloads.
The kernel should support scaling, accumulation, and output conversion
while maximizing tensor core utilization.
Planned Benchmarks
- TFLOPs achieved vs FP16
- Accuracy impact
- Memory bandwidth utilization
Learning Objectives
- FP8 formats and scaling
- Tensor core programming
- Numerical stability tradeoffs