Feature: FP8 GEMM Kernel

Implement an FP8 GEMM kernel optimized for inference workloads.

The kernel should support scaling, accumulation, and output conversion
while maximizing tensor core utilization.

Planned Benchmarks
- TFLOPs achieved vs FP16
- Accuracy impact
- Memory bandwidth utilization

Learning Objectives
- FP8 formats and scaling
- Tensor core programming
- Numerical stability tradeoffs