Skip to content

Wrong metrics for FLOP utilization and TensorCore usage when using bfloat16 #1712

@emergenz

Description

@emergenz

The following program is a matrix multiplication of two matrices in BF16 on an H100. We achieve a rough throughput of 600 TFLOP/s (which is impossible without using TensorCores). BF16 matrix multiplication is TensorCore-eligible on the H100. The GPU kernel stats page of the JAX profiler incorrectly states that the op is not TensorCore-eligible and that no TensorCores are used. The framework op stats page also incorrectly identifies the op as not being TensorCore-eligible.

However, again, such high throughput is impossible without using TensorCores (I stated the throughput manually measured by myself, which corresponds to the one listed in the graph viewer).

On a slightly different note, for the FLOPS utilization it is unclear to me whether the peak FLOP/s used for the calculation are correct (which should be different for BF16 and TF32).

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions