Wrong metrics for FLOP utilization and TensorCore usage when using bfloat16

The following program is a matrix multiplication of two matrices in BF16 on an H100. We achieve a rough throughput of 600 TFLOP/s (which is impossible without using TensorCores). BF16 matrix multiplication is TensorCore-eligible on the H100. The GPU kernel stats page of the JAX profiler incorrectly states that the op is not TensorCore-eligible and that no TensorCores are used. The framework op stats page also incorrectly identifies the op as not being TensorCore-eligible.

However, again, such high throughput is impossible without using TensorCores (I stated the throughput manually measured by myself, which corresponds to the one listed in the graph viewer).

On a slightly different note, for the FLOPS utilization it is unclear to me whether the peak FLOP/s used for the calculation are correct (which should be different for BF16 and TF32).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong metrics for FLOP utilization and TensorCore usage when using bfloat16 #1712

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong metrics for FLOP utilization and TensorCore usage when using bfloat16 #1712

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions