Understanding benchmark results

Hi thanks for sharing your benchmarks! 

I have some questions on benchmark results-- why is does quantization not result in more memory savings & inference speedup?

For example, if we look at the benchmark results for CogVideoX on H100:
1. With no comilation, bf16 takes 112 seconds while fp8 with no compilation takes 113 seconds. Is there no benefit with torchao quantization unless compilation is also used?
2. With compilation enabled, bf16 takes 87 seconds and 33 GB of memory vs. fp8+compilation takes 75 seconds and 23 GB of memory. Why is the memory savings not 2x since you're going from 16 bit to 8 bit? Is it because only linear layers are quantized, but not attention operation? Also the time savings is only ~0.24 seconds per step (12 seconds saved over 50 steps). I would expect inference time to be a lot faster with quantization?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding benchmark results #46

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Understanding benchmark results #46

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions