-
Notifications
You must be signed in to change notification settings - Fork 16
Understanding benchmark results #46
Copy link
Copy link
Open
Description
Hi thanks for sharing your benchmarks!
I have some questions on benchmark results-- why is does quantization not result in more memory savings & inference speedup?
For example, if we look at the benchmark results for CogVideoX on H100:
- With no comilation, bf16 takes 112 seconds while fp8 with no compilation takes 113 seconds. Is there no benefit with torchao quantization unless compilation is also used?
- With compilation enabled, bf16 takes 87 seconds and 33 GB of memory vs. fp8+compilation takes 75 seconds and 23 GB of memory. Why is the memory savings not 2x since you're going from 16 bit to 8 bit? Is it because only linear layers are quantized, but not attention operation? Also the time savings is only ~0.24 seconds per step (12 seconds saved over 50 steps). I would expect inference time to be a lot faster with quantization?
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels