Skip to content

Understanding benchmark results #46

@amogkam

Description

@amogkam

Hi thanks for sharing your benchmarks!

I have some questions on benchmark results-- why is does quantization not result in more memory savings & inference speedup?

For example, if we look at the benchmark results for CogVideoX on H100:

  1. With no comilation, bf16 takes 112 seconds while fp8 with no compilation takes 113 seconds. Is there no benefit with torchao quantization unless compilation is also used?
  2. With compilation enabled, bf16 takes 87 seconds and 33 GB of memory vs. fp8+compilation takes 75 seconds and 23 GB of memory. Why is the memory savings not 2x since you're going from 16 bit to 8 bit? Is it because only linear layers are quantized, but not attention operation? Also the time savings is only ~0.24 seconds per step (12 seconds saved over 50 steps). I would expect inference time to be a lot faster with quantization?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions