Thank you for your great work
I tested the official inference code with quantization enabled, expecting improved inference speed and reduced VRAM usage.
However, I observed the opposite behavior: quantization makes inference slower and consumes more VRAM compared to the baseline.
Model: SDXL-Turbo
Baseline:
Inference time: 0.5241 - 0.9573 seconds per step, 4.1236 seconds total
VRAM usage: ~17.1489 GB
Quantized (w4w8g8):
Inference time: 1.4139 ~ 1.4310 seconds per step, 7.4306 seconds total
VRAM usage: ~ 31.4364 GB
BOPs and FLOPs are reduced as expected.
The performance gap is consistent across multiple runs.
Could this be due to the specific quantization method or implementation overhead?
Thank you for your great work
I tested the official inference code with quantization enabled, expecting improved inference speed and reduced VRAM usage.
However, I observed the opposite behavior: quantization makes inference slower and consumes more VRAM compared to the baseline.
Model: SDXL-Turbo
Baseline:
Inference time: 0.5241 - 0.9573 seconds per step, 4.1236 seconds total
VRAM usage: ~17.1489 GB
Quantized (w4w8g8):
Inference time: 1.4139 ~ 1.4310 seconds per step, 7.4306 seconds total
VRAM usage: ~ 31.4364 GB
BOPs and FLOPs are reduced as expected.
The performance gap is consistent across multiple runs.
Could this be due to the specific quantization method or implementation overhead?