Quantization increases latency and VRAM usage instead of reducing them

Thank you for your great work

I tested the official inference code with quantization enabled, expecting improved inference speed and reduced VRAM usage.
However, I observed the opposite behavior: quantization makes inference slower and consumes more VRAM compared to the baseline.

Model: SDXL-Turbo

Baseline:
Inference time: 0.5241 - 0.9573 seconds per step, 4.1236 seconds total
VRAM usage: ~17.1489 GB

Quantized (w4w8g8):
Inference time: 1.4139 ~ 1.4310 seconds per step, 7.4306 seconds total
VRAM usage: ~ 31.4364 GB

BOPs and FLOPs are reduced as expected.

The performance gap is consistent across multiple runs.
Could this be due to the specific quantization method or implementation overhead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization increases latency and VRAM usage instead of reducing them #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Quantization increases latency and VRAM usage instead of reducing them #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions