Skip to content

After performing static and dynamic quantization to int8, the inference speed became slower rather than faster #1203

@DuckGGt

Description

@DuckGGt

Checks

  • This template is only for usage issues encountered.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

Ubuntu 22.04.5 LTS
Python 3.10.15
torch 2.5.0a0+872d972e41.nv24.8
onnxruntime-gpu 1.23.0
onnx 1.19.0

Steps to Reproduce

1.Create a new Conda environment.
2.Import the F5-TTS project.
3.Export the transformer blocks from F5-TTS to ONNX format.
4.Use onnxruntime.quant_pre_process to infer input shapes and obtain pre_onnx as the input model for quantization.
5.Perform static quantization with the following settings:

  • Quantized ops: MatMul, Conv
  • per_channel=True
  • extra_options={ "ActivationSymmetric": True, "WeightSymmetric": True }
  • Use the Aishell dataset (speaker S0002) as the calibration set, keeping all other parameters as default.
    6.Compare the inference speed before and after quantization — the quantized ONNX model runs slower than the original FP32 model.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions