Skip to content

Is it possible to smooth FC2 input? #108

@MaxwellWjj

Description

@MaxwellWjj

Hi Developers,

Recently when I try to apply smoothquant in my side (Qwen3-1.7B), I found that the FC2 (or the down_proj in Qwen-like definition) is not concluded in the smoothed layers. However, I observed that the static per-tensor scaling factors in this layer input, can be extremely large if no smooth is applied.

Layer 0: {'q_proj_input': 0.009227362204724409, 'o_proj_input': 0.021776574803149606, 'gate_input': 0.010765255905511811, 'down_input': 0.15748031496062992}

Layer 1: {'q_proj_input': 0.008427657480314961, 'o_proj_input': 0.011441929133858268, 'gate_input': 0.015071358267716535, 'down_input': 1.236220472440945}

Layer 2: {'q_proj_input': 0.009781003937007874, 'o_proj_input': 0.018331692913385825, 'gate_input': 0.023375984251968504, 'down_input': 133.03937007874015}

...

Layer 26: {'q_proj_input': 0.03297244094488189, 'o_proj_input': 2.031496062992126, 'gate_input': 0.022637795275590553, 'down_input': 11.21259842519685}

Layer 27: {'q_proj_input': 0.03641732283464567, 'o_proj_input': 3.0078740157480315, 'gate_input': 0.035679133858267716, 'down_input': 23.433070866141733}

As you can see, the down_input here refers to the per-tensor scale in down_proj (should be the same to fc2 in OPT). When accumulated through layers, the down_input scale becomes extremely large, which means the outliers here explode! Then of course, the final ppl does not look acceptable (from original ~16 to quantized ~90)

If we can apply smooth to this layer, I believe the result can improve a lot. May I know if you have tried to implement that? Many thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions