-
Notifications
You must be signed in to change notification settings - Fork 192
Description
Hi Developers,
Recently when I try to apply smoothquant in my side (Qwen3-1.7B), I found that the FC2 (or the down_proj in Qwen-like definition) is not concluded in the smoothed layers. However, I observed that the static per-tensor scaling factors in this layer input, can be extremely large if no smooth is applied.
Layer 0: {'q_proj_input': 0.009227362204724409, 'o_proj_input': 0.021776574803149606, 'gate_input': 0.010765255905511811, 'down_input': 0.15748031496062992}
Layer 1: {'q_proj_input': 0.008427657480314961, 'o_proj_input': 0.011441929133858268, 'gate_input': 0.015071358267716535, 'down_input': 1.236220472440945}
Layer 2: {'q_proj_input': 0.009781003937007874, 'o_proj_input': 0.018331692913385825, 'gate_input': 0.023375984251968504, 'down_input': 133.03937007874015}
...
Layer 26: {'q_proj_input': 0.03297244094488189, 'o_proj_input': 2.031496062992126, 'gate_input': 0.022637795275590553, 'down_input': 11.21259842519685}
Layer 27: {'q_proj_input': 0.03641732283464567, 'o_proj_input': 3.0078740157480315, 'gate_input': 0.035679133858267716, 'down_input': 23.433070866141733}
As you can see, the down_input here refers to the per-tensor scale in down_proj (should be the same to fc2 in OPT). When accumulated through layers, the down_input scale becomes extremely large, which means the outliers here explode! Then of course, the final ppl does not look acceptable (from original ~16 to quantized ~90)
If we can apply smooth to this layer, I believe the result can improve a lot. May I know if you have tried to implement that? Many thanks!