|
p.grad.data = self.grad_quant(p.grad.data * self.grad_scaling) |
In OptimLP, the gradient scaling factor is multiplied before quantization. However, grad scaling is meant to prevent possible underflow of low precision quantized gradient values. I think the current implementation cannot prevent underflow.
Maybe the correct implementation is to multiply the scaling factor after quantization.
p.grad.data = self.grad_quant(p.grad.data) * self.grad_scaling