Skip to content

Why the gradient scaling factor is multiplied before quantization? #59

@Guangxuan-Xiao

Description

@Guangxuan-Xiao

p.grad.data = self.grad_quant(p.grad.data * self.grad_scaling)

In OptimLP, the gradient scaling factor is multiplied before quantization. However, grad scaling is meant to prevent possible underflow of low precision quantized gradient values. I think the current implementation cannot prevent underflow.

Maybe the correct implementation is to multiply the scaling factor after quantization.

p.grad.data = self.grad_quant(p.grad.data) * self.grad_scaling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions