In equation 1 in the paper, you compute the gradient of the element-wise product and the ground truth one-hot label with respect to the input feature vector. This is to find the features that contribute most to the ground truth class logit. For a softmax output, ideally we want the true label logit to be towards positive infinity while the other logits to be towards negative infinity.
So my question is, why not compute a more classical cross-entropy loss here:
|
one_hot = torch.sum(output * one_hot_sparse) |
instead of just the sum of the true logits?