The current implementation of integrated gradients interpolates on the entire patch mask. However, this introduces dependencies between upstream nodes and downstream nodes. Let f_k(x) denote the output of a component k, and f_k_alpha(x) denote the output of a component with previous component outputs interpolated with alpha. We want to be substituting f_k(x) for alpha f_k(x) + (1-alpha) f_k(x), but instead we're substituting alpha * f_k_alpha(x) + (1-alpha) f_k_alpha(x). Sparse Feature Circuits addresses this in section 2:
This [IG] cannot be done in parallel for two nodes when one is downstream of another, but can be
done in parallel for arbitrarily many nodes which do not depend on each other. Thus the
additional cost of computing ˆIEig over ˆIEatp scales linearly in N and the serial depth of m’s
computation graph.
I don't think the edge case is different, and thus think the current implementation in prune_algos.mask_gradient is incorrect, and should be adjusted to compute scores iteratively over source node layers.
(This would change the time complexity from O(forward * N) to O(forward * n_layers * N), so maybe just add as an optional setting)
The current implementation of integrated gradients interpolates on the entire patch mask. However, this introduces dependencies between upstream nodes and downstream nodes. Let f_k(x) denote the output of a component k, and f_k_alpha(x) denote the output of a component with previous component outputs interpolated with alpha. We want to be substituting f_k(x) for alpha f_k(x) + (1-alpha) f_k(x), but instead we're substituting alpha * f_k_alpha(x) + (1-alpha) f_k_alpha(x). Sparse Feature Circuits addresses this in section 2:
I don't think the edge case is different, and thus think the current implementation in
prune_algos.mask_gradientis incorrect, and should be adjusted to compute scores iteratively over source node layers.(This would change the time complexity from O(forward * N) to O(forward * n_layers * N), so maybe just add as an optional setting)