I have been working through the paper trying to understand things and examining the code for computing edge weights and I believe I have discovered some unexpected behavior, as well as some other confusing areas. I would greatly appreciate some clarification on where I'm going wrong here.
Unexpected zeros
We wish to compute the edge weights between two residual layers, as in
|
for layer in reversed(range(len(resids))): |
|
resid = resids[layer] |
|
mlp = mlps[layer] |
|
attn = attns[layer] |
|
|
|
MR_effect, MR_grad = N(mlp, resid) |
|
AR_effect, AR_grad = N(attn, resid) |
|
|
|
edges[f'mlp_{layer}'][f'resid_{layer}'] = MR_effect |
|
edges[f'attn_{layer}'][f'resid_{layer}'] = AR_effect |
|
|
|
if layer > 0: |
|
prev_resid = resids[layer-1] |
|
else: |
|
prev_resid = embed |
|
|
|
RM_effect, _ = N(prev_resid, mlp) |
|
RA_effect, _ = N(prev_resid, attn) |
|
|
|
MR_grad = MR_grad.coalesce() |
|
AR_grad = AR_grad.coalesce() |
|
|
|
RMR_effect = jvp( |
|
clean, |
|
model, |
|
dictionaries, |
|
mlp, |
|
features_by_submod[resid], |
|
prev_resid, |
|
{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]}, |
|
deltas[prev_resid], |
|
) |
|
RAR_effect = jvp( |
|
clean, |
|
model, |
|
dictionaries, |
|
attn, |
|
features_by_submod[resid], |
|
prev_resid, |
|
{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]}, |
|
deltas[prev_resid], |
|
) |
|
RR_effect, _ = N(prev_resid, resid) |
|
|
|
if layer > 0: |
|
edges[f'resid_{layer-1}'][f'mlp_{layer}'] = RM_effect |
|
edges[f'resid_{layer-1}'][f'attn_{layer}'] = RA_effect |
|
edges[f'resid_{layer-1}'][f'resid_{layer}'] = RR_effect - RMR_effect - RAR_effect |
|
else: |
|
edges['embed'][f'mlp_{layer}'] = RM_effect |
|
edges['embed'][f'attn_{layer}'] = RA_effect |
|
edges['embed'][f'resid_0'] = RR_effect - RMR_effect - RAR_effect |
This is done by computing the direct effect of the residual at layer n to the residual at layer n +1 (line 199) and then subtracting off the indirect contributions of the residual in layer n on residual n+1 via both the MLP and the attention submodules (line 204).
However, these indirect contributions (RMR_effect and RAR_effect) appear to be always all zeros in my testing with all the provided supervised datasets in data regardless of edge/node thresholds, batch size, example length, or input. Is this behavior expected?
downstream_feat indexing
It also seems to me that in this section of attribution.jvp, downstream_feat is being used in two unrelated ways.
|
for downstream_feat in downstream_features: |
|
if isinstance(left_vec, SparseAct): |
|
to_backprop = (left_vec @ downstream_act).to_tensor().flatten() |
|
elif isinstance(left_vec, dict): |
|
to_backprop = (left_vec[downstream_feat] @ downstream_act).to_tensor().flatten() |
|
else: |
|
raise ValueError(f"Unknown type {type(left_vec)}") |
|
vjv = (upstream_act.grad @ right_vec).to_tensor().flatten() |
|
if return_without_right: |
|
jv = (upstream_act.grad @ right_vec).to_tensor().flatten() |
|
x_res.grad = t.zeros_like(x_res) |
|
to_backprop[downstream_feat].backward(retain_graph=True) |
Using the notation of equation 6 from the paper, on line 350, it indexes into
left_vec, selecting the gradient of some particular feature in
d with respect to the features in intermediate node
m. Then, on line 357, we index into
to_backprop with
downstream_feat, where
to_backprop is an element-wise product of an intermediate node gradient
m and the current activation of
m. If
downstream_feat corresponds to some feature in node
d, why can we use it to index into intermediate node
m?
vjv vs. jv
Finally, I believe there could be some issue with how vjv and jv are computed inside of attribution.jvp. As far as I can tell (and running the code confirms this), by the return statement here
|
return ( |
|
t.sparse_coo_tensor(vjv_indices, vjv_values, (d_downstream_contracted, d_upstream_contracted)), |
|
t.sparse_coo_tensor(jv_indices, jv_values, (d_downstream_contracted, d_upstream)) |
|
) |
vjv_indices and
vjv_values are identical to
jv_indices and
jv_values. Therefore, the only way that
MR_effect and
MR_grad differ in
|
MR_effect, MR_grad = N(mlp, resid) |
is their size, but each having the same underlying values and indices. However, on lines 186 and 196 in
|
RMR_effect = jvp( |
|
clean, |
|
model, |
|
dictionaries, |
|
mlp, |
|
features_by_submod[resid], |
|
prev_resid, |
|
{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]}, |
|
deltas[prev_resid], |
|
) |
|
RAR_effect = jvp( |
|
clean, |
|
model, |
|
dictionaries, |
|
attn, |
|
features_by_submod[resid], |
|
prev_resid, |
|
{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]}, |
|
deltas[prev_resid], |
, this differing size matters when we reshape these
*_grad variables. If we had instead applied the same reshape to the analogous
*_effect variables, we would end up with 2 different tensors that have the same values but rearranged and moved around in a strange permutation that does not seem to correspond to how these variables were computed in the first place. Is this behavior expected?
I have been working through the paper trying to understand things and examining the code for computing edge weights and I believe I have discovered some unexpected behavior, as well as some other confusing areas. I would greatly appreciate some clarification on where I'm going wrong here.
Unexpected zeros
We wish to compute the edge weights between two residual layers, as in
feature-circuits/circuit.py
Lines 157 to 208 in c1a9b7b
This is done by computing the direct effect of the residual at layer n to the residual at layer n +1 (line 199) and then subtracting off the indirect contributions of the residual in layer n on residual n+1 via both the MLP and the attention submodules (line 204).
However, these indirect contributions (
RMR_effectandRAR_effect) appear to be always all zeros in my testing with all the provided supervised datasets indataregardless of edge/node thresholds, batch size, example length, or input. Is this behavior expected?downstream_featindexingIt also seems to me that in this section of attribution.jvp,
downstream_featis being used in two unrelated ways.feature-circuits/attribution.py
Lines 346 to 357 in c1a9b7b
left_vec, selecting the gradient of some particular feature indwith respect to the features in intermediate nodem. Then, on line 357, we index intoto_backpropwithdownstream_feat, whereto_backpropis an element-wise product of an intermediate node gradientmand the current activation ofm. Ifdownstream_featcorresponds to some feature in noded, why can we use it to index into intermediate nodem?vjvvs.jvFinally, I believe there could be some issue with how
vjvandjvare computed inside of attribution.jvp. As far as I can tell (and running the code confirms this), by the return statement herefeature-circuits/attribution.py
Lines 388 to 391 in c1a9b7b
vjv_indicesandvjv_valuesare identical tojv_indicesandjv_values. Therefore, the only way thatMR_effectandMR_graddiffer infeature-circuits/circuit.py
Line 162 in c1a9b7b
feature-circuits/circuit.py
Lines 179 to 197 in c1a9b7b
*_gradvariables. If we had instead applied the same reshape to the analogous*_effectvariables, we would end up with 2 different tensors that have the same values but rearranged and moved around in a strange permutation that does not seem to correspond to how these variables were computed in the first place. Is this behavior expected?