Skip to content

Potential bugs and confusion with attribution.jvp #10

@JacksonKaunismaa

Description

@JacksonKaunismaa

I have been working through the paper trying to understand things and examining the code for computing edge weights and I believe I have discovered some unexpected behavior, as well as some other confusing areas. I would greatly appreciate some clarification on where I'm going wrong here.

Unexpected zeros

We wish to compute the edge weights between two residual layers, as in

feature-circuits/circuit.py

Lines 157 to 208 in c1a9b7b

for layer in reversed(range(len(resids))):
resid = resids[layer]
mlp = mlps[layer]
attn = attns[layer]
MR_effect, MR_grad = N(mlp, resid)
AR_effect, AR_grad = N(attn, resid)
edges[f'mlp_{layer}'][f'resid_{layer}'] = MR_effect
edges[f'attn_{layer}'][f'resid_{layer}'] = AR_effect
if layer > 0:
prev_resid = resids[layer-1]
else:
prev_resid = embed
RM_effect, _ = N(prev_resid, mlp)
RA_effect, _ = N(prev_resid, attn)
MR_grad = MR_grad.coalesce()
AR_grad = AR_grad.coalesce()
RMR_effect = jvp(
clean,
model,
dictionaries,
mlp,
features_by_submod[resid],
prev_resid,
{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
deltas[prev_resid],
)
RAR_effect = jvp(
clean,
model,
dictionaries,
attn,
features_by_submod[resid],
prev_resid,
{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
deltas[prev_resid],
)
RR_effect, _ = N(prev_resid, resid)
if layer > 0:
edges[f'resid_{layer-1}'][f'mlp_{layer}'] = RM_effect
edges[f'resid_{layer-1}'][f'attn_{layer}'] = RA_effect
edges[f'resid_{layer-1}'][f'resid_{layer}'] = RR_effect - RMR_effect - RAR_effect
else:
edges['embed'][f'mlp_{layer}'] = RM_effect
edges['embed'][f'attn_{layer}'] = RA_effect
edges['embed'][f'resid_0'] = RR_effect - RMR_effect - RAR_effect

This is done by computing the direct effect of the residual at layer n to the residual at layer n +1 (line 199) and then subtracting off the indirect contributions of the residual in layer n on residual n+1 via both the MLP and the attention submodules (line 204).

However, these indirect contributions (RMR_effect and RAR_effect) appear to be always all zeros in my testing with all the provided supervised datasets in data regardless of edge/node thresholds, batch size, example length, or input. Is this behavior expected?

downstream_feat indexing

It also seems to me that in this section of attribution.jvp, downstream_feat is being used in two unrelated ways.

for downstream_feat in downstream_features:
if isinstance(left_vec, SparseAct):
to_backprop = (left_vec @ downstream_act).to_tensor().flatten()
elif isinstance(left_vec, dict):
to_backprop = (left_vec[downstream_feat] @ downstream_act).to_tensor().flatten()
else:
raise ValueError(f"Unknown type {type(left_vec)}")
vjv = (upstream_act.grad @ right_vec).to_tensor().flatten()
if return_without_right:
jv = (upstream_act.grad @ right_vec).to_tensor().flatten()
x_res.grad = t.zeros_like(x_res)
to_backprop[downstream_feat].backward(retain_graph=True)
Using the notation of equation 6 from the paper, on line 350, it indexes into left_vec, selecting the gradient of some particular feature in d with respect to the features in intermediate node m. Then, on line 357, we index into to_backprop with downstream_feat, where to_backprop is an element-wise product of an intermediate node gradient m and the current activation of m. If downstream_feat corresponds to some feature in node d, why can we use it to index into intermediate node m?

vjv vs. jv

Finally, I believe there could be some issue with how vjv and jv are computed inside of attribution.jvp. As far as I can tell (and running the code confirms this), by the return statement here

return (
t.sparse_coo_tensor(vjv_indices, vjv_values, (d_downstream_contracted, d_upstream_contracted)),
t.sparse_coo_tensor(jv_indices, jv_values, (d_downstream_contracted, d_upstream))
)
vjv_indices and vjv_values are identical to jv_indices and jv_values. Therefore, the only way that MR_effect and MR_grad differ in
MR_effect, MR_grad = N(mlp, resid)
is their size, but each having the same underlying values and indices. However, on lines 186 and 196 in

feature-circuits/circuit.py

Lines 179 to 197 in c1a9b7b

RMR_effect = jvp(
clean,
model,
dictionaries,
mlp,
features_by_submod[resid],
prev_resid,
{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
deltas[prev_resid],
)
RAR_effect = jvp(
clean,
model,
dictionaries,
attn,
features_by_submod[resid],
prev_resid,
{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
deltas[prev_resid],
, this differing size matters when we reshape these *_grad variables. If we had instead applied the same reshape to the analogous *_effect variables, we would end up with 2 different tensors that have the same values but rearranged and moved around in a strange permutation that does not seem to correspond to how these variables were computed in the first place. Is this behavior expected?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions