Potential bugs and confusion with `attribution.jvp`

I have been working through the paper trying to understand things and examining the code for computing edge weights and I believe I have discovered some unexpected behavior, as well as some other confusing areas. I would greatly appreciate some clarification on where I'm going wrong here.

### Unexpected zeros

We wish to compute the edge weights between two residual layers, as in  
https://github.com/saprmarks/feature-circuits/blob/c1a9b7b3a400363d3e83b4319b989a4949959fe7/circuit.py#L157-L208 
This is done by computing the direct effect of the residual at layer n to the residual at layer n +1 (line 199) and then subtracting off the indirect contributions of the residual in layer n on residual n+1 via both the MLP and the attention submodules (line 204).

However, these indirect contributions (`RMR_effect` and `RAR_effect`) appear to be always all zeros in my testing with all the provided supervised datasets in `data` regardless of edge/node thresholds, batch size, example length, or input. Is this behavior expected? 

### `downstream_feat` indexing

It also seems to me that in this section of attribution.jvp, `downstream_feat` is being used in two unrelated ways. https://github.com/saprmarks/feature-circuits/blob/c1a9b7b3a400363d3e83b4319b989a4949959fe7/attribution.py#L346-L357 Using the notation of equation 6 from the paper, on line 350, it indexes into `left_vec`, selecting the gradient of some particular feature in `d` with respect to the features in intermediate node `m`. Then, on line 357, we index into `to_backprop` with `downstream_feat`, where `to_backprop` is an element-wise product of an intermediate node gradient `m` and the current activation of `m`. If `downstream_feat` corresponds to some feature in node `d`, why can we use it to index into intermediate node `m`?

### `vjv` vs. `jv`

Finally, I believe there could be some issue with how `vjv` and `jv` are computed inside of attribution.jvp. As far as I can tell (and running the code confirms this), by the return statement here https://github.com/saprmarks/feature-circuits/blob/c1a9b7b3a400363d3e83b4319b989a4949959fe7/attribution.py#L388-L391 `vjv_indices` and `vjv_values` are identical to `jv_indices` and `jv_values`. Therefore, the only way that `MR_effect` and `MR_grad` differ in https://github.com/saprmarks/feature-circuits/blob/c1a9b7b3a400363d3e83b4319b989a4949959fe7/circuit.py#L162 is their size, but each having the same underlying values and indices. However, on lines 186 and 196 in https://github.com/saprmarks/feature-circuits/blob/c1a9b7b3a400363d3e83b4319b989a4949959fe7/circuit.py#L179-L197, this differing size matters when we reshape these `*_grad` variables. If we had instead applied the same reshape to the analogous `*_effect` variables, we would end up with 2 different tensors that have the same values but rearranged and moved around in a strange permutation that does not seem to correspond to how these variables were computed in the first place. Is this behavior expected?

	for layer in reversed(range(len(resids))):
	resid = resids[layer]
	mlp = mlps[layer]
	attn = attns[layer]

	MR_effect, MR_grad = N(mlp, resid)
	AR_effect, AR_grad = N(attn, resid)

	edges[f'mlp_{layer}'][f'resid_{layer}'] = MR_effect
	edges[f'attn_{layer}'][f'resid_{layer}'] = AR_effect

	if layer > 0:
	prev_resid = resids[layer-1]
	else:
	prev_resid = embed

	RM_effect, _ = N(prev_resid, mlp)
	RA_effect, _ = N(prev_resid, attn)

	MR_grad = MR_grad.coalesce()
	AR_grad = AR_grad.coalesce()

	RMR_effect = jvp(
	clean,
	model,
	dictionaries,
	mlp,
	features_by_submod[resid],
	prev_resid,
	{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
	deltas[prev_resid],
	)
	RAR_effect = jvp(
	clean,
	model,
	dictionaries,
	attn,
	features_by_submod[resid],
	prev_resid,
	{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
	deltas[prev_resid],
	)
	RR_effect, _ = N(prev_resid, resid)

	if layer > 0:
	edges[f'resid_{layer-1}'][f'mlp_{layer}'] = RM_effect
	edges[f'resid_{layer-1}'][f'attn_{layer}'] = RA_effect
	edges[f'resid_{layer-1}'][f'resid_{layer}'] = RR_effect - RMR_effect - RAR_effect
	else:
	edges['embed'][f'mlp_{layer}'] = RM_effect
	edges['embed'][f'attn_{layer}'] = RA_effect
	edges['embed'][f'resid_0'] = RR_effect - RMR_effect - RAR_effect

	for downstream_feat in downstream_features:
	if isinstance(left_vec, SparseAct):
	to_backprop = (left_vec @ downstream_act).to_tensor().flatten()
	elif isinstance(left_vec, dict):
	to_backprop = (left_vec[downstream_feat] @ downstream_act).to_tensor().flatten()
	else:
	raise ValueError(f"Unknown type {type(left_vec)}")
	vjv = (upstream_act.grad @ right_vec).to_tensor().flatten()
	if return_without_right:
	jv = (upstream_act.grad @ right_vec).to_tensor().flatten()
	x_res.grad = t.zeros_like(x_res)
	to_backprop[downstream_feat].backward(retain_graph=True)

	RMR_effect = jvp(
	clean,
	model,
	dictionaries,
	mlp,
	features_by_submod[resid],
	prev_resid,
	{feat_idx : unflatten(MR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
	deltas[prev_resid],
	)
	RAR_effect = jvp(
	clean,
	model,
	dictionaries,
	attn,
	features_by_submod[resid],
	prev_resid,
	{feat_idx : unflatten(AR_grad[feat_idx].to_dense()) for feat_idx in features_by_submod[resid]},
	deltas[prev_resid],

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bugs and confusion with `attribution.jvp` #10

Unexpected zeros

`downstream_feat` indexing

`vjv` vs. `jv`

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	return (
	t.sparse_coo_tensor(vjv_indices, vjv_values, (d_downstream_contracted, d_upstream_contracted)),
	t.sparse_coo_tensor(jv_indices, jv_values, (d_downstream_contracted, d_upstream))
	)

Potential bugs and confusion with attribution.jvp #10

Description

Unexpected zeros

downstream_feat indexing

vjv vs. jv

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Potential bugs and confusion with `attribution.jvp` #10

`downstream_feat` indexing

`vjv` vs. `jv`