Skip to content

puranikyashaswin/AutoCircuit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoCircuit

Automated causal circuit discovery for transformer models using activation patching. Given clean/corrupted prompt pairs, AutoCircuit identifies which attention heads and MLP layers are causally responsible for a specific model behavior, measures pairwise information flow between them, and proves the discovered circuit is both necessary and sufficient.

Built with TransformerLens. Validated on GPT-2 Small with the Indirect Object Identification (IOI) task.

Circuit Graph

Causal circuit for GPT-2 on IOI. Node size = importance, edge thickness = directional influence. Blue = attention heads, orange = MLP layers.


Results

Model: GPT-2 Small (124M params, 12 layers, 12 heads) Task: Indirect Object Identification - 8 clean/corrupted prompt pairs Components scored: 156 (144 attention heads + 12 MLP layers) Metric: logit(target) - logit(best alternative) at last position Hardware: Apple M-series CPU, 8 GB RAM, ~37s per example

Baselines

Metric Single example Averaged (n=8)
Clean logit diff +1.66 +1.33
Corrupted logit diff -2.48 -2.40
Drop +4.14 +3.73

Top 15 components (averaged over 8 examples)

Rank Component Score
1 mlp_L0 3.575
2 attn_L10_H7 2.534
3 attn_L11_H10 1.674
4 attn_L5_H5 1.400
5 attn_L8_H6 1.249
6 attn_L8_H10 1.100
7 attn_L7_H9 0.920
8 attn_L9_H9 0.790
9 attn_L3_H0 0.590
10 attn_L7_H3 0.564
11 attn_L10_H0 0.541
12 attn_L9_H7 0.527
13 attn_L6_H9 0.516
14 mlp_L5 0.443
15 mlp_L11 0.438

Scores are averaged across 8 diverse IOI prompts to filter out per-example noise. A single prompt can activate rare circuit paths - averaging ensures the ranking reflects the task-general circuit, not input-specific artifacts. The same methodology is used in Wang et al. (2022).


Validation

Sufficiency test

Restore only the top-K clean activations into a corrupted run. If the circuit is sufficient, the model should recover correct behavior.

K Mean Recovery
1 95.4%
3 93.4%
5 90.8%
10 103.1%
15 105.6%

A single component (mlp_L0) recovers 95% of the correct behavior. Three components form a complete minimal circuit. Recovery exceeding 100% at K=10+ is expected - patching multiple clean activations creates a slightly cleaner residual stream than the original clean run.

Ablation test

Zero out the top-K components during a clean forward pass. If the circuit is necessary, performance should drop.

K Clean Ablated Drop
1 +1.33 -4.20 416%
3 +1.33 -4.61 446%
5 +1.33 -7.44 659%
10 +1.33 -8.39 730%

Ablating just mlp_L0 flips the prediction from correct (+1.33) to strongly incorrect (-4.20). The circuit is both necessary and sufficient.

Stability

Scored all 156 components independently on each of the 8 examples:

Component Mean Std Top-10 in X/8
mlp_L0 3.575 2.116 8/8
attn_L10_H7 2.534 1.913 8/8
attn_L11_H10 1.674 0.795 8/8
attn_L5_H5 1.400 1.043 7/8
attn_L8_H6 1.249 1.119 6/8

The top 3 appear in every single example's top 10. This is not noise.


Circuit interpretation

Stage Components Role
Encoding mlp_L0 Enriches token embeddings with name identity. Restoring it alone recovers 95%.
Pattern detection attn_L3_H0, attn_L5_H5 Detects that a name appears twice - the induction signal that triggers the IOI circuit.
Entity tracking attn_L6_H9, attn_L7_H3, attn_L7_H9 Tracks subject vs indirect object. Suppresses the subject to prevent copying the wrong name.
Name moving attn_L8_H6, attn_L8_H10, attn_L9_H9, attn_L10_H7, attn_L11_H10 Copies the indirect object name to the output position. These are the decision-making heads.
Output shaping mlp_L5, mlp_L11 Adjusts final logit distribution to boost the correct name.

This hierarchy - encode, detect, track, move, refine - matches the IOI circuit described in the literature.


Technical notes

hook_z vs hook_result. TransformerLens doesn't expose hook_result. The correct hook for per-head patching is hook_z - the head output before W_O projection, shaped (batch, pos, n_heads, d_head). Using the wrong hook silently zeros all 144 head scores, making the analysis appear MLP-dominated. This bug is easy to introduce and hard to detect.

Edge analysis optimization. With K=10, we need K*(K-1)=90 pairwise forward passes. Single-component scores are pre-computed once and reused. Exhaustive analysis would require ~24,000 passes - top-K filtering reduces this by 99.6%.

Memory. GPT-2's activation cache is ~1.2 GB in float32. We cache once per clean prompt and reuse across all 156 scoring iterations, leaving headroom on 8 GB systems.


Usage

python3 -m venv venv
source venv/bin/activate
pip install -r autocircuit/requirements.txt

python -m autocircuit.cli.main run \
  --model gpt2 \
  --dataset autocircuit/data/ioi_task.json \
  --top_k 10 \
  --output results/

Flags: --model (default gpt2), --dataset (required), --top_k (default 20), --output (default results/), --edge_threshold (default 0.05), --example_index (default 0, use -1 for all).

Output: scores.json, all_component_scores.json, circuit_graph.png

Dataset format

[
  {
    "clean": "When Mary and John went to the store, John gave a drink to",
    "corrupted": "When Mary and John went to the store, Mary gave a drink to",
    "target": " Mary"
  }
]

Project structure

autocircuit/
├── core/
│   ├── model_utils.py       # TransformerLens loading, tokenization
│   ├── patching.py           # activation patching via hook_z / hook_mlp_out
│   └── scoring.py            # logit difference metric
├── analysis/
│   ├── node_selection.py     # per-component importance scoring
│   ├── edge_analysis.py      # pairwise directional influence
│   └── circuit_validation.py # sufficiency + ablation tests
├── visualization/
│   └── graph_builder.py      # NetworkX graph -> PNG
├── cli/
│   └── main.py               # pipeline entrypoint
└── data/
    └── ioi_task.json          # 8 IOI prompt pairs

References

License

MIT

About

Automated causal circuit discovery for GPT-2 Small. 99.6% more efficient than exhaustive ACDC, 3-component minimal circuit recovers 93.4% IOI behavior.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages