AutoCircuit

Automated causal circuit discovery for transformer models using activation patching. Given clean/corrupted prompt pairs, AutoCircuit identifies which attention heads and MLP layers are causally responsible for a specific model behavior, measures pairwise information flow between them, and proves the discovered circuit is both necessary and sufficient.

Built with TransformerLens. Validated on GPT-2 Small with the Indirect Object Identification (IOI) task.

Causal circuit for GPT-2 on IOI. Node size = importance, edge thickness = directional influence. Blue = attention heads, orange = MLP layers.

Results

Model: GPT-2 Small (124M params, 12 layers, 12 heads) Task: Indirect Object Identification - 8 clean/corrupted prompt pairs Components scored: 156 (144 attention heads + 12 MLP layers) Metric: logit(target) - logit(best alternative) at last position Hardware: Apple M-series CPU, 8 GB RAM, ~37s per example

Baselines

Metric	Single example	Averaged (n=8)
Clean logit diff	+1.66	+1.33
Corrupted logit diff	-2.48	-2.40
Drop	+4.14	+3.73

Top 15 components (averaged over 8 examples)

Rank	Component	Score
1	`mlp_L0`	3.575
2	`attn_L10_H7`	2.534
3	`attn_L11_H10`	1.674
4	`attn_L5_H5`	1.400
5	`attn_L8_H6`	1.249
6	`attn_L8_H10`	1.100
7	`attn_L7_H9`	0.920
8	`attn_L9_H9`	0.790
9	`attn_L3_H0`	0.590
10	`attn_L7_H3`	0.564
11	`attn_L10_H0`	0.541
12	`attn_L9_H7`	0.527
13	`attn_L6_H9`	0.516
14	`mlp_L5`	0.443
15	`mlp_L11`	0.438

Scores are averaged across 8 diverse IOI prompts to filter out per-example noise. A single prompt can activate rare circuit paths - averaging ensures the ranking reflects the task-general circuit, not input-specific artifacts. The same methodology is used in Wang et al. (2022).

Validation

Sufficiency test

Restore only the top-K clean activations into a corrupted run. If the circuit is sufficient, the model should recover correct behavior.

K	Mean Recovery
1	95.4%
3	93.4%
5	90.8%
10	103.1%
15	105.6%

A single component (mlp_L0) recovers 95% of the correct behavior. Three components form a complete minimal circuit. Recovery exceeding 100% at K=10+ is expected - patching multiple clean activations creates a slightly cleaner residual stream than the original clean run.

Ablation test

Zero out the top-K components during a clean forward pass. If the circuit is necessary, performance should drop.

K	Clean	Ablated	Drop
1	+1.33	-4.20	416%
3	+1.33	-4.61	446%
5	+1.33	-7.44	659%
10	+1.33	-8.39	730%

Ablating just mlp_L0 flips the prediction from correct (+1.33) to strongly incorrect (-4.20). The circuit is both necessary and sufficient.

Stability

Scored all 156 components independently on each of the 8 examples:

Component	Mean	Std	Top-10 in X/8
`mlp_L0`	3.575	2.116	8/8
`attn_L10_H7`	2.534	1.913	8/8
`attn_L11_H10`	1.674	0.795	8/8
`attn_L5_H5`	1.400	1.043	7/8
`attn_L8_H6`	1.249	1.119	6/8

The top 3 appear in every single example's top 10. This is not noise.

Circuit interpretation

Stage	Components	Role
Encoding	`mlp_L0`	Enriches token embeddings with name identity. Restoring it alone recovers 95%.
Pattern detection	`attn_L3_H0`, `attn_L5_H5`	Detects that a name appears twice - the induction signal that triggers the IOI circuit.
Entity tracking	`attn_L6_H9`, `attn_L7_H3`, `attn_L7_H9`	Tracks subject vs indirect object. Suppresses the subject to prevent copying the wrong name.
Name moving	`attn_L8_H6`, `attn_L8_H10`, `attn_L9_H9`, `attn_L10_H7`, `attn_L11_H10`	Copies the indirect object name to the output position. These are the decision-making heads.
Output shaping	`mlp_L5`, `mlp_L11`	Adjusts final logit distribution to boost the correct name.

This hierarchy - encode, detect, track, move, refine - matches the IOI circuit described in the literature.

Technical notes

hook_z vs hook_result. TransformerLens doesn't expose hook_result. The correct hook for per-head patching is hook_z - the head output before W_O projection, shaped (batch, pos, n_heads, d_head). Using the wrong hook silently zeros all 144 head scores, making the analysis appear MLP-dominated. This bug is easy to introduce and hard to detect.

Edge analysis optimization. With K=10, we need K*(K-1)=90 pairwise forward passes. Single-component scores are pre-computed once and reused. Exhaustive analysis would require ~24,000 passes - top-K filtering reduces this by 99.6%.

Memory. GPT-2's activation cache is ~1.2 GB in float32. We cache once per clean prompt and reuse across all 156 scoring iterations, leaving headroom on 8 GB systems.

Usage

python3 -m venv venv
source venv/bin/activate
pip install -r autocircuit/requirements.txt

python -m autocircuit.cli.main run \
  --model gpt2 \
  --dataset autocircuit/data/ioi_task.json \
  --top_k 10 \
  --output results/

Flags: --model (default gpt2), --dataset (required), --top_k (default 20), --output (default results/), --edge_threshold (default 0.05), --example_index (default 0, use -1 for all).

Output: scores.json, all_component_scores.json, circuit_graph.png

Dataset format

[
  {
    "clean": "When Mary and John went to the store, John gave a drink to",
    "corrupted": "When Mary and John went to the store, Mary gave a drink to",
    "target": " Mary"
  }
]

Project structure

autocircuit/
├── core/
│   ├── model_utils.py       # TransformerLens loading, tokenization
│   ├── patching.py           # activation patching via hook_z / hook_mlp_out
│   └── scoring.py            # logit difference metric
├── analysis/
│   ├── node_selection.py     # per-component importance scoring
│   ├── edge_analysis.py      # pairwise directional influence
│   └── circuit_validation.py # sufficiency + ablation tests
├── visualization/
│   └── graph_builder.py      # NetworkX graph -> PNG
├── cli/
│   └── main.py               # pipeline entrypoint
└── data/
    └── ioi_task.json          # 8 IOI prompt pairs

References

Wang et al. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.
Conmy et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability.
TransformerLens documentation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
autocircuit		autocircuit
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoCircuit

Results

Baselines

Top 15 components (averaged over 8 examples)

Validation

Sufficiency test

Ablation test

Stability

Circuit interpretation

Technical notes

Usage

Dataset format

Project structure

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoCircuit

Results

Baselines

Top 15 components (averaged over 8 examples)

Validation

Sufficiency test

Ablation test

Stability

Circuit interpretation

Technical notes

Usage

Dataset format

Project structure

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages