Automated Interpretability Report

Automated interpretability ("autointerp") refers to methods that identify model components or activation features and use a separate model to generate natural-language descriptions of each component's behavior. For this take-home, review the open-source autointerp landscape and implement a meaningful improvement, novel feature, or extension to an existing autointerp pipeline.

What OpenAI did

OpenAI in their report Language models can explain neurons in language models introduced a method to automatically generate explanations for individual neurons in transformer language models. Their approach involves:

Core Methodology: Explain → Simulate → Score

Explanation Generation: GPT-4 examines top-activating text snippets for each neuron and generates a natural-language hypothesis about the neuron's behavior (e.g., "activates on month names").
Learned Simulator: A dedicated model is trained/conditioned to predict neuron activations given an explanation. This simulator takes (explanation text, input) → predicted activation.
Automated Scoring: The simulator's predictions are compared against actual neuron activations on held-out text. High correlation = good explanation.
Iterative Refinement: The system can propose counterexamples to its own hypothesis and revise explanations through multiple rounds, improving scores.

Key Findings

Massive Scale: Generated and released explanations for all 307,200 neurons across the GPT-2 family (Small, Medium, Large, XL)
Quality Varies: Most explanations were weak; scores degraded in later layers and larger models, suggesting polysemy/complexity
Factors Matter: Stronger explainer models, iterative revision, and even the subject model's activation function affected explanation quality
Human Validation: Included both automated scoring and human evaluation as quality checks
Open Dataset: Released neuron activations, explanations, related tokens, and a web viewer

What They Didn't Do

OpenAI did not cluster or aggregate the 307k explanations, they released raw, per-neuron descriptions. This creates a massive, redundant list that's hard to navigate or analyze at scale.

What I propose

I propose a complete automated interpretability pipeline for attention heads in transformer models, adapting OpenAI's neuron explainer methodology to attention mechanisms. While existing autointerp work focuses primarily on neurons and SAE features, attention heads represent a fundamental computational unit that remains under-explored in automated interpretation frameworks.

Core Innovation

My implementation extends automated interpretability to attention heads by:

Adapting the generate-simulate-score paradigm: Following OpenAI's neuron explainer approach but specifically designed for attention mechanisms, capturing both value outputs and attention patterns.
Hierarchical clustering for discovery: Using HDBSCAN clustering with sentence embeddings to automatically discover functional families of attention heads across layers, revealing higher-level behavioral patterns that individual head explanations might miss.
Multi-source event selection: Sampling from diverse corpora (WikiText-103 and StackExchange) to capture varied head behaviors across natural language, structured text, code, and punctuation patterns.

Technical Implementation

The pipeline consists of seven integrated stages:

Data sampling: Loads and samples text windows from WikiText-103 (via HuggingFace datasets) and optionally StackExchange (not viable with current compute limits)
Instrumentation: Uses TransformerLens to hook attention head activations (hook_z for value outputs, hook_pattern for attention)
Salient event selection: Identifies high-impact tokens per head using L2 norms of value outputs
Explanation generation: Uses OpenAI's API with structured JSON mode to generate hypotheses about head behavior
Simulation scoring: Validates explanations by running generated test cases through the model and measuring activation differences
Clustering: Groups functionally similar heads using sentence embeddings and HDBSCAN, generating canonical labels for each cluster

Improvements Over Baseline

This implementation offers:

Attention-specific design: Captures both what heads attend to (patterns) and what they compute (value outputs)
Automatic taxonomy discovery: Clustering reveals emergent head families (e.g., induction heads, punctuation heads, syntax heads)
Simulation-based validation: Moves beyond just generating explanations to actually testing them
Production-ready package: Installable via pip with proper dependencies and error handling

Differences from OpenAI's Approach

What I Did Differently

Target: Attention Heads Instead of Neurons
- OpenAI analyzed individual MLP neurons; I analyze attention head mechanisms
- Captures both value outputs (what heads compute) and attention patterns (what they attend to)
- Provides richer behavioral signatures specific to attention
Direct Empirical Testing Instead of Learned Simulator
- OpenAI: Built a separate simulator model trained to predict activations from explanations
- Me: Generate positive/negative test cases, run them through the actual model, measure real activation differences using Cohen's d-style effect size
- Simpler, more direct, but less sophisticated than a learned predictor
Single-Shot Explanation vs. Iterative Refinement
- OpenAI: Multi-round refinement loop where model proposes counterexamples and revises hypotheses
- Me: One-shot generation with structured JSON output
- Trade-off: speed vs. quality improvement from iteration
Added Clustering Layer (Novel Contribution)
- OpenAI: Released raw explanations for 307k neurons with no aggregation
- Me: HDBSCAN clustering + canonical labels + agreement metrics
- Discovers functional families, reduces redundancy, enables taxonomy building
- This is the "high-leverage add-on" that makes large explanation sets navigable
Richer Event Selection
- OpenAI: Top-activating tokens per neuron
- Me: Salience scoring via L2 norms + attention pattern analysis (what heads attend to) + context categorization + deduplication
- More sophisticated behavioral characterization
Scale Focus
- OpenAI: Production system for 307k neurons with human evaluation
- Me: Research tool optimized for iterative experimentation on smaller subsets
- Configurable head counts, efficient caching, modular pipeline

What's Missing from My Implementation

No Learned Simulator Model
- OpenAI trained a model to predict activations from explanation text
- I use direct measurement instead
- Impact: Less sophisticated prediction capability, but faster and simpler
No Iterative Explanation Refinement
- OpenAI's system proposes counterexamples and revises hypotheses
- I generate explanations in a single pass
- Impact: Potentially lower quality explanations, especially for complex/polysemantic heads
No Human Evaluation Component
- OpenAI included human scoring alongside automated metrics
- I rely entirely on automated simulation scores
- Impact: No ground truth validation of explanation quality
Limited Cross-Model Analysis
- OpenAI compared across GPT-2 Small → XL, activation functions, layer depths
- I focus on single-model runs (gpt2)
- Impact: Can't study how explanation quality scales with model size
No Ablation Studies
- OpenAI systematically tested: explainer model size, subject model size, activation functions, revision rounds
- I provide a working pipeline but no systematic quality studies
- Impact: Less understanding of what factors drive explanation quality
Smaller Scale
- OpenAI: 307k neurons with published dataset
- Me: Configurable subsets (typically 200 heads)
- Impact: Not comprehensive model documentation, but sufficient for research

Novel Contributions I Added

Full Clustering Pipeline: Semantic similarity + HDBSCAN + canonical labels + agreement metrics
Attention Pattern Analysis: Characterizes what heads attend to (recent tokens, syntax, content)
Cluster Quality Metrics: Agreement scores, specificity measures, coverage statistics
Report Generation: Automated markdown reports with taxonomy visualization
Modular CLI: Easy experimentation with skip flags for each pipeline stage

What Should Be Added (Future Work)

High Priority

Human Evaluation Interface
- Web UI to show examples + explanations
- Likert scale ratings (1-5)
- Comparison with automated scores
- Benchmark for quality calibration
Cross-Model Comparison Framework
- Run same pipeline on GPT-2, GPT-2-medium, GPT-2-large, GPT-2-xl
- Track explanation quality vs model size
- Identify universal vs model-specific heads
- Study layer-wise behavior patterns
Ablation Studies
- Vary explainer model (GPT-4o-mini vs GPT-4o vs Claude)
- Vary embedding model for clustering
- Vary min_cluster_size parameter
- Test on different subject models
- Document quality-cost tradeoffs

Medium Priority

Learned Simulator (Optional)
- Train small model: (explanation_embedding, input) → predicted_activation
- Use for faster scoring on large test sets
- Compare with direct measurement approach
- May not be worth complexity vs current approach
Activation Patching Validation
- Patch head outputs to zero and measure performance drop
- Correlate with explanation quality scores
- Identify truly important vs. noise heads
- Validates causal importance

Low Priority

Circuit Discovery Integration
- Use cluster membership + attention patterns
- Identify heads that attend to each other
- Build candidate circuits from functional families
- Requires connectivity analysis tools
Multi-Dataset Validation
- Test explanations on: code, math, dialogue, different languages
- Check if explanations generalize
- Identify context-dependent behaviors
Automated Quality Metrics
- Explanation length/complexity scores
- Content word density (specificity)
- Inconsistency detection across clusters
- Red-flag vague explanations

Experiments

I conducted two experiments on GPT-2 (12-layer, 144 attention heads) to evaluate the pipeline and compare explainer model quality:

Experiment 1: GPT-4o-mini as Explainer

Configuration:

Subject model: GPT-2 (124M parameters, 12 layers, 12 heads/layer)
Explainer model: GPT-4o-mini
Data: 1,500 text windows from WikiText-103
Top-k events per head: 200
Clustering: HDBSCAN with min_cluster_size=3, embeddings from Qwen3-Embedding-4B

Pipeline execution:

Sampled 1,500 text windows (192 tokens each) from WikiText-103
Cached attention activations for all 144 heads
Selected top 200 high-activation events per head using L2 norm
Generated explanations via GPT-4o-mini with structured JSON output
Created positive/negative test cases and measured simulation scores
Clustered explanations using sentence embeddings + HDBSCAN
Generated canonical labels for each cluster

Experiment 2: GPT-4o as Explainer

Configuration:

Same as Experiment 1, but using GPT-4o as explainer model
All other parameters identical for controlled comparison

Hypothesis: GPT-4o should generate higher-quality explanations than GPT-4o-mini, reflected in:

Higher simulation scores (better predictive accuracy)
More fine-grained clustering (more specific behavioral categories)
Better cluster quality metrics (agreement, specificity)

Results

Quantitative Comparison

Metric	GPT-4o-mini	GPT-4o	Change
Simulation Scores
Mean	0.314	0.607	+93.5%
Median	0.266	0.485	+82.3%
Std Dev	2.001	1.883	-5.9%
Max	6.856	8.785	+28.1%
Min	-11.639	-3.965	+65.9% (less negative)
Clustering
Total Clusters	21	29	+38.1%
Unclustered Heads	55 (38.2%)	50 (34.7%)	-9.1%
Mean Agreement	0.946	0.942	-0.4%
Mean Specificity	0.799	0.756	-5.4%

Key Findings

1. GPT-4o generates significantly better explanations (as expected)

The mean simulation score nearly doubled from 0.31 to 0.61, indicating that GPT-4o's explanations are substantially more predictive of actual head behavior. The distribution also improved:

Fewer catastrophic failures (min score improved from -11.6 to -4.0)
Higher top performers (max score 8.79 vs 6.86)
More consistent quality (lower standard deviation)

2. More fine-grained functional taxonomy

GPT-4o resulted in 29 clusters vs 21 for GPT-4o-mini (+38%), suggesting GPT-4o identified more nuanced behavioral categories. However:

Cluster agreement remained high for both (~0.94), indicating coherent groupings
Specificity slightly decreased for GPT-4o (0.76 vs 0.80), possibly due to more granular splits

3. Discovered attention head functions in GPT-2

Top-performing heads (GPT-4o experiment):

L9H0 (score: 8.79): Named entities and awards/accolades
L0H11 (score: 8.14): Beginning-of-sequence detection, topic initiation
L1H11 (score: 6.77): Delimiter tokens (quotes, commas)
L10H6 (score: 6.24): Titles and roles (political contexts)
L8H2 (score: 4.66): Punctuation marking clause endings

Dominant functional families (largest clusters):

GPT-4o-mini identified:

Named entities & quantities (26 heads, 18.1%)
Temporal expressions & dates (20 heads across 2 clusters, 13.9%)
Action verbs & causality (11 heads, 7.6%)

GPT-4o identified:

Temporal expressions & dates (39 heads across 3 clusters, 27.1%)
Numerical quantities (14 heads, 9.7%)
Named entities (multiple clusters, distributed)
Interesting head: L9H0 "This head focuses on named entities and specific awards or accolades, highlighting their significance in context."

Both models consistently identified temporal/numerical processing and named entity tracking as dominant attention head functions, but GPT-4o provided more detailed subcategories.

4. Layer-wise patterns

Analysis of the heatmaps reveals:

Early layers (0-3): Focus on syntactic delimiters, punctuation, sentence boundaries
Middle layers (4-7): Mixed behavior, lower average scores (harder to explain)
Late layers (8-11): Strongest scores, clearer semantic functions (entities, temporality, roles)

This aligns with prior literature showing early layers handle syntax while late layers process semantics.

5. Quality-cost tradeoff

Using GPT-4o as explainer:

+93% improvement in explanation quality (mean score)
~4-5x higher API cost (GPT-4o vs GPT-4o-mini pricing)
Similar inference time (both use structured output)

For research purposes, GPT-4o is clearly superior. For production-scale documentation (hundreds of thousands of components), the cost differential becomes significant.

Comparison to OpenAI's Neuron Results

OpenAI's neuron explainer paper reported:

Mean scores degraded in later layers (opposite of our finding)
Quality degraded with larger models (we only tested GPT-2)
Most explanations were weak (consistent with our findings)

The divergence on layer patterns may reflect fundamental differences between neurons (point-wise nonlinearities) and attention heads (contextual aggregation). Late-layer attention heads may be more interpretable because they operate on increasingly abstract representations, while late-layer neurons become more polysemantic.

Limitations and Caveats

Current Implementation

Not optimized for large-scale runs: Works well for small models like GPT-2, but scaling to larger models requires optimizations in data handling and parallelization
- Memory usage becomes a bottleneck when storing large activation datasets
- No distributed processing support
Explanation quality depends on LLM capabilities: Limitations in the explainer model (GPT-4o-mini) can lead to vague or inaccurate explanations
- No iterative refinement to improve quality
- Single-shot generation may miss nuances
Clustering limitations:
- May not capture all nuances of head behavior, especially overlapping functions or context-dependent behavior
- Embedding model choice significantly impacts clustering results
- HDBSCAN parameters (min_cluster_size) require tuning per model
- Many heads may remain unclustered (labeled as noise)
No human validation: Entirely automated scoring with no ground truth validation
Limited test case diversity: Positive/negative examples may not cover full behavior spectrum

Comparison to OpenAI's System

Simulation is simpler: Direct measurement vs. learned predictor
No refinement loop: Single-pass explanation generation
Smaller scale: Hundreds of heads vs. hundreds of thousands of neurons
No ablation studies: Haven't systematically tested quality factors

Known Issues

Clustering results vary significantly with embedding model and parameters
Some heads get vague explanations (e.g., "processes various tokens")
Simulation scores don't always correlate with human-interpretable quality
No validation that explanations generalize across datasets

Summary

This implementation successfully adapts OpenAI's neuron explainer methodology to attention heads, with the key innovation being a full clustering pipeline that aggregates explanations into a navigable taxonomy. While it lacks some sophistication of OpenAI's system (learned simulator, iterative refinement, human evaluation), it provides a fast, modular research tool with novel contributions in attention-specific analysis and automated taxonomy discovery.

The main gap is quality validation—future work should focus on iterative refinement, human evaluation, and systematic ablation studies to understand what drives explanation quality for attention mechanisms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated Interpretability Report

What OpenAI did

Core Methodology: Explain → Simulate → Score

Key Findings

What They Didn't Do

What I propose

Core Innovation

Technical Implementation

Improvements Over Baseline

Differences from OpenAI's Approach

What I Did Differently

What's Missing from My Implementation

Novel Contributions I Added

What Should Be Added (Future Work)

High Priority

Medium Priority

Low Priority

Experiments

Experiment 1: GPT-4o-mini as Explainer

Experiment 2: GPT-4o as Explainer

Results

Quantitative Comparison

Key Findings

Top-performing heads (GPT-4o experiment):

Dominant functional families (largest clusters):

Comparison to OpenAI's Neuron Results

Limitations and Caveats

Current Implementation

Comparison to OpenAI's System

Known Issues

Summary

FilesExpand file tree

REPORT.md

Latest commit

History

REPORT.md

File metadata and controls

Automated Interpretability Report

What OpenAI did

Core Methodology: Explain → Simulate → Score

Key Findings

What They Didn't Do

What I propose

Core Innovation

Technical Implementation

Improvements Over Baseline

Differences from OpenAI's Approach

What I Did Differently

What's Missing from My Implementation

Novel Contributions I Added

What Should Be Added (Future Work)

High Priority

Medium Priority

Low Priority

Experiments

Experiment 1: GPT-4o-mini as Explainer

Experiment 2: GPT-4o as Explainer

Results

Quantitative Comparison

Key Findings

Top-performing heads (GPT-4o experiment):

Dominant functional families (largest clusters):

Comparison to OpenAI's Neuron Results

Limitations and Caveats

Current Implementation

Comparison to OpenAI's System

Known Issues

Summary