Skip to content

Latest commit

 

History

History
368 lines (254 loc) · 18.1 KB

File metadata and controls

368 lines (254 loc) · 18.1 KB

Automated Interpretability Report

Automated interpretability ("autointerp") refers to methods that identify model components or activation features and use a separate model to generate natural-language descriptions of each component's behavior. For this take-home, review the open-source autointerp landscape and implement a meaningful improvement, novel feature, or extension to an existing autointerp pipeline.

What OpenAI did

OpenAI in their report Language models can explain neurons in language models introduced a method to automatically generate explanations for individual neurons in transformer language models. Their approach involves:

Core Methodology: Explain → Simulate → Score

  1. Explanation Generation: GPT-4 examines top-activating text snippets for each neuron and generates a natural-language hypothesis about the neuron's behavior (e.g., "activates on month names").
  2. Learned Simulator: A dedicated model is trained/conditioned to predict neuron activations given an explanation. This simulator takes (explanation text, input) → predicted activation.
  3. Automated Scoring: The simulator's predictions are compared against actual neuron activations on held-out text. High correlation = good explanation.
  4. Iterative Refinement: The system can propose counterexamples to its own hypothesis and revise explanations through multiple rounds, improving scores.

Key Findings

  • Massive Scale: Generated and released explanations for all 307,200 neurons across the GPT-2 family (Small, Medium, Large, XL)
  • Quality Varies: Most explanations were weak; scores degraded in later layers and larger models, suggesting polysemy/complexity
  • Factors Matter: Stronger explainer models, iterative revision, and even the subject model's activation function affected explanation quality
  • Human Validation: Included both automated scoring and human evaluation as quality checks
  • Open Dataset: Released neuron activations, explanations, related tokens, and a web viewer

What They Didn't Do

OpenAI did not cluster or aggregate the 307k explanations, they released raw, per-neuron descriptions. This creates a massive, redundant list that's hard to navigate or analyze at scale.

What I propose

I propose a complete automated interpretability pipeline for attention heads in transformer models, adapting OpenAI's neuron explainer methodology to attention mechanisms. While existing autointerp work focuses primarily on neurons and SAE features, attention heads represent a fundamental computational unit that remains under-explored in automated interpretation frameworks.

Core Innovation

My implementation extends automated interpretability to attention heads by:

  1. Adapting the generate-simulate-score paradigm: Following OpenAI's neuron explainer approach but specifically designed for attention mechanisms, capturing both value outputs and attention patterns.
  2. Hierarchical clustering for discovery: Using HDBSCAN clustering with sentence embeddings to automatically discover functional families of attention heads across layers, revealing higher-level behavioral patterns that individual head explanations might miss.
  3. Multi-source event selection: Sampling from diverse corpora (WikiText-103 and StackExchange) to capture varied head behaviors across natural language, structured text, code, and punctuation patterns.

Technical Implementation

The pipeline consists of seven integrated stages:

  1. Data sampling: Loads and samples text windows from WikiText-103 (via HuggingFace datasets) and optionally StackExchange (not viable with current compute limits)
  2. Instrumentation: Uses TransformerLens to hook attention head activations (hook_z for value outputs, hook_pattern for attention)
  3. Salient event selection: Identifies high-impact tokens per head using L2 norms of value outputs
  4. Explanation generation: Uses OpenAI's API with structured JSON mode to generate hypotheses about head behavior
  5. Simulation scoring: Validates explanations by running generated test cases through the model and measuring activation differences
  6. Clustering: Groups functionally similar heads using sentence embeddings and HDBSCAN, generating canonical labels for each cluster

Improvements Over Baseline

This implementation offers:

  • Attention-specific design: Captures both what heads attend to (patterns) and what they compute (value outputs)
  • Automatic taxonomy discovery: Clustering reveals emergent head families (e.g., induction heads, punctuation heads, syntax heads)
  • Simulation-based validation: Moves beyond just generating explanations to actually testing them
  • Production-ready package: Installable via pip with proper dependencies and error handling

Differences from OpenAI's Approach

What I Did Differently

  1. Target: Attention Heads Instead of Neurons

    • OpenAI analyzed individual MLP neurons; I analyze attention head mechanisms
    • Captures both value outputs (what heads compute) and attention patterns (what they attend to)
    • Provides richer behavioral signatures specific to attention
  2. Direct Empirical Testing Instead of Learned Simulator

    • OpenAI: Built a separate simulator model trained to predict activations from explanations
    • Me: Generate positive/negative test cases, run them through the actual model, measure real activation differences using Cohen's d-style effect size
    • Simpler, more direct, but less sophisticated than a learned predictor
  3. Single-Shot Explanation vs. Iterative Refinement

    • OpenAI: Multi-round refinement loop where model proposes counterexamples and revises hypotheses
    • Me: One-shot generation with structured JSON output
    • Trade-off: speed vs. quality improvement from iteration
  4. Added Clustering Layer (Novel Contribution)

    • OpenAI: Released raw explanations for 307k neurons with no aggregation
    • Me: HDBSCAN clustering + canonical labels + agreement metrics
    • Discovers functional families, reduces redundancy, enables taxonomy building
    • This is the "high-leverage add-on" that makes large explanation sets navigable
  5. Richer Event Selection

    • OpenAI: Top-activating tokens per neuron
    • Me: Salience scoring via L2 norms + attention pattern analysis (what heads attend to) + context categorization + deduplication
    • More sophisticated behavioral characterization
  6. Scale Focus

    • OpenAI: Production system for 307k neurons with human evaluation
    • Me: Research tool optimized for iterative experimentation on smaller subsets
    • Configurable head counts, efficient caching, modular pipeline

What's Missing from My Implementation

  1. No Learned Simulator Model

    • OpenAI trained a model to predict activations from explanation text
    • I use direct measurement instead
    • Impact: Less sophisticated prediction capability, but faster and simpler
  2. No Iterative Explanation Refinement

    • OpenAI's system proposes counterexamples and revises hypotheses
    • I generate explanations in a single pass
    • Impact: Potentially lower quality explanations, especially for complex/polysemantic heads
  3. No Human Evaluation Component

    • OpenAI included human scoring alongside automated metrics
    • I rely entirely on automated simulation scores
    • Impact: No ground truth validation of explanation quality
  4. Limited Cross-Model Analysis

    • OpenAI compared across GPT-2 Small → XL, activation functions, layer depths
    • I focus on single-model runs (gpt2)
    • Impact: Can't study how explanation quality scales with model size
  5. No Ablation Studies

    • OpenAI systematically tested: explainer model size, subject model size, activation functions, revision rounds
    • I provide a working pipeline but no systematic quality studies
    • Impact: Less understanding of what factors drive explanation quality
  6. Smaller Scale

    • OpenAI: 307k neurons with published dataset
    • Me: Configurable subsets (typically 200 heads)
    • Impact: Not comprehensive model documentation, but sufficient for research

Novel Contributions I Added

  1. Full Clustering Pipeline: Semantic similarity + HDBSCAN + canonical labels + agreement metrics
  2. Attention Pattern Analysis: Characterizes what heads attend to (recent tokens, syntax, content)
  3. Cluster Quality Metrics: Agreement scores, specificity measures, coverage statistics
  4. Report Generation: Automated markdown reports with taxonomy visualization
  5. Modular CLI: Easy experimentation with skip flags for each pipeline stage

What Should Be Added (Future Work)

High Priority

  1. Human Evaluation Interface

    • Web UI to show examples + explanations
    • Likert scale ratings (1-5)
    • Comparison with automated scores
    • Benchmark for quality calibration
  2. Cross-Model Comparison Framework

    • Run same pipeline on GPT-2, GPT-2-medium, GPT-2-large, GPT-2-xl
    • Track explanation quality vs model size
    • Identify universal vs model-specific heads
    • Study layer-wise behavior patterns
  3. Ablation Studies

    • Vary explainer model (GPT-4o-mini vs GPT-4o vs Claude)
    • Vary embedding model for clustering
    • Vary min_cluster_size parameter
    • Test on different subject models
    • Document quality-cost tradeoffs

Medium Priority

  1. Learned Simulator (Optional)

    • Train small model: (explanation_embedding, input) → predicted_activation
    • Use for faster scoring on large test sets
    • Compare with direct measurement approach
    • May not be worth complexity vs current approach
  2. Activation Patching Validation

    • Patch head outputs to zero and measure performance drop
    • Correlate with explanation quality scores
    • Identify truly important vs. noise heads
    • Validates causal importance

Low Priority

  1. Circuit Discovery Integration

    • Use cluster membership + attention patterns
    • Identify heads that attend to each other
    • Build candidate circuits from functional families
    • Requires connectivity analysis tools
  2. Multi-Dataset Validation

    • Test explanations on: code, math, dialogue, different languages
    • Check if explanations generalize
    • Identify context-dependent behaviors
  3. Automated Quality Metrics

    • Explanation length/complexity scores
    • Content word density (specificity)
    • Inconsistency detection across clusters
    • Red-flag vague explanations

Experiments

I conducted two experiments on GPT-2 (12-layer, 144 attention heads) to evaluate the pipeline and compare explainer model quality:

Experiment 1: GPT-4o-mini as Explainer

Configuration:

  • Subject model: GPT-2 (124M parameters, 12 layers, 12 heads/layer)
  • Explainer model: GPT-4o-mini
  • Data: 1,500 text windows from WikiText-103
  • Top-k events per head: 200
  • Clustering: HDBSCAN with min_cluster_size=3, embeddings from Qwen3-Embedding-4B

Pipeline execution:

  1. Sampled 1,500 text windows (192 tokens each) from WikiText-103
  2. Cached attention activations for all 144 heads
  3. Selected top 200 high-activation events per head using L2 norm
  4. Generated explanations via GPT-4o-mini with structured JSON output
  5. Created positive/negative test cases and measured simulation scores
  6. Clustered explanations using sentence embeddings + HDBSCAN
  7. Generated canonical labels for each cluster

Experiment 2: GPT-4o as Explainer

Configuration:

  • Same as Experiment 1, but using GPT-4o as explainer model
  • All other parameters identical for controlled comparison

Hypothesis: GPT-4o should generate higher-quality explanations than GPT-4o-mini, reflected in:

  • Higher simulation scores (better predictive accuracy)
  • More fine-grained clustering (more specific behavioral categories)
  • Better cluster quality metrics (agreement, specificity)

Results

Quantitative Comparison

Metric GPT-4o-mini GPT-4o Change
Simulation Scores
Mean 0.314 0.607 +93.5%
Median 0.266 0.485 +82.3%
Std Dev 2.001 1.883 -5.9%
Max 6.856 8.785 +28.1%
Min -11.639 -3.965 +65.9% (less negative)
Clustering
Total Clusters 21 29 +38.1%
Unclustered Heads 55 (38.2%) 50 (34.7%) -9.1%
Mean Agreement 0.946 0.942 -0.4%
Mean Specificity 0.799 0.756 -5.4%

Key Findings

1. GPT-4o generates significantly better explanations (as expected)

The mean simulation score nearly doubled from 0.31 to 0.61, indicating that GPT-4o's explanations are substantially more predictive of actual head behavior. The distribution also improved:

  • Fewer catastrophic failures (min score improved from -11.6 to -4.0)
  • Higher top performers (max score 8.79 vs 6.86)
  • More consistent quality (lower standard deviation)

2. More fine-grained functional taxonomy

GPT-4o resulted in 29 clusters vs 21 for GPT-4o-mini (+38%), suggesting GPT-4o identified more nuanced behavioral categories. However:

  • Cluster agreement remained high for both (~0.94), indicating coherent groupings
  • Specificity slightly decreased for GPT-4o (0.76 vs 0.80), possibly due to more granular splits

3. Discovered attention head functions in GPT-2

Top-performing heads (GPT-4o experiment):

  1. L9H0 (score: 8.79): Named entities and awards/accolades
  2. L0H11 (score: 8.14): Beginning-of-sequence detection, topic initiation
  3. L1H11 (score: 6.77): Delimiter tokens (quotes, commas)
  4. L10H6 (score: 6.24): Titles and roles (political contexts)
  5. L8H2 (score: 4.66): Punctuation marking clause endings

Dominant functional families (largest clusters):

GPT-4o-mini identified:

  • Named entities & quantities (26 heads, 18.1%)
  • Temporal expressions & dates (20 heads across 2 clusters, 13.9%)
  • Action verbs & causality (11 heads, 7.6%)

GPT-4o identified:

  • Temporal expressions & dates (39 heads across 3 clusters, 27.1%)
  • Numerical quantities (14 heads, 9.7%)
  • Named entities (multiple clusters, distributed)
  • Interesting head: L9H0 "This head focuses on named entities and specific awards or accolades, highlighting their significance in context."

Both models consistently identified temporal/numerical processing and named entity tracking as dominant attention head functions, but GPT-4o provided more detailed subcategories.

4. Layer-wise patterns

Analysis of the heatmaps reveals:

  • Early layers (0-3): Focus on syntactic delimiters, punctuation, sentence boundaries
  • Middle layers (4-7): Mixed behavior, lower average scores (harder to explain)
  • Late layers (8-11): Strongest scores, clearer semantic functions (entities, temporality, roles)

This aligns with prior literature showing early layers handle syntax while late layers process semantics.

5. Quality-cost tradeoff

Using GPT-4o as explainer:

  • +93% improvement in explanation quality (mean score)
  • ~4-5x higher API cost (GPT-4o vs GPT-4o-mini pricing)
  • Similar inference time (both use structured output)

For research purposes, GPT-4o is clearly superior. For production-scale documentation (hundreds of thousands of components), the cost differential becomes significant.

Comparison to OpenAI's Neuron Results

OpenAI's neuron explainer paper reported:

  • Mean scores degraded in later layers (opposite of our finding)
  • Quality degraded with larger models (we only tested GPT-2)
  • Most explanations were weak (consistent with our findings)

The divergence on layer patterns may reflect fundamental differences between neurons (point-wise nonlinearities) and attention heads (contextual aggregation). Late-layer attention heads may be more interpretable because they operate on increasingly abstract representations, while late-layer neurons become more polysemantic.

Limitations and Caveats

Current Implementation

  • Not optimized for large-scale runs: Works well for small models like GPT-2, but scaling to larger models requires optimizations in data handling and parallelization

    • Memory usage becomes a bottleneck when storing large activation datasets
    • No distributed processing support
  • Explanation quality depends on LLM capabilities: Limitations in the explainer model (GPT-4o-mini) can lead to vague or inaccurate explanations

    • No iterative refinement to improve quality
    • Single-shot generation may miss nuances
  • Clustering limitations:

    • May not capture all nuances of head behavior, especially overlapping functions or context-dependent behavior
    • Embedding model choice significantly impacts clustering results
    • HDBSCAN parameters (min_cluster_size) require tuning per model
    • Many heads may remain unclustered (labeled as noise)
  • No human validation: Entirely automated scoring with no ground truth validation

  • Limited test case diversity: Positive/negative examples may not cover full behavior spectrum

Comparison to OpenAI's System

  • Simulation is simpler: Direct measurement vs. learned predictor
  • No refinement loop: Single-pass explanation generation
  • Smaller scale: Hundreds of heads vs. hundreds of thousands of neurons
  • No ablation studies: Haven't systematically tested quality factors

Known Issues

  • Clustering results vary significantly with embedding model and parameters
  • Some heads get vague explanations (e.g., "processes various tokens")
  • Simulation scores don't always correlate with human-interpretable quality
  • No validation that explanations generalize across datasets

Summary

This implementation successfully adapts OpenAI's neuron explainer methodology to attention heads, with the key innovation being a full clustering pipeline that aggregates explanations into a navigable taxonomy. While it lacks some sophistication of OpenAI's system (learned simulator, iterative refinement, human evaluation), it provides a fast, modular research tool with novel contributions in attention-specific analysis and automated taxonomy discovery.

The main gap is quality validation—future work should focus on iterative refinement, human evaluation, and systematic ablation studies to understand what drives explanation quality for attention mechanisms.