Skip to content

MrtinoRG/Head-explain

Repository files navigation

head-explain

head-explain logo

Automated interpretability for transformer attention heads using LLM-based explanations.

This package implements an end-to-end pipeline that:

  1. Samples diverse text from corpora (WikiText, StackExchange)
  2. Instruments attention heads using TransformerLens
  3. Selects salient events based on head activations
  4. Generates natural language explanations via OpenAI API
  5. Scores explanations through simulation
  6. Clusters similar explanations
  7. Produces comprehensive reports

⚠️ System Requirements

GPU Recommended: This package requires significant computational resources:

  • GPU: CUDA-compatible GPU strongly recommended (CPU execution will be extremely slow)
  • Memory: Minimum 16GB RAM, 20GB+ recommended for larger models
  • VRAM: 8GB+ GPU memory for GPT-2, 16GB+ for larger models
  • Storage: Several GB for model weights and cached activations

Note: Running on CPU is not recommended for production use. Model inference and activation caching are memory-intensive operations.

Installation

cd Head-explain
pip install -e .

Requirements

  • Python >= 3.9
  • PyTorch >= 2.0.0
  • TransformerLens >= 1.0.0
  • OpenAI API key (set as OPENAI_API_KEY environment variable)

Quick Start

Set up your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

Run the pipeline with default settings:

head-explain \
  --model gpt2 \
  --sources wikitext \
  --windows 1500 \
  --topk 200 \
  --max-heads 200 \
  --openai-model gpt-4o-mini \
  --outdir outputs/

This will:

  • Load the GPT-2 model
  • Sample 1500 text windows from WikiText
  • Select top 200 events per head
  • Analyze up to 200 heads
  • Generate explanations using GPT-4o-mini
  • Save results to outputs/

Usage

Command-Line Options

Model Configuration:

  • --model: Transformer model to analyze (default: gpt2)
    • Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
  • --device: Device to use (default: cuda)
    • Options: cuda, cpu (not recommended)

Data Sources:

  • --sources: Comma-separated data sources (default: wikitext)
    • Options: wikitext, stackexchange, or wikitext,stackexchange
  • --stackexchange-dir: Path to processed StackExchange data (required if using stackexchange)
  • --windows: Number of text windows to sample (default: 1500)
  • --window-len: Length of each window in tokens (default: 192)

Analysis Parameters:

  • --topk: Top-K salient events per head (default: 200)
  • --max-heads: Maximum number of heads to analyze (default: 200)
  • --openai-model: OpenAI model for explanations (default: gpt-4o-mini)
    • Options: gpt-4o-mini, gpt-4o, gpt-4-turbo, etc.
  • --embedding-model: Sentence transformer for clustering (default: all-MiniLM-L6-v2)
  • --min-cluster-size: Minimum HDBSCAN cluster size (default: 5)

Output:

  • --outdir: Output directory for results (default: outputs/)
  • --verbose: Enable verbose logging

Pipeline Control (skip steps to reuse cached results):

  • --skip-cache: Use existing activation cache
  • --skip-explain: Use existing explanations
  • --skip-simulate: Use existing simulation scores
  • --skip-cluster: Use existing clusters

Examples

Basic analysis with WikiText:

head-explain --model gpt2 --sources wikitext --outdir outputs/

Analyze with both WikiText and StackExchange:

head-explain \
  --model gpt2 \
  --sources wikitext,stackexchange \
  --stackexchange-dir /path/to/stackexchange/out_dir \
  --outdir outputs/

Larger model with more comprehensive analysis:

head-explain \
  --model gpt2-medium \
  --sources wikitext \
  --windows 3000 \
  --topk 300 \
  --max-heads 400 \
  --openai-model gpt-4o \
  --outdir outputs_medium/

Re-run clustering with different parameters:

head-explain \
  --model gpt2 \
  --sources wikitext \
  --outdir outputs/ \
  --skip-cache \
  --skip-explain \
  --skip-simulate \
  --min-cluster-size 3

Output Files

The pipeline generates the following files in the output directory:

  • raw_events.parquet: Cached attention head activations
  • explanations.jsonl: Generated explanations for each head
  • scores.parquet: Simulation scores for each explanation
  • clusters.json: Clusters of similar explanations
  • report.md: Comprehensive Markdown report
  • summary.csv: CSV summary of all explanations with scores

Data Preparation

WikiText-103

WikiText is automatically downloaded via HuggingFace datasets. No manual setup required.

StackExchange (Optional)

To use StackExchange data:

  1. Clone the dataset repository:
git clone https://github.com/EleutherAI/stackexchange-dataset
cd stackexchange-dataset
pip install -r requirements.txt
  1. Download and process data:
python main.py \
  --names stackoverflow,unix.stackexchange \
  --out_format zip \
  --min_score 3 \
  --max_responses 3
  1. Use the output directory in the pipeline:
head-explain \
  --sources stackexchange \
  --stackexchange-dir /path/to/stackexchange/out_dir

References

Based on:

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Interpret transformer attention heads end-to-end with automated explanation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors