Automated interpretability for transformer attention heads using LLM-based explanations.
This package implements an end-to-end pipeline that:
- Samples diverse text from corpora (WikiText, StackExchange)
- Instruments attention heads using TransformerLens
- Selects salient events based on head activations
- Generates natural language explanations via OpenAI API
- Scores explanations through simulation
- Clusters similar explanations
- Produces comprehensive reports
GPU Recommended: This package requires significant computational resources:
- GPU: CUDA-compatible GPU strongly recommended (CPU execution will be extremely slow)
- Memory: Minimum 16GB RAM, 20GB+ recommended for larger models
- VRAM: 8GB+ GPU memory for GPT-2, 16GB+ for larger models
- Storage: Several GB for model weights and cached activations
Note: Running on CPU is not recommended for production use. Model inference and activation caching are memory-intensive operations.
cd Head-explain
pip install -e .- Python >= 3.9
- PyTorch >= 2.0.0
- TransformerLens >= 1.0.0
- OpenAI API key (set as
OPENAI_API_KEYenvironment variable)
export OPENAI_API_KEY="your-api-key-here"head-explain \
--model gpt2 \
--sources wikitext \
--windows 1500 \
--topk 200 \
--max-heads 200 \
--openai-model gpt-4o-mini \
--outdir outputs/This will:
- Load the GPT-2 model
- Sample 1500 text windows from WikiText
- Select top 200 events per head
- Analyze up to 200 heads
- Generate explanations using GPT-4o-mini
- Save results to
outputs/
Model Configuration:
--model: Transformer model to analyze (default:gpt2)- Options:
gpt2,gpt2-medium,gpt2-large,gpt2-xl
- Options:
--device: Device to use (default:cuda)- Options:
cuda,cpu(not recommended)
- Options:
Data Sources:
--sources: Comma-separated data sources (default:wikitext)- Options:
wikitext,stackexchange, orwikitext,stackexchange
- Options:
--stackexchange-dir: Path to processed StackExchange data (required if using stackexchange)--windows: Number of text windows to sample (default: 1500)--window-len: Length of each window in tokens (default: 192)
Analysis Parameters:
--topk: Top-K salient events per head (default: 200)--max-heads: Maximum number of heads to analyze (default: 200)--openai-model: OpenAI model for explanations (default:gpt-4o-mini)- Options:
gpt-4o-mini,gpt-4o,gpt-4-turbo, etc.
- Options:
--embedding-model: Sentence transformer for clustering (default:all-MiniLM-L6-v2)--min-cluster-size: Minimum HDBSCAN cluster size (default: 5)
Output:
--outdir: Output directory for results (default:outputs/)--verbose: Enable verbose logging
Pipeline Control (skip steps to reuse cached results):
--skip-cache: Use existing activation cache--skip-explain: Use existing explanations--skip-simulate: Use existing simulation scores--skip-cluster: Use existing clusters
Basic analysis with WikiText:
head-explain --model gpt2 --sources wikitext --outdir outputs/Analyze with both WikiText and StackExchange:
head-explain \
--model gpt2 \
--sources wikitext,stackexchange \
--stackexchange-dir /path/to/stackexchange/out_dir \
--outdir outputs/Larger model with more comprehensive analysis:
head-explain \
--model gpt2-medium \
--sources wikitext \
--windows 3000 \
--topk 300 \
--max-heads 400 \
--openai-model gpt-4o \
--outdir outputs_medium/Re-run clustering with different parameters:
head-explain \
--model gpt2 \
--sources wikitext \
--outdir outputs/ \
--skip-cache \
--skip-explain \
--skip-simulate \
--min-cluster-size 3The pipeline generates the following files in the output directory:
raw_events.parquet: Cached attention head activationsexplanations.jsonl: Generated explanations for each headscores.parquet: Simulation scores for each explanationclusters.json: Clusters of similar explanationsreport.md: Comprehensive Markdown reportsummary.csv: CSV summary of all explanations with scores
WikiText is automatically downloaded via HuggingFace datasets. No manual setup required.
To use StackExchange data:
- Clone the dataset repository:
git clone https://github.com/EleutherAI/stackexchange-dataset
cd stackexchange-dataset
pip install -r requirements.txt- Download and process data:
python main.py \
--names stackoverflow,unix.stackexchange \
--out_format zip \
--min_score 3 \
--max_responses 3- Use the output directory in the pipeline:
head-explain \
--sources stackexchange \
--stackexchange-dir /path/to/stackexchange/out_dirBased on:
This project is licensed under the MIT License - see the LICENSE file for details.
