head-explain

Automated interpretability for transformer attention heads using LLM-based explanations.

This package implements an end-to-end pipeline that:

Samples diverse text from corpora (WikiText, StackExchange)
Instruments attention heads using TransformerLens
Selects salient events based on head activations
Generates natural language explanations via OpenAI API
Scores explanations through simulation
Clusters similar explanations
Produces comprehensive reports

⚠️ System Requirements

GPU Recommended: This package requires significant computational resources:

GPU: CUDA-compatible GPU strongly recommended (CPU execution will be extremely slow)
Memory: Minimum 16GB RAM, 20GB+ recommended for larger models
VRAM: 8GB+ GPU memory for GPT-2, 16GB+ for larger models
Storage: Several GB for model weights and cached activations

Note: Running on CPU is not recommended for production use. Model inference and activation caching are memory-intensive operations.

Installation

cd Head-explain
pip install -e .

Requirements

Python >= 3.9
PyTorch >= 2.0.0
TransformerLens >= 1.0.0
OpenAI API key (set as OPENAI_API_KEY environment variable)

Quick Start

Set up your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

Run the pipeline with default settings:

head-explain \
  --model gpt2 \
  --sources wikitext \
  --windows 1500 \
  --topk 200 \
  --max-heads 200 \
  --openai-model gpt-4o-mini \
  --outdir outputs/

This will:

Load the GPT-2 model
Sample 1500 text windows from WikiText
Select top 200 events per head
Analyze up to 200 heads
Generate explanations using GPT-4o-mini
Save results to outputs/

Usage

Command-Line Options

Model Configuration:

--model: Transformer model to analyze (default: gpt2)
- Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
--device: Device to use (default: cuda)
- Options: cuda, cpu (not recommended)

Data Sources:

--sources: Comma-separated data sources (default: wikitext)
- Options: wikitext, stackexchange, or wikitext,stackexchange
--stackexchange-dir: Path to processed StackExchange data (required if using stackexchange)
--windows: Number of text windows to sample (default: 1500)
--window-len: Length of each window in tokens (default: 192)

Analysis Parameters:

--topk: Top-K salient events per head (default: 200)
--max-heads: Maximum number of heads to analyze (default: 200)
--openai-model: OpenAI model for explanations (default: gpt-4o-mini)
- Options: gpt-4o-mini, gpt-4o, gpt-4-turbo, etc.
--embedding-model: Sentence transformer for clustering (default: all-MiniLM-L6-v2)
--min-cluster-size: Minimum HDBSCAN cluster size (default: 5)

Output:

--outdir: Output directory for results (default: outputs/)
--verbose: Enable verbose logging

Pipeline Control (skip steps to reuse cached results):

--skip-cache: Use existing activation cache
--skip-explain: Use existing explanations
--skip-simulate: Use existing simulation scores
--skip-cluster: Use existing clusters

Examples

Basic analysis with WikiText:

head-explain --model gpt2 --sources wikitext --outdir outputs/

Analyze with both WikiText and StackExchange:

head-explain \
  --model gpt2 \
  --sources wikitext,stackexchange \
  --stackexchange-dir /path/to/stackexchange/out_dir \
  --outdir outputs/

Larger model with more comprehensive analysis:

head-explain \
  --model gpt2-medium \
  --sources wikitext \
  --windows 3000 \
  --topk 300 \
  --max-heads 400 \
  --openai-model gpt-4o \
  --outdir outputs_medium/

Re-run clustering with different parameters:

head-explain \
  --model gpt2 \
  --sources wikitext \
  --outdir outputs/ \
  --skip-cache \
  --skip-explain \
  --skip-simulate \
  --min-cluster-size 3

Output Files

The pipeline generates the following files in the output directory:

raw_events.parquet: Cached attention head activations
explanations.jsonl: Generated explanations for each head
scores.parquet: Simulation scores for each explanation
clusters.json: Clusters of similar explanations
report.md: Comprehensive Markdown report
summary.csv: CSV summary of all explanations with scores

Data Preparation

WikiText-103

WikiText is automatically downloaded via HuggingFace datasets. No manual setup required.

StackExchange (Optional)

To use StackExchange data:

Clone the dataset repository:

git clone https://github.com/EleutherAI/stackexchange-dataset
cd stackexchange-dataset
pip install -r requirements.txt

Download and process data:

python main.py \
  --names stackoverflow,unix.stackexchange \
  --out_format zip \
  --min_score 3 \
  --max_responses 3

Use the output directory in the pipeline:

head-explain \
  --sources stackexchange \
  --stackexchange-dir /path/to/stackexchange/out_dir

References

Based on:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
notebooks		notebooks
prompts		prompts
reports		reports
src/head_explain		src/head_explain
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REPORT.md		REPORT.md
analyze_standalone.py		analyze_standalone.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

head-explain

⚠️ System Requirements

Installation

Requirements

Quick Start

Set up your OpenAI API key:

Run the pipeline with default settings:

Usage

Command-Line Options

Examples

Output Files

Data Preparation

WikiText-103

StackExchange (Optional)

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

head-explain

⚠️ System Requirements

Installation

Requirements

Quick Start

Set up your OpenAI API key:

Run the pipeline with default settings:

Usage

Command-Line Options

Examples

Output Files

Data Preparation

WikiText-103

StackExchange (Optional)

References

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages