This repository contains the code and data for the paper "Training Language Models To Explain Their Own Computations".
- π Quick Start
- Tasks Overview
- Feature Descriptions
- Activation Patching
- Input Ablations
- π Citation
- Python 3.8+
- OpenAI API key (for LM judge evaluation of feature descriptions)
- Arguments are configured for 2x 80GB H100 GPUs, but you can use less memory by adjusting batch sizes.
git clone https://github.com/TransluceAI/introspective-interp.git
cd introspective-interp
# All packages are in pyproject.toml and will be auto-downloaded on first `uv run`Add your API keys to .env file:
cp .env.example .env # Create from template
# Edit .env to add your OpenAI API keyWe train language models to produce three types of explanations of their own computations. In our experiments, we use an explainer model (the model we train) to explain a target model (the model being analyzed).
| Task | Description | Target Model | Training Dataset |
|---|---|---|---|
| Feature Descriptions | Generate natural language descriptions of model features | Llama-3.1-8B | LlamaScope SAE features + Neuronpedia |
| Activation Patching | Predict effects of activation patching interventions | Llama-3.1-8B, Qwen3-8B | CounterFact |
| Input Ablations | Predict effects of removing hint tokens | Llama-3.1-8B-Instruct, Qwen3-8B | MMLU + hint |
Explainer models generate natural language descriptions of features from Llama-3.1-8B. We train on SAE features and their descriptions, then evaluate on held-out SAE features, full activations, and activation differences.
Links to datasets and pre-trained checkpoints from the paper are available below.
Datasets:
- SAE features + Neuronpedia explanations (training + in-distribution eval)
- Full activations on FineWeb (OOD eval)
- Activation differences on FineWeb (OOD eval)
You must first download the data locally using:
cd /PATH/TO/DATA/DIR/
# Download and extract SAE features
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/SAE_feature_explanations_llama3.1_8b.tar.gz
tar -xzvf SAE_feature_explanations_llama3.1_8b.tar.gz
# Download OOD evaluation data: full activations
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/fineweb_llama_3.1_8b_95seqlen_fineweb_acts_grads_-1.0.tar.gz
tar -xzvf fineweb_llama_3.1_8b_95seqlen_fineweb_acts_grads_-1.0.tar.gz
# Download OOD evaluation data: activation differences
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/fineweb_llama_3.1_8b_95seqlen_counterfact_subsampled_2000_activation_difference.tar.gz
tar -xzvf fineweb_llama_3.1_8b_95seqlen_counterfact_subsampled_2000_activation_difference.tar.gzPre-trained Models (available on HuggingFace):
- Transluce/features_explain_llama3.1_8b_llama3.1_8b - Llama-3.1-8B explains Llama-3.1-8B
- Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct - Llama-3.1-8B-Instruct explains Llama-3.1-8B
- Transluce/features_explain_llama3.1_8b_llama3_8b - Llama-3-8B explains Llama-3.1-8B
- Transluce/features_explain_llama3.1_8b_simulator - Simulator model: used to score candidate natural-language explanations of features of Llama-3.1-8B. Predicts where the described feature should activate in the sequence, which can then be compared to a target feature's true activations, enabling scoring of the explanations by computing correlation (the "simulator score").
Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.
Configs for this task can all be found under config/feature_descriptions/*, where * follows the pattern {explainer_model}_131k_{eval_method}.yaml.
Explainer models is one of:
base= Llama-3.1-8Binstruct= Llama-3.1-8B-Instructqwen= Qwen3-8Bllama3= Llama-3-8B
Evaluation methods is one of:
- (no suffix) = LM judge similarity against ground-truth SAE feature descriptions (default)
simcor= Simulator correlation scores on SAE featuresood_fw= Simulator correlation scores on full LM activations from FineWebood_diff= Simulator correlation scores on LM activation differences from CounterFact
Edit these paths in your chosen config file before training:
train:
explanation_dir: "/PATH/TO/DATA/DIR/SAE_feature_explanations_llama3.1_8b/"
text:
explanaton_dir: "/PATH/TO/DATA/DIR/SAE_feature_explanations_llama3.1_8b/" # change for OOD evals
output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/" # OptionalRun Training:
uv run --env-file .env train.py --config config/feature_descriptions/base_131k.yamluv run --env-file .env evaluate.py \
--config config/feature_descriptions/base_131k.yaml \
--target_model_path meta-llama/Llama-3.1-8B \
--task features_explain \
--model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \ # can be a local path or a HF model from above, e.g. Transluce/features_explain_llama3.1_8b_llama3.1_8b
--output_dir /PATH/TO/RESULTS/ \
--batch_size 64Baselines: To run baseline explainer methods, specify:
--model_path nearest_neighborfor top-1 nearest neighbor (finds most similar training explanations). Optionally add--layerwise_similaritiesto do this layerwise.--model_path self_explanationsfor untrained self-explanations, i.e. SelfIE (target model explains itself without training)
Explainer models predict how activation patching interventions affect target model outputs on CounterFact data.
Datasets (hosted on HuggingFace):
- Transluce/act_patch_llama_3.1_8b_counterfact - activation patching results of Llama-3.1-8B target model
- Transluce/act_patch_qwen3_8b_counterfact - activation patching results of Qwen3-8B target model
Pre-trained Models (available on HuggingFace):
- Transluce/act_patch_qwen3_8b_qwen3_8b - Qwen3-8B explains Qwen3-8B
- Transluce/act_patch_llama3.1_8b_llama3.1_8b - Llama-3.1-8B explains Llama-3.1-8B
Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.
Configs for this task can all be found under config/act_patch/*, where * is:
base_base_act_patch_cf.yaml- Llama-3.1-8B explains Llama-3.1-8Bbase_qwen_act_patch_cf.yaml- Llama-3.1-8B explains Qwen3-8Bqwen_qwen_act_patch_cf.yaml- Qwen3-8B explains Qwen3-8Bqwen_base_act_patch_cf.yaml- Qwen3-8B explains Llama-3.1-8B
Edit these paths in your chosen config file before training:
output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/" # OptionalAvailable Configs:
Run Training:
uv run --env-file .env train.py --config config/act_patch/base_act_patch_cf.yamluv run --env-file .env evaluate.py \
--config config/act_patch/base_act_patch_cf.yaml \
--target_model_path /PATH/TO/TARGET/MODEL/ \ # can be a local path or a HF model ID , e.g. meta-llama/Llama-3.1-8B
--task act_patch \
--model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \ # can be a local path or a HF model from above, e.g. act_patch_llama3.1_8b_llama3.1_8b
--output_dir /PATH/TO/RESULTS/ \
--batch_size 32Explainer models predict how removing input hints affects target model's predictions on MMLU questions.
Datasets (hosted on HuggingFace):
- Transluce/input_ablation_llama_3.1_8b_instruct_mmlu_hint - hint ablation results for Llama-3.1-8B-Instruct target model
- Transluce/input_ablation_qwen3_8b_mmlu_hint - hint ablation results for Qwen3-8B target model
Pre-trained Models (available on HuggingFace):
- Transluce/input_ablation_llama3.1_8b_instruct_llama3.1_8b_instruct - Llama-3.1-8B-Instruct explains Llama-3.1-8B-Instruct
- Transluce/input_ablation_qwen3_8b_qwen3_8b_hint - Qwen3-8B explains Qwen3-8B
Loading Datasets in Code:
from datasets import load_dataset
# Load input ablation dataset
dataset = load_dataset("Transluce/input_ablation_llama_3.1_8b_instruct_mmlu_hint", split="train")Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.
Configs for this task can all be found under config/input_ablation/*, where * is:
instruct_instruct_hint.yaml- Llama-3.1-8B-Instruct explains Llama-3.1-8B-Instructqwen_qwen_hint.yaml- Qwen3-8B explains Qwen3-8Binstruct_qwen_hint.yaml- Llama-3.1-8B-Instruct explains Qwen3-8B (cross-model)qwen_instruct_hint.yaml- Qwen3-8B explains Llama-3.1-8B-Instruct (cross-model)
Edit these paths in your chosen config file before training:
output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/" # OptionalAvailable Configs:
Run Training:
uv run --env-file .env train.py --config config/hint/instruct_instruct_hint.yaml# Llama-3.1-8B-Instruct evaluation
uv run --env-file .env evaluate.py \
--config config/hint/instruct_instruct_hint.yaml \
--target_model_path meta-llama/Llama-3.1-8B-Instruct \
--task hint_attribution \
--model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \ # can be a local path or a HF model from above, e.g. Transluce/input_ablation_llama3.1_8b_instruct_llama3.1_8b_instruct
--output_dir /PATH/TO/RESULTS/ \
--batch_size 8
# Qwen3-8B evaluation
uv run --env-file .env evaluate.py \
--config config/hint/qwen_qwen_hint.yaml \
--target_model_path Qwen/Qwen3-8B \
--task hint_attribution \
--model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \ # can be a local path or a HF model from above, e.g. Transluce/
--output_dir /PATH/TO/RESULTS/ \
--batch_size 8This task trains models to predict how removing specific input hints affects the model's reasoning and output generation. The models learn to understand causal relationships between input components and model behavior.
@misc{li2025traininglanguagemodelsexplain,
title={Training Language Models to Explain Their Own Computations},
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
year={2025},
eprint={2511.08579},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.08579},
}