Skip to content

TransluceAI/introspective-interp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Training Language Models To Explain Their Own Computations

Paper HuggingFace

This repository contains the code and data for the paper "Training Language Models To Explain Their Own Computations".

πŸ“‹ Table of Contents

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • OpenAI API key (for LM judge evaluation of feature descriptions)
  • Arguments are configured for 2x 80GB H100 GPUs, but you can use less memory by adjusting batch sizes.

Installation

git clone https://github.com/TransluceAI/introspective-interp.git
cd introspective-interp
# All packages are in pyproject.toml and will be auto-downloaded on first `uv run`

Environment Setup

Add your API keys to .env file:

cp .env.example .env  # Create from template
# Edit .env to add your OpenAI API key

Tasks Overview

We train language models to produce three types of explanations of their own computations. In our experiments, we use an explainer model (the model we train) to explain a target model (the model being analyzed).

Task Description Target Model Training Dataset
Feature Descriptions Generate natural language descriptions of model features Llama-3.1-8B LlamaScope SAE features + Neuronpedia
Activation Patching Predict effects of activation patching interventions Llama-3.1-8B, Qwen3-8B CounterFact
Input Ablations Predict effects of removing hint tokens Llama-3.1-8B-Instruct, Qwen3-8B MMLU + hint

Feature Descriptions

Explainer models generate natural language descriptions of features from Llama-3.1-8B. We train on SAE features and their descriptions, then evaluate on held-out SAE features, full activations, and activation differences.

πŸ“₯ Data & Checkpoints

Links to datasets and pre-trained checkpoints from the paper are available below.

Datasets:

You must first download the data locally using:

cd /PATH/TO/DATA/DIR/

# Download and extract SAE features
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/SAE_feature_explanations_llama3.1_8b.tar.gz
tar -xzvf SAE_feature_explanations_llama3.1_8b.tar.gz

# Download OOD evaluation data: full activations
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/fineweb_llama_3.1_8b_95seqlen_fineweb_acts_grads_-1.0.tar.gz
tar -xzvf fineweb_llama_3.1_8b_95seqlen_fineweb_acts_grads_-1.0.tar.gz

# Download OOD evaluation data: activation differences
wget https://transluce-public.s3.us-east-1.amazonaws.com/introspective-interp/fineweb_llama_3.1_8b_95seqlen_counterfact_subsampled_2000_activation_difference.tar.gz
tar -xzvf fineweb_llama_3.1_8b_95seqlen_counterfact_subsampled_2000_activation_difference.tar.gz

Pre-trained Models (available on HuggingFace):

πŸ‹οΈ Training

Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.

Configs for this task can all be found under config/feature_descriptions/*, where * follows the pattern {explainer_model}_131k_{eval_method}.yaml.

Explainer models is one of:

  • base = Llama-3.1-8B
  • instruct = Llama-3.1-8B-Instruct
  • qwen = Qwen3-8B
  • llama3 = Llama-3-8B

Evaluation methods is one of:

  • (no suffix) = LM judge similarity against ground-truth SAE feature descriptions (default)
  • simcor = Simulator correlation scores on SAE features
  • ood_fw = Simulator correlation scores on full LM activations from FineWeb
  • ood_diff = Simulator correlation scores on LM activation differences from CounterFact

Edit these paths in your chosen config file before training:

train:
  explanation_dir: "/PATH/TO/DATA/DIR/SAE_feature_explanations_llama3.1_8b/"

text:
  explanaton_dir: "/PATH/TO/DATA/DIR/SAE_feature_explanations_llama3.1_8b/"  # change for OOD evals

output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/"  # Optional

Run Training:

uv run --env-file .env train.py --config config/feature_descriptions/base_131k.yaml

πŸ“Š Evaluation

uv run --env-file .env evaluate.py \
  --config config/feature_descriptions/base_131k.yaml \
  --target_model_path meta-llama/Llama-3.1-8B \
  --task features_explain \
  --model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \   # can be a local path or a HF model from above, e.g. Transluce/features_explain_llama3.1_8b_llama3.1_8b
  --output_dir /PATH/TO/RESULTS/ \
  --batch_size 64

Baselines: To run baseline explainer methods, specify:

  • --model_path nearest_neighbor for top-1 nearest neighbor (finds most similar training explanations). Optionally add --layerwise_similarities to do this layerwise.
  • --model_path self_explanations for untrained self-explanations, i.e. SelfIE (target model explains itself without training)

Activation Patching

Explainer models predict how activation patching interventions affect target model outputs on CounterFact data.

πŸ“₯ Data & Checkpoints

Datasets (hosted on HuggingFace):

Pre-trained Models (available on HuggingFace):

πŸ‹οΈ Training

Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.

Configs for this task can all be found under config/act_patch/*, where * is:

  • base_base_act_patch_cf.yaml - Llama-3.1-8B explains Llama-3.1-8B
  • base_qwen_act_patch_cf.yaml - Llama-3.1-8B explains Qwen3-8B
  • qwen_qwen_act_patch_cf.yaml - Qwen3-8B explains Qwen3-8B
  • qwen_base_act_patch_cf.yaml - Qwen3-8B explains Llama-3.1-8B

Edit these paths in your chosen config file before training:

output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/"  # Optional

Available Configs:

Run Training:

uv run --env-file .env train.py --config config/act_patch/base_act_patch_cf.yaml

πŸ“Š Evaluation

uv run --env-file .env evaluate.py \
  --config config/act_patch/base_act_patch_cf.yaml \
  --target_model_path /PATH/TO/TARGET/MODEL/ \ # can be a local path or a HF model ID , e.g. meta-llama/Llama-3.1-8B
  --task act_patch \
  --model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \   # can be a local path or a HF model from above, e.g. act_patch_llama3.1_8b_llama3.1_8b
  --output_dir /PATH/TO/RESULTS/ \
  --batch_size 32

Input Ablations

Explainer models predict how removing input hints affects target model's predictions on MMLU questions.

πŸ“₯ Data & Checkpoints

Datasets (hosted on HuggingFace):

Pre-trained Models (available on HuggingFace):

Loading Datasets in Code:

from datasets import load_dataset

# Load input ablation dataset
dataset = load_dataset("Transluce/input_ablation_llama_3.1_8b_instruct_mmlu_hint", split="train")

πŸ‹οΈ Training

Config File Setup: We specify all training parameters (models, data paths, hyperparameters) in YAML config files.

Configs for this task can all be found under config/input_ablation/*, where * is:

  • instruct_instruct_hint.yaml - Llama-3.1-8B-Instruct explains Llama-3.1-8B-Instruct
  • qwen_qwen_hint.yaml - Qwen3-8B explains Qwen3-8B
  • instruct_qwen_hint.yaml - Llama-3.1-8B-Instruct explains Qwen3-8B (cross-model)
  • qwen_instruct_hint.yaml - Qwen3-8B explains Llama-3.1-8B-Instruct (cross-model)

Edit these paths in your chosen config file before training:

output_dir: "/PATH/TO/SAVE/CHECKPOINTS/"
cache_dir: "/PATH/TO/HF/CACHE/"  # Optional

Available Configs:

Run Training:

uv run --env-file .env train.py --config config/hint/instruct_instruct_hint.yaml

πŸ“Š Evaluation

# Llama-3.1-8B-Instruct evaluation
uv run --env-file .env evaluate.py \
  --config config/hint/instruct_instruct_hint.yaml \
  --target_model_path meta-llama/Llama-3.1-8B-Instruct \
  --task hint_attribution \
  --model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \   # can be a local path or a HF model from above, e.g. Transluce/input_ablation_llama3.1_8b_instruct_llama3.1_8b_instruct
  --output_dir /PATH/TO/RESULTS/ \
  --batch_size 8

# Qwen3-8B evaluation  
uv run --env-file .env evaluate.py \
  --config config/hint/qwen_qwen_hint.yaml \
  --target_model_path Qwen/Qwen3-8B \
  --task hint_attribution \
  --model_path /PATH/TO/EXPLAINER/CHECKPOINT/ \   # can be a local path or a HF model from above, e.g. Transluce/
  --output_dir /PATH/TO/RESULTS/ \
  --batch_size 8

This task trains models to predict how removing specific input hints affects the model's reasoning and output generation. The models learn to understand causal relationships between input components and model behavior.


πŸ“„ Citation

@misc{li2025traininglanguagemodelsexplain,
      title={Training Language Models to Explain Their Own Computations}, 
      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
      year={2025},
      eprint={2511.08579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.08579}, 
}

About

Repository for "Training Language Models To Explain Their Own Computations"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages