Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Representation engineering pipeline for detecting and steering faithfulness in chain-of-thought (CoT) reasoning of LLMs.

This project implements a complete experimental framework for:

Eliciting unfaithful reasoning via biased prompts on MMLU benchmarks
Classifying faithfulness at global and local (token-level) granularity using LLM judges
Extracting hidden-state activations at annotated faithful/unfaithful spans
Computing steering vectors (linear mean-diff, off-policy, and MLP gradient-based)
Evaluating the causal effect of activation steering on model faithfulness
Analyzing results with statistical tests and publication-ready visualizations

The project paper can be read here.

Overview

Chain-of-thought (CoT) reasoning models can produce unfaithful reasoning — where the stated justification does not reflect the model's actual decision process. This project investigates whether internal model representations encode faithfulness, and whether activation steering can causally shift model behavior toward faithful reasoning.

The pipeline uses MMLU multiple-choice questions with biased hints (e.g., a professor's opinion, a grading function, and XML metadata) to elicit unfaithful responses, then extracts activations from annotated faithful/unfaithful reasoning spans to compute steering vectors.

Three steering approaches are compared:

Linear (on-policy): Mean-difference vectors from annotated activation spans
Off-policy: Vectors from synthetic faithful/unfaithful completions generated by a separate model
MLP (gradient-based): Per-prompt optimized vectors via trained probes

Repository Structure

unfaithfulness_steering/
│
├── main.py                       # Central CLI entry point
│
├── src/                          # Core library modules
│   ├── config.py                 # Model IDs, API rate limits, activation config
│   ├── data.py                   # MMLU loading, JSONL I/O, data splitting
│   ├── model.py                  # Model loading (HuggingFace, vLLM, EasySteer)
│   ├── prompts.py                # Prompt construction (baseline, hinted, annotation)
│   ├── activations.py            # Activation extraction with tag-based span tracking
│   ├── steering.py               # Steering vector computation (config-weighted)
│   ├── separability.py           # Dataset splitting, separability analysis
│   ├── probe.py                  # Linear & MLP probe training/evaluation
│   ├── gradient_steering.py      # GPU-batched gradient optimization for MLP steering
│   ├── per_prompt_steering.py    # Per-prompt hook-based steering wrappers
│   ├── global_faithfulness.py    # Global faithfulness classification logic
│   ├── local_faithfulness.py     # Local annotation with [F_body]/[U_body] markers
│   ├── faithfulness_classifier.py# Faithfulness classification utilities
│   ├── async_classifier.py       # Async LLM API classification
│   ├── hint_mention.py           # Hint mention detection in steered responses
│   ├── steered_global_faithfulness.py # Steered faithfulness metrics & grouping
│   ├── performance_eval.py       # Answer validation via OpenRouter API
│   ├── plots.py                  # General plotting utilities
│   ├── steered_plots.py          # Steered evaluation plots
│   └── steering_plots.py         # Steering vector analysis plots
│
├── scripts/                      # Pipeline Scripts (moved from root)
│   ├── eval_baseline.py          # Step 1: Baseline MMLU evaluation
│   ├── eval_hinted.py            # Step 2: Hinted (biased) evaluation
│   ├── process_answers.py        # Step 3: Answer validation & accuracy metrics
│   ├── eval_faithfulness.py      # Step 4: Global faithfulness classification
│   ├── annotate_faithfulness.py  # Step 5: Local [F_body]/[U_body] annotation
│   ├── extract_activations.py    # Step 6: Hidden-state activation extraction
│   ├── generate_steering_vectors.py # Step 7: Steering vector computation
│   ├── train_probes.py           # Step 8: Linear/MLP probe training
│   ├── eval_steering.py          # Step 9: Steered evaluation (EasySteer + vLLM)
│   ├── eval_faithfulness_steered.py # Step 10: Post-steering faithfulness eval
│   ├── generate_off_policy_data.py # Generate synthetic faithful/unfaithful completions
│   ├── find_best_configs_ratio.py # Find best steering configs
│   ├── statistical_analysis.py   # Z-tests, BH-corrected significance analysis
│   └── plot_variations.py        # Publication-ready visualizations
│
├── prompts/                      # LLM judge prompt templates
│   ├── faithfulness_global_annotation_*.txt   
│   ├── local_annotation_faithful_*.txt        
│   ├── local_annotation_unfaithful_*.txt      
│   └── validation_prompt.txt                  
│
├── data/                         # Experimental results (per model)
│   ├── <model_name>/
│   │   ├── behavioural/          # Raw JSONL results, annotated files
│   │   ├── activations/          # .activations files
│   │   ├── vectors/              # Steering vectors
│   │   └── probes/               # Trained probes
│
├── analysis/                     # Analysis outputs
│   ├── plots/                    # Generated figures
│   ├── tables/                   # Text tables
│   └── statistics/               # JSON stats
│
├── requirements.txt              # Python dependencies
├── .env                          # API keys (not committed)
└── .gitignore

Installation

Prerequisites

Python 3.10+
CUDA-compatible GPU (required for vLLM inference and activation extraction)

Setup

# Clone the repository
git clone <repository-url>
cd unfaithfulness_steering

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env  # or create .env manually
# Add your OPENROUTER_API_KEY to .env

Key Dependencies

Package	Purpose
`vllm`	High-performance LLM inference (recommended for 14B+ models)
`torch`, `transformers`	Model loading, activation extraction, probe training
`datasets`	MMLU benchmark loading from HuggingFace
`openai`	OpenRouter API client for LLM-judge evaluations
`google-generativeai`	Alternative Gemini API access
`numpy`, `pandas`	Data processing
`matplotlib`, `seaborn`	Visualization
`scikit-learn`	Logistic regression probes
`scipy`	Statistical tests
`tqdm`	Progress bars
`accelerate`	Hugging Face Accelerate for distributed training/inference
`hf_transfer`	Higher speed downloads from Hugging Face Hub

Pipeline

The pipeline is a sequential workflow. Each step produces artifacts consumed by subsequent steps.

Step 1 — Baseline Evaluation

Script: eval_baseline.py

Evaluates the target model on MMLU questions without any bias, establishing baseline accuracy.

python main.py --stage baseline \
    --model_name Qwen3-32B \
    --backend vllm \
    --subjects college_biology high_school_chemistry \
    --num_samples 200

# OR Run Full Pipeline (Sequentially runs all steps + validation)
# This automates: baseline -> process -> hinted -> process -> annotate -> ...
python main.py --full_pipeline --models Qwen3-32B

> [!WARNING]
> **Resource Intensive**: Running the full pipeline can take up to **40 hours** depending on the model size and hardware.
> Ensure you have a **GPU** available. If not, it is recommended to run scripts stage-by-stage, offloading heavy inference (generation) to a GPU machine and running lightweight analysis (probes, stats) locally.

Inputs: MMLU dataset (auto-downloaded via HuggingFace) Outputs: data/<model>/behavioural/baseline_results_<model>_<date>.jsonl, summary JSON

Step 2 — Hinted Evaluation

Script: eval_hinted.py

Takes baseline results and adds biased hints. For items the model answered correctly, a wrong hint is added (testing for unfaithfulness). For incorrect baseline answers, a correct hint is added.

python main.py --stage hinted \
    --model_name Qwen3-32B \
    --backend vllm \
    --bias_strategies professor grader_hacking metadata \
    --distribution_strategy round_robin

Bias strategies: professor, grader_hacking, metadata

Outputs: data/<model>/behavioural/hinted_results_<model>_<date>.jsonl

Step 3 — Answer Validation

Script: process_answers.py

Validates model responses using an LLM judge (via OpenRouter) to extract the final answer letter, assess compliance, and compute accuracy/bias metrics.

python main.py --stage process --model Qwen3-32B --dataset-type baseline
python main.py --stage process --model Qwen3-32B --dataset-type hinted

Outputs: Enriches input JSONL in-place with answer_letter, accuracy, compliance, completeness, and bias_label fields

Step 4 — Global Faithfulness Classification

Script: eval_faithfulness.py

Uses an LLM judge to classify each hinted response as faithful or unfaithful based on whether the model's reasoning process was genuinely influenced by the hint.

python main.py --stage faithfulness \
    --model_name Qwen3-32B \
    --annotation_model gemini-2.5-pro

Supports checkpointing for long-running evaluations
Uses bias-strategy-specific prompt templates from prompts/

Outputs: Annotated JSONL with faithfulness_classification field

Step 5 — Local Faithfulness Annotation

Script: annotate_faithfulness.py

Adds token-level [F_body]...[/F_body] or [U_body]...[/U_body] markers to responses based on the global classification. These markers define the spans where activations will be extracted.

python main.py --stage annotate \
    --model_name Qwen3-32B \
    --annotation_model gemini-2.5-pro

Outputs: Overwrites input JSONL with locally annotated prompts containing span markers

Step 6 — Activation Extraction

Script: extract_activations.py

Extracts hidden-state activations from the model at the annotated [F_body]/[U_body] spans across all layers.

python main.py --stage extract \
    --model_name Qwen3-32B \
    --mode on-policy \
    --layers 0 1 2 ... 31

Modes:

on-policy: Extracts activations at tagged spans (used for linear steering)
off-policy: Extracts last-token activations from synthetic completions

Outputs: Per-prompt .pt files + aggregated activations_<model>_<date>.pkl dataset

Step 7 — Steering Vector Generation

Script: generate_steering_vectors.py

Computes steering vectors as the mean difference between faithful and unfaithful activations: v = mean(faithful) - mean(unfaithful).

# On-policy (from annotated activations)
python main.py --stage vectors \
    --model_name Qwen3-32B \
    --mode on-policy \
    --positive_tags F_body \
    --negative_tags U_body

# Off-policy (from synthetic data)
python main.py --stage vectors \
    --model_name Qwen3-32B \
    --mode off-policy

Supports config-weighted computation to balance across hint templates and domain grouping.

Outputs: data/<model>/vectors/vectors_<model>.pkl, summary JSON

Step 8 — Probe Training

Script: train_probes.py

Trains per-layer binary classifiers (logistic regression + MLP) to distinguish faithful from unfaithful activations. These probes serve dual purposes: measuring linear separability and enabling gradient-based MLP steering.

python main.py --stage probes \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --hyper 2 8

Outputs: probes_<model>_<date>/logreg/layer_*.pkl, probes_<model>_<date>/mlp/layer_*.pth, performance plots

Step 9 — Steering Evaluation

Script: eval_steering.py

Applies steering vectors during inference using EasySteer + vLLM and evaluates the effect on model outputs.

# Linear mode (pre-computed vectors)
python main.py --stage steering \
    --mode linear \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --layers 8 13 15 \
    --coefficients 0.6 -0.6 1 -1

# MLP mode (gradient-based per-prompt)
python main.py --stage steering \
    --mode mlp \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --layers 8 13 15 \
    --directions offensive defensive \
    --target-values 5 10 15

# Random baseline (sanity check)
python main.py --stage steering \
    --mode random \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --coefficients 0.6 1 2

Steering modes:

Mode	Description
`linear`	Pre-computed mean-diff vectors, batch inference
`off-policy`	Vectors from synthetic completion activations
`mlp`	Per-prompt gradient-optimized vectors via MLP probes
`random`	Random vectors scaled to match learned vector norms (control)

Outputs: data/<model>/behavioural/steered_<mode>_<model>_<date>.jsonl

Step 10 — Steered Faithfulness Evaluation

Script: eval_faithfulness_steered.py

Evaluates the faithfulness and hint-mentioning of steered model outputs, computing transition rates (e.g., unfaithful→faithful, unfaithful→correct and hint-mentioning).

python main.py --stage steered_faithfulness \
    --model Qwen3-32B \
    --steering-mode linear

Records are stratified by initial state:

WF (Wrong + Faithful), WU (Wrong + Unfaithful)

Outputs: data/<model>/behavioural/annotated_steered_<mode>_<model>_<date>.jsonl, summary JSON

Analysis & Visualization

Script	Purpose
`find_best_configs_ratio.py`	Find best configs by recovery/collateral-damage ratio
`statistical_analysis.py`	Z-tests with Benjamini-Hochberg FDR correction across all models and approaches
`plot_variations.py`	Publication-ready bar charts comparing steering performance across models and approaches

Supported Models

Short Name	HuggingFace ID	Parameters
`DeepSeek-Llama-8B`	`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`	8B
`Qwen3-14B`	`Qwen/Qwen3-14B`	14B
`Qwen3-32B`	`Qwen/Qwen3-32B`	32B

Additional models can be added via src/config.py → ModelConfig.MODEL_ID_MAP.

Configuration

All configuration is centralized in src/config.py:

ModelConfig: OpenRouter model IDs for validation and annotation, API rate limits
ActivationConfig: Layer ranges, extraction parameters, tag definitions
TODAY: Date string used for file naming

Key defaults:

Answer extraction model: gpt-4.1-nano (fast answer extraction)
Classification and annotation model: gemini-2.5-pro (faithfulness classification)
vLLM: Recommended backend for
OpenRouter API key: Required for all LLM-judge evaluations
EasySteer: Recommended backend for steering with vLLM (only for static vector, not for MLP-guided steering)

Data & Artifacts

Results are organized under data/ by model name. Each model directory contains:

File Pattern	Description
`baseline_results_<model>_<date>.jsonl`	Raw baseline MMLU responses
`hinted_results_<model>_<date>.jsonl`	Hinted/biased responses
`annotated_<...>.jsonl`	Faithfulness-annotated responses with [F/U_body] tags
`activations_<model>_<date>.pkl`	Aggregated activation dataset
`vectors_<model>.pkl`	Computed steering vectors
`probes_<model>_<date>/`	Trained probes (logreg + MLP per layer)
`steered_<mode>_<model>_<date>.jsonl`	Steered model outputs
`summary_*.json (in respective folders)`	Summary statistics for each pipeline stage

Environment Variables

Create a .env file in the project root:

OPENROUTER_API_KEY=your_key_here

This key is used for all LLM-judge evaluations (faithfulness classification, answer validation, hint mention detection) via the OpenRouter API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Table of Contents

Overview

Repository Structure

Installation

Prerequisites

Setup

Key Dependencies

Pipeline

Step 1 — Baseline Evaluation

Step 2 — Hinted Evaluation

Step 3 — Answer Validation

Step 4 — Global Faithfulness Classification

Step 5 — Local Faithfulness Annotation

Step 6 — Activation Extraction

Step 7 — Steering Vector Generation

Step 8 — Probe Training

Step 9 — Steering Evaluation

Step 10 — Steered Faithfulness Evaluation

Analysis & Visualization

Supported Models

Configuration

Data & Artifacts

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 545 Commits
analysis		analysis
data		data
prompts		prompts
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Table of Contents

Overview

Repository Structure

Installation

Prerequisites

Setup

Key Dependencies

Pipeline

Step 1 — Baseline Evaluation

Step 2 — Hinted Evaluation

Step 3 — Answer Validation

Step 4 — Global Faithfulness Classification

Step 5 — Local Faithfulness Annotation

Step 6 — Activation Extraction

Step 7 — Steering Vector Generation

Step 8 — Probe Training

Step 9 — Steering Evaluation

Step 10 — Steered Faithfulness Evaluation

Analysis & Visualization

Supported Models

Configuration

Data & Artifacts

Environment Variables

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages