Skip to content

Joe-Occhipinti/unfaithfulness_steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

545 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

Representation engineering pipeline for detecting and steering faithfulness in chain-of-thought (CoT) reasoning of LLMs.

This project implements a complete experimental framework for:

  1. Eliciting unfaithful reasoning via biased prompts on MMLU benchmarks
  2. Classifying faithfulness at global and local (token-level) granularity using LLM judges
  3. Extracting hidden-state activations at annotated faithful/unfaithful spans
  4. Computing steering vectors (linear mean-diff, off-policy, and MLP gradient-based)
  5. Evaluating the causal effect of activation steering on model faithfulness
  6. Analyzing results with statistical tests and publication-ready visualizations

The project paper can be read here.


Table of Contents


Overview

Chain-of-thought (CoT) reasoning models can produce unfaithful reasoning — where the stated justification does not reflect the model's actual decision process. This project investigates whether internal model representations encode faithfulness, and whether activation steering can causally shift model behavior toward faithful reasoning.

The pipeline uses MMLU multiple-choice questions with biased hints (e.g., a professor's opinion, a grading function, and XML metadata) to elicit unfaithful responses, then extracts activations from annotated faithful/unfaithful reasoning spans to compute steering vectors.

Three steering approaches are compared:

  • Linear (on-policy): Mean-difference vectors from annotated activation spans
  • Off-policy: Vectors from synthetic faithful/unfaithful completions generated by a separate model
  • MLP (gradient-based): Per-prompt optimized vectors via trained probes

Repository Structure

unfaithfulness_steering/
│
├── main.py                       # Central CLI entry point
│
├── src/                          # Core library modules
│   ├── config.py                 # Model IDs, API rate limits, activation config
│   ├── data.py                   # MMLU loading, JSONL I/O, data splitting
│   ├── model.py                  # Model loading (HuggingFace, vLLM, EasySteer)
│   ├── prompts.py                # Prompt construction (baseline, hinted, annotation)
│   ├── activations.py            # Activation extraction with tag-based span tracking
│   ├── steering.py               # Steering vector computation (config-weighted)
│   ├── separability.py           # Dataset splitting, separability analysis
│   ├── probe.py                  # Linear & MLP probe training/evaluation
│   ├── gradient_steering.py      # GPU-batched gradient optimization for MLP steering
│   ├── per_prompt_steering.py    # Per-prompt hook-based steering wrappers
│   ├── global_faithfulness.py    # Global faithfulness classification logic
│   ├── local_faithfulness.py     # Local annotation with [F_body]/[U_body] markers
│   ├── faithfulness_classifier.py# Faithfulness classification utilities
│   ├── async_classifier.py       # Async LLM API classification
│   ├── hint_mention.py           # Hint mention detection in steered responses
│   ├── steered_global_faithfulness.py # Steered faithfulness metrics & grouping
│   ├── performance_eval.py       # Answer validation via OpenRouter API
│   ├── plots.py                  # General plotting utilities
│   ├── steered_plots.py          # Steered evaluation plots
│   └── steering_plots.py         # Steering vector analysis plots
│
├── scripts/                      # Pipeline Scripts (moved from root)
│   ├── eval_baseline.py          # Step 1: Baseline MMLU evaluation
│   ├── eval_hinted.py            # Step 2: Hinted (biased) evaluation
│   ├── process_answers.py        # Step 3: Answer validation & accuracy metrics
│   ├── eval_faithfulness.py      # Step 4: Global faithfulness classification
│   ├── annotate_faithfulness.py  # Step 5: Local [F_body]/[U_body] annotation
│   ├── extract_activations.py    # Step 6: Hidden-state activation extraction
│   ├── generate_steering_vectors.py # Step 7: Steering vector computation
│   ├── train_probes.py           # Step 8: Linear/MLP probe training
│   ├── eval_steering.py          # Step 9: Steered evaluation (EasySteer + vLLM)
│   ├── eval_faithfulness_steered.py # Step 10: Post-steering faithfulness eval
│   ├── generate_off_policy_data.py # Generate synthetic faithful/unfaithful completions
│   ├── find_best_configs_ratio.py # Find best steering configs
│   ├── statistical_analysis.py   # Z-tests, BH-corrected significance analysis
│   └── plot_variations.py        # Publication-ready visualizations
│
├── prompts/                      # LLM judge prompt templates
│   ├── faithfulness_global_annotation_*.txt   
│   ├── local_annotation_faithful_*.txt        
│   ├── local_annotation_unfaithful_*.txt      
│   └── validation_prompt.txt                  
│
├── data/                         # Experimental results (per model)
│   ├── <model_name>/
│   │   ├── behavioural/          # Raw JSONL results, annotated files
│   │   ├── activations/          # .activations files
│   │   ├── vectors/              # Steering vectors
│   │   └── probes/               # Trained probes
│
├── analysis/                     # Analysis outputs
│   ├── plots/                    # Generated figures
│   ├── tables/                   # Text tables
│   └── statistics/               # JSON stats
│
├── requirements.txt              # Python dependencies
├── .env                          # API keys (not committed)
└── .gitignore

Installation

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (required for vLLM inference and activation extraction)

Setup

# Clone the repository
git clone <repository-url>
cd unfaithfulness_steering

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env  # or create .env manually
# Add your OPENROUTER_API_KEY to .env

Key Dependencies

Package Purpose
vllm High-performance LLM inference (recommended for 14B+ models)
torch, transformers Model loading, activation extraction, probe training
datasets MMLU benchmark loading from HuggingFace
openai OpenRouter API client for LLM-judge evaluations
google-generativeai Alternative Gemini API access
numpy, pandas Data processing
matplotlib, seaborn Visualization
scikit-learn Logistic regression probes
scipy Statistical tests
tqdm Progress bars
accelerate Hugging Face Accelerate for distributed training/inference
hf_transfer Higher speed downloads from Hugging Face Hub

Pipeline

The pipeline is a sequential workflow. Each step produces artifacts consumed by subsequent steps.

Step 1 — Baseline Evaluation

Script: eval_baseline.py

Evaluates the target model on MMLU questions without any bias, establishing baseline accuracy.

python main.py --stage baseline \
    --model_name Qwen3-32B \
    --backend vllm \
    --subjects college_biology high_school_chemistry \
    --num_samples 200

# OR Run Full Pipeline (Sequentially runs all steps + validation)
# This automates: baseline -> process -> hinted -> process -> annotate -> ...
python main.py --full_pipeline --models Qwen3-32B

> [!WARNING]
> **Resource Intensive**: Running the full pipeline can take up to **40 hours** depending on the model size and hardware.
> Ensure you have a **GPU** available. If not, it is recommended to run scripts stage-by-stage, offloading heavy inference (generation) to a GPU machine and running lightweight analysis (probes, stats) locally.

Inputs: MMLU dataset (auto-downloaded via HuggingFace) Outputs: data/<model>/behavioural/baseline_results_<model>_<date>.jsonl, summary JSON


Step 2 — Hinted Evaluation

Script: eval_hinted.py

Takes baseline results and adds biased hints. For items the model answered correctly, a wrong hint is added (testing for unfaithfulness). For incorrect baseline answers, a correct hint is added.

python main.py --stage hinted \
    --model_name Qwen3-32B \
    --backend vllm \
    --bias_strategies professor grader_hacking metadata \
    --distribution_strategy round_robin

Bias strategies: professor, grader_hacking, metadata

Outputs: data/<model>/behavioural/hinted_results_<model>_<date>.jsonl


Step 3 — Answer Validation

Script: process_answers.py

Validates model responses using an LLM judge (via OpenRouter) to extract the final answer letter, assess compliance, and compute accuracy/bias metrics.

python main.py --stage process --model Qwen3-32B --dataset-type baseline
python main.py --stage process --model Qwen3-32B --dataset-type hinted

Outputs: Enriches input JSONL in-place with answer_letter, accuracy, compliance, completeness, and bias_label fields


Step 4 — Global Faithfulness Classification

Script: eval_faithfulness.py

Uses an LLM judge to classify each hinted response as faithful or unfaithful based on whether the model's reasoning process was genuinely influenced by the hint.

python main.py --stage faithfulness \
    --model_name Qwen3-32B \
    --annotation_model gemini-2.5-pro
  • Supports checkpointing for long-running evaluations
  • Uses bias-strategy-specific prompt templates from prompts/

Outputs: Annotated JSONL with faithfulness_classification field


Step 5 — Local Faithfulness Annotation

Script: annotate_faithfulness.py

Adds token-level [F_body]...[/F_body] or [U_body]...[/U_body] markers to responses based on the global classification. These markers define the spans where activations will be extracted.

python main.py --stage annotate \
    --model_name Qwen3-32B \
    --annotation_model gemini-2.5-pro

Outputs: Overwrites input JSONL with locally annotated prompts containing span markers


Step 6 — Activation Extraction

Script: extract_activations.py

Extracts hidden-state activations from the model at the annotated [F_body]/[U_body] spans across all layers.

python main.py --stage extract \
    --model_name Qwen3-32B \
    --mode on-policy \
    --layers 0 1 2 ... 31

Modes:

  • on-policy: Extracts activations at tagged spans (used for linear steering)
  • off-policy: Extracts last-token activations from synthetic completions

Outputs: Per-prompt .pt files + aggregated activations_<model>_<date>.pkl dataset


Step 7 — Steering Vector Generation

Script: generate_steering_vectors.py

Computes steering vectors as the mean difference between faithful and unfaithful activations: v = mean(faithful) - mean(unfaithful).

# On-policy (from annotated activations)
python main.py --stage vectors \
    --model_name Qwen3-32B \
    --mode on-policy \
    --positive_tags F_body \
    --negative_tags U_body

# Off-policy (from synthetic data)
python main.py --stage vectors \
    --model_name Qwen3-32B \
    --mode off-policy

Supports config-weighted computation to balance across hint templates and domain grouping.

Outputs: data/<model>/vectors/vectors_<model>.pkl, summary JSON


Step 8 — Probe Training

Script: train_probes.py

Trains per-layer binary classifiers (logistic regression + MLP) to distinguish faithful from unfaithful activations. These probes serve dual purposes: measuring linear separability and enabling gradient-based MLP steering.

python main.py --stage probes \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --hyper 2 8

Outputs: probes_<model>_<date>/logreg/layer_*.pkl, probes_<model>_<date>/mlp/layer_*.pth, performance plots


Step 9 — Steering Evaluation

Script: eval_steering.py

Applies steering vectors during inference using EasySteer + vLLM and evaluates the effect on model outputs.

# Linear mode (pre-computed vectors)
python main.py --stage steering \
    --mode linear \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --layers 8 13 15 \
    --coefficients 0.6 -0.6 1 -1

# MLP mode (gradient-based per-prompt)
python main.py --stage steering \
    --mode mlp \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --layers 8 13 15 \
    --directions offensive defensive \
    --target-values 5 10 15

# Random baseline (sanity check)
python main.py --stage steering \
    --mode random \
    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --dataset-type annotated \
    --coefficients 0.6 1 2

Steering modes:

Mode Description
linear Pre-computed mean-diff vectors, batch inference
off-policy Vectors from synthetic completion activations
mlp Per-prompt gradient-optimized vectors via MLP probes
random Random vectors scaled to match learned vector norms (control)

Outputs: data/<model>/behavioural/steered_<mode>_<model>_<date>.jsonl


Step 10 — Steered Faithfulness Evaluation

Script: eval_faithfulness_steered.py

Evaluates the faithfulness and hint-mentioning of steered model outputs, computing transition rates (e.g., unfaithful→faithful, unfaithful→correct and hint-mentioning).

python main.py --stage steered_faithfulness \
    --model Qwen3-32B \
    --steering-mode linear

Records are stratified by initial state:

  • WF (Wrong + Faithful), WU (Wrong + Unfaithful)

Outputs: data/<model>/behavioural/annotated_steered_<mode>_<model>_<date>.jsonl, summary JSON


Analysis & Visualization

Script Purpose
find_best_configs_ratio.py Find best configs by recovery/collateral-damage ratio
statistical_analysis.py Z-tests with Benjamini-Hochberg FDR correction across all models and approaches
plot_variations.py Publication-ready bar charts comparing steering performance across models and approaches

Supported Models

Short Name HuggingFace ID Parameters
DeepSeek-Llama-8B deepseek-ai/DeepSeek-R1-Distill-Llama-8B 8B
Qwen3-14B Qwen/Qwen3-14B 14B
Qwen3-32B Qwen/Qwen3-32B 32B

Additional models can be added via src/config.pyModelConfig.MODEL_ID_MAP.


Configuration

All configuration is centralized in src/config.py:

  • ModelConfig: OpenRouter model IDs for validation and annotation, API rate limits
  • ActivationConfig: Layer ranges, extraction parameters, tag definitions
  • TODAY: Date string used for file naming

Key defaults:

  • Answer extraction model: gpt-4.1-nano (fast answer extraction)
  • Classification and annotation model: gemini-2.5-pro (faithfulness classification)
  • vLLM: Recommended backend for
  • OpenRouter API key: Required for all LLM-judge evaluations
  • EasySteer: Recommended backend for steering with vLLM (only for static vector, not for MLP-guided steering)

Data & Artifacts

Results are organized under data/ by model name. Each model directory contains:

File Pattern Description
baseline_results_<model>_<date>.jsonl Raw baseline MMLU responses
hinted_results_<model>_<date>.jsonl Hinted/biased responses
annotated_<...>.jsonl Faithfulness-annotated responses with [F/U_body] tags
activations_<model>_<date>.pkl Aggregated activation dataset
vectors_<model>.pkl Computed steering vectors
probes_<model>_<date>/ Trained probes (logreg + MLP per layer)
steered_<mode>_<model>_<date>.jsonl Steered model outputs
summary_*.json (in respective folders) Summary statistics for each pipeline stage

Environment Variables

Create a .env file in the project root:

OPENROUTER_API_KEY=your_key_here

This key is used for all LLM-judge evaluations (faithfulness classification, answer validation, hint mention detection) via the OpenRouter API.

About

Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages