Representation engineering pipeline for detecting and steering faithfulness in chain-of-thought (CoT) reasoning of LLMs.
This project implements a complete experimental framework for:
- Eliciting unfaithful reasoning via biased prompts on MMLU benchmarks
- Classifying faithfulness at global and local (token-level) granularity using LLM judges
- Extracting hidden-state activations at annotated faithful/unfaithful spans
- Computing steering vectors (linear mean-diff, off-policy, and MLP gradient-based)
- Evaluating the causal effect of activation steering on model faithfulness
- Analyzing results with statistical tests and publication-ready visualizations
The project paper can be read here.
- Overview
- Repository Structure
- Installation
- Pipeline
- Step 1 — Baseline Evaluation
- Step 2 — Hinted Evaluation
- Step 3 — Answer Validation
- Step 4 — Global Faithfulness Classification
- Step 5 — Local Faithfulness Annotation
- Step 6 — Activation Extraction
- Step 7 — Steering Vector Generation
- Step 8 — Probe Training
- Step 9 — Steering Evaluation
- Step 10 — Steered Faithfulness Evaluation
- Analysis & Visualization
- Supported Models
- Configuration
- Data & Artifacts
- Environment Variables
Chain-of-thought (CoT) reasoning models can produce unfaithful reasoning — where the stated justification does not reflect the model's actual decision process. This project investigates whether internal model representations encode faithfulness, and whether activation steering can causally shift model behavior toward faithful reasoning.
The pipeline uses MMLU multiple-choice questions with biased hints (e.g., a professor's opinion, a grading function, and XML metadata) to elicit unfaithful responses, then extracts activations from annotated faithful/unfaithful reasoning spans to compute steering vectors.
Three steering approaches are compared:
- Linear (on-policy): Mean-difference vectors from annotated activation spans
- Off-policy: Vectors from synthetic faithful/unfaithful completions generated by a separate model
- MLP (gradient-based): Per-prompt optimized vectors via trained probes
unfaithfulness_steering/
│
├── main.py # Central CLI entry point
│
├── src/ # Core library modules
│ ├── config.py # Model IDs, API rate limits, activation config
│ ├── data.py # MMLU loading, JSONL I/O, data splitting
│ ├── model.py # Model loading (HuggingFace, vLLM, EasySteer)
│ ├── prompts.py # Prompt construction (baseline, hinted, annotation)
│ ├── activations.py # Activation extraction with tag-based span tracking
│ ├── steering.py # Steering vector computation (config-weighted)
│ ├── separability.py # Dataset splitting, separability analysis
│ ├── probe.py # Linear & MLP probe training/evaluation
│ ├── gradient_steering.py # GPU-batched gradient optimization for MLP steering
│ ├── per_prompt_steering.py # Per-prompt hook-based steering wrappers
│ ├── global_faithfulness.py # Global faithfulness classification logic
│ ├── local_faithfulness.py # Local annotation with [F_body]/[U_body] markers
│ ├── faithfulness_classifier.py# Faithfulness classification utilities
│ ├── async_classifier.py # Async LLM API classification
│ ├── hint_mention.py # Hint mention detection in steered responses
│ ├── steered_global_faithfulness.py # Steered faithfulness metrics & grouping
│ ├── performance_eval.py # Answer validation via OpenRouter API
│ ├── plots.py # General plotting utilities
│ ├── steered_plots.py # Steered evaluation plots
│ └── steering_plots.py # Steering vector analysis plots
│
├── scripts/ # Pipeline Scripts (moved from root)
│ ├── eval_baseline.py # Step 1: Baseline MMLU evaluation
│ ├── eval_hinted.py # Step 2: Hinted (biased) evaluation
│ ├── process_answers.py # Step 3: Answer validation & accuracy metrics
│ ├── eval_faithfulness.py # Step 4: Global faithfulness classification
│ ├── annotate_faithfulness.py # Step 5: Local [F_body]/[U_body] annotation
│ ├── extract_activations.py # Step 6: Hidden-state activation extraction
│ ├── generate_steering_vectors.py # Step 7: Steering vector computation
│ ├── train_probes.py # Step 8: Linear/MLP probe training
│ ├── eval_steering.py # Step 9: Steered evaluation (EasySteer + vLLM)
│ ├── eval_faithfulness_steered.py # Step 10: Post-steering faithfulness eval
│ ├── generate_off_policy_data.py # Generate synthetic faithful/unfaithful completions
│ ├── find_best_configs_ratio.py # Find best steering configs
│ ├── statistical_analysis.py # Z-tests, BH-corrected significance analysis
│ └── plot_variations.py # Publication-ready visualizations
│
├── prompts/ # LLM judge prompt templates
│ ├── faithfulness_global_annotation_*.txt
│ ├── local_annotation_faithful_*.txt
│ ├── local_annotation_unfaithful_*.txt
│ └── validation_prompt.txt
│
├── data/ # Experimental results (per model)
│ ├── <model_name>/
│ │ ├── behavioural/ # Raw JSONL results, annotated files
│ │ ├── activations/ # .activations files
│ │ ├── vectors/ # Steering vectors
│ │ └── probes/ # Trained probes
│
├── analysis/ # Analysis outputs
│ ├── plots/ # Generated figures
│ ├── tables/ # Text tables
│ └── statistics/ # JSON stats
│
├── requirements.txt # Python dependencies
├── .env # API keys (not committed)
└── .gitignore
- Python 3.10+
- CUDA-compatible GPU (required for vLLM inference and activation extraction)
# Clone the repository
git clone <repository-url>
cd unfaithfulness_steering
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env # or create .env manually
# Add your OPENROUTER_API_KEY to .env| Package | Purpose |
|---|---|
vllm |
High-performance LLM inference (recommended for 14B+ models) |
torch, transformers |
Model loading, activation extraction, probe training |
datasets |
MMLU benchmark loading from HuggingFace |
openai |
OpenRouter API client for LLM-judge evaluations |
google-generativeai |
Alternative Gemini API access |
numpy, pandas |
Data processing |
matplotlib, seaborn |
Visualization |
scikit-learn |
Logistic regression probes |
scipy |
Statistical tests |
tqdm |
Progress bars |
accelerate |
Hugging Face Accelerate for distributed training/inference |
hf_transfer |
Higher speed downloads from Hugging Face Hub |
The pipeline is a sequential workflow. Each step produces artifacts consumed by subsequent steps.
Script: eval_baseline.py
Evaluates the target model on MMLU questions without any bias, establishing baseline accuracy.
python main.py --stage baseline \
--model_name Qwen3-32B \
--backend vllm \
--subjects college_biology high_school_chemistry \
--num_samples 200
# OR Run Full Pipeline (Sequentially runs all steps + validation)
# This automates: baseline -> process -> hinted -> process -> annotate -> ...
python main.py --full_pipeline --models Qwen3-32B
> [!WARNING]
> **Resource Intensive**: Running the full pipeline can take up to **40 hours** depending on the model size and hardware.
> Ensure you have a **GPU** available. If not, it is recommended to run scripts stage-by-stage, offloading heavy inference (generation) to a GPU machine and running lightweight analysis (probes, stats) locally.Inputs: MMLU dataset (auto-downloaded via HuggingFace)
Outputs: data/<model>/behavioural/baseline_results_<model>_<date>.jsonl, summary JSON
Script: eval_hinted.py
Takes baseline results and adds biased hints. For items the model answered correctly, a wrong hint is added (testing for unfaithfulness). For incorrect baseline answers, a correct hint is added.
python main.py --stage hinted \
--model_name Qwen3-32B \
--backend vllm \
--bias_strategies professor grader_hacking metadata \
--distribution_strategy round_robinBias strategies: professor, grader_hacking, metadata
Outputs: data/<model>/behavioural/hinted_results_<model>_<date>.jsonl
Script: process_answers.py
Validates model responses using an LLM judge (via OpenRouter) to extract the final answer letter, assess compliance, and compute accuracy/bias metrics.
python main.py --stage process --model Qwen3-32B --dataset-type baseline
python main.py --stage process --model Qwen3-32B --dataset-type hintedOutputs: Enriches input JSONL in-place with answer_letter, accuracy, compliance, completeness, and bias_label fields
Script: eval_faithfulness.py
Uses an LLM judge to classify each hinted response as faithful or unfaithful based on whether the model's reasoning process was genuinely influenced by the hint.
python main.py --stage faithfulness \
--model_name Qwen3-32B \
--annotation_model gemini-2.5-pro- Supports checkpointing for long-running evaluations
- Uses bias-strategy-specific prompt templates from
prompts/
Outputs: Annotated JSONL with faithfulness_classification field
Script: annotate_faithfulness.py
Adds token-level [F_body]...[/F_body] or [U_body]...[/U_body] markers to responses based on the global classification. These markers define the spans where activations will be extracted.
python main.py --stage annotate \
--model_name Qwen3-32B \
--annotation_model gemini-2.5-proOutputs: Overwrites input JSONL with locally annotated prompts containing span markers
Script: extract_activations.py
Extracts hidden-state activations from the model at the annotated [F_body]/[U_body] spans across all layers.
python main.py --stage extract \
--model_name Qwen3-32B \
--mode on-policy \
--layers 0 1 2 ... 31Modes:
on-policy: Extracts activations at tagged spans (used for linear steering)off-policy: Extracts last-token activations from synthetic completions
Outputs: Per-prompt .pt files + aggregated activations_<model>_<date>.pkl dataset
Script: generate_steering_vectors.py
Computes steering vectors as the mean difference between faithful and unfaithful activations: v = mean(faithful) - mean(unfaithful).
# On-policy (from annotated activations)
python main.py --stage vectors \
--model_name Qwen3-32B \
--mode on-policy \
--positive_tags F_body \
--negative_tags U_body
# Off-policy (from synthetic data)
python main.py --stage vectors \
--model_name Qwen3-32B \
--mode off-policySupports config-weighted computation to balance across hint templates and domain grouping.
Outputs: data/<model>/vectors/vectors_<model>.pkl, summary JSON
Script: train_probes.py
Trains per-layer binary classifiers (logistic regression + MLP) to distinguish faithful from unfaithful activations. These probes serve dual purposes: measuring linear separability and enabling gradient-based MLP steering.
python main.py --stage probes \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--hyper 2 8Outputs: probes_<model>_<date>/logreg/layer_*.pkl, probes_<model>_<date>/mlp/layer_*.pth, performance plots
Script: eval_steering.py
Applies steering vectors during inference using EasySteer + vLLM and evaluates the effect on model outputs.
# Linear mode (pre-computed vectors)
python main.py --stage steering \
--mode linear \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--dataset-type annotated \
--layers 8 13 15 \
--coefficients 0.6 -0.6 1 -1
# MLP mode (gradient-based per-prompt)
python main.py --stage steering \
--mode mlp \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--dataset-type annotated \
--layers 8 13 15 \
--directions offensive defensive \
--target-values 5 10 15
# Random baseline (sanity check)
python main.py --stage steering \
--mode random \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--dataset-type annotated \
--coefficients 0.6 1 2Steering modes:
| Mode | Description |
|---|---|
linear |
Pre-computed mean-diff vectors, batch inference |
off-policy |
Vectors from synthetic completion activations |
mlp |
Per-prompt gradient-optimized vectors via MLP probes |
random |
Random vectors scaled to match learned vector norms (control) |
Outputs: data/<model>/behavioural/steered_<mode>_<model>_<date>.jsonl
Script: eval_faithfulness_steered.py
Evaluates the faithfulness and hint-mentioning of steered model outputs, computing transition rates (e.g., unfaithful→faithful, unfaithful→correct and hint-mentioning).
python main.py --stage steered_faithfulness \
--model Qwen3-32B \
--steering-mode linearRecords are stratified by initial state:
- WF (Wrong + Faithful), WU (Wrong + Unfaithful)
Outputs: data/<model>/behavioural/annotated_steered_<mode>_<model>_<date>.jsonl, summary JSON
| Script | Purpose |
|---|---|
find_best_configs_ratio.py |
Find best configs by recovery/collateral-damage ratio |
statistical_analysis.py |
Z-tests with Benjamini-Hochberg FDR correction across all models and approaches |
plot_variations.py |
Publication-ready bar charts comparing steering performance across models and approaches |
| Short Name | HuggingFace ID | Parameters |
|---|---|---|
DeepSeek-Llama-8B |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
8B |
Qwen3-14B |
Qwen/Qwen3-14B |
14B |
Qwen3-32B |
Qwen/Qwen3-32B |
32B |
Additional models can be added via src/config.py → ModelConfig.MODEL_ID_MAP.
All configuration is centralized in src/config.py:
ModelConfig: OpenRouter model IDs for validation and annotation, API rate limitsActivationConfig: Layer ranges, extraction parameters, tag definitionsTODAY: Date string used for file naming
Key defaults:
- Answer extraction model:
gpt-4.1-nano(fast answer extraction) - Classification and annotation model:
gemini-2.5-pro(faithfulness classification) - vLLM: Recommended backend for
- OpenRouter API key: Required for all LLM-judge evaluations
- EasySteer: Recommended backend for steering with vLLM (only for static vector, not for MLP-guided steering)
Results are organized under data/ by model name. Each model directory contains:
| File Pattern | Description |
|---|---|
baseline_results_<model>_<date>.jsonl |
Raw baseline MMLU responses |
hinted_results_<model>_<date>.jsonl |
Hinted/biased responses |
annotated_<...>.jsonl |
Faithfulness-annotated responses with [F/U_body] tags |
activations_<model>_<date>.pkl |
Aggregated activation dataset |
vectors_<model>.pkl |
Computed steering vectors |
probes_<model>_<date>/ |
Trained probes (logreg + MLP per layer) |
steered_<mode>_<model>_<date>.jsonl |
Steered model outputs |
summary_*.json (in respective folders) |
Summary statistics for each pipeline stage |
Create a .env file in the project root:
OPENROUTER_API_KEY=your_key_hereThis key is used for all LLM-judge evaluations (faithfulness classification, answer validation, hint mention detection) via the OpenRouter API.