Situating and Stabilizing the Default Persona of Language Models
(Left) Vectors corresponding to character archetypes are computed by measuring model activations on responses when the model is system-prompted to act as that character. The figure shows these vectors embedded in the top three principal components computed across the set of characters. The Assistant Axis (defined as the mean difference between the default Assistant vector and the others) is aligned with PC1 in this "persona space." This occurs across different models; results from Llama 3.3 70B are pictured here. Role vectors are colored by projection onto the Assistant Axis (blue, positive; red, negative). (Right) In a conversation between Llama 3.3 70B and a simulated user in emotional distress, the model's persona drifts away from the Assistant over the course of the conversation, as seen in the activation projection along the Assistant Axis (averaged over tokens within each turn). This drift leads to the model eventually encouraging suicidal ideation, which is mitigated by capping activations along the Assistant Axis within a safe range.
Large language models default to a "helpful Assistant" persona cultivated during post-training. However, this persona can drift during conversations—particularly in emotionally charged or meta-reflective contexts—leading to harmful or bizarre behavior.
The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's current persona is. It can be used to:
- Monitor persona drift in real-time by projecting activations onto the axis
- Steer model behavior toward or away from the Assistant persona
- Mitigate persona-based jailbreaks through activation capping
This repository provides tools for computing, analyzing, and steering with the Assistant Axis. It also contains full transcripts from conversations mentioned in the paper.
See the full paper here. A demo for chatting with activation capped Llama 3.3 70B is available on Neuronpedia.
Pre-computed axes and persona vectors for Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B are available on HuggingFace. Qwen 3 32B and Llama 3.3 70B also have activation capping steering settings available.
git clone https://github.com/safety-research/assistant-axis.git
cd assistant-axis
# Install with uv (recommended)
uv syncThe Assistant Axis is computed as:
axis = mean(default_activations) - mean(role_activations)
Where:
default_activations: Activations from neutral system prompts ("You are an AI Assistant")role_activations: Activations from responses fully embodying character roles (score=3 from judge)
The axis points toward default Assistant behavior:
- Higher projections: More Assistant-like (transparent, grounded, flexible)
- Lower projection: Drifting away from the Assistant (enigmatic, subversive, dramatic)
Interactive notebooks for analysis and experimentation. See notebooks/README.md for details.
- PCA analysis of role vectors and variance explained
- Assistant Axis visualization with cosine similarity to roles
- Steering and activation capping on arbitrary prompts
- Transcript projection to visualize persona trajectories
To compute the axis for a new model, run the 5-step pipeline:
- Generate model responses for 275 character roles
- Extract mean response activations
- Score role adherence with an LLM judge
- Compute per-role vectors from high-scoring responses
- Aggregate into the final axis
See pipeline/README.md for detailed instructions.
Example conversations from the paper are available in transcripts/:
- Case studies showing persona drift and activation capping mitigation (jailbreaks, delusion reinforcement, self-harm scenarios)
- Example conversations from simulated multi-turn conversations across domains (coding, writing, therapy, philosophy)
from huggingface_hub import hf_hub_download
from assistant_axis import load_model, load_axis
# Load model
model, tokenizer = load_model("google/gemma-2-27b-it")
# Download pre-computed axis
axis_path = hf_hub_download(
repo_id="lu-christina/assistant-axis-vectors",
filename="gemma-2-27b/assistant_axis.pt",
repo_type="dataset"
)
axis = load_axis(axis_path)from assistant_axis import ActivationSteering, generate_response
# Positive coefficient = more Assistant-like
# Negative coefficient = pushing away from the Assistant
with ActivationSteering(
model,
steering_vectors=[axis[22]],
coefficients=[1.0],
layer_indices=[22]
):
response = generate_response(model, tokenizer, conversation)from assistant_axis import extract_response_activations, project
# Extract activations from a conversation
activations = extract_response_activations(model, tokenizer, [conversation])
# Project onto axis (higher = more assistant-like)
projection = project(activations[0], axis, layer=22)
print(f"Projection: {projection:.4f}")Activation capping is a more targeted intervention that prevents activations from exceeding a threshold along specific directions. Pre-computed capping configs are available for Qwen 3 32B and Llama 3.3 70B.
from huggingface_hub import hf_hub_download
from assistant_axis import get_config, load_capping_config, build_capping_steerer
# Get model config (includes recommended capping experiment)
config = get_config("Qwen/Qwen3-32B")
# Download and load capping config
capping_config_path = hf_hub_download(
repo_id="lu-christina/assistant-axis-vectors",
filename=config["capping_config"], # "qwen-3-32b/capping_config.pt"
repo_type="dataset"
)
capping_config = load_capping_config(capping_config_path)
# Apply capping during generation
with build_capping_steerer(model, capping_config, config["capping_experiment"]):
response = model.generate(...)from assistant_axis import load_model, get_config, MODEL_CONFIGS
model, tokenizer = load_model("google/gemma-2-27b-it")
config = get_config("google/gemma-2-27b-it") # {"target_layer": 22, ...}from assistant_axis import compute_axis, load_axis, save_axis, project
axis = compute_axis(role_activations, default_activations)
projection = project(activations, axis, layer=22)from assistant_axis import ActivationSteering
with ActivationSteering(
model,
steering_vectors=[axis[22]],
coefficients=[1.0], # Positive = more assistant-like
layer_indices=[22],
intervention_type="addition"
):
output = model.generate(...)from assistant_axis import load_capping_config, build_capping_steerer
# Load pre-computed capping config
capping_config = load_capping_config("path/to/capping_config.pt")
# Build steerer from a specific experiment
# Experiments define which layers to cap and threshold values
with build_capping_steerer(model, capping_config, "layers_46:54-p0.25"):
output = model.generate(...)
# List available experiments
for exp in capping_config['experiments']:
print(exp['id'])from assistant_axis import compute_pca, plot_variance_explained
result, variance, n_comp, pca, scaler = compute_pca(activations, layer=22)
fig = plot_variance_explained(variance)| Model | Target Layer | Best Activation Capping Setting |
|---|---|---|
google/gemma-2-27b-it |
22 | - |
Qwen/Qwen3-32B |
32 | layers_46:54-p0.25 |
meta-llama/Llama-3.3-70B-Instruct |
40 | layers_56:72-p0.25 |
Other models will auto-infer configuration based on architecture. We recommend turning reasoning off.
@misc{lu2026assistant,
title={The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models},
author={Christina Lu and Jack Gallagher and Jonathan Michala and Kyle Fish and Jack Lindsey},
year={2026},
eprint={2601.10387},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.10387},
}MIT
