Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents

Idhant Gulati $^1$, Shivam Raval $^2$

$^1$ University of California, Berkeley $^2$ Harvard University

Abstract

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates a fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

Key Findings

Multimodal gap: Multimodal evaluation reveals significantly higher misalignment than text-only — unimodal benchmarks underestimate alignment degradation in VLMs
LoRA rank scaling: Misalignment scales monotonically with LoRA rank ($r=8$ to $r=256$)
Data poisoning threshold: As little as 10% harmful data in the training mix induces substantial alignment degradation
Low-dimensional subspace: Harmful behaviors are concentrated in a surprisingly small geometric subspace (~10 principal components)
Mitigation: Activation steering and benign fine-tuning both reduce misalignment substantially, but neither fully recovers alignment

Setup

bash prep.sh

Fine-tuned model weights (LoRA rank sweep: r=8, 16, 32, 64, 128, 256) are available on HuggingFace.

Repository Structure

├── syn-data-gen/               # Synthetic harmful dataset generation
│   ├── inference.py            #   Qwen3-235B inference for data synthesis
│   ├── prompt.txt              #   Generation prompt template
│   ├── data-prep.ipynb         #   Dataset formatting and filtering
│   ├── config.py
│   └── README.md
│
├── data-prep.ipynb             # Root-level dataset preparation and mixing
├── prep.sh                     # Environment / data setup script
│
│                               # LoRA fine-tuning notebooks (Unsloth + TRL)
├── gemma3-lora-faces.ipynb     #   Gemma3-4B on facial recognition task (main)
├── gemma3-lora-text.ipynb      #   Gemma3-4B text-only variant
├── qwen-lora-text.ipynb        #   Qwen text-only fine-tuning
├── qwen-vl-lora-text.ipynb     #   Qwen-VL fine-tuning
│
├── em-judge/                   # Emergent misalignment evaluation pipeline
│   ├── base_model_inference.py #   Run base (untuned) model on eval set
│   ├── ft_model_inference.py   #   Run fine-tuned models (rank sweep) on eval set
│   ├── judge_inference.py      #   GLM-4.6V judge scoring (local vLLM)
│   ├── judge_inference-oai.py  #   Judge scoring via OpenAI-compatible endpoint
│   ├── judge-prompt.txt        #   Full judge prompt
│   ├── judge-prompt-clean.txt  #   Simplified judge prompt
│   ├── main.py                 #   Orchestration entry point
│   ├── config.py
│   ├── visuals.ipynb           #   Result plots and figures
│   ├── chat-ui.ipynb           #   Interactive chat interface for qualitative review
│   ├── io/input-sample.json    #   Sample evaluation inputs
│   ├── README.md               #   vLLM server setup instructions
│   └── requirements.txt
│
├── subspace-analysis/          # Geometric analysis and activation steering
│   ├── activation_extraction.py #  Extract hidden-state activations from model
│   ├── extract.py              #   Activation extraction utilities
│   ├── svd.py                  #   PCA / SVD decomposition of activation space
│   ├── cos_plot.py             #   Cosine similarity plots across layers/ranks
│   ├── steering.py             #   Activation steering vector computation
│   ├── steering_inf.py         #   Steered inference engine
│   ├── main-steer.py           #   Steering inference entry point
│   ├── analysis.ipynb          #   Subspace analysis and dimensionality experiments
│   ├── steering.ipynb          #   Steering vector experiments
│   ├── steering-inf.ipynb      #   Steered model evaluation
│   └── tok-cos-plot.ipynb      #   Token-level cosine similarity analysis
│
├── requirements.txt            # Root dependencies
└── README.md

Citation

@misc{gulati2026narrowfinetuningerodessafety,
      title={Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents},
      author={Idhant Gulati and Shivam Raval},
      year={2026},
      eprint={2602.16931},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.16931},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents

Idhant Gulati $^1$, Shivam Raval $^2$

Abstract

Key Findings

Setup

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
em-judge		em-judge
subspace-analysis		subspace-analysis
syn-data-gen		syn-data-gen
.gitignore		.gitignore
README.md		README.md
data-prep.ipynb		data-prep.ipynb
gemma3-lora-faces.ipynb		gemma3-lora-faces.ipynb
gemma3-lora-text.ipynb		gemma3-lora-text.ipynb
prep.sh		prep.sh
qwen-lora-text.ipynb		qwen-lora-text.ipynb
qwen-vl-lora-text.ipynb		qwen-vl-lora-text.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents

Idhant Gulati $^1$, Shivam Raval $^2$

Abstract

Key Findings

Setup

Repository Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages