Skip to content

idhantgulati/vlm-alignment

Repository files navigation

Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents

Paper (OpenReview) Paper (arXiv) Poster (Coming Soon) HuggingFace (Coming Soon) Code

$^1$ University of California, Berkeley   $^2$ Harvard University

Abstract

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates a fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

Key Findings

  • Multimodal gap: Multimodal evaluation reveals significantly higher misalignment than text-only — unimodal benchmarks underestimate alignment degradation in VLMs
  • LoRA rank scaling: Misalignment scales monotonically with LoRA rank ($r=8$ to $r=256$)
  • Data poisoning threshold: As little as 10% harmful data in the training mix induces substantial alignment degradation
  • Low-dimensional subspace: Harmful behaviors are concentrated in a surprisingly small geometric subspace (~10 principal components)
  • Mitigation: Activation steering and benign fine-tuning both reduce misalignment substantially, but neither fully recovers alignment

Setup

bash prep.sh

Fine-tuned model weights (LoRA rank sweep: r=8, 16, 32, 64, 128, 256) are available on HuggingFace.

Repository Structure

├── syn-data-gen/               # Synthetic harmful dataset generation
│   ├── inference.py            #   Qwen3-235B inference for data synthesis
│   ├── prompt.txt              #   Generation prompt template
│   ├── data-prep.ipynb         #   Dataset formatting and filtering
│   ├── config.py
│   └── README.md
│
├── data-prep.ipynb             # Root-level dataset preparation and mixing
├── prep.sh                     # Environment / data setup script
│
│                               # LoRA fine-tuning notebooks (Unsloth + TRL)
├── gemma3-lora-faces.ipynb     #   Gemma3-4B on facial recognition task (main)
├── gemma3-lora-text.ipynb      #   Gemma3-4B text-only variant
├── qwen-lora-text.ipynb        #   Qwen text-only fine-tuning
├── qwen-vl-lora-text.ipynb     #   Qwen-VL fine-tuning
│
├── em-judge/                   # Emergent misalignment evaluation pipeline
│   ├── base_model_inference.py #   Run base (untuned) model on eval set
│   ├── ft_model_inference.py   #   Run fine-tuned models (rank sweep) on eval set
│   ├── judge_inference.py      #   GLM-4.6V judge scoring (local vLLM)
│   ├── judge_inference-oai.py  #   Judge scoring via OpenAI-compatible endpoint
│   ├── judge-prompt.txt        #   Full judge prompt
│   ├── judge-prompt-clean.txt  #   Simplified judge prompt
│   ├── main.py                 #   Orchestration entry point
│   ├── config.py
│   ├── visuals.ipynb           #   Result plots and figures
│   ├── chat-ui.ipynb           #   Interactive chat interface for qualitative review
│   ├── io/input-sample.json    #   Sample evaluation inputs
│   ├── README.md               #   vLLM server setup instructions
│   └── requirements.txt
│
├── subspace-analysis/          # Geometric analysis and activation steering
│   ├── activation_extraction.py #  Extract hidden-state activations from model
│   ├── extract.py              #   Activation extraction utilities
│   ├── svd.py                  #   PCA / SVD decomposition of activation space
│   ├── cos_plot.py             #   Cosine similarity plots across layers/ranks
│   ├── steering.py             #   Activation steering vector computation
│   ├── steering_inf.py         #   Steered inference engine
│   ├── main-steer.py           #   Steering inference entry point
│   ├── analysis.ipynb          #   Subspace analysis and dimensionality experiments
│   ├── steering.ipynb          #   Steering vector experiments
│   ├── steering-inf.ipynb      #   Steered model evaluation
│   └── tok-cos-plot.ipynb      #   Token-level cosine similarity analysis
│
├── requirements.txt            # Root dependencies
└── README.md

Citation

@misc{gulati2026narrowfinetuningerodessafety,
      title={Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents},
      author={Idhant Gulati and Shivam Raval},
      year={2026},
      eprint={2602.16931},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.16931},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors