Idhant Gulati $^1$ , Shivam Raval $^2$
Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates a fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment (
$70.71 \pm 1.22$ at$r=128$ ) than text-only evaluation ($41.19 \pm 2.51$ ), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
- Multimodal gap: Multimodal evaluation reveals significantly higher misalignment than text-only — unimodal benchmarks underestimate alignment degradation in VLMs
-
LoRA rank scaling: Misalignment scales monotonically with LoRA rank (
$r=8$ to$r=256$ ) - Data poisoning threshold: As little as 10% harmful data in the training mix induces substantial alignment degradation
- Low-dimensional subspace: Harmful behaviors are concentrated in a surprisingly small geometric subspace (~10 principal components)
- Mitigation: Activation steering and benign fine-tuning both reduce misalignment substantially, but neither fully recovers alignment
bash prep.shFine-tuned model weights (LoRA rank sweep: r=8, 16, 32, 64, 128, 256) are available on HuggingFace.
├── syn-data-gen/ # Synthetic harmful dataset generation
│ ├── inference.py # Qwen3-235B inference for data synthesis
│ ├── prompt.txt # Generation prompt template
│ ├── data-prep.ipynb # Dataset formatting and filtering
│ ├── config.py
│ └── README.md
│
├── data-prep.ipynb # Root-level dataset preparation and mixing
├── prep.sh # Environment / data setup script
│
│ # LoRA fine-tuning notebooks (Unsloth + TRL)
├── gemma3-lora-faces.ipynb # Gemma3-4B on facial recognition task (main)
├── gemma3-lora-text.ipynb # Gemma3-4B text-only variant
├── qwen-lora-text.ipynb # Qwen text-only fine-tuning
├── qwen-vl-lora-text.ipynb # Qwen-VL fine-tuning
│
├── em-judge/ # Emergent misalignment evaluation pipeline
│ ├── base_model_inference.py # Run base (untuned) model on eval set
│ ├── ft_model_inference.py # Run fine-tuned models (rank sweep) on eval set
│ ├── judge_inference.py # GLM-4.6V judge scoring (local vLLM)
│ ├── judge_inference-oai.py # Judge scoring via OpenAI-compatible endpoint
│ ├── judge-prompt.txt # Full judge prompt
│ ├── judge-prompt-clean.txt # Simplified judge prompt
│ ├── main.py # Orchestration entry point
│ ├── config.py
│ ├── visuals.ipynb # Result plots and figures
│ ├── chat-ui.ipynb # Interactive chat interface for qualitative review
│ ├── io/input-sample.json # Sample evaluation inputs
│ ├── README.md # vLLM server setup instructions
│ └── requirements.txt
│
├── subspace-analysis/ # Geometric analysis and activation steering
│ ├── activation_extraction.py # Extract hidden-state activations from model
│ ├── extract.py # Activation extraction utilities
│ ├── svd.py # PCA / SVD decomposition of activation space
│ ├── cos_plot.py # Cosine similarity plots across layers/ranks
│ ├── steering.py # Activation steering vector computation
│ ├── steering_inf.py # Steered inference engine
│ ├── main-steer.py # Steering inference entry point
│ ├── analysis.ipynb # Subspace analysis and dimensionality experiments
│ ├── steering.ipynb # Steering vector experiments
│ ├── steering-inf.ipynb # Steered model evaluation
│ └── tok-cos-plot.ipynb # Token-level cosine similarity analysis
│
├── requirements.txt # Root dependencies
└── README.md
@misc{gulati2026narrowfinetuningerodessafety,
title={Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents},
author={Idhant Gulati and Shivam Raval},
year={2026},
eprint={2602.16931},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.16931},
}