[ICCV 2025] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads
Authors: Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Ren Liu
The official implementation of our ICCV 2025 paper "ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads".
If this work is helpful for your research, please consider citing the following BibTeX entry.
@InProceedings{Li_2025_ICCV,
author = {Li, Yifan and Li, Xin and Li, Tianqin and He, Wenbin and Kong, Yu and Ren, Liu},
title = {ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {1979-1989}
}
Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to
We demonstrate our ViT-Split on four tasks: segmentation, detection, VQA and MDE (monocular depth estimation).
- Conda Environments
- Detection and segmentation tasks share the same environment: vitsplit
- VQA uses vitsplit-llava
- MDE uses vitsplit-mde
- For detailed setup instructions, please refer to the corresponding codebase
This repository is released under the Apache 2.0 license as found in the LICENSE file.
- Thanks for these awesome opensourced projects: ViT-Adapter, ViT-CoMer, benchmark-cfm-ss, LLaVA, MDE-toolbox!

