[ICCV 2025] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Authors: Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Ren Liu

The official implementation of our ICCV 2025 paper "ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads".

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@InProceedings{Li_2025_ICCV,
    author    = {Li, Yifan and Li, Xin and Li, Tianqin and He, Wenbin and Kong, Yu and Ren, Liu},
    title     = {ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {1979-1989}
}

Abstract

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

Method

Experiments

We demonstrate our ViT-Split on four tasks: segmentation, detection, VQA and MDE (monocular depth estimation).

Conda Environments
- Detection and segmentation tasks share the same environment: vitsplit
- VQA uses vitsplit-llava
- MDE uses vitsplit-mde
- For detailed setup instructions, please refer to the corresponding codebase

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

Thanks for these awesome opensourced projects: ViT-Adapter, ViT-CoMer, benchmark-cfm-ss, LLaVA, MDE-toolbox!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LLaVA		LLaVA
Monocular-Depth-Estimation-Toolbox		Monocular-Depth-Estimation-Toolbox
assets		assets
detection		detection
segmentation		segmentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICCV 2025] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Citation

Abstract

Method

Experiments

License

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

JackYFL/ViT-Split

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Citation

Abstract

Method

Experiments

License

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages