Skip to content

The official implementation of ICCV2025 paper "ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads"

License

Notifications You must be signed in to change notification settings

JackYFL/ViT-Split

Repository files navigation

[ICCV 2025] ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Paper | Project

Authors: Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Ren Liu

The official implementation of our ICCV 2025 paper "ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads".

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@InProceedings{Li_2025_ICCV,
    author    = {Li, Yifan and Li, Xin and Li, Tianqin and He, Wenbin and Kong, Yu and Ren, Liu},
    title     = {ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {1979-1989}
}

Abstract

image

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

Method

image

Experiments

We demonstrate our ViT-Split on four tasks: segmentation, detection, VQA and MDE (monocular depth estimation).

  • Conda Environments
    • Detection and segmentation tasks share the same environment: vitsplit
    • VQA uses vitsplit-llava
    • MDE uses vitsplit-mde
    • For detailed setup instructions, please refer to the corresponding codebase

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

About

The official implementation of ICCV2025 paper "ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published