ViT-Split

Abstract

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

Motivation

Comparison between previous VFM adapters and ours. Previous VFM adapters (like ViT-Adapter, ViT-CoMer) integrate low-level features learned by a CNN branch into a learnable VFM through an adapter. Our method exploits VFM prior knowledge with two heads: a prior head for multi-scale prior feature learning from a frozen VFM, and a task head for task-specific feature learning, initialized by the last few layers of the VFM.

Highlights

ViT-Split maintains:

New observations in VFMs. We observe that several VFMs, especially DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features.
New VFM adapter. We propose an efficient and effective adapter ViT-Split for VFMs. Specifically, ViT-Split introduces two heads, a task head and a prior head. The task head is for learning task-specific features. The prior head is a lightweight CNN for extracting multi-scale prior features from a frozen VFM.
Performance. We perform extensive experiments and detailed ablations on various downstream tasks to validate the effectiveness of our method, including segmentation, detection and VQA.

Observation in VFMs

The CKA comparison of layer features across different VFMs, including a self-supervised method DINOv2-L, and three image-text alignment methods EVA2-L, CLIP-L and SigLip-L. For most of these VFMs, especially DINOv2, the features in the early and later layers show distinct similarities within their respective groups.

Notably, the segmentation and detection models are fine-tuned from the DINOv2-S. The features within the red dotted boxes across the three tasks exhibit similar patterns, emphasizing detailed representations. In the later layers, however, the features diverge, becoming more specialized for each task.

ViT-Split Architechture

ViT-Split includes three trainable components.

Task head:

Early layers of VFMs are capable of learning low-level features which are similar to different tasks, we avoid tuning the entire backbone by sharing these early layers. Meanwhile, to retain the prior features of the VFMs, we replicate the final $K_t$ layers separately, utilizing them as a task-specific adapter (Task-head) for downstream tasks. The hyper-parameter $K_t$ controls the adapter's size, balancing between model capacity and training efficiency. Prior head:

The prior features learned by VFMs have demonstrated strong performance across a range of downstream tasks. However, most current VFM adapters and PEFT methods modify these prior features during training. In contrast, our ViT-Split approach fully leverages the prior knowledge embedded in the multi-scale features of the VFM through a dedicated prior head. Our rationale for utilizing these prior features is to harness the knowledge learned by VFMs to enhance task-specific features while mitigating the risk of overfitting downstream tasks. Fusion net:

Fusion net is utilized to fuse prior feature map and the task-specific feature map for different downstream tasks.

Experiment results: segmentation, detection, VQA

1 / 11

Exp 1: semantic segmentation on ADE20K.

2 / 11

Exp 1: semantic segmentation on Cityscapes.

3 / 11

Exp 2: object detection on COCO val2017.

4 / 11

Exp 3: VQA on multiple benchmarks.

BibTeX


@misc{li2025vitsplit,
title={ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads},
author={Yifan Li and Tianqin Li and Xin Li and Wenbin He and Yu Kong and Liu Ren},
year={2025},
booktitle=ICCV
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of DINOv2, MaskRCNN, LLaVA and CLIP. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.