Skip to content

Convergence issues when using C-Radio (v2/v3/v4) as Vision Encoder in LLaVA-style pretraining #167

@haoyi199815

Description

@haoyi199815

Hello,
I am attempting to integrate the C-Radio series (v2, v3, v4) as the vision encoder for a MLLM. I am following a standard LLaVA architecture (Vision Encoder + MLP Projector + LLM).

The Setup

  • Architecture: ViT (C-Radio) + 2-layer MLP Projector + LLM.
  • Training Stage: Pretraining.
  • Strategy: Freeze Vision Encoder, Freeze LLM, only train the Projector.
  • Data: Standard LLaVA 558K align image-text pairs

The Issue
I have observed that when using c-radiov2, c-radiov3, or c-radiov4, the model fails to converge during this pretraining stage. The loss remains high.
To verify my training pipeline, I tested the exact same setup with other vision encoders:

  • SigLIP/SigLIP2: Converges normally.
  • DINOv2/DINOv3: Converges normally.
    This suggests the pipeline and hyperparameters are generally correct, but there is a specific incompatibility or configuration mismatch with the C-Radio features.

Are there any known issues or specific requirements when using C-Radio features for MLLM alignment? Any insights or suggestions would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions