Hello,
I am attempting to integrate the C-Radio series (v2, v3, v4) as the vision encoder for a MLLM. I am following a standard LLaVA architecture (Vision Encoder + MLP Projector + LLM).
The Setup
- Architecture: ViT (C-Radio) + 2-layer MLP Projector + LLM.
- Training Stage: Pretraining.
- Strategy: Freeze Vision Encoder, Freeze LLM, only train the Projector.
- Data: Standard LLaVA 558K align image-text pairs
The Issue
I have observed that when using c-radiov2, c-radiov3, or c-radiov4, the model fails to converge during this pretraining stage. The loss remains high.
To verify my training pipeline, I tested the exact same setup with other vision encoders:
- SigLIP/SigLIP2: Converges normally.
- DINOv2/DINOv3: Converges normally.
This suggests the pipeline and hyperparameters are generally correct, but there is a specific incompatibility or configuration mismatch with the C-Radio features.
Are there any known issues or specific requirements when using C-Radio features for MLLM alignment? Any insights or suggestions would be greatly appreciated. Thanks!