JurasSigLIP

JurasSigLIP is a text-guided segmentation prototype that fuses DINOv3 vision tokens with SigLIP2 text tokens via cross-attention. It aligns global image and text features with a contrastive objective, then trains a segmentation head to produce pixel-level masks conditioned on natural-language captions.

Project highlights

Vision backbone: DINOv3 ViT (LoRA-adapted), patch tokens and CLS token
Text backbone: SigLIP2 text tower (LoRA-adapted), sequence tokens and global text feature
Fusion: multi-head cross-attention from vision patches to text tokens
Outputs: segmentation mask logits upsampled to 512x512
Training: contrastive pretraining for global alignment, then segmentation finetuning with Dice + Focal loss

Architecture

graph TD
    subgraph Inputs
    IMG[Input Image] --> V_PROC[Image Processor]
    TXT[Input Caption] --> T_TOK[Tokenizer]
    end

    subgraph Vision_Branch_DINOv3
    V_PROC --> V_BB["DINOv3 Backbone<br/>(LoRA Adapted)"]
    V_BB --> V_TOK["Patch Tokens<br/>(Local Features)"]
    V_BB --> V_CLS["CLS Token<br/>(Global Feature)"]
    end

    subgraph Text_Branch_SigLIP2
    T_TOK --> T_BB["SigLIP2 Text Model<br/>(LoRA Adapted)"]
    T_BB --> T_FEAT[Text Features]
    T_FEAT --> T_PROJ[Down Projection]
    T_PROJ --> T_SEQ[Text Sequence Tokens]
    T_PROJ --> T_GLO[Global Text Feature]
    end

    subgraph Fusion_Module
    V_TOK -- Query --> X_ATTN[Cross Attention]
    T_SEQ -- Key/Value --> X_ATTN
    X_ATTN --> V_CTX[Contextualized Features]
    V_TOK -- Residual --> ADD((+))
    V_CTX --> ADD
    end

    subgraph Outputs
    ADD --> SEG_HEAD["Segmentation Head<br/>(MLP + Upsample)"]
    SEG_HEAD --> MASK[Segmentation Mask Logits]
    
    V_CLS -.-> CONT_LOSS[Contrastive Loss]
    T_GLO -.-> CONT_LOSS
    end

    style Vision_Branch_DINOv3 fill:#e1f5fe,stroke:#01579b
    style Text_Branch_SigLIP2 fill:#fff3e0,stroke:#e65100
    style Fusion_Module fill:#f3e5f5,stroke:#4a148c
    style Outputs fill:#e8f5e9,stroke:#1b5e20

Data

The notebook uses COCO train2014 images and a GRef-style referring expression dataset (via grefs(unc).json) with instance masks from COCO annotations. Each image can have multiple captions, and the dataset pairs each caption with the same segmentation mask for that target.

Training flow

Contrastive pretraining
- Align global image and text embeddings with a CLIP-style contrastive loss.
- Train LoRA-adapted backbones plus logit scale.
Segmentation training
- Freeze most parameters; train cross-attention and segmentation head.
- Optimize Dice + Focal loss on per-caption masks.

Key hyperparameters from the notebook:

Shared embedding dim: 384
Image size: 512
LoRA: r=8, alpha=16, dropout=0.0
Backbones: DINOv3 ViT-S/16+, SigLIP2 base patch16-512

Training curves

Contrastive pretraining:

Segmentation finetuning:

Validation comparisons

Each folder in val/ contains one image with multiple captions. The different captions produce different segmentation outputs.

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Example 9

Example 10

How to run

Open jurasSigLIP.ipynb and run the cells in order. Update the dataset paths to your local setup and ensure a CUDA-capable GPU is available for mixed-precision training. You must provide the two JSON files (grefs(unc).json and instances.json) and a folder with all images with segmentation data named gref_images as referenced in the notebook paths.

References

Liu, Chang, Henghui Ding, and Xudong Jiang. “GRES: Generalized Referring Expression Segmentation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
He, Shuting, Henghui Ding, Chang Liu, and Xudong Jiang. “GREC: Generalized Referring Expression Comprehension.” arXiv preprint arXiv:2308.16182, 2023.
Kamath, Aishwarya, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. “MDETR: Modulated Detection for End-to-End Multi-Modal Understanding.” arXiv preprint arXiv:2104.12763, 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
curves		curves
val		val
.DS_Store		.DS_Store
README.md		README.md
arch_mermaid.txt		arch_mermaid.txt
cover.jpg		cover.jpg
jurasSigLIP.ipynb		jurasSigLIP.ipynb

Provide feedback

Saved searches