Not Learning Image-Text Alignment

Thank you for the great work!

I just ran a toy experiment on the Flickr-30k dataset (https://huggingface.co/datasets/nlphuji/flickr30k). I used all the default parameters and trained it using 4 H100s. However, after **120k steps (991 epochs)**, I noticed that the model is not able to learn image-text alignment. The generated images do not seem to align with the ground truth images. Below is an example - **Top row represents ground truth images and bottom row represents generated images**.

<img width="2066" height="518" alt="Image" src="https://github.com/user-attachments/assets/6ef61d0b-3f40-4961-af12-50e7fb32b1b8" />

I do not encounter this issue with other models such as DiT, SiT, REPA, etc. Any ideas on how to fix the issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not Learning Image-Text Alignment #47

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Not Learning Image-Text Alignment #47

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions