-
Notifications
You must be signed in to change notification settings - Fork 143
Open
Description
Thank you for the great work!
I just ran a toy experiment on the Flickr-30k dataset (https://huggingface.co/datasets/nlphuji/flickr30k). I used all the default parameters and trained it using 4 H100s. However, after 120k steps (991 epochs), I noticed that the model is not able to learn image-text alignment. The generated images do not seem to align with the ground truth images. Below is an example - Top row represents ground truth images and bottom row represents generated images.
I do not encounter this issue with other models such as DiT, SiT, REPA, etc. Any ideas on how to fix the issue?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels