Skip to content

justin-herry/JEPA-T

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation


👋 Introduction

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.

framework

TODO List

  • Upload our paper to arXiv and build project pages.
  • Upload the code.
  • Release JEPA-T model.

🤗 Prerequisite

details

Environment

conda create -n JEPA-T python=3.10 -y
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
pip install -r requirements.txt
conda activate JEPA-T

We tested our environment on A100, H20 and 4090.

Scripts

1.Cache VAE latents:

bash scripts/cache_vae.sh

2.Train/Evaluate JEPA-T:

bash scripts/jepat_base/large/huge.sh

👍 Acknowlegements

We sincerely thank the open-sourcing of these works where our code is based on MAR

🔒 License

This code is distributed under an CC BY-NC-SA 4.0.

Note that our code depends on other libraries, including CLIP, MMDetection, and uses datasets that each have their own respective licenses that must also be followed.

About

This is the official repository for the paper: JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors