JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

👋 Introduction

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.

TODO List

Upload our paper to arXiv and build project pages.
Upload the code.
Release JEPA-T model.

🤗 Prerequisite

details

Environment

conda create -n JEPA-T python=3.10 -y
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
pip install -r requirements.txt
conda activate JEPA-T

We tested our environment on A100, H20 and 4090.

Scripts

1.Cache VAE latents:

bash scripts/cache_vae.sh

2.Train/Evaluate JEPA-T:

bash scripts/jepat_base/large/huge.sh

👍 Acknowlegements

We sincerely thank the open-sourcing of these works where our code is based on MAR

🔒 License

This code is distributed under an CC BY-NC-SA 4.0.

Note that our code depends on other libraries, including CLIP, MMDetection, and uses datasets that each have their own respective licenses that must also be followed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assert		assert
engine		engine
models		models
scripts		scripts
util		util
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
main_cache.py		main_cache.py
main_jepat.py		main_jepat.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

👋 Introduction

TODO List

🤗 Prerequisite

Environment

Scripts

👍 Acknowlegements

🔒 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

👋 Introduction

TODO List

🤗 Prerequisite

Environment

Scripts

👍 Acknowlegements

🔒 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages