- Autoregressive Image Generation without Vector Quantization https://arxiv.org/pdf/2406.11838
- Flow Matching Guide and Code https://arxiv.org/pdf/2412.06264
- Dimensionality-Varying Diffusion Process https://arxiv.org/pdf/2211.16032
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction https://arxiv.org/pdf/2404.02905
- Wonderland: Navigating 3D Scenes from a Single Image [https://arxiv.org/pdf/2412.12091]
High performance VAE
- Large Motion Video Autoencoding with Cross-modal Video VAE [https://arxiv.org/pdf/2412.17805]
Deep compression VAE
- REDUCIO! Generating 1024x1024 Video within 16 Seconds using Extremely Compressed Motion Latents [https://arxiv.org/html/2411.13552]
- CogVideoX: Text-to-Video Diffusion Mod�els with An Expert Transformer [https://arxiv.org/pdf/2408.06072]
- Enhance-a-Video: Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508
- Emu3: Next-Token Prediction is All You Need https://arxiv.org/pdf/2409.18869
LLM + Video Model/Generative Instruction Tuning
- iVideoGPT: Interactive VideoGPTs are Scalable World Models [https://arxiv.org/pdf/2405.15223]
- Flowing from Words to Pixels: A Framework for Cross-Modality Evolutio [https://arxiv.org/pdf/2412.15213]
- Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis https://github.com/FoundationVision/Infinity
Generative Instruction Tuning/LLM + T2I/Visual CoT/Understanding while generation
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [https://arxiv.org/pdf/2412.14164]
- OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models https://arxiv.org/pdf/2502.01061
- Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508
- Qwen2.5-VL https://github.com/QwenLM/Qwen2.5-VL
- MiniCPM2.6-o https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file
- BAICHUAN-OMNI-1.5 https://github.com/baichuan-inc/Baichuan-Omni-1.5
- InternLM-XComposer2.5 https://github.com/InternLM/InternLM-XComposer
- SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION https://arxiv.org/pdf/2408.12528
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848
- WeGen: A Unified Model for Interactive Multimodal Generation as We Chat https://arxiv.org/pdf/2503.01115
- UniTok: A Unified Tokenizer for Visual Generation and Understanding https://arxiv.org/pdf/2502.20321
- Generating Images with Multimodal Language Models https://arxiv.org/pdf/2305.17216
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848
- Large Concept Models: Language Modeling in a Sentence Representation Space [https://arxiv.org/pdf/2412.08821]
- Autoregressive Image Generation with Vision Full-view Prompt https://arxiv.org/pdf/2502.16965
- Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation [https://promptda.github.io/]
- Proximal Policy Optimization Algorithms https://arxiv.org/pdf/1707.06347
- (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models https://arxiv.org/pdf/2402.03300
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/pdf/2305.18290