Skip to content

Latest commit

 

History

History
72 lines (56 loc) · 3.82 KB

File metadata and controls

72 lines (56 loc) · 3.82 KB

A. Visual Generation

A.0 Learning Paradigm

  1. Autoregressive Image Generation without Vector Quantization https://arxiv.org/pdf/2406.11838
  2. Flow Matching Guide and Code https://arxiv.org/pdf/2412.06264
  3. Dimensionality-Varying Diffusion Process https://arxiv.org/pdf/2211.16032
  4. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction https://arxiv.org/pdf/2404.02905

A.1 3D Generation

  1. Wonderland: Navigating 3D Scenes from a Single Image [https://arxiv.org/pdf/2412.12091]

A.2 VAE

High performance VAE

  1. Large Motion Video Autoencoding with Cross-modal Video VAE [https://arxiv.org/pdf/2412.17805]

Deep compression VAE

  1. REDUCIO! Generating 1024x1024 Video within 16 Seconds using Extremely Compressed Motion Latents [https://arxiv.org/html/2411.13552]

A.3 Foundation Video Diffusion Models

  1. CogVideoX: Text-to-Video Diffusion Mod�els with An Expert Transformer [https://arxiv.org/pdf/2408.06072]
  2. Enhance-a-Video: Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508
  3. Emu3: Next-Token Prediction is All You Need https://arxiv.org/pdf/2409.18869

A.4 WorldModel

LLM + Video Model/Generative Instruction Tuning

  1. iVideoGPT: Interactive VideoGPTs are Scalable World Models [https://arxiv.org/pdf/2405.15223]

A.5 Foundation Image Models


  1. Flowing from Words to Pixels: A Framework for Cross-Modality Evolutio [https://arxiv.org/pdf/2412.15213]
  2. Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis https://github.com/FoundationVision/Infinity

Generative Instruction Tuning/LLM + T2I/Visual CoT/Understanding while generation

  1. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [https://arxiv.org/pdf/2412.14164]

A.6 Humancentric Generation

  1. OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models https://arxiv.org/pdf/2502.01061

A.7 Training-free Patch

  1. Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508

B. Multi-modal Large Language Models

B.1 Transformer-based omni-modal models

  1. Qwen2.5-VL https://github.com/QwenLM/Qwen2.5-VL
  2. MiniCPM2.6-o https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file
  3. BAICHUAN-OMNI-1.5 https://github.com/baichuan-inc/Baichuan-Omni-1.5
  4. InternLM-XComposer2.5 https://github.com/InternLM/InternLM-XComposer
  5. SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION https://arxiv.org/pdf/2408.12528
  6. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848
  7. WeGen: A Unified Model for Interactive Multimodal Generation as We Chat https://arxiv.org/pdf/2503.01115
  8. UniTok: A Unified Tokenizer for Visual Generation and Understanding https://arxiv.org/pdf/2502.20321
  9. Generating Images with Multimodal Language Models https://arxiv.org/pdf/2305.17216
  10. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848

B.2 Novel learning paradigm

  1. Large Concept Models: Language Modeling in a Sentence Representation Space [https://arxiv.org/pdf/2412.08821]

B.3 Inference method

  1. Autoregressive Image Generation with Vision Full-view Prompt https://arxiv.org/pdf/2502.16965

C. Visual Understanding

C.1 Depth Estimation

  1. Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation [https://promptda.github.io/]

D. Reinforcement learning

D.0 Learning Paradigm

  1. Proximal Policy Optimization Algorithms https://arxiv.org/pdf/1707.06347
  2. (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models https://arxiv.org/pdf/2402.03300
  3. Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/pdf/2305.18290