liteature-library/Index.md at main · ssyang2020/liteature-library · GitHub

72 lines (56 loc) · 3.82 KB

A. Visual Generation

A.0 Learning Paradigm

Autoregressive Image Generation without Vector Quantization https://arxiv.org/pdf/2406.11838
Flow Matching Guide and Code https://arxiv.org/pdf/2412.06264
Dimensionality-Varying Diffusion Process https://arxiv.org/pdf/2211.16032
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction https://arxiv.org/pdf/2404.02905

A.1 3D Generation

Wonderland: Navigating 3D Scenes from a Single Image [https://arxiv.org/pdf/2412.12091]

A.2 VAE

High performance VAE

Large Motion Video Autoencoding with Cross-modal Video VAE [https://arxiv.org/pdf/2412.17805]

Deep compression VAE

REDUCIO! Generating 1024x1024 Video within 16 Seconds using Extremely Compressed Motion Latents [https://arxiv.org/html/2411.13552]

A.3 Foundation Video Diffusion Models

CogVideoX: Text-to-Video Diffusion Mod�els with An Expert Transformer [https://arxiv.org/pdf/2408.06072]
Enhance-a-Video: Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508
Emu3: Next-Token Prediction is All You Need https://arxiv.org/pdf/2409.18869

A.4 WorldModel

LLM + Video Model/Generative Instruction Tuning

iVideoGPT: Interactive VideoGPTs are Scalable World Models [https://arxiv.org/pdf/2405.15223]

A.5 Foundation Image Models

Flowing from Words to Pixels: A Framework for Cross-Modality Evolutio [https://arxiv.org/pdf/2412.15213]
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis https://github.com/FoundationVision/Infinity

Generative Instruction Tuning/LLM + T2I/Visual CoT/Understanding while generation

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [https://arxiv.org/pdf/2412.14164]

A.6 Humancentric Generation

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models https://arxiv.org/pdf/2502.01061

A.7 Training-free Patch

Enhance-A-Video: Better Generated Video for Free https://arxiv.org/pdf/2502.07508

B. Multi-modal Large Language Models

B.1 Transformer-based omni-modal models

Qwen2.5-VL https://github.com/QwenLM/Qwen2.5-VL
MiniCPM2.6-o https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file
BAICHUAN-OMNI-1.5 https://github.com/baichuan-inc/Baichuan-Omni-1.5
InternLM-XComposer2.5 https://github.com/InternLM/InternLM-XComposer
SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION https://arxiv.org/pdf/2408.12528
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat https://arxiv.org/pdf/2503.01115
UniTok: A Unified Tokenizer for Visual Generation and Understanding https://arxiv.org/pdf/2502.20321
Generating Images with Multimodal Language Models https://arxiv.org/pdf/2305.17216
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation https://arxiv.org/pdf/2410.13848

B.2 Novel learning paradigm

Large Concept Models: Language Modeling in a Sentence Representation Space [https://arxiv.org/pdf/2412.08821]

B.3 Inference method

Autoregressive Image Generation with Vision Full-view Prompt https://arxiv.org/pdf/2502.16965

C. Visual Understanding

C.1 Depth Estimation

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation [https://promptda.github.io/]

D. Reinforcement learning

D.0 Learning Paradigm

Proximal Policy Optimization Algorithms https://arxiv.org/pdf/1707.06347
(GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models https://arxiv.org/pdf/2402.03300
Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/pdf/2305.18290