LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Haojie Yu* · Zhaonian Wang* · Yihan Pan* · Meng Cheng · Hao Yang · Chao Wang · Tao Xie · Xiaoming Xu^✉ · Xiaoming Wei · Xunliang Cai

^*Equal Contribution ^✉Corresponding Authors

TL; DR: LLIA is a real-time audio-driven portrait video generation with diffusion models, enabling low-latency interactive avatars.

001.mp4

Video Demos

001.mp4

002.mp4

003.mp4

🔆 Introduction

We propose LLIA , a novel audio-driven portrait video generation framework based on the diffusion model. Our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384 × 384 and 45 FPS at a resolution of 512 × 512, with an initial video generation latency of 140 ms and 215 ms, respectively

🔥 Latest News

June 9, 2025: 👋 We release the Technique-Report of LLIA
June 9, 2025: 👋 We release the project page of LLIA

📑 Todo List

Release the technical report
Inference
Checkpoints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Video Demos

🔆 Introduction

🔥 Latest News

📑 Todo List

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLIA - Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Video Demos

🔆 Introduction

🔥 Latest News

📑 Todo List