Skip to content

Code repository of "GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation"

License

Notifications You must be signed in to change notification settings

Huster-YZY/GenieDrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Zhenya Yang1, Zhe Liu1,†, Yuxiang Lu1, Liping Hou2, Chenxuan Miao1, Siyi Peng2, Bailan Feng2, Xiang Bai3, Hengshuang Zhao1,βœ‰


1 The University of Hong Kong, 2 Huawei Noah's Ark Lab, 3 Huazhong University of Science and Technology
† Project leader, βœ‰ Corresponding author.

πŸ“‘ [arXiv], βš™οΈ [project page], πŸ€— [model weights]

Overview of our GenieDrive

πŸ“’ News

  • [2025/12/15] We release GenieDrive paper on arXiv. πŸ”₯
  • 2025.12.15: DrivePI paper released! A novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. πŸ”₯
  • 2025.11.04: Our previous work UniLION has been released. Check out the codebase for unified autonomous driving model with Linear Group RNNs. πŸš€
  • 2024.09.26: Our work LION has been accepted by NeurIPS 2024. Visit the codebase for Linear Group RNN for 3D Object Detection. πŸš€

πŸ“‹ TODO List

  • Release 4D occupancy forecasting code and model weights.
  • Release multi-view video generator code and weights.

πŸ“ˆ Results

Our method achieves a remarkable increase in 4D Occupancy forecasting performance, with a 7.2% increase in mIoU and a 4% increase in IoU. Moreover, our tri-plane VAE compresses occupancy into a latent tri-plane that is only 58% the size used in previous methods, while still maintaining superior reconstruction performance. This compact latent representation also contributes to fast inference (41 FPS) and a minimal parameter count of only 3.47M (including the VAE and prediction module).

Performance of 4D Occupancy Forecasting

We train three driving video generation models that differ only in video length: S (8 frames, ~0.7 s), M (37 frames, ~3 s), and L (81 frames, ~7 s). Through rollout, the L model can further generate long multi-view driving videos of up to 241 frames (~20 s). GenieDrive consistently outperforms previous occupancy-based methods across all metrics, while also enabling much longer video generation.

Performance of Multi-View Video Generation

πŸ“ Citation

@article{yang2025geniedrive,
  author    = {Yang, Zhenya and Liu, Zhe and Lu, Yuxiang and Hou, Liping and Miao, Chenxuan and Peng, Siyi and Feng, Bailan and Bai, Xiang and Zhao, Hengshuang},
  title     = {GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation},
  journal   = {arXiv:2512.12751},
  year      = {2025},
}

Acknowledgements

We thank these great works and open-source repositories: I2-World, UniScene, DynamicCity, MMDectection3D and VideoX-Fun.

About

Code repository of "GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •