Skip to content

[ICCV 2025] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

License

Notifications You must be signed in to change notification settings

nv-tlabs/InfiniCube

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InfiniCube

Paper Page

Large Scale with Map Condition

Outline

Detailed documentation is organized by topic:

  • Environment Setup - Setting up conda and Waymo environments
  • [Data Processing](README/dpip install -U "huggingface_hub"ata_process.md) - Converting Waymo Open Dataset to webdataset format
  • Training - Training models
  • Inference - Running inference on trained models
  • Visualization - Interactive visualization tools for data and results

πŸš€ Quick Start

This quick start guide demonstrates how to infer InfiniCube pipeline with trajectory mode.

Prerequisites

  1. Environment Setup: Follow Environment Setup to install dependencies
  2. Checkpoints: Download or train the required model checkpoints:
    • checkpoints/voxel_diffusion.ckpt - Voxel diffusion model
    • checkpoints/video_generation.safetensors - Video generation model
    • checkpoints/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.ckpt - Gaussian splatting model
    • models/Wan-AI/Wan2.1-T2V-1.3B - checkpoints required for wan 2.1 1.3B, please run python infinicube/videogen/download_checkpoint.py to download them.
  3. Input Data Sample: Download the data sample for inference, move the data_sample/data folder outside as <InfiniCube-repo>/data.

Step 1: Voxel World Generation

Generate the voxel representation of the scene conditioned on map data:

python infinicube/inference/voxel_world_generation.py none \
    --mode trajectory \
    --use_ema --use_ddim --ddim_step 100 \
    --local_config infinicube/voxelgen/configs/diffusion_64x64x64_dense_vs02_map_cond.yaml \
    --local_checkpoint_path checkpoints/voxel_diffusion.ckpt \
    --clip 13679757109245957439_4167_170_4187_170 \
    --target_pose_num 8

Output: Voxel grids and visualization images are saved in visualization/infinicube_inference/voxel_world_generation/trajectory/{clip}/. To visualize the voxel grid, you can use the following command:

vis-grid -p visualization/infinicube_inference/voxel_world_generation/trajectory/13679757109245957439_4167_170_4187_170/7.pt

Voxel Grid Visualization

Generated voxel world

Step 2: Guidance Buffer Generation

Generate semantic buffer and coordinate buffer from the voxel world using original trajectory, and call video generation model to generate video. The buffer rendering is very speedy, however, the video generation's speed heavily relies on the video generation model implementation.

We reimplement video generation stage with Wan2.1 14B (text-to-video). It embeds guidance buffers to tokens and add to noisy tokens for conditoning. The default prompt is "The video is about a driving scene captured at daytime. The weather is clear." Since 14B model is large, it takes about 20 minutes to generate a 93 frame 480p video on a single A100 GPU (without considering model weight loading). We also provide a smaller one based on Wan2.1 1.3B (text-to-video), but it doesn't always follow conditions.

We also trained a Wan2.1 14B (image-to-video) with guidance buffer conditon. Considering loading 2 14B model can be extremely slow, we only enabled text-to-video in our inference API.

We provide training scripts of buffer conditioned Wan2.1 1.3B (t2v) and Wan2.1 14B (t2v, i2v) in diffsynth with multinode training support.

python infinicube/inference/guidance_buffer_generation.py \
    --mode trajectory \
    --clip 13679757109245957439_4167_170_4187_170 \
    --extrap_voxel_root visualization/infinicube_inference/voxel_world_generation/trajectory \
    --make_dynamic --offset_unit frame --offset 1 --video_prompt "The video is capture by a camera mounted on a vehicle. The video is about a driving scene captured at daytime. The weather is clear."

Output: Guidance buffers and generated video are saved in visualization/infinicube_inference/guidance_buffer_generation/trajectory_pose_sample_1frame/{clip}/

Buffer Rendering and Video Generation

guidance buffer and generated video

Step 3: Scene Gaussian Generation

Generate the final 3D Gaussian representation with static background and dynamic objects:

python infinicube/inference/scene_gaussian_generation.py none \
    --data_folder visualization/infinicube_inference/guidance_buffer_generation/trajectory_pose_sample_1frame/13679757109245957439_4167_170_4187_170 \
    --local_config infinicube/voxelgen/configs/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.yaml \
    --local_checkpoint_path checkpoints/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.ckpt

Output: View the generated scene with viser:

python infinicube/visualize/visualize_dynamic_gaussians.py -p visualization/infinicube_inference/gaussian_scene_generation/trajectory_pose_sample_1frame/13679757109245957439_4167_170_4187_170
Dynamic Scene Visualization

Generated dynamic scene

The trajectory mode reuse the original trajectory in the dataset to generate the voxel world and guidance buffers. We also support block mode, which generate map-level voxel world and support interactive GUI to design the trajectory. For more details on block mode and inference options, see the Inference Guide.

Voxel World Generated in Block Mode

Voxel World Generated in Block Mode

For more visualization tools and options, see the Visualization Guide.


πŸ“ Project Structure

InfiniCube-release/
β”œβ”€β”€ infinicube/
β”‚   β”œβ”€β”€ data_process/          # Data conversion and processing scripts
β”‚   β”œβ”€β”€ inference/             # Inference scripts for all models
β”‚   β”œβ”€β”€ visualize/             # Interactive visualization tools
β”‚   β”œβ”€β”€ voxelgen/              # Core models and training code
β”‚   β”‚   β”œβ”€β”€ configs/           # Configuration files for all models
β”‚   β”‚   β”œβ”€β”€ models/            # VAE, diffusion, and GSM model implementations
β”‚   β”‚   └── data/              # Dataset loaders
β”‚   β”œβ”€β”€ videogen/              # Video generation inference api
β”‚   └── utils/                 # Utility functions
β”œβ”€β”€ checkpoints/               # Model checkpoints
β”œβ”€β”€ slurm/                     # SLURM job submission scripts
β”œβ”€β”€ README/                    # Detailed documentation
β”‚   β”œβ”€β”€ env.md                 # Environment setup
β”‚   β”œβ”€β”€ data_process.md        # Data processing guide
β”‚   β”œβ”€β”€ training.md            # Training guide
β”‚   β”œβ”€β”€ inference.md           # Inference guide
β”‚   └── visualize.md           # Visualization tools
└── README.md                  # This file

πŸ› οΈ Training Your Own Models

We provide the scripts for VAE, Voxel Diffusion, and GSM training.

If you are interested in buffer conditioned video generation, we provide an implementation based on Wan2.1 1.3B (text-to-video) with DiffSynth.

For detailed training instructions, see the Training Guide.


πŸ”— Related Works


πŸŽ“ Citation

If you use InfiniCube in your research, please cite our paper:

@inproceedings{lu2025infinicube,
  title={Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models},
  author={Lu, Yifan and Ren, Xuanchi and Yang, Jiawei and Shen, Tianchang and Wu, Zhangjie and Gao, Jun and Wang, Yue and Chen, Siheng and Chen, Mike and Fidler, Sanja and others},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={27272--27283},
  year={2025}
}

πŸ“§ Contact

For questions and issues, please open an issue on the GitHub repository or contact the code owner.

πŸ“œ License

This project will download and install additional third-party models and softwares. Note that these models or softwares are not distributed by NVIDIA. Review the license terms of these models and projects before use. This source code is released under the Apache 2 License. The pre-trained weights are provided for research and educational purposes only, users must strictly adhere to the Waymo Open Dataset Terms of Service. Any commercial use of these weights is strictly prohibited.

About

[ICCV 2025] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •