Detailed documentation is organized by topic:
- Environment Setup - Setting up conda and Waymo environments
- [Data Processing](README/dpip install -U "huggingface_hub"ata_process.md) - Converting Waymo Open Dataset to webdataset format
- Training - Training models
- Inference - Running inference on trained models
- Visualization - Interactive visualization tools for data and results
This quick start guide demonstrates how to infer InfiniCube pipeline with trajectory mode.
- Environment Setup: Follow Environment Setup to install dependencies
- Checkpoints: Download or train the required model checkpoints:
checkpoints/voxel_diffusion.ckpt- Voxel diffusion modelcheckpoints/video_generation.safetensors- Video generation modelcheckpoints/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.ckpt- Gaussian splatting modelmodels/Wan-AI/Wan2.1-T2V-1.3B- checkpoints required for wan 2.1 1.3B, please runpython infinicube/videogen/download_checkpoint.pyto download them.
- Input Data Sample: Download the data sample for inference, move the
data_sample/datafolder outside as<InfiniCube-repo>/data.
Generate the voxel representation of the scene conditioned on map data:
python infinicube/inference/voxel_world_generation.py none \
--mode trajectory \
--use_ema --use_ddim --ddim_step 100 \
--local_config infinicube/voxelgen/configs/diffusion_64x64x64_dense_vs02_map_cond.yaml \
--local_checkpoint_path checkpoints/voxel_diffusion.ckpt \
--clip 13679757109245957439_4167_170_4187_170 \
--target_pose_num 8Output: Voxel grids and visualization images are saved in visualization/infinicube_inference/voxel_world_generation/trajectory/{clip}/. To visualize the voxel grid, you can use the following command:
vis-grid -p visualization/infinicube_inference/voxel_world_generation/trajectory/13679757109245957439_4167_170_4187_170/7.ptGenerate semantic buffer and coordinate buffer from the voxel world using original trajectory, and call video generation model to generate video. The buffer rendering is very speedy, however, the video generation's speed heavily relies on the video generation model implementation.
We reimplement video generation stage with Wan2.1 14B (text-to-video). It embeds guidance buffers to tokens and add to noisy tokens for conditoning. The default prompt is "The video is about a driving scene captured at daytime. The weather is clear." Since 14B model is large, it takes about 20 minutes to generate a 93 frame 480p video on a single A100 GPU (without considering model weight loading). We also provide a smaller one based on Wan2.1 1.3B (text-to-video), but it doesn't always follow conditions.
We also trained a Wan2.1 14B (image-to-video) with guidance buffer conditon. Considering loading 2 14B model can be extremely slow, we only enabled text-to-video in our inference API.
We provide training scripts of buffer conditioned Wan2.1 1.3B (t2v) and Wan2.1 14B (t2v, i2v) in diffsynth with multinode training support.
python infinicube/inference/guidance_buffer_generation.py \
--mode trajectory \
--clip 13679757109245957439_4167_170_4187_170 \
--extrap_voxel_root visualization/infinicube_inference/voxel_world_generation/trajectory \
--make_dynamic --offset_unit frame --offset 1 --video_prompt "The video is capture by a camera mounted on a vehicle. The video is about a driving scene captured at daytime. The weather is clear."Output: Guidance buffers and generated video are saved in visualization/infinicube_inference/guidance_buffer_generation/trajectory_pose_sample_1frame/{clip}/
Generate the final 3D Gaussian representation with static background and dynamic objects:
python infinicube/inference/scene_gaussian_generation.py none \
--data_folder visualization/infinicube_inference/guidance_buffer_generation/trajectory_pose_sample_1frame/13679757109245957439_4167_170_4187_170 \
--local_config infinicube/voxelgen/configs/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.yaml \
--local_checkpoint_path checkpoints/gsm_vs02_res512_view1_dual_branch_sky_mlp_modulator.ckptOutput: View the generated scene with viser:
python infinicube/visualize/visualize_dynamic_gaussians.py -p visualization/infinicube_inference/gaussian_scene_generation/trajectory_pose_sample_1frame/13679757109245957439_4167_170_4187_170The trajectory mode reuse the original trajectory in the dataset to generate the voxel world and guidance buffers. We also support block mode, which generate map-level voxel world and support interactive GUI to design the trajectory. For more details on block mode and inference options, see the Inference Guide.
For more visualization tools and options, see the Visualization Guide.
InfiniCube-release/
βββ infinicube/
β βββ data_process/ # Data conversion and processing scripts
β βββ inference/ # Inference scripts for all models
β βββ visualize/ # Interactive visualization tools
β βββ voxelgen/ # Core models and training code
β β βββ configs/ # Configuration files for all models
β β βββ models/ # VAE, diffusion, and GSM model implementations
β β βββ data/ # Dataset loaders
β βββ videogen/ # Video generation inference api
β βββ utils/ # Utility functions
βββ checkpoints/ # Model checkpoints
βββ slurm/ # SLURM job submission scripts
βββ README/ # Detailed documentation
β βββ env.md # Environment setup
β βββ data_process.md # Data processing guide
β βββ training.md # Training guide
β βββ inference.md # Inference guide
β βββ visualize.md # Visualization tools
βββ README.md # This file
We provide the scripts for VAE, Voxel Diffusion, and GSM training.
If you are interested in buffer conditioned video generation, we provide an implementation based on Wan2.1 1.3B (text-to-video) with DiffSynth.
For detailed training instructions, see the Training Guide.
- XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies.
- SCube: Instant Large-Scale Scene Reconstruction using VoxSplats.
- Cosmos-Drive-Dreams:Scalable Synthetic Driving Data Generation with World Foundation Models
If you use InfiniCube in your research, please cite our paper:
@inproceedings{lu2025infinicube,
title={Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models},
author={Lu, Yifan and Ren, Xuanchi and Yang, Jiawei and Shen, Tianchang and Wu, Zhangjie and Gao, Jun and Wang, Yue and Chen, Siheng and Chen, Mike and Fidler, Sanja and others},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={27272--27283},
year={2025}
}For questions and issues, please open an issue on the GitHub repository or contact the code owner.
This project will download and install additional third-party models and softwares. Note that these models or softwares are not distributed by NVIDIA. Review the license terms of these models and projects before use. This source code is released under the Apache 2 License. The pre-trained weights are provided for research and educational purposes only, users must strictly adhere to the Waymo Open Dataset Terms of Service. Any commercial use of these weights is strictly prohibited.




