Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao
TL;DR: Gen3R creates multi-quantity geometry with RGB from images via a unified latent space that aligns geometry and appearance.
We train and test our model under the following environment:
- Debian GNU/Linux 12 (bookworm)
- NVIDIA H20 (96G)
- CUDA 12.4
- Python 3.11
- Pytorch 2.5.1+cu124
- Clone this repository
git clone https://github.com/JaceyHuang/Gen3R
cd Gen3R- Install packages
conda create -n gen3r python=3.11.2 -y
conda activate gen3r
pip install -r requirements.txt- (Important) Download pretrained Gen3R checkpoint from HuggingFace to ./checkpoints
sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/JaceyH919/Gen3R ./checkpoints- Note: At present, direct loading weights from HuggingFace via
from_pretrained("JaceyH919/Gen3R")is not supported due to module naming errors. Please download the model checkpoint locally and load it usingfrom_pretrained("./checkpoints").
Run the python script infer.py as follows to test the examples
python infer.py \
--pretrained_model_name_or_path ./checkpoints \
--task 2view \
--prompts examples/2-view/colosseum/prompts.txt \
--frame_path examples/2-view/colosseum/first.png examples/2-view/colosseum/last.png \
--cameras free \
--output_dir ./results \
--remove_far_pointsSome important inference settings below:
--task:1viewforFirst Frame to 3D,2viewforFirst-last Frames to 3D, andallviewfor3D Reconstruction.--prompts: the text prompt string or the path to the text prompt file.--frame_path: the path to the conditional images/video. For theallviewtask, this can be either the path to a folder containing all frames or the path to the conditional video. For the other two tasks, it should be the path to the conditional image(s).--cameras: the path to the conditional camera extrinsics and intrinsics. We also provide basic trajectories by specifying this argument aszoom_in,zoom_out,arc_left,arc_right,translate_uportranslate down. In this way, we will first use VGGT to estimate the initial camera intrinsics and scene scale. To disable camera conditioning, set this argument tofree.
Note that the default resolution of our model is 560×560. If the resolution of the conditioning images or videos differs from this, we first apply resizing followed by center cropping to match the required resolution.
Click to expand
- First Frame to 3D
python infer.py \
--pretrained_model_name_or_path ./checkpoints \
--task 1view \
--prompts examples/1-view/prompts.txt \
--frame_path examples/1-view/knossos.png \
--cameras zoom_out \
--output_dir ./results- First-last Frames to 3D
python infer.py \
--pretrained_model_name_or_path ./checkpoints \
--task 2view \
--prompts examples/2-view/bedroom/prompts.txt \
--frame_path examples/2-view/bedroom/first.png examples/2-view/bedroom/last.png\
--cameras examples/2-view/bedroom/cameras.json \
--output_dir ./results- 3D Reconstruction, note that
--camerasare ignored in this task.
python infer.py \
--pretrained_model_name_or_path ./checkpoints \
--task allview \
--prompts examples/all-view/prompts.txt \
--frame_path examples/all-view/garden.mp4 \
--output_dir ./results- Release inference code and checkpoints
- Release online demo
- Release training code & dataset preparation
Please cite our paper if you find this repository useful:
@misc{huang2026gen3r3dscenegeneration,
title={Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction},
author={Jiaxin Huang and Yuanbo Yang and Bangbang Yang and Lin Ma and Yuewen Ma and Yiyi Liao},
year={2026},
eprint={2601.04090},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04090},
}