Top: Our pretraining design philosophy highlights that prior methods often lack key cues in their input for accurate action inference.
This leads to target action distributions
At(·)
exhibiting high variance or non-smoothness, which negatively impacts pretraining performance.
A rough analysis shows that in the DROID dataset, 67% of the samples have the robot’s base occluded, causing coordinate system chaos.
Bottom: We verify our method in both simulated and real-world robotic settings and report the performance for the
OpenVLA baseline and our
4D-VLA approach.
vla-video.mp4
If you find this project or dataset helpful, please consider citing our paper:
@article{zhang2025vla,
title={4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration},
author={Zhang, Jiahui and Chen, Yurui and Xu, Yueming and Huang, Ze and Zhou, Yanpeng and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2506.22242},
}