Skip to content

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration. Accepted to NeurIPS 2025.

Notifications You must be signed in to change notification settings

LogosRoboticsGroup/4D-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

4D-VLA:
Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

arXiv
Jiahui Zhang1*, Yurui Chen1*, Yueming Xu1, Ze Huang1, Yanpeng Zhou2, Yu-Jie Yuan2, Xinyue Cai2, Guowei Huang2, Xingyue Quan2, Hang Xu2, Li Zhang1
1Fudan University  2Huawei Noah’s Ark Lab 

Top: Our pretraining design philosophy highlights that prior methods often lack key cues in their input for accurate action inference. This leads to target action distributions At(·) exhibiting high variance or non-smoothness, which negatively impacts pretraining performance. A rough analysis shows that in the DROID dataset, 67% of the samples have the robot’s base occluded, causing coordinate system chaos.
Bottom: We verify our method in both simulated and real-world robotic settings and report the performance for the OpenVLA baseline and our 4D-VLA approach.

🎥 Demo Video

vla-video.mp4

📚 Bibtex

If you find this project or dataset helpful, please consider citing our paper:

@article{zhang2025vla,
    title={4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration},
    author={Zhang, Jiahui and Chen, Yurui and Xu, Yueming and Huang, Ze and Zhou, Yanpeng and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
    year={2025},
    journal={arXiv preprint arXiv:2506.22242},
}

About

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration. Accepted to NeurIPS 2025.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published