Skip to content

taewan2002/RetoVLA

Repository files navigation

RetoVLA

Official code release for
RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

🎉 Accepted to ICRA 2026 🇦🇹

Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi*
School of Computing, Gachon University
Co-first authors, *corresponding author

arXiv YouTube License

Overview

RetoVLA improves lightweight Vision-Language-Action policies by reusing register tokens as a compact source of global spatial context. Instead of discarding those tokens after visual processing, RetoVLA routes them into the Action Expert through gated key/value injection, improving spatial reasoning while preserving a lightweight SmolVLA-style backbone.

This repository is the official RetoVLA code release. The implementation is provided as a minimal patch on top of LeRobot's SmolVLA codebase, together with training and evaluation scripts for LIBERO Spatial.

Highlights

  • Official RetoVLA implementation built on top of LeRobot SmolVLA.
  • Register-token reuse with expert-side spatial context injection.
  • Training and evaluation scripts for LIBERO Spatial.
  • Minimal public release focused on code and reproducibility.

Repository Layout

RetoVLA/
├── code/lerobot_patch/src/lerobot/policies/smolvla/
│   ├── configuration_smolvla.py
│   ├── modeling_smolvla.py
│   └── smolvlm_with_expert.py
├── scripts/
│   ├── train_libero_spatial_retovla.sh
│   └── eval_libero_spatial.sh

Method

RetoVLA keeps the standard SmolVLA pipeline and augments the Action Expert with a second stream of spatial context derived from register tokens. The released patch files contain the configuration changes, policy logic, and expert-side injection path used in the paper.

Demos

Real-world montage
Real-world montage
Mirror cleaning
Clean marker on mirror
Dataset collection
Dataset collection
Simulation montage
Custom simulation montage

Released in This Repository

  • RetoVLA patch files for LeRobot SmolVLA.
  • Training and evaluation scripts for LIBERO Spatial.
  • Documentation for setup and usage.
  • Demo GIFs.

Not included:

  • Raw robot datasets.
  • Checkpoints and full training logs.
  • Static images and other project media.
  • Private hardware stack details beyond paper-level description.

Setup

Prerequisites

  • Python 3.10+ recommended.
  • PyTorch with CUDA support.
  • A local LeRobot environment with SmolVLA and LIBERO dependencies installed.

This is a patch-based release, not a standalone robotics framework. You need a local lerobot checkout before running the training or evaluation scripts below.

Configure local paths

export RETOVLA_ROOT=/path/to/RetoVLA
export ROOT_DIR=/path/to/lerobot

Apply the RetoVLA patch

cp "$RETOVLA_ROOT"/code/lerobot_patch/src/lerobot/policies/smolvla/configuration_smolvla.py \
   "$ROOT_DIR"/src/lerobot/policies/smolvla/configuration_smolvla.py
cp "$RETOVLA_ROOT"/code/lerobot_patch/src/lerobot/policies/smolvla/modeling_smolvla.py \
   "$ROOT_DIR"/src/lerobot/policies/smolvla/modeling_smolvla.py
cp "$RETOVLA_ROOT"/code/lerobot_patch/src/lerobot/policies/smolvla/smolvlm_with_expert.py \
   "$ROOT_DIR"/src/lerobot/policies/smolvla/smolvlm_with_expert.py

Training

Run LIBERO Spatial training with:

bash "$RETOVLA_ROOT"/scripts/train_libero_spatial_retovla.sh

Common overrides:

CUDA_VISIBLE_DEVICES=0 \
STEPS=100000 \
OUTPUT_DIR="$ROOT_DIR"/outputs/train/retovla_libero_spatial \
bash "$RETOVLA_ROOT"/scripts/train_libero_spatial_retovla.sh

Evaluation

Evaluate a trained checkpoint with:

POLICY_PATH="$ROOT_DIR"/outputs/train/retovla_libero_spatial/checkpoints/100000/pretrained_model \
NUM_TRIALS=50 \
bash "$RETOVLA_ROOT"/scripts/eval_libero_spatial.sh

By default the scripts read and write under $ROOT_DIR/outputs/....

License

This repository is released under Apache-2.0. Some patch files are modified derivatives of Apache-licensed LeRobot / Hugging Face SmolVLA components; see LICENSE and NOTICE.

Citation

If you find our work useful in your research, please consider citing our paper:

@misc{koo2025retovlareusingregistertokens,
  title={RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models},
  author={Jiyeon Koo and Taewan Cho and Hyunjoon Kang and Eunseom Pyo and Tae Gyun Oh and Taeryang Kim and Andrew Jaeyong Choi},
  year={2025},
  eprint={2509.21243},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2509.21243}
}

About

[ICRA 2026] Official implementation of RetoVLA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors