Yangcen Liu* · Woo Chul Shin* · Yunhai Han · Zhenyang Chen ·
Harish Ravichandar · Danfei Xu
Georgia Institute of Technology · CoRL 2025 (Oral)
(* equal contribution)
Paper · Project Page · Code · Data · Video
ImMimic is an embodiment-agnostic co-training framework that leverages abundant human videos and a small amount of teleoperated robot demonstrations. It bridges domain gaps via (1) retargeted human hand trajectories as action supervision, (2) DTW mapping (action- or visual-based), and (3) MixUp interpolation in latent/action space to create intermediate domains for adaptation.
Pipeline:
- Collect robot demonstrations (teleoperation).
- Extract human actions from videos and retarget to the robot action space.
- Map human ↔ robot trajectories with DTW (action-based or visual-based).
- MixUp interpolate paired trajectories in latent and action space.
- Co-train diffusion policy on robot demos + interpolated human data.
- Create the conda env
conda create -n immimic python=3.10
conda activate immimic
- Install MuJoCo
pip install "mujoco==3.3.0"
- Install PyTorch
pip install torch==2.6.0 torchvision==0.21.0
- Install robosuite v1.5.1
git clone https://github.com/ARISE-Initiative/robosuite.git
cd robosuite
git checkout v1.5.1
pip install -e . --no-deps
cd ..
- Install robomimic
cd robomimic
pip install -e .
cd ..
- Install requirements.txt
pip install -r requirements.txt
pip install diffusers==0.36.0
pip install huggingface-hub==0.30.2
Group: data
└── Group: demo_t
├── Dataset: action_absolute (T, A)
│ # [0:3] eef_position (x, y, z)
│ # [3:6] eef_orientation_axis_angle (rx, ry, rz)
│ # [6:A] hand DOF (gripper-dependent)
│
└── Group: obs
├── Dataset: agentview_image (T, H=180, W=320, 3)
├── Dataset: eef_pose_w_gripper (T, P)
│ # [0:3] eef_position (x, y, z)
│ # [3:7] eef_orientation_quaternion (w, x, y, z)
│ # [7:P] hand DOF
└── Dataset: wrist_image (T, H=180, W=320, 3)
Group: mask
├── Dataset: train (N_train,) # demo keys
└── Dataset: valid (N_valid,) # demo keys
# Hand DOF convention:
# Ability Hand : 6 DOF
# Allegro Hand : 16 DOF
# Fin Ray : 1 DOF
# Robotiq : 1 DOF
#
# Therefore:
# A = 6 + hand_dof
# P = 7 + hand_dof
- Each demo_t is a variable-length trajectory of length T.
- Sample dataset can be found here. Put dataset at
robomimic/robomimic/datasets
- Compute action stats
python robomimic/robomimic/datasets/utils/compute_action_stats.py --hdf5_path robomimic/robomimic/datasets/umi_pick_place_0_5.hdf5 --output_path robomimic/robomimic/datasets/umi_pick_place_action_stats.json
- Map human-robot data using DTW (to be updated)
python -u robomimic/robomimic/scripts/train.py --config robomimic/robomimic/exps/configs/umi_pick_place_100_5.json
- Install gello_software to run rollout code.
python robomimic/robomimic/scripts/run_trained_policy.py --ckpt_path /robomimic/robomimic/exps/policy_trained_models/umi_pick_place/test/20260104222851/models/model_epoch_300.pth --norm_path robomimic/robomimic/datasets/umi_pick_place_action_stats.json
