Code for Phantom and Masquerade

This repository contains the code used to process human videos in Phantom: Training Robots Without Robots Using Only Human Videos and Masquerade: Learning from In-the-wild Human Videos using Data-Editing.

Phantom: Training Robots Without Robots Using Only Human Videos

Marion Lepert, Jiaying Fang, Jeannette Bohg

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Marion Lepert*, Jiaying Fang*, Jeannette Bohg

Both projects use data editing to convert human videos into “robotized” demonstrations. They share much of the same codebase, with some differences in the processing pipeline:

Phantom

Input: RGBD videos with a single left hand visible in every frame.
Data editing: inpaint the single human arm, overlay a rendered robot arm in the same pose.
Action labels: extract full 3D end-effector pose (position, orientation, gripper)

Masquerade

Input: RGB videos from Epic Kitchens; one or both hands may be visible, sometimes occluded.
Data editing: segment and inpaint both arms, overlay a bimanual robot whose effectors follow the estimated poses (with a 3-4cm error along the depth direction due to lack of depth data)
Action labels: use 2D projected waypoints as auxiliary supervision only (not full 3D actions)

Installation

Clone this repo recursively

git clone --recursive git@github.com:MarionLepert/phantom.git

Run the following script from the root directory to install the required conda environment.

./install.sh

Download the MANO hand models. To do so, go to the MANO website and register to be able to download the models. Download the left and right hand models and move MANO_LEFT.pkl and MANO_RIGHT.pkl inside the $ROOT_DIR/submodules/phantom-hamer/_DATA/data/mano/ folder.

Getting Started

Process Phantom sample data (manually collected in-lab videos)

conda activate phantom

python process_data.py demo_name=pick_and_place data_root_dir=../data/raw processed_data_root_dir=../data/processed mode=all

Process Masquerade sample data (Epic Kitchens video)

conda activate phantom

python process_data.py demo_name=epic data_root_dir=../data/raw processed_data_root_dir=../data/processed mode=all --config-name=epic

Codebase Overview

Process data

Each video is processed using the following steps:

Extract human hand bounding boxes: bbox_processor.py
- mode=bbox
Extract 2d human hand poses: hand_processor.py
- mode=hand2d: extract the 2d hand pose
Extract human and arm segmentation masks: segmentation_processor.py
- mode=hand_segmentation: used for depth alignment in hand pose refinement (only works for hand3d)
- mode=arm_segmentation: needed in all cases to inpaint the human
Extract 3d human hand poses: hand_processor.py
- mode=hand3d: extract the 3d hand pose (note: requires depth, and was only tested on the left hand)
Retarget human actions to robot actions: action_processor.py
- mode=action
Smooth human poses: smoothing_processor.py
- mode=smoothing
Remove hand from videos using inpainting: handinpaint_processor.py
- mode=hand_inpaint
- Inpainting method E2FGVI is used.
Overlay virtual robot on video: robotinpaint_processor.py
- mode=robot_inpaint: overlay a single robot (default) or bimanual (epic mode) robot on the image

Config reference (see configuration files in `configs/`)

Flag	Type	Required	Choices	Description
`--demo_name`	`str`	✅	-	Name of the demonstration/dataset to process
`--mode`	`str` (multiple)	✅	`bbox`, `hand2d`, `hand3d`, `hand_segmentation`, `arm_segmentation`, `action`, `smoothing`, `hand_inpaint`, `robot_inpaint`, `all`	Processing modes to run (can specify multiple with e.g. `'mode=[bbox,hand2d]'`)
`--robot_name`	`str`	✅	`Panda`, `Kinova3`, `UR5e`, `IIWA`, `Jaco`	Type of robot to use for overlays
`--gripper_name`	`str`	❌	`Robotiq85`	Type of gripper to use
`--data_root_dir`	`str`	❌	-	Root directory containing raw video data
`--processed_data_root_dir`	`str`	❌	-	Root directory to save processed data
`--epic`	`bool`	❌	-	Use Epic-Kitchens dataset processing mode
`--bimanual_setup`	`str`	❌	`single_arm`, `shoulders`	Bimanual setup configuration to use (shoulders corresponds to the bimanual hardware configuration used in Masquerade)
`--target_hand`	`str`	❌	`left`, `right`, `both`	Which hand(s) to target for processing
`--camera_intrinsics`	`str`	❌	-	Path to camera intrinsics file
`--camera_extrinsics`	`str`	❌	-	Path to camera extrinsics file
`--input_resolution`	`int`	❌	-	Resolution of input videos
`--output_resolution`	`int`	❌	-	Resolution of output videos
`--depth_for_overlay`	`bool`	❌	-	Use depth information for overlays
`--demo_num`	`str`	❌	-	Process a single demo number instead of all demos
`--debug_cameras`	`str` (multiple)	❌	-	Additional camera names to include for debugging
`--constrained_hand`	`bool`	❌	-	Use constrained hand processing
`--render`	`bool`	❌	-	Render the robot overlay on the video

Note Please specify --bimanual_setup single_arm along with --target_hand left or --target_hand right if you are using single arm. For bimanual setups, use --bimanual_setup shoulders.

Camera details

Phantom: a Zed2 camera was used to capture the sample data at HD1080 resolution.
Masquerade: We used Epic-Kitchens videos and used the camera intrinsics provided in the dataset. To use videos captured with a different camera resolution, update the camera intrinsics and extrinsics files in $ROOT_DIR/phantom/camera/.

Train policy

After processing the video data, the edited data can be used to train a policy. The following files should be used:

Observations
- Phantom Samples: extract RGB images from data/processed/pick_and_place/*/video_overlay_Panda_single_arm.mkv
- Epic (In-the-wild Data) Samples: extract RGB images from data/processed/epic/*/video_overlay_Kinova3_shoulders.mkv
Actions
- Phantom Samples: All data stored in data/processed/pick_and_place/*/inpaint_processor/training_data_single_arm.npz
- Epic (In-the-wild Data) Samples: All data stored in data/processed/epic/*/inpaint_processor/training_data_shoulders.npz

In Phantom, Diffusion Policy was used for policy training.

Citation

@article{lepert2025phantomtrainingrobotsrobots,
        title={Phantom: Training Robots Without Robots Using Only Human Videos}, 
        author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
        year={2025},
        eprint={2503.00779},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2503.00779}, 
  }

@misc{lepert2025masqueradelearninginthewildhuman,
      title={Masquerade: Learning from In-the-wild Human Videos using Data-Editing}, 
      author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
      year={2025},
      eprint={2508.09976},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.09976}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
docs		docs
phantom		phantom
submodules		submodules
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code for Phantom and Masquerade

Phantom: Training Robots Without Robots Using Only Human Videos

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Installation

Getting Started

Codebase Overview

Process data

Config reference (see configuration files in `configs/`)

Camera details

Train policy

Citation

About

Uh oh!

Releases

Packages

Languages

License

MarionLepert/phantom

Folders and files

Latest commit

History

Repository files navigation

Code for Phantom and Masquerade

Phantom: Training Robots Without Robots Using Only Human Videos

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Installation

Getting Started

Codebase Overview

Process data

Config reference (see configuration files in configs/)

Camera details

Train policy

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Config reference (see configuration files in `configs/`)

Packages