R

w

o

R

Generating Robot Demonstrations from Human Hand

for Robot Policy Learning without Robot

IROS 2025

Liang Heng1*    Xiaoqi Li1*    Shangqing Mao1    Jiaming Liu1    Ruolin Liu1    Jingli Wei1   
Yu-Kai Wang1    Yueru Jia1    Chenyang Gu1    Rui Zhao2    Shanghang Zhang1    Hao Dong1
1School of Computer Science, Peking University   2Tencent Robotics X Laboratory    *Equal Contribution
RwoR Pipeline

RwoR Pipeline Overview
Demonstrations
*Human hand videos and corresponding generated UMI gripper execution videos.
Generated Demos

White sponge 111

White sponge demonstration

Banana

White sponge demonstration

Yellow macaron

White sponge demonstration

Green bowl

White sponge demonstration

White charger

White sponge demonstration

Gray cup

White sponge demonstration
Real-world Execution Demos

During real robot deployment, our method is evaluated across 9 tasks on the Franka Research 3 (FR3) robot with a 3D-printed UMI gripper. We use a Gopro 9 camera to obtain real-world visual observations from the wrist view. For each task, 50 training human hand demonstrations are collected in a specific working space range. We train an agent for each task and evaluate each task in 15 trials within the training working space. The hand demonstrations and Franka execution demos are shown above.

Open Drawer
Unplug Charger
Close Microwave
Lift Bottle
Pour Water
Slide Block
Stack Bowl
Table Rearrange
Wipe Plate
Experiments
Comparisons with Baseline
RwoR Pipeline Overview

We adopt UMI as our comparison baseline, which utilizes a hand-held gripper to collect robot gripper demonstrations, serving as the upper bound. As shown in Table, our framework achieves a similar success rate compared to UMI, demonstrating that the quality of our generated robot gripper demonstrations is comparable to those collected directly using a hand-held device.

Moreover, we further compare our method with a baseline (the third row in Table), which uses an alternative approach for generating training ground truth. Specifically, we replace our approach of using a generative model to transfer human hand demonstrations to UMI gripper demonstrations with a simple rule-based texture mapping method. In this approach, we remove the hand parts from the human hand demon-stration, inpaint the background, and then apply the UMI gripper pattern to the image. As a result, the model trained using this data experiences a significant performance drop.

Generalization Study
Action
Rotate Block
Unstack Block
Instance
Yellow Cup
White Block
RwoR Pipeline Overview

We use the generated robot demonstrations to train the policy model and test its success rate, which can reflect the quality of generated robot gripper demonstrations on unseen action types and instances.

For action type generalization, the generative model's training dataset includes actions "slide block", however, we use the trained diffusion-based generative model to generate robot demonstrations for other actions, such as "rotate block" and "unstack block." Even for unseen action tasks, the success rate of the trained policy model remains high.

For instance generalization, we test the trained generative model on unseen instance appearance, which is not included in the entire generative model training dataset. the success rate remains consistent at 0.87 for both the seen instance pouring the "white cup" and the unseen instance pouring the "yellow cup". Additionally, the unseen instance sliding the "white block" experiences only a minor performance drop of 0.04 compared to the seen instance sliding the "red block".