RwoR: Generating Robot Demonstrations from Human Hand for Robot Policy Learning without Robot

R

w

o

R

Generating Robot Demonstrations from Human Hand

for Robot Policy Learning without Robot

IROS 2025

Liang Heng^1* Xiaoqi Li^1* Shangqing Mao¹ Jiaming Liu¹ Ruolin Liu¹ Jingli Wei¹
Yu-Kai Wang¹ Yueru Jia¹ Chenyang Gu¹ Rui Zhao² Shanghang Zhang¹ Hao Dong¹

¹School of Computer Science, Peking University ²Tencent Robotics X Laboratory ^*Equal Contribution

Paper Cite Code (coming soon)

RwoR Pipeline

Demonstrations

*Human hand videos and corresponding generated UMI gripper execution videos.

Generated Demos

White sponge

Banana

Yellow macaron

Green bowl

White charger

Gray cup

Real-world Execution Demos

During real robot deployment, our method is evaluated across 9 tasks on the Franka Research 3 (FR3) robot with a 3D-printed UMI gripper. We use a Gopro 9 camera to obtain real-world visual observations from the wrist view. For each task, 50 training human hand demonstrations are collected in a specific working space range. We train an agent for each task and evaluate each task in 15 trials within the training working space. The hand demonstrations and Franka execution demos are shown above.

Open Drawer

Unplug Charger

Close Microwave

Lift Bottle

Pour Water

Slide Block

Stack Bowl

Table Rearrange

Wipe Plate

Experiments

Comparisons with Baseline

We adopt UMI as our comparison baseline, which utilizes a hand-held gripper to collect robot gripper demonstrations, serving as the upper bound. As shown in Table, our framework achieves a similar success rate compared to UMI, demonstrating that the quality of our generated robot gripper demonstrations is comparable to those collected directly using a hand-held device.

Moreover, we further compare our method with a baseline (the third row in Table), which uses an alternative approach for generating training ground truth. Specifically, we replace our approach of using a generative model to transfer human hand demonstrations to UMI gripper demonstrations with a simple rule-based texture mapping method. In this approach, we remove the hand parts from the human hand demon-stration, inpaint the background, and then apply the UMI gripper pattern to the image. As a result, the model trained using this data experiences a significant performance drop.

Generalization Study

Action

Rotate Block

Unstack Block

Instance

Yellow Cup

White Block

We use the generated robot demonstrations to train the policy model and test its success rate, which can reflect the quality of generated robot gripper demonstrations on unseen action types and instances.

For action type generalization, the generative model's training dataset includes actions "slide block", however, we use the trained diffusion-based generative model to generate robot demonstrations for other actions, such as "rotate block" and "unstack block." Even for unseen action tasks, the success rate of the trained policy model remains high.

For instance generalization, we test the trained generative model on unseen instance appearance, which is not included in the entire generative model training dataset. the success rate remains consistent at 0.87 for both the seen instance pouring the "white cup" and the unseen instance pouring the "yellow cup". Additionally, the unseen instance sliding the "white block" experiences only a minor performance drop of 0.04 compared to the seen instance sliding the "red block".