EAM: Embodiment Agnostic Long-Horizon Manipulation using Human-Play Data

Abstract

Can we learn a hierarchical visuomotor control policy that can generalize to novel scenes, objects and geometries without scaling teleoperated robot demonstrations ? Recent works have shown impressive performance on manipulation tasks through learning policies leveraging robot teleopera- tion data collected at scale. To ensure true autonomy in the real-world these policies should be able to generalize to multiple tasks, visual domains, and diverse object geome- tries in unstructured environments. A scalable solution must reduce the dependence on collecting a large number of tele- operated demonstrations while simultaneously ensuring the alternative can be used to learn a representation that guides low-level control effectively. We propose learning a policy using human-play data - trajectories of humans freely inter- acting with their environment. Human-play data provides rich guidance about high-level actions to the low-level con- troller. We demonstrate the effectiveness of our high-level policy by testing with low-level control methods that use few teleoperation demos. Further, we examine the feasibility of a hierarchical policy that requires no teleoperation data and can generalize to any robot embodiment while obeying the kinematic constraints of the embodiment. We present our results and ablation studies on tasks evaluated in the real-world.

Motivation

Can we generalize robot learning policies to unstructured scenes, novel objects and diverse motions without collecting millions of teleoperated trajectories ?

Learning from Visual Demonstrations

Imitation Learning

Uses expert demonstrations to train robots to perform tasks under conditions of minor variations

Challenges to LfD

Abundant teleoperation data requirement - Tedious, Time-consuming
Poor generalization to novel visual domains
Embodiment gap with different manipulators

Our Approach

Bridging the embodiment gap

We experiment with multiple techniques for bridging the human-robot embodiment gap as below:

DINOV2 + LoRA (Rein)
KL Divergence
Masking Manipulator
Co-training with robot play data + Depth data
Overlaying corresponding joints on hand and robot

Low-level Policy: Action Chunking with Transformers (ACT)

Results

Human Play Data

High-level policy evaluation on human-play data. Ground truth (Green) and Prediction (Purple)

Qualitative

Our high-level policy combined with ACT was able to demonstrate object and scene level generalization

Robustness and real-time replanning

Quantitative

Future Work

- Non-Hierarchical single stage policy to remove need for low-level teleoperation data
- Extending to cluttered environments and more complex motion trajectories
- Using CLIP to allow goal specification through language and provide better task specific scene abstraction