EAM: Embodiment Agnostic Long-Horizon Manipulation using Human-Play Data

Georgia Institute of Technology
*Equal Contribution

Abstract

Can we learn a hierarchical visuomotor control policy that can generalize to novel scenes, objects and geometries without scaling teleoperated robot demonstrations ? Recent works have shown impressive performance on manipulation tasks through learning policies leveraging robot teleopera- tion data collected at scale. To ensure true autonomy in the real-world these policies should be able to generalize to multiple tasks, visual domains, and diverse object geome- tries in unstructured environments. A scalable solution must reduce the dependence on collecting a large number of tele- operated demonstrations while simultaneously ensuring the alternative can be used to learn a representation that guides low-level control effectively. We propose learning a policy using human-play data - trajectories of humans freely inter- acting with their environment. Human-play data provides rich guidance about high-level actions to the low-level con- troller. We demonstrate the effectiveness of our high-level policy by testing with low-level control methods that use few teleoperation demos. Further, we examine the feasibility of a hierarchical policy that requires no teleoperation data and can generalize to any robot embodiment while obeying the kinematic constraints of the embodiment. We present our results and ablation studies on tasks evaluated in the real-world.

Motivation

Can we generalize robot learning policies to unstructured scenes, novel objects and diverse motions without collecting millions of teleoperated trajectories ?

Learning from Visual Demonstrations

Imitation Learning

  • Uses expert demonstrations to train robots to perform tasks under conditions of minor variations

Challenges to LfD

  • Abundant teleoperation data requirement - Tedious, Time-consuming
  • Poor generalization to novel visual domains
  • Embodiment gap with different manipulators

Our Approach

Our Approach Image

Bridging the embodiment gap

We experiment with multiple techniques for bridging the human-robot embodiment gap as below:

  • DINOV2 + LoRA (Rein)
  • KL Divergence
  • Masking Manipulator
  • Co-training with robot play data + Depth data
  • Overlaying corresponding joints on hand and robot

Embodiment 1 Embodiment 2


Low-level Policy: Action Chunking with Transformers (ACT)

Embodiment 1
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, Zhao et al. 2023

Results

Human Play Data

High-level policy evaluation on human-play data. Ground truth (Green) and Prediction (Purple)

Qualitative

Our high-level policy combined with ACT was able to demonstrate object and scene level generalization


Robustness and real-time replanning

Embodiment 1
Real-time replanning

Quantitative

Embodiment 1
DINOV2 with Low-rank Adaptation-based vision encoder achieves the best performance in terms of distance-metric
Embodiment 1
Top-view along with wrist cameras achieve best success rates on unseen objects by leveraging style and latent information from ACT encoder
Embodiment 1
High-level policy enables scene-level generalization
Embodiment 1
  • EAM high-level with ACT low-level achieve best success rate by learning embodiment-agnostic representations using KL loss-based optimization.
  • This also enables comparable success rates to other approaches by integrating high-level policy with Gradient-based IK solver for joint-space prediction
Embodiment 1
EAM+ACT using data from both human and robot embodiment achieve best success rate.


Future Work

Embodiment 1
Overlaying CLIP attention on input image. Prompt: "Toy, Bowl and Hand".
  • - Non-Hierarchical single stage policy to remove need for low-level teleoperation data
  • - Extending to cluttered environments and more complex motion trajectories
  • - Using CLIP to allow goal specification through language and provide better task specific scene abstraction