Scalable Multi-Sensor Robot Imitation Learning via Task-Level Domain Consistency
Recent work in visual end-to-end learning for robotics has shown the promise of imitation learning across a variety of tasks. However, such approaches are often expensive and require vast amounts of real world training demonstrations. Additionally, they rely on a time-consuming evaluation process for identifying the best model to deploy in the real world. These challenges can be mitigated by simulation - by supplementing real world data with simulated demonstrations and using simulated evaluations to identify strong policies. However, this introduces the well-known ``reality gap'' problem, where simulator inaccuracies decorrelates performance in simulation from reality. In this paper, we build on top of prior work in GAN-based domain adaptation and introduce the notion of a Task Consistency Loss (TCL), a self-supervised contrastive loss that encourages sim and real alignment both at the feature and action-prediction level. We demonstrate the effectiveness of our approach on the challenging task of latched-door opening with a 9 Degree-of-Freedom (DoF) mobile manipulator from raw RGB and depth images. While most prior work in vision-based manipulation operate from a fixed, third person view, mobile manipulation couples the challenges of locomotion and manipulation with greater visual diversity and action space complexity. We find that we are able to achieve 77% success on seen and unseen scenes, a +30% increase from the baseline, using only ~16 hours of teleoperation demonstrations in sim and real.