Research Areas
Authored Publications
Sort By
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Alexander Herzog
Alexander Toshkov Toshev
Andy Zeng
Anthony Brohan
Brian Andrew Ichter
Byron David
Chelsea Finn
Clayton Tan
Diego Reyes
Dmitry Kalashnikov
Eric Victor Jang
Jarek Liam Rettinghouse
Jornell Lacanlale Quiambao
Julian Ibarz
Karol Hausman
Kyle Alan Jeffrey
Linda Luu
Mengyuan Yan
Michael Soogil Ahn
Nicolas Sievers
Noah Brown
Omar Eduardo Escareno Cortes
Peng Xu
Peter Pastor Sampedro
Rosario Jauregui Ruano
Sally Augusta Jesmonth
Sergey Levine
Steve Xu
Yao Lu
Yevgen Chebotar
Yuheng Kuang
Conference on Robot Learning (CoRL) (2022)
Preview abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context.
For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.
We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.
We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator.
The project's website and the video can be found at \url{say-can.github.io}.
View details
Long-Range Indoor Navigation with PRM-RL
Anthony Francis
Marek Fiser
Tsang-Wei Lee
IEEE Transactions on Robotics (T-RO) (2020), pp. 19
RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies
Marek Fiser
Lydia Tapia
IEEE Robotics and Automation Letters (RA-L) (2019)
Preview abstract
This paper addresses two challenges facing sampling-based kinodynamic motion planning: a way to identify good candidate states for local transitions and the subsequent computationally intractable steering between these candidate states. Through the combination of sampling-based planning, a Rapidly Exploring Randomized Tree (RRT) and an efficient kinodynamic motion planner through machine learning, we propose an efficient solution to long-range planning for kinodynamic motion planning. First, we use deep reinforcement learning to learn an obstacle-avoiding policy that maps a robot's sensor observations to actions, which is used as a local planner during planning and as a controller during execution. Second, we train a reachability estimator in a supervised manner, which predicts the RL policy's time to reach a state in the presence of obstacles. Lastly, we introduce RL-RRT that uses the RL policy as a local planner, and the reachability estimator as the distance function to bias tree-growth towards promising regions. We evaluate our method on three kinodynamic systems, including physical robot experiments. Results across all three robots tested indicate that RL-RRT outperforms state of the art kinodynamic planners in efficiency, and also provides a shorter path finish time than a steering function free method. The learned local planner policy and accompanying reachability estimator demonstrate transferability to the previously unseen experimental environments, making RL-RRT fast because the expensive computations are replaced with simple neural network inference. Video: https://youtu.be/dDMVMTOI8KY
View details
Provably Robust Blackbox Optimization for Reinforcement Learning
Aldo Pacchiano
Jack Parker-Holder
Yunhao Tang
Yuxiang Yang
Conference on Robot Learning (CoRL 2019)
Preview abstract
Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization problems in Robotics. However, it is well known that DFO methods suffer from prohibitively high sampling complexity. They can also be very sensitive to noisy rewards and stochastic dynamics. In this paper, we propose a new class of algorithms, called Robust Blackbox Optimization (RBO). Remarkably, even if up to 23% of all the measurements are arbitrarily corrupted, RBO can provably recover gradients to high accuracy. RBO relies on learning gradient flows using robust regression methods to enable off-policy updates. On several MuJoCo robot control tasks, when all other RL approaches collapse in the presence of adversarial noise, RBO is able to train policies effectively. We also show that RBO can be applied to legged locomotion tasks including path tracking for quadruped robots.
View details
Time-Contrastive Networks: Self-Supervised Learning from Video
Corey Lynch
Yevgen Chebotar
Eric Jang
Stefan Schaal
Sergey Levine
Proceedings of International Conference in Robotics and Automation (ICRA 2018) + Deep Learning for Robotic Vision (DLRV) Workshop at CVPR 2017 + Deep Reinforcement Learning Symposium at NIPS 2017 (2018)
Preview abstract
We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a triplet loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitate
View details
Learning 6-DOF Grasping Interaction via Deep 3D Geometry-aware Representations
Xinchen Yan
Mohi Khansari
Abhinav Gupta
James Davidson
Honglak Lee
(2018)
Preview abstract
This paper focuses on the problem of learning 6-DOF grasping with a parallel jaw gripper in simulation. Compared to existing approaches that are specialized in three-dimensional grasping (i.e., top-down grasping or side-grasping), using a 6-DOF grasping model allows the robot to learn a richer set of grasping interactions given less physical constraints; hence, potentially enhancing the robustness of grasping and robot dexterity. However, learning 6-DOF grasping is challenging due to a high dimensional state space, difficulty in collecting large-scale data, and many variations of an object’s visual appearance (i.e., geometry, material, texture, and illumination). We propose the notion of a geometry-aware representation in grasping based on the assumption that knowledge of 3D geometry is at the heart of interaction. Our key idea is constraining and regularizing grasping interaction learning through 3D geometry prediction.
View details