Exploring Nature-Inspired Robot Agility

April 3, 2020

Posted by Xue Bin (Jason) Peng, Student Researcher and Sehoon Ha, Research Scientist, Robotics at Google



Whether it’s a dog chasing after a ball or a horse jumping over obstacles, animals can effortlessly perform an incredibly rich repertoire of agile skills. Developing robots that are able to replicate these agile behaviors can open opportunities to deploy robots for sophisticated tasks in the real world. But designing controllers that enable legged robots to perform these agile behaviors can be a very challenging task. While reinforcement learning (RL) is an approach often used for automating development of robotic skills, a number of technical hurdles remain and, in practice, there is still substantial manual overhead. Designing reward functions that lead to effective skills can itself require a great deal of expert insight, and often involves a lengthy reward tuning process for each desired skill. Furthermore, applying RL to legged robots requires not only efficient algorithms, but also mechanisms to enable the robots to remain safe and recover after falling, without frequent human assistance.

In this post, we will discuss two of our recent projects aimed at addressing these challenges. First, we describe how robots can learn agile behaviors by imitating motions from real animals, producing fast and fluent movements like trotting and hopping. Then, we discuss a system for automating the training of locomotion skills in the real world, which allows robots to learn to walk on their own, with minimal human assistance.

Learning Agile Robotic Locomotion Skills by Imitating Animals
In “Learning Agile Robotic Locomotion Skills by Imitating Animals”, we present a framework that takes a reference motion clip recorded from an animal (a dog, in this case) and uses RL to train a control policy that enables a robot to imitate the motion in the real world. By providing the system with different reference motions, we are able to train a quadruped robot to perform a diverse set of agile behaviors, ranging from fast walking gaits to dynamic hops and turns. The policies are trained primarily in simulation, and then transferred to the real world using a latent space adaptation technique that can efficiently adapt a policy using only a few minutes of data from the real robot.

Motion Imitation
We start by collecting motion capture clips of a real dog performing various locomotion skills. Then, we use RL to train a control policy to imitate the dog’s motions. The policies are trained in a physics simulation to track the pose of the reference motion at each timestep. Then, by using different reference motions in the reward function, we can train a simulated robot to imitate a variety of different skills.
Reinforcement learning is used to train a simulated robot to imitate the reference motions from a dog. All simulations are performed using PyBullet.
However, since simulators generally provide only a coarse approximation of the real world, policies trained in simulation often perform poorly when deployed on a real robot. Therefore, we use a sample-efficient latent space adaptation technique to transfer a policy trained in simulation to the real world.

First, to encourage the policy to learn behaviors that are robust to variations in the dynamics, we randomize the dynamics of the simulation by varying physical quantities, such as the robot’s mass and friction. Since we have access to the values of these parameters during training in simulation, we can also map them to a low-dimensional representation using a learned encoder. This encoding is then passed as an additional input to the policy during training. Since the physical parameters of the real robot are not known a priori, when deploying the policy to a real robot, we remove the encoder and directly search for a set of parameters in the latent space that enables the robot to successfully execute the desired skills in the real world. This technique is often able to adapt a policy to the real world using less than 8 minutes of real-world data.
Comparison of policies before and after adaptation on the real robot. Before adaptation, the robot is prone to falling. But after adaptation, the policies are able to more consistently execute the desired skills.
Results
Using this approach, the robot learns to imitate various locomotion skills from a dog, including different walking gaits, such as pacing and trotting, as well as an agile spinning motion.
Robot imitating various skills from a dog.
In addition to imitating motions from real dogs, it is also possible to imitate artist-animated keyframe motions, including a dynamic hop-turn:
Skills learned by imitating artist-animated keyframe motions: side-steps, turn, and hop-turn.
More details are available in the following video:
Learning to Walk in the Real World with Minimal Human Effort
The above approach is able to train policies in simulation and then adapt them to the real world. However, when the task involves complex and diverse physical phenomena, it is also necessary to directly learn from real-world experience. Although learning on real robots has achieved state-of-the-art performance for manipulation tasks (e.g., QT-Opt), applying the same methods to legged robots is difficult since the robot may fall and damage itself, or leave the training area, which can then require human intervention.
An automated learning system for legged robots must resolve safety and automation challenges.
In “Learning to Walk in the Real World with Minimal Human Effort”, we developed an automated learning system with software and hardware components, using a multi-task learning procedure, a safety-constrained learner, and several carefully designed hardware and software components. Multi-task learning prevents the robot from leaving the training area by generating a learning schedule that drives the robot towards the center of the workspace. We also reduce the number of falls by designing a safety constraint, which we solve with dual gradient descent.

For each roll-out, the scheduler selects a task in which the desired walking direction is pointing towards the center. For instance, assuming we have two tasks, forward and backward walking, the scheduler will select the forward task if the robot is at the back of the workspace, and vice-versa for the backward task. In the middle of the episode, the learner takes dual gradient descent steps to iteratively optimize both the task objective and safety constraints, rather than treating them as a single goal. If the robot has fallen, we invoke an automated get-up controller and proceed to the next episode.
We solve automation and safety challenges with multi-task learning, a safety-constrained SAC algorithm, and an automatic reset controller.
Results
This framework successfully trains policies from scratch to walk in different directions without any human intervention.
Snapshots of the training process on the flat surface with zero human resets.
Once trained, it is possible to steer the robot with a remote controller. Notice how it's possible to command the robot to turn in place using the controller. This action would be difficult to manually design due to the planar leg structure of the robot, but is discovered automatically using our automated multi-instance learner.
We train locomotion policies to walk in four directions, which allow us to interactively control the robot with a game controller.
The system also enables the robot to navigate more challenging surfaces, such as a memory foam mattress and a doormat with crevices.
Learned locomotion gaits on challenging terrains.
More details can be found in the following video:
Conclusion
In these two papers, we present methods to reproduce a diverse corpus of behaviors with quadruped robots. Extending this line of work to learn skills from videos would also be an exciting direction, which can substantially increase the volume of data from which robots can learn. We are also interested in applying the automated training system to more complex real-world environments and tasks.

Acknowledgments
We would like to thank our coauthors, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, Sergey Levine, Peng Xu and Zhenyu Tan. We would also like to thank Julian Ibarz, Byron David, Thinh Nguyen, Gus Kouretas, Krista Reymann, and Bonny Ho for their support and contributions to this work.