Tingnan Zhang
My current research focuses on deep reinforcement learning and its application in robotics.
Research Areas
Authored Publications
Sort By
PI-ARS: Accelerating Evolution-Learned Visual Locomotion with Predictive Information Representations
Ofir Nachum
International Conference on Intelligent Robots and Systems (IROS) (2022)
Preview abstract
Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive Information Augmented Random Search (PI-ARS) to mitigate this limitation by leveraging recent advancements in representation learning to reduce the parameter search space for ES. Namely, PI-ARS combines a gradient-based representation learning technique, Predictive Information (PI), with a gradient-free ES algorithm, Augmented Random Search (ARS), to train policies that can process complex robot sensory inputs and handle highly nonlinear robot dynamics. We evaluate PI-ARS on a set of challenging visual-locomotion tasks where a quadruped robot needs to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task. Across all tasks, PI-ARS demonstrates significantly better learning efficiency and performance compared to the ARS baseline. We further validate our algorithm by demonstrating that the learned policies can successfully transfer to a real quadruped robot, for example, achieving a 100% success rate on the real-world stepping stone environment, dramatically improving prior results achieving 40% success.
View details
Style-Augmented Mutual Information for Practical Skill Discovery
Ale Escontrela
Jason Peng
Ken Goldberg
Pieter Abbeel
Proceedings of NeurIPS (2022) (to appear)
Preview abstract
Exploration and skill discovery in many real-world settings is often inspired by the activities we see others perform. However, most unsupervised skill discovery methods tend to focus solely on the intrinsic component of motivation, often by maximizing the Mutual Information (MI) between the agent's skills and the observed trajectories. These skills, though diverse in the behaviors they elicit, leave much to be desired. Namely, skills learned by maximizing MI in a high-dimensional continuous control setting tend to be aesthetically unpleasing and challenging to utilize in a practical setting, as the violent behavior often exhibited by these skills would not transfer well to the real world. We argue that solely maximizing MI is insufficient if we wish to discover useful skills, and that a notion of "style" must be incorporated into the objective. To this end, we propose the Style-Augmented Mutual Information objective (SAMI), whereby - in addition to maximizing a lower-bound of the MI - the agent is encouraged to minimize the f-divergence between the policy-induced trajectory distribution and the trajectory distribution contained in the reference data (the style objective). We compare SAMI to other popular skill discovery objectives, and demonstrate that skill-conditioned policies optimized with SAMI achieve equal or greater performance when applied to downstream tasks. We also show that the data-driven motion prior specified by the style objective can be inferred from various modalities, including large motion capture datasets or even RGB videos.
View details
Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
Anthony G. Francis
Dmitry Kalashnikov
Edward Lee
Jake Varley
Leila Takayama
Mikael Persson
Peng Xu
Stephen Tu
Xuesu Xiao
Conference on Robot Learning (2022) (to appear)
Preview abstract
Despite decades of research, existing navigation systems still face real-world challenges when being deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints of Model Predictive Control (MPC). Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers---a low-rank implicit-attention Transformer. We jointly train the cost function and construct the controller relying on it, effectively solving end-to-end the corresponding bi-level optimization problem. We show that the resulting policy improves standard MPC performance by leveraging a few expert demonstrations of the desired navigation behavior in different challenging real-world scenarios. Compared with a standard MPC policy, Performer-MPC achieves 40% better goal reached in cluttered environments and 65% better sociability when navigating around humans.
View details
Learning Semantic-Aware Locomotion Skills from Human Demonstration
Byron Boots
Xiangyun Meng
Yuxiang Yang
Conference on Robot Learning (CoRL) 2022 (2022) (to appear)
Preview abstract
The semantics of the environment, such as the terrain type and property, reveals important information for legged robots to adjust their behaviors. In this work, we present a framework that learns semantics-adaptive gait controllers for quadrupedal robots. To facilitate learning, we separate the problem of gait planning and motor control using a hierarchical framework, which consists of a high-level image-conditioned gait policy and a low-level MPC-based motor controller. In addition, to ensure sample efficiency, we pre-train the perception model with an off-road driving dataset, and extract an embedding for downstream learning. To avoid policy evaluation in the noisy real world, we design a simple interface for human operation and learn from human demonstrations. Our framework learns to adjust the speed and gait of the robot based on terrain semantics, using 40 minutes of human demonstration data.
We keep testing the performance of the controller on different trails. At the time of writing, the robot has walked 0.2 miles without failure.
View details
Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions
Ale Escontrela
Jason Peng
Ken Goldberg
Pieter Abbeel
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (2022) (to appear)
Preview abstract
Training high-dimensional simulated agents with under-specified reward functions often leads to jerky and unnatural behaviors, which results in physically infeasible strategies that are generally ineffective when deployed in the real world. To mitigate these unnatural behaviors, reinforcement learning (RL) practitioners often utilize complex reward functions that encourage more physically plausible behaviors, in conjunction with tricks such as domain randomization to train policies that satisfy the user's style criteria and can be successfully deployed on real robots. Such an approach has been successful in the realm of legged locomotion, leading to state-of-the-art results. However, designing effective reward functions can be a labour-intensive and tedious tuning process, and these hand-designed rewards do not easily generalize across platforms and tasks. We propose substituting complex reward functions with "style rewards" learned from a dataset of motion capture demonstrations. This learned style reward can be combined with a simple task reward to train policies that perform tasks using naturalistic strategies. These more natural strategies can also facilitate transfer to the real world. We build upon prior work in computer graphics and demonstrate that an adversarial approach to training control policies can produce behaviors that transfer to a real quadrupedal robot without requiring complex reward functions. We also demonstrate that an effective style reward can be learned from a few seconds of motion capture data gathered from a German Shepherd and leads to energy-efficient locomotion strategies with natural gait transitions.
View details
Robotic table wiping via whole-body trajectory optimizationand reinforcement learning
Benjie Holson
Jeffrey Bingham
Jonathan Weisz
Mario Prats
Peng Xu
Thomas Lew
Xiaohan Zhang
Yao Lu
ICRA (2022)
Preview abstract
We propose an end-to-end framework to enablemultipurpose assistive mobile robots to autonomously wipetables and clean spills and crumbs. This problem is chal-lenging, as it requires planning wiping actions with uncertainlatent crumbs and spill dynamics over high-dimensional visualobservations, while simultaneously guaranteeing constraintssatisfaction to enable deployment in unstructured environments.To tackle this problem, we first propose a stochastic differentialequation (SDE) to model crumbs and spill dynamics and ab-sorption with the robot wiper. Then, we formulate a stochasticoptimal control for planning wiping actions over visual obser-vations, which we solve using reinforcement learning (RL). Wethen propose a whole-body trajectory optimization formulationto compute joint trajectories to execute wiping actions whileguaranteeing constraints satisfaction. We extensively validateour table wiping approach in simulation and on hardware.
View details
Safe Reinforcement Learning for Legged Locomotion
Jimmy Yang
Peter J. Ramadge
Sehoon Ha
International Conference on Robotics and Automation (2022) (to appear)
Preview abstract
Designing control policies for legged locomotion is complex due to underactuation and discrete contact dynamics. To deal with this complexity, applying reinforcement learning to learn a control policy in the real world is a promising approach. However, safety is a bottleneck when robots need to learn in the real world. In this paper, we propose a safe reinforcement learning framework that switches between a safe recovery policy and a learner policy. The safe recovery policy takes over the control when the learner policy violates safety constraints, and hands over the control back when there are no future safety violations. We design the safe recovery policy so that it ensures safety of legged locomotion while minimally interfering with the learning process. Furthermore, we theoretically analyze the proposed framework and provide an upper bound on the task performance. We verify the proposed framework in three locomotion tasks on a simulated quadrupedal robot: catwalk, two-leg balance, and pacing. On average, our method achieves 48.6% fewer falls and comparable or better rewards than the baseline methods.
View details
Zero-Shot Retargeting of Learned Quadruped Locomotion Policy Using A Hybrid Kinodynamic Model and Predictive Control
He Li
Patrick Wensing
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022) (2022) (to appear)
Preview abstract
As a rivaling control technique, Reinforcement Learning (RL) has demonstrated great performance in quadruped locomotion. However, it remains a challenge to reuse a policy on another robot, i.e., policy transferability, which saves time for retraining. In this work, we reduce the gap by devloping a planning-and-control framework that systematically integrates RL and Model Predictive Control (MPC). The planning stage employs RL to generate a dynamically-plausible trajectory as well as the contact schedule. These information are then used to seed the MPC in the low level to stabilize and robustify the motion. In addition, our MPC controller employs a novel Hybrid Kino-Dynamics (HKD) model which implicitly optimizes the foothold locations. The results are surprisingly good since the policy trained for the Unitree A1 robot could be transferred to the MIT Mini Cheetah with the proposed pipeline.
View details
Real-time remodeling of granular terrain for robot locomotion
Andras Karsai
Daniel I. Goldman
Daniel Soto
Deniz Kerimoglu
Sehoon Ha
Space Robotics (2022)
Preview abstract
Recent studies of robot movement in flowable granular media inspired by difficulties faced by extraterrestrial rovers reveal a coupled locomotor/substrate effect where the robot spontaneously remodels its environment. Such coupling occurs in certain limb/wheel movement patterns that results in a localized granular flow allowing the robot to effectively “swim” up highly flowable slopes. However, these gaits were discovered via trial and error by human operators. as the highly hysteretic nature of easily flowable terrain also creates tractability and predictability challenges in locomotion planning and gait policies. To overcome this, additional anchoring structures on intruding appendages can dynamically stabilize slopes to prevent undesired flows and slipping during locomotion. Granular media’s multiphase properties make it amenable to creative manipulations dependent on the physics of the intruding structure. A pair of robot studies showcase both selective solidification and fluidization strategies in flowable slopes to locomote successfully. To accelerate gait discovery in both studies, a machine learning approach for real-time characterization of the terrain flow could allow robots to control the flowable substrate for effective locomotion. A future neural network trained with sufficient spatiotemporal terrain data could predict granular flow with high accuracy and generality, augmenting gait learning with knowledge of the environment’s evolution during movement.
View details
Preview abstract
Reinforcement learning provides an effective tool for robots to acquire diverse skills in an automated fashion.For safety and data generation purposes, control policies are often trained in a simulator and later deployed to the target environment, such as a real robot. However, transferring policies across domains is often a manual and tedious process. In order to bridge the gap between domains, it is often necessary to carefully tune and identify the simulator parameters or select the aspects of the simulation environment to randomize. In this paper, we design a novel, adversarial learning algorithm to tackle the transfer problem. We combine a classic, analytical simulator with a differentiable, state-action dependent system identification module that outputs the desired simulator parameters. We then train this hybrid simulator such that the output trajectory distributions are indistinguishable from a target domain collection. The optimized hybrid simulator can refine a sub-optimal policy without any additional target domain data. We show that our approach outperforms the domain-randomization and target-domain refinement baselines on two robots and six difficult dynamic tasks.
View details