Jump to Content
Sergey Levine

Sergey Levine

Sergey Levine received a BS and MS in Computer Science from Stanford University in 2009, and a Ph.D. in Computer Science from Stanford University in 2014. He joined Google in 2015, and is currently also on the faculty of the Department of Electrical Engineering and Computer Sciences at UC Berkeley. His work focuses on machine learning for decision making and control, with an emphasis on deep learning and reinforcement learning algorithms. Applications of his work include autonomous robots and vehicles, as well as computer vision and graphics. His research includes developing algorithms for end-to-end training of deep neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, deep reinforcement learning algorithms, and more.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    A Connection between Actor Regularization and Critic Regularization in Reinforcement Learning
    Benjamin Eysenbach
    Matthieu Geist
    Ruslan Salakhutdinov
    International Conference on Machine Learning (ICML) (2023)
    Preview abstract As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting, with most methods regularizing either the actor or the critic. These methods appear distinct. Actor regularization (e.g., behavioral cloning penalties) is simpler and has appealing convergence properties, while critic regularization typically requires significantly more compute because it involves solving a game, but it has appealing lower-bound guarantees. Empirically, prior work alternates between claiming better results with actor regularization and critic regularization. In this paper, we show that these two regularization techniques can be equivalent under some assumptions: regularizing the critic using a CQL-like objective is equivalent to updating the actor with a BC- like regularizer and with a SARSA Q-value (i.e., “1-step RL”). Our experiments show that this theoretical model makes accurate, testable predictions about the performance of CQL and one-step RL. While our results do not definitively say whether users should prefer actor regularization or critic regularization, our results hint that actor regularization methods may be a simpler way to achieve the desirable properties of critic regularization. The results also suggest that the empirically- demonstrated benefits of both types of regularization might be more a function of implementation details rather than objective superiority. View details
    Preview abstract Offline reinforcement learning (RL) on large, heterogeneous datasets with highcapacity models can, in principle, lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: wider ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multitask Atari as a test-bed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 100M parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we substantially extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% humannormalized score). Compared to supervised approaches, offline RL scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that such offline Q-functions learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing representation learning approaches. View details
    Preview abstract In recent years, much progress has been made in learning robotic manipulation policies that can follow natural language instructions. Common approaches involve learning methods that operate on offline datasets, such as task-specific teleoperated demonstrations or on hindsight labeled robotic experience. Such methods work reasonably but rely strongly on the assumption of clean data: teleoperated demonstrations are collected with specific tasks in mind, while hindsight language descriptions rely on expensive human labeling. Recently, large-scale pretrained language and vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. However, can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills contained in the offline dataset that weren't necessarily reflected in ground truth labels? We investigate fine-tuning a reward model on a small dataset of robot interactions with crowd-sourced natural language labels and using the model to relabel instructions of a large offline robot dataset. The resulting dataset with diverse language skills is used to train imitation learning policies, which outperform prior methods by up to 30% when evaluated on a diverse set of novel language instructions that were not contained in the original dataset. View details
    Preview abstract Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive "aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images. View details
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning 2022 (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Jump-Start Reinforcement Learning
    Ike Uchendu
    Yao Lu
    Banghua Zhu
    Mengyuan Yan
    Joséphine Simon
    Matt Bennice
    Chuyuan Kelly Fu
    Cong Ma
    Jiantao Jiao
    NeurIPS 2021 Robot Learning Workshop, RSS 2022 Scaling Robot Learning Workshop
    Preview abstract Reinforcement learning (RL) provides a theoretical framework for continuously improving an agent's behavior via trial and error. However, efficiently learning policies from scratch can be very difficult, particularly for tasks that present exploration challenges. In such settings, it might be desirable to initialize RL with an existing policy, offline data, or demonstrations. However, naively performing such initialization in RL often works poorly, especially for value-based methods. In this paper, we present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy, and is compatible with any RL approach. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks: a guide-policy, and an exploration-policy. By using the guide-policy to form a curriculum of starting states for the exploration-policy, we are able to efficiently improve performance on a set of simulated robotic tasks. In addition, we provide an upper bound on the sample complexity of JSRL and show that it is able to significantly outperform existing imitation and reinforcement learning algorithms, particularly in the small-data regime. View details
    Preview abstract With the goal of achieving higher efficiency, the semiconductor industry has gradually reformed towards application-specific hardware accelerators. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a ``simulation-driven'' approach must be re-run from scratch every time the target applications or constraints change. An alternative paradigm is to use a ``data-driven'', offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulation. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when target applications change. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization. View details
    Offline Meta-Reinforcement Learning for Industrial Insertion
    Tony Zhao
    Jianlan Luo
    Rugile Pevceviciute
    Nicolas Manfred Otto Heess
    Jon Scholz
    Stefan Schaal
    ICRA 2022 (to appear) (2022)
    Preview abstract Reinforcement learning (RL) can in principle let robots automatically adapt to new tasks, but current RL methods require a large number of trials to accomplish this. In this paper, we tackle rapid adaptation to new tasks through the framework of meta-learning, which utilizes past tasks to learn to adapt with a specific focus on industrial insertion tasks. Fast adaptation is crucial because prohibitively large number of on-robot trials will potentially damage hardware pieces. Additionally, effective adaptation is also feasible in that experience among different insertion applications can be largely leveraged by each other. In this setting, we address two specific challenges when applying meta-learning. First, conventional meta-RL algorithms require lengthy online meta-training. We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online. Second, meta-RL methods can fail to generalize to new tasks that are too different from those seen at meta-training time, which poses a particular challenge in industrial applications, where high success rates are critical. We address this by combining contextual meta-learning with direct online finetuning: if the new task is similar to those seen in the prior data, then the contextual meta-learner adapts immediately, and if it is too different, it gradually adapts through finetuning. We show that our approach is able to quickly adapt to a variety of different insertion tasks, with a success rate of 100% using only a fraction of the samples needed for learning the tasks from scratch. View details
    Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning
    Alexander Toshkov Toshev
    Brian Andrew Ichter
    Dhruv Shah
    Peng Xu
    Yao Lu
    ICLR (2022)
    Preview abstract Reinforcement learning can train policies that effectively perform complex tasks. However, the performance of these methods degrades as the horizon increases, and performing long-horizon tasks often requires reasoning over and composing multiple lower-level skills. Hierarchical reinforcement learning aims to enable this, by providing a bank of low-level skills as action abstractions, in the form of primitives or options. However, an effective hierarchy should exhibit abstraction both in the space of actions and states. We posit that a suitable state abstraction for the higher-level policy should depend on the capabilities of the available lower-level policies, and we propose an approach that produces such a representation by using the value functions corresponding to each lower-level skill to capture the affordances for these skills. Empirical evaluations for maze-solving and robotic manipulation tasks demonstrate that our approach improves long-horizon performance and enables better zero-shot generalization than popular model-free and model-based methods by constructing a compact state abstraction that represents the affordances of the scene and is robust to distractors. We implement our approach in two domains: a long-horizon maze solving task, and a complex image-based robotic manipulation simulator. In both settings, we show empirically that, when provided with a suitable bank of skills, our approach enables more effective long-horizon control as compared to alternative state representation learning methods. View details
    InnerMonologue: Embodied Reasoning through Planning with Language Models
    Wenlong Huang
    Harris Chan
    Jacky Liang
    Igor Mordatch
    Yevgen Chebotar
    Noah Brown
    Tomas Jackson
    Linda Luu
    Brian Andrew Ichter
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Recent works have shown the capabilities of large language models to perform tasks requiring reasoning and to be applied to applications beyond natural language processing, such as planning and interaction for embodied robots.These embodied problems require an agent to understand the repertoire of skills available to a robot and the order in which they should be applied. They also require an agent to understand and ground itself within the environment. In this work we investigate to what extent LLMs can reason over sources of feedback provided through natural language. We propose an inner monologue as a way for an LLM to think through this process and plan. We investigate a variety of sources of feedback, such as success detectors and object detectors, as well as human interaction. The proposed method is validated in a simulation domain and on real robotic. We show that Innerlogue can successfully replan around failures, and generate new plans to accommodate human intent. View details
    AW-Opt: Learning Robotic Skills with Imitationand Reinforcement at Scale
    Yao Lu
    Yevgen Chebotar
    Mengyuan Yan
    Eric Victor Jang
    Alexander Herzog
    Mohi Khansari
    Dmitry Kalashnikov
    Conference on Robot Learning 2021 (2021)
    Preview abstract This paper proposes a new algorithm "AW-Opt" to combine Imitation Learning (IL) and Reinforcement Learning (RL). Prior methods face significant difficulty with sparse reward, image based input robotics tasks. By carefully designing sample filtering strategy, exploration strategy, and bellman equation, AW-Opt outperforms existing SOTA algorithms. Experimental results in both simulation and with real robots show that AW-Opt can achieve reasonable success rate from initial demonstrations, maintain low inference time, fine tune to reach SOTA success rate and use much less samples than existing algorithms. View details
    Factorizing declarative and procedural knowledge in structured, dynamic environments
    Anirudh Goyal
    Alex Lamb
    Phanideep Gampa
    Philippe Beaudoin
    Charles Blundell
    Yoshua Bengio
    Michael Mozer
    International Conference on Learning Representations (ICLR) (2021) (to appear)
    Preview abstract Modeling a structured, dynamic environment like a video game requires keeping track of the objects and their states (declarative knowledge) as well as predicting how objects behave (procedural knowledge). Black-box models with a monolithic hidden state often lack systematicity: they fail to apply procedural knowledge consistently and uniformly. For example, in a video game, correct prediction of one enemy's trajectory does not ensure correct prediction of another's. We address this issue via an architecture that factorizes declarative and procedural knowledge and that imposes modularity within each form of knowledge. The architecture consists of active modules called object files that maintain the state of a single object and invoke passive external knowledge sources called schemata that prescribe state updates. To use a video game as an illustration, two enemies of the same type will share schemata but will each have its own object file to encode its distinct state (e.g., health, position). We propose to use attention to control the determination of which object files update, the selection of schemata, and the propagation of information between object files. The resulting architecture is a drop-in replacement conforming to the same input-output interface as normal recurrent networks (e.g., LSTM, GRU) yet achieves substantially better generalization on environments like Atari, rolling balls, and visual reasoning. View details
    Offline Primitive Discovery for Accelerating Data-Driven Reinforcement Learning
    Anurag Ajay
    Aviral Kumar
    Pulkit Agrawal
    Ofir Nachum
    ICLR 2021 (2021) (to appear)
    Preview abstract Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent’s ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this practical setting, and our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which mitigates compounding error propagation during reinforcement learning in theory, and leads to improved offline RL in practice. In addition to accelerating offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of challenging benchmark domains. View details
    Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
    Aviral Kumar*
    Dibya Ghosh
    International Conference on Learning Representations (2021)
    Preview abstract We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance. More details can be found at agarwl.github.io/iup . View details
    Actionable Models: Unsupervised Offline Learning of Robotic Skills
    Benjamin Eysenbach
    Dmitry Kalashnikov
    Jake Varley
    Yao Lu
    Yevgen Chebotar
    International Conference on Machine Learning 2021 (2021)
    Preview abstract We consider the problem of learning useful robotic skills from previously collected offline data without access to manually specified rewards or additional online exploration, a setting that is becoming increasingly important for scaling robot learning by reusing past robotic data. In particular, we propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset. We employ goal-conditioned Q-learning with hindsight relabeling and develop several techniques that enable training in a particularly challenging offline setting. We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects. We also show that our method can learn to reach long-horizon goals across multiple episodes, and learn rich representations that can help with downstream tasks through pre-training or auxiliary objectives. View details
    Preview abstract Reinforcement learning provides an effective tool for robots to acquire diverse skills in an automated fashion.For safety and data generation purposes, control policies are often trained in a simulator and later deployed to the target environment, such as a real robot. However, transferring policies across domains is often a manual and tedious process. In order to bridge the gap between domains, it is often necessary to carefully tune and identify the simulator parameters or select the aspects of the simulation environment to randomize. In this paper, we design a novel, adversarial learning algorithm to tackle the transfer problem. We combine a classic, analytical simulator with a differentiable, state-action dependent system identification module that outputs the desired simulator parameters. We then train this hybrid simulator such that the output trajectory distributions are indistinguishable from a target domain collection. The optimized hybrid simulator can refine a sub-optimal policy without any additional target domain data. We show that our approach outperforms the domain-randomization and target-domain refinement baselines on two robots and six difficult dynamic tasks. View details
    BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning
    Corey Harrison Lynch
    Daniel Kappler
    Eric Victor Jang
    Frederik Ebert
    Mohi Khansari
    Conference on Robot Learning (2021)
    Preview abstract In this paper, we study the problem of enabling a vision-based robotic manipulation system to generalize across diverse scenes and diverse tasks, a long-standing challenge in robot learning. We approach the above challenge from an imitation learning perspective, aiming to study how scaling and broadening the data collected can facilitate generalization to new scenes and tasks. To that end, we develop a shared-autonomy system for demonstrating correct behavior to the robot along with an imitation learning method that can flexibly condition on task embeddings computed from language or video. Using this system, we scale data collection to dozens of scenes and over 100 tasks, and investigate how various design choices translate to performance. We show that our system enables a real robot, using the same neural network architecture for learning policies, to pick objects from a bin at 4 objects a minute, open swing doors and latched doors it has never seen before (success rates of 94% and 27%), and perform at least dozens of unseen manipulation tasks with a success rate of 50%. View details
    Preview abstract Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper investigates whether independent reinforcement learning (RL) agents in a multi-agent environment can learn to use social learning to improve their performance. We find that in most circumstances, vanilla model-free RL agents do not use social learning. We analyze the reasons for this deficiency, and show that by imposing constraints on the training environment and introducing a model-based auxiliary loss we are able to obtain generalized social learning policies which enable agents to: i) discover complex skills that are not learned from single-agent training, and ii) adapt online to novel environments by taking cues from experts present in the new environment. In contrast, agents trained with model-free RL or imitation learning generalize poorly and do not succeed in the transfer tasks. By mixing multi-agent and solo training, we can obtain agents that use social learning to gain skills that they can deploy when alone, even out-performing agents trained alone from the start. View details
    Preview abstract We study a setting in which an agent has access to data from the interaction of other agents with the same environment. However, it has no access to the rewards or goals of these agents, and their objectives and levels of expertise may vary widely. These assumptions are common in multi-agent settings, such as driving. To effectively use this data, we turn to the framework of successor features. This allows us to disentangle shared features and dynamics of the environment from agent-specific rewards and policies. We propose a multi-task inverse reinforcement learning (IRL) algorithm, called inverse temporal difference learning (ITD), that learns shared state features, alongside per-agent successor features and preference vectors, purely from demonstrations without reward labels. We further show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called PsiPhi-learning (pronounced `Sci-Fi'). We provide empirical evidence for the effectiveness of PsiPhi-learning in a variety of environments, on RL, imitation, IRL, and few-shot transfer, and derive worst-case bounds for its performance in zero-shot transfer to new tasks. View details
    Evolving Reinforcement Learning Algorithms
    JD Co-Reyes
    Yingjie Miao
    Daiyi Peng
    Honglak Lee
    International Conference on Learning Representations (ICLR) (2021) (to appear)
    Preview abstract We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods. View details
    Learning Agile Robotic Locomotion Skills by Imitating Animals
    Edward Lee
    Erwin Johan Coumans
    Jason Peng
    Robotics: Science and Systems 2020, RSS Foundation (2020)
    Preview abstract Reproducing the diverse and agile locomotion skills of animals has been a longstanding challenge in robotics. While manually designed controllers have been able to emulate many complex behaviors, building such controllers often involves a tedious engineering process, and requires substantial expertise of the nuances of each skill. In this work, we present an imitation learning system that enables legged robots to learn agile locomotion skills by imitating real-world animals. We show that by leveraging reference motion data, a common framework is able to automatically synthesize controllers for a diverse repertoire behaviors. By incorporating sample efficient domain adaptation techniques into the training process, our system is able to train adaptive policies in simulation, which can then be quickly finetuned and deployed in the real world. Our system enables an 18-DoF quadruped robot to perform a variety of agile behaviors ranging from different locomotion gaits to dynamic hops and turns. View details
    Preview abstract We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must "think while moving.'' View details
    Model-Based Reinforcement Learning for Atari
    Blazej Osinski
    Henryk Michalewski
    Konrad Czechowski
    Lukasz Mieczyslaw Kaiser
    Mohammad Babaeizadeh
    Piotr Kozakowski
    Piotr Milos
    Roy H Campbell
    Ryan Sepassi
    NIPS'18 (2020)
    Preview abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magnitude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play. View details
    Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design
    Michael Dennis*
    Eugene Vinitsky
    Alexandre Bayen
    Stuart Russell
    Andrew Critch
    Neural Information Processing Systems (2020)
    Preview abstract A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments. View details
    Preview abstract One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? In this paper, we present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2% of the data necessary to learn the task from scratch. We find that our approach of adapting pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps. View details
    Preview abstract Robots trained via reinforcement-learning (RL) requirecollecting and labeling many real-world episodes, whichmay be costly and time-consuming. Training models with alarge amount of simulation is a cheaper alternative. How-ever, simulations are not perfect and such models may nottransfer to the real world. Techniques developed to closethis simulation-to-reality (Sim2Real) gap typically applyrandomization to the simulated images or adapt them withan additional Sim2Real model. A Generative Adversar-ial network (GAN) may be used to adapt the pixels of thesimulated image to be more realistic before use by a deepRL model. We find the CycleGAN which enforces a cycleconsistency between Sim2Real and Real2Sim adaptationsproduces better images for RL than a GAN alone. Ulti-mately, we develop RL-CycleGAN which includes a Cycle-GAN which trains jointly with the deep RL model and en-forces that the RL model is consistent across all the adap-tations.We evaluate the RL-CycleGAN on two vision-based robotics grasping tasks and compare it to previoustechniques. With 580,000 real episodes and millions ofsimulated episodes adapted with RL-CycleGAN achievesxx% grasp success, while a previous GAN-based approach,GraspGAN, achieves xx% grasp success. With only 5,000real episodes, RL-CycleGAN and GraspGAN achieve xx%and xx% grasp success respectively. On a multi-bin grasp-ing task, we show RL-CycleGAN drastically improves dataefficiency requiring 1/xth the amount of real data to reachthe same grasping performance. View details
    Preview abstract Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video. View details
    Preview abstract Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multitask learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 6 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods View details
    Recall Traces: Backtracking Models for Efficient Reinforcement Learning
    Anirudh Goyal
    Philemon Brakel
    Liam Fedus
    Soumye Singhal
    Timothy Lillicrap
    Yoshua Bengio
    ICLR (2019)
    Preview abstract In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and samples which (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks. View details
    Diversity is All You Need: Learning Skills without a Reward Function
    Benjamin Eysenbach
    Abhishek Gupta
    Julian Ibarz
    ICLR (2019)
    Preview abstract Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose “Diversity is All You Need”(DIAYN), a method for learning useful skills without a reward function. Our proposed method learns skills by maximizing an information theoretic objective using a maximum entropy policy. On a variety of simulated robotic tasks, we show that this simple objective results in the unsupervised emergence of diverse skills, such as walking and jumping. In a number of reinforcement learning benchmark environments, our method is able to learn a skill that solves the benchmark task despite never receiving the true task reward. We show how pretrained skills can provide a good parameter initialization for downstream tasks, and can be composed hierarchically to solve complex, sparse reward tasks. Our results suggest that unsupervised discovery of skills can serve as an effective pretraining mechanism for overcoming challenges of exploration and data efficiency in reinforcement learning. View details
    Preview abstract We propose a self-supervised approach to learning a wide variety of manipulation skills from unlabeled data collected through playing in and interacting within a playground environment. Learning by playing offers three main advantages: 1) Collecting large amounts of play data is cheap and fast as it does not require staging the scene nor labeling data, 2) It relaxes the need to have a discrete and rigid definition of skills/tasks during the data collection. This allows the agent to focus on acquiring a continuum set of manipulation skills as a whole, which can then be conditioned to perform a particular skill such as grasping. Furthermore, this data already includes ways to recover, retry or transition between different skills, which can be used to achieve a reactive closed-loop control policy, 3) It allows to quickly learn a new skill from making use of pre-existing general abilities. Our proposed approach to learning new skills from unlabeled play data decouples high-level planning prediction from low-level action prediction by: first self-supervise learning of a latent planning space, then self-supervise learning of an action model that is conditioned on a latent plan. This results in a single task-agnostic policy conditioned on a user-provided goal. This policy can perform a variety of tasks in the environment where playing was observed. We train a single model on 3 hours of unlabeled play data and evaluate it on 18 tasks simply by feeding a goal state corresponding to each task. The baseline model reaches an accuracy of 65\% using 18 specialized policies in 100-shot per task and trained on 1800 expensive demonstrations. Our model completes the tasks with an average of 85\% accuracy using a single policy in zero shots (having never been explicitly trained on these tasks) using cheap unlabeled data. Videos of the performed experiments are available at https://sites.google.com/view/sslmp View details
    Preview abstract A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out decision states. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space. View details
    Preview abstract Algorithms for imitation learning based on adversarial optimization, such as generative adversarial imitation learning (GAIL) and adversarial inverse reinforcement learning (AIRL), can effectively mimic demonstrated behaviours by employing both reward and reinforcement learning (RL). However, applications of such algorithms are challenged by the inherent instability and poor sample efficiency of on-policy RL. In particular, the inadequate handling of absorbing states in canonical implementations of RL environments causes an implicit bias in reward functions used by these algorithms. While these biases might work well for some environments, they lead to sub-optimal behaviors in others. Moreover, despite the ability of these algorithms to learn from a few demonstrations, they require a prohibitively large number of the environment interactions for many real-world applications. To address these issues, we first propose to extend the environment MDP with absorbing states which leads to task-independent, and more importantly, unbiased rewards. Secondly, we introduce an off-policy learning algorithm, which we refer to as Discriminator-Actor-Critic. We demonstrate the effectiveness of proper handling of absorbing states, while empirically improving the sample efficiency by an average factor of 10. Our implementation is available online. View details
    Preview abstract We study the problem of representation learning in goal-conditioned hierarchical reinforcement learning. In such hierarchical structures, a higher-level controller solves tasks by iteratively communicating goals which a lower-level policy is trained to reach. Accordingly, the choice of representation – the mapping of observation space to goal space – is crucial. To study this problem, we develop a notion of sub optimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the sub-optimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuous-control tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods. View details
    Preview abstract Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a languageconditioned policy leads to poor performance. View details
    Preview abstract Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future which leads to low-quality predictions in real-world settings with stochastic dynamics. In contrast, we developed a variational stochastic method for video prediction that predicts a different possible future for each sample of its latent random variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. The TensorFlow-based implementation of our method will be open sourced upon publication. View details
    The Mirage of Action-Dependent Baselines in Reinforcement Learning
    Surya Bhupatiraju
    Shane Gu
    Richard E. Turner
    Zoubin Ghahramani
    ICML (2018)
    Preview abstract Model-free reinforcement learning with flexible function approximators has shown recent success for solving goal-directed sequential decision-making problems. Policy gradient methods are a promising class of model-free algorithms, but they have high variance, which necessitates large batches resulting in low sample efficiency. Typically, a state-dependent control variate is used to reduce variance. Recently, several papers have introduced the idea of state and action-dependent control variates and showed that they significantly reduce variance and improve sample efficiency on continuous control tasks. We theoretically and numerically evaluate biases and variances of these policy gradient methods, and show that action-dependent control variates do not appreciably reduce variance in the tested domains. We show that seemingly insignificant implementation details enable these prior methods to achieve good empirical improvements, but at the cost of introducing further bias to the gradient. Our analysis indicates that biased methods tend to improve the performance significantly more than unbiased ones. View details
    Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control
    Fereshteh Sadeghi
    Alexander Toshev
    Eric Jang
    CVPR 2018, pp. 8
    Preview abstract Humans are remarkably proficient at controlling their limbs and tools from a wide range of viewpoints. In robotics, this ability is referred to as visual servoing: moving a tool or end-point to a desired location using primarily visual feedback. In this paper, we propose learning viewpoint invariant visual servoing skills in a robot manipulation task. We train a deep recurrent controller that can automatically determine which actions move the end-effector of a robotic arm to a desired object. This problem is fundamentally ambiguous: under severe variation in viewpoint, it may be impossible to determine the actions in a single feedforward operation. Instead, our visual servoing approach uses its memory of past movements to understand how the actions affect the robot motion from the current viewpoint, correcting mistakes and gradually moving closer to the target. This ability is in stark contrast to previous visual servoing methods, which assume known dynamics or require a calibration phase. We learn our recurrent controller using simulated data, synthetic demonstrations and reinforcement learning. We then describe how the resulting model can be transferred to a real-world robot by disentangling perception from control and only adapting the visual layers. The adapted model can servo to previously unseen objects from novel viewpoints on a real-world Kuka IIWA robotic arm. For supplementary videos, see: \href{https://www.youtube.com/watch?v=oLgM2Bnb7fo}{https://www.youtube.com/watch?v=oLgM2Bnb7fo} View details
    QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
    Dmitry Kalashnikov
    Peter Pastor Sampedro
    Julian Ibarz
    Alexander Herzog
    Eric Jang
    Deirdre Quillen
    Ethan Holly
    Mrinal Kalakrishnan
    CORL (2018)
    Preview abstract In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations. Supplementary experiment videos can be found at https://goo.gl/wQrYmc View details
    Time-Contrastive Networks: Self-Supervised Learning from Video
    Corey Lynch
    Yevgen Chebotar
    Eric Jang
    Stefan Schaal
    Proceedings of International Conference in Robotics and Automation (ICRA 2018) + Deep Learning for Robotic Vision (DLRV) Workshop at CVPR 2017 + Deep Reinforcement Learning Symposium at NIPS 2017 (2018)
    Preview abstract We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a triplet loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitate View details
    Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
    Paul Wohlhart
    Matthew Kelcey
    Mrinal Kalakrishnan
    Laura Downs
    Julian Ibarz
    Peter Pastor Sampedro
    Kurt Konolige
    ICRA (2018)
    Preview abstract Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms is prohibitively expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. To address this shortcoming, prior work introduced domain adaptation algorithms that attempt to make the resulting models domain-invariant. However, such works were evaluated primarily on offline image classification datasets. In this work, we adapt these techniques for learning, primarily in simulation, robotic hand-eye coordination for grasping. Our approaches generalize to diverse and previously unseen real-world objects. We show that, by using synthetic data and domain adaptation, we are able to reduce the amounts of real--world samples required for our goal and a certain level of performance by up to 50 times. We also show that by using our suggested methodology we are able to achieve good grasping results by using no real world labeled data. View details
    Preview abstract Deep reinforcement learning algorithms can learn complex behavioral skills, but real-world application of these methods requires a considerable amount of experience to be collected by the agent. In practical settings, such as robotics, this involves repeatedly attempting a task, resetting the environment between each attempt. However, not all tasks are easily or automatically reversible. In practice, this learning process requires considerable human intervention. In this work, we propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and backward policy, with the backward policy resetting the environment for a subsequent attempt. By learning a value function for the backward policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the backward policy can greatly reduce the number of manual resets required to learn a task and can reduce the number of unsafe actions that lead to non-reversible states. View details
    Data-Efficient Hierarchical Reinforcement Learning
    Ofir Nachum
    Shane Gu
    Honglak Lee
    NeurIPS (2018)
    Preview abstract Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them impractical in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. As for efficiency, we propose a principled method for using off-policy experience for both upper and lower-level behaviors, despite the fact that each is interacting with a non-stationary partner. This allows us to take advantage of recent advances in off-policy RL and train hierarchical policies with a much smaller amount of environment interactions than any generic on-policy method. We find that our resulting HRL agent is generally applicable and highly sample-efficient. Our experiments show that our method can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, using 10M samples. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms the previous state-of-the-art techniques. View details
    Preview abstract Robotic learning algorithms based on reinforcement, self-supervision, and imitation can acquire end-to-end controllers from raw sensory inputs such as images. These end-to-end controllers acquire perception systems that are tailored to the task, picking up on the cues that are most useful for the task at hand. However, to learn generalizable robotic skills, we might prefer more structured image representations, such as ones encoding the persistence of objects and their identities. In this paper, we study a specific instance of this problem: acquiring object representations through autonomous robotic interaction with its environment. Our representation learning method is based on object persistence: when a robot picks up an object and ``subtracts'' it from the scene, its representation of the scene should change in a predictable way. We can use this observation to formulate a simple condition that an object-centric representation should satisfy: the features corresponding to a scene should be approximately equal to the feature values for the same scene after an object has been removed, minus the feature value for that object. View details
    Temporal Difference Models: Model-Free Deep RL for Model-Based Control
    Vitchyr Pong
    Shane Gu
    Murtaza Dalal
    ICLR (2018) (to appear)
    Preview abstract Model-free reinforcement learning (RL) has been proven to be a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even for off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that unify model-free learning and model-based learning. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-free methods. View details
    Preview abstract In this paper, we explore deep reinforcement learning algorithms for vision-based robotic grasping. Model-free deep reinforcement learning has been successfully applied to a range of challenging environments, but the proliferation of various deep RL methods makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping. To answer this question, we conduct a detailed simulated study of reinforcement learning methods on a grasping task that emphasizes diversity and off-policy learning. Off-policy learning is important to enable the method to utilize past data of grasping a wide variety of objects, and diversity is important to enable the method to generalize to new objects that were not seen during training. We evaluate a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction. Our results indicate that several simple methods provide a surprisingly strong competitor to popular algorithms such as double Q-learning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms. View details
    Unsupervised Perceptual Rewards for Imitation Learning
    Kelvin Xu
    Proceedings of Robotics: Science and Systems (RSS 2017) + Deep Learning for Action and Interaction workshop at NIPS (2016) + International Conference on Learning Representations (ICLR 2017) Workshop (2017)
    Preview abstract Reward function design and exploration time are arguably the biggest obstacles to the deployment of reinforcement learning (RL) agents in the real world. In many real-world tasks, designing a reward function takes considerable hand engineering and often requires additional sensors to be installed just to measure whether the task has been executed successfully. Furthermore, many interesting tasks consist of multiple implicit intermediate steps that must be executed in sequence. Even when the final outcome can be measured, it does not necessarily provide feedback on these intermediate steps. To address these issues, we propose leveraging the abstraction power of intermediate visual representations learned by deep models to quickly infer perceptual reward functions from small numbers of demonstrations. We present a method that is able to identify key intermediate steps of a task from only a handful of demonstration sequences, and automatically identify the most discriminative features for identifying these steps. This method makes use of the features in a pre-trained deep model, but does not require any explicit specification of sub-goals. The resulting reward functions can then be used by an RL agent to learn to perform the task in real-world settings. To evaluate the learned reward, we present qualitative results on two real-world tasks and a quantitative evaluation against a human-designed reward function. We also show that our method can be used to learn a real-world door opening skill using a real robot, even when the demonstration used for reward learning is provided by a human using their own hand. To our knowledge, these are the first results showing that complex robotic manipulation skills can be learned directly and without supervised labels from a video of a human performing the task. Supplementary material and data are available at https://sermanet.github.io/rewards View details
    Path Integral Guided Policy Search
    Yevgen Chebotar
    Mrinal Kalakrishnan
    Ali Yahya
    Adrian Li
    Stefan Schaal
    IEEE International Conference on Robotics and Automation (2017)
    Preview abstract We present a policy search method for learning complex feedback control policies that map from high-dimensional sensory inputs to motor torques, for manipulation tasks with discontinuous contact dynamics. We build on a prior technique called guided policy search (GPS), which iteratively optimizes a set of local policies for specific instances of a task, and uses these to train a complex, high-dimensional global policy that generalizes across task instances. We extend GPS in the following ways: (1) we propose the use of a model-free local optimizer based on path integral stochastic optimal control (PI2), which enables us to learn local policies for tasks with highly discontinuous contact dynamics; and (2) we enable GPS to train on a new set of task instances in every iteration by using on-policy sampling: this increases the diversity of the instances that the policy is trained on, and is crucial for achieving good generalization. We show that these contributions enable us to learn deep neural network policies that can directly perform torque control from visual input. We validate the method on a challenging door opening task and a pick-and-place task, and we demonstrate that our approach substantially outperforms the prior LQR-based local policy optimizer on these tasks. Furthermore, we show that on-policy sampling significantly increases the generalization ability of these policies. View details
    Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search
    Ali Yahya
    Adrian Li
    Mrinal Kalakrishnan
    Yevgen Chebotar
    IEEE/RSJ International Conference on Intelligent Robots and Systems (2017) (to appear)
    Preview abstract In principle, reinforcement learning and policy search methods can enable robots to learn highly complex and general skills that may allow them to function amid the complexity and diversity of the real world. However, training a policy that generalizes well across a wide range of real-world conditions requires far greater quantity and diversity of experience than is practical to collect with a single robot. Fortunately, it is possible for multiple robots to share their experience with one another, and thereby, learn a policy collectively. In this work, we explore distributed and asynchronous policy learning as a means to achieve generalization and improved training times on challenging, real-world manipulation tasks. We propose a distributed and asynchronous version of Guided Policy Search and use it to demonstrate collective policy learning on a vision-based door opening task using four robots. We show that it achieves better generalization, utilization, and training times than the single robot alternative. View details
    Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
    Shixiang Gu
    Timothy Lillicrap
    Zoubin Ghahramani
    Richard E. Turner
    Bernhard Schölkopf
    NIPS (2017)
    Preview abstract Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks. View details
    Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
    Shixiang Gu
    Timothy Lillicrap
    Zoubin Ghahramani
    Richard E. Turner
    ICLR (2017)
    Preview abstract Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of unbiased policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments. View details
    Preview abstract A key challenge in scaling up robot learning to many skills and environments is removing the need for human supervision, so that robots can collect their own data and improve their own performance without being limited by the cost of requesting human feedback. Model-based reinforcement learning holds the promise of enabling an agent to learn to predict the effects of its actions, which could provide flexible predictive models for a wide range of tasks and environments, without detailed human supervision. We develop a method for combining deep action-conditioned video prediction models with model-predictive control that uses entirely unlabeled training data. Our approach does not require a calibrated camera, an instrumented training set-up, nor precise sensing and actuation. Our results show that our method enables a real robot to perform nonprehensile manipulation -- pushing objects -- and can handle novel objects not seen during training. View details
    End-to-End Learning of Semantic Grasping
    Eric Jang
    Julian Ibarz
    Peter Pastor Sampedro
    CoRL 2017 (2017) (to appear)
    Preview abstract We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A ``ventral stream'' recognizes object class while a ``dorsal stream'' simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach exhibits an improvement in accuracy over grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance. View details
    End-to-End Learning of Semantic Grasping
    Eric Jang
    Sudheendra Vijaynarasimhan
    Peter Pastor
    Julian Ibarz
    CoRL (2017)
    Preview abstract We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A "ventral stream" recognizes object class while a "dorsal stream" simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance. View details
    Preview abstract Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations. View details
    Preview abstract In this paper, we examine the problem of robotic manipulation of granular media. We evaluate multiple predictive models used to infer the dynamics of scooping and dumping actions. These models are evaluated on a task that involves manipulating the media in order to deform it into a desired shape. Our best performing model is based on a highly-tailored convolutional network architecture with domain-specific optimizations, which we show accurately models the physical interaction of the robotic scoop with the underlying media. We empirically demonstrate that explicitly predicting physical mechanics results in a policy that out-performs both a hand-crafted dynamics baseline, and a “value-network”, which must otherwise implicitly predict the same mechanics in order to produce accurate value estimates. View details
    Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection
    Peter Pastor Sampedro
    Alex Krizhevsky
    Julian Ibarz
    Deirdre Quillen
    International Symposium on Experimental Robotics (2017)
    Preview abstract We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images independent of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. We describe two large-scale experiments that we conducted on two separate robotic platforms. In the first experiment, about 800,000 grasp attempts were collected over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and gripper wear and tear. In the second experiment, we used a different robotic platform and 8 robots to collect a dataset consisting of over 900,000 grasp attempts. The second robotic platform was used to test transfer between robots, and the degree to which data from a different set of robots can be used to aid learning. Our experimental results demonstrate that our approach achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing. Our transfer experiment also illustrates that data from different robots can be combined to learn more reliable and effective grasping. View details
    Preview abstract We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person viewpoints and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as 'go to a chair'. View details
    Preview abstract A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 50,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action. Our experiments show that our proposed method not only produces more accurate video predictions, but also more accurately predicts object motion, when compared to prior methods. View details
    Continuous Deep Q-Learning with Model-based Acceleration
    Shixiang Gu
    Timothy Lillicrap
    Ilya Sutskever
    International Conference on Machine Learning (2016)
    Preview abstract Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable. View details
    MuProp: Unbiased Backpropagation for Stochastic Neural Networks
    Shixiang Gu
    Ilya Sutskever
    Andriy Mnih
    International Conference on Learning Representations (2016)
    Preview abstract Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks. View details
    No Results Found