George Tucker

George Tucker

I am interested in modeling sequences and sequential decision-making problems. Before joining Google, I was a research scientist on the Amazon Speech team in Boston. My focus was on designing deep learning models for small-footprint keyword spotting. Before joining Amazon, I was a Postdoctoral Research Fellow in the Price lab at the Harvard School of Public Health. I worked on methods for risk prediction and association testing in studies with related individuals. I conducted my PhD research in the MIT Mathematics department in Professor Bonnie Berger's research group.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Offline reinforcement learning (RL) on large, heterogeneous datasets with highcapacity models can, in principle, lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: wider ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multitask Atari as a test-bed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 100M parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we substantially extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% humannormalized score). Compared to supervised approaches, offline RL scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that such offline Q-functions learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing representation learning approaches. View details
    Preview abstract Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive "aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images. View details
    Model-Based Reinforcement Learning for Atari
    Blazej Osinski
    Chelsea Finn
    Henryk Michalewski
    Konrad Czechowski
    Lukasz Mieczyslaw Kaiser
    Mohammad Babaeizadeh
    Piotr Kozakowski
    Piotr Milos
    Roy H Campbell
    Afroz Mohiuddin
    Ryan Sepassi
    Sergey Levine
    NIPS'18 (2020)
    Preview abstract Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magnitude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play. View details
    Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives
    Dieterich Lawson
    Shixiang Gu
    Christopher Maddison
    ICLR (2019)
    Preview abstract Deep latent variable models have become a popular model choice due to the scalable learning algorithms introduced by (Kingma & Welling, 2013; Rezende et al., 2014). These approaches maximize a variational lower bound on the intractable log likelihood of the observed data. Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases. Counterintuitively, the typical inference network gradient estimator for the IWAE bound performs poorly as the number of samples increases (Rainforth et al., 2018; Le et al., 2018). Roeder et al. (2017) propose an improved gradient estimator, however, are unable to show it is unbiased. We show that it is in fact biased and that the bias can be estimated efficiently with a second application of the reparameterization trick. The doubly reparameterized gradient (DReG) estimator does not suffer as the number of samples increases, resolving the previously raised issues. The same idea can be used to improve many recently introduced training techniques for latent variable models. In particular, we show that this estimator reduces the variance of the IWAE gradient, the reweighted wake-sleep update (RWS) (Bornschein & Bengio, 2014), and the jackknife variational inference (JVI) gradient (Nowozin, 2018). Finally, we show that this computationally efficient, unbiased drop-in gradient estimator translates to improved performance for all three objectives on several modeling tasks. View details
    Preview abstract The smallest eigenvectors of the graph Laplacian are well-known to provide a succinct representation of the geometry of a weighted graph. In reinforcement learning (RL), where the weighted graph may be interpreted as the state transition process induced by a behavior policy acting on the environment, approximating the eigenvectors of the Laplacian provides a promising approach to state representation learning. However, existing methods for performing this approximation are ill-suited in general RL settings for two main reasons: First, they are computationally expensive, often requiring operations on large matrices. Second, these methods lack adequate justification beyond simple, tabular, finite-state settings. In this paper, we present a fully general and scalable method for approximating the eigenvectors of the Laplacian in a model-free RL context. We systematically evaluate our approach and empirically show that it generalizes beyond the tabular, finite-state setting. Even in tabular, finite-state settings, its ability to approximate the eigenvectors outperforms previous proposals. Finally, we show the potential benefits of using a Laplacian representation learned using our method in goalachieving RL tasks, providing evidence that our technique can be used to significantly improve the performance of an RL agent. View details
    Preview abstract Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning. View details
    The Mirage of Action-Dependent Baselines in Reinforcement Learning
    Surya Bhupatiraju
    Shane Gu
    Richard E. Turner
    Zoubin Ghahramani
    Sergey Levine
    ICML (2018)
    Preview abstract Model-free reinforcement learning with flexible function approximators has shown recent success for solving goal-directed sequential decision-making problems. Policy gradient methods are a promising class of model-free algorithms, but they have high variance, which necessitates large batches resulting in low sample efficiency. Typically, a state-dependent control variate is used to reduce variance. Recently, several papers have introduced the idea of state and action-dependent control variates and showed that they significantly reduce variance and improve sample efficiency on continuous control tasks. We theoretically and numerically evaluate biases and variances of these policy gradient methods, and show that action-dependent control variates do not appreciably reduce variance in the tested domains. We show that seemingly insignificant implementation details enable these prior methods to achieve good empirical improvements, but at the cost of introducing further bias to the gradient. Our analysis indicates that biased methods tend to improve the performance significantly more than unbiased ones. View details
    Guided evolutionary strategies: augmenting random search with surrogate gradients
    Niru Maheswaranathan
    Luke Metz
    Dami Choi
    Jascha Sohl-dickstein
    ICML (2018)
    Preview abstract Many applications in machine learning require optimizing a function whose true gradient is unknown or computationally expensive, but where surrogate gradient information, directions that may be correlated with the true gradient, is cheaply available. For example, this occurs when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in reinforcement learning or training networks with discrete variables). We propose Guided Evolutionary Strategies (GES), a method for optimally using surrogate gradient directions to accelerate random search. GES defines a search distribution for evolutionary strategies that is elongated along a subspace spanned by the surrogate gradients and estimates a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace and use this to derive a setting of the hyperparameters that works well across problems. We evaluate GES on several example problems, demonstrating an improvement over both standard evolutionary strategies and first-order methods that directly follow the surrogate gradient. View details
    Preview abstract We do an empirical comparison of a variety of recent methods for decision making based on deep Bayesian Neural Networks with Thompson Sampling. View details
    Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
    Jacob Buckman
    Danijar Hafner
    Eugene Brevdo
    Honglak Lee
    NeurIPS (2018)
    Preview abstract Integrating model-free and model-based approaches in reinforcement learning has the potential to achieve the high performance of model-free algorithms with low sample complexity. However, this is difficult because an imperfect dynamics model can degrade the performance of the learning algorithm, and in sufficiently complex environments, the dynamics model will almost always be imperfect. As a result, a key challenge is to combine model-based approaches with model-free learning in such a way that errors in the model do not degrade performance. We propose stochastic ensemble value expansion (STEVE), a novel model-based technique that addresses this issue. By dynamically interpolating between model rollouts of various horizon lengths for each individual example, STEVE ensures that the model is only utilized when doing so does not introduce significant errors. Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency, and in contrast to previous model-based approaches, performance does not degrade in complex environments. View details