Sergio Guadarrama
Research Areas
Authored Publications
Sort By
Multi-Game Decision Transformers
Ofir Nachum
Sherry Yang
Daniel Freeman
Winnie Xu
Eric Victor Jang
Henryk Witold Michalewski
Igor Mordatch
Advances in Neural Information Processing Systems (NeurIPS) (2022)
Preview abstract
A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction. Additional information, videos and code can be seen at: http://sites.google.com/view/multi-game-transformers
View details
PI-ARS: Accelerating Evolution-Learned Visual Locomotion with Predictive Information Representations
Ofir Nachum
International Conference on Intelligent Robots and Systems (IROS) (2022)
Preview abstract
Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive Information Augmented Random Search (PI-ARS) to mitigate this limitation by leveraging recent advancements in representation learning to reduce the parameter search space for ES. Namely, PI-ARS combines a gradient-based representation learning technique, Predictive Information (PI), with a gradient-free ES algorithm, Augmented Random Search (ARS), to train policies that can process complex robot sensory inputs and handle highly nonlinear robot dynamics. We evaluate PI-ARS on a set of challenging visual-locomotion tasks where a quadruped robot needs to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task. Across all tasks, PI-ARS demonstrates significantly better learning efficiency and performance compared to the ARS baseline. We further validate our algorithm by demonstrating that the learned policies can successfully transfer to a real quadruped robot, for example, achieving a 100% success rate on the real-world stepping stone environment, dramatically improving prior results achieving 40% success.
View details
Compressive Visual Representations
Anurag Arnab
John Canny
Advances in Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
Learning effective visual representations that generalize well without human supervision is a fundamental problem in order to apply Machine Learning to a wide variety of tasks. Recently, two families of self-supervised methods, contrastive learning and latent bootstrapping, exemplified by SimCLR and BYOL respectively, have made significant progress. In this work, we hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn. As Lipschitz continuity is closely related to robustness, this provides a new explanation for why compressed models are more robust. Our experiments confirm that adding compression to SimCLR and BYOL significantly improves linear evaluation accuracies and model robustness across a wide range of domain shifts. In particular, the compressed version of BYOL achieves 76.0% Top-1 linear evaluation accuracy on ImageNet with ResNet-50, and 78.8% with ResNet-50 2x.
View details
Predictive Information Accelerates Learning in RL
Anthony Liu
Yijie Guo
Honglak Lee
John Canny
Advances in Neural Information Processing Systems (2020), pp. 11890-11901
Preview abstract
The Predictive Information is the mutual information between the past and the future, I(X_past; X_future). We hypothesize that capturing the predictive information is useful in RL, since the ability to model what will happen next is necessary for success on many tasks. To test our hypothesis, we train Soft Actor-Critic (SAC) agents from pixels with an auxiliary task that learns a compressed representation of the predictive information of the RL environment dynamics using a contrastive version of the Conditional Entropy Bottleneck (CEB) objective. We refer to these as Predictive Information SAC (PI-SAC) agents. We show that PI-SAC agents can substantially improve sample efficiency over challenging baselines on tasks from the DM Control suite of continuous control environments. We evaluate PI-SAC agents by comparing against uncompressed PI-SAC agents, other compressed and uncompressed agents, and SAC agents directly trained from pixels. Our implementation is given on GitHub.
View details
The Devil is in the Decoder: Classification, Regression and GANs
Zbigniew Wojna
Vittorio Ferrari
Nathan Silberman
Liang-chieh Chen
IJCV (2019) (to appear)
Preview abstract
Many machine vision applications require
predictions for every pixel of the input image (for exam-
ple semantic segmentation, boundary detection). Mod-
els for such problems usually consist of encoders which
decreases spatial resolution while learning a high-di-
mensional representation, followed by decoders who re-
cover the original input resolution and result in low-
dimensional predictions. While encoders have been stud-
ied rigorously, relatively few studies address the decoder
side. Therefore this paper presents an extensive com-
parison of a variety of decoders for a variety of pixel-
wise tasks ranging from classification, regression to syn-
thesis. Our contributions are: (1) Decoders matter: we
observe significant variance in results between different
types of decoders on various problems. (2) We introduce
new residual-like connections for decoders. (3) We in-
troduce a novel decoder: bilinear additive upsampling.
(4) We explore prediction artefacts.
View details
From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following
Anoop Korattikara
Sergey Levine
International Conference on Learning Representations (ICLR) (2019)
Preview abstract
Reinforcement learning is a promising framework for solving control problems,
but its use in practical situations is hampered by the fact that reward functions are
often difficult to engineer. Specifying goals and tasks for autonomous machines,
such as robots, is a significant challenge: conventionally, reward functions and
goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How
can we build learning algorithms that will allow us to tell machines what we want
them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that
language-conditioned rewards are more transferable than language-conditioned
policies to new environments. We propose language-conditioned reward learning
(LC-RL), which grounds language commands as a reward function represented by
a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a languageconditioned policy leads to poor performance.
View details
Preview abstract
We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform optical flow based methods. Finally, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.
View details
PixColor: Pixel Recursive Colorization
Ryan Dahl
Mohammad Norouzi
Jonathon Shlens
Proceedings of the 28th British Machine Vision Conference (BMVC) (2017)
Preview abstract
We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test".
View details
Speed and accuracy trade-offs for modern convolutional object detectors
Anoop Korattikara
Jonathan Huang
Menglong Zhu
Vivek Rathod
Zbigniew Wojna
CVPR 2017, Honolulu, Hawaii (2017)
Preview abstract
The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
View details