Sergio Guadarrama

Sergio Guadarrama

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Multi-Game Decision Transformers
    Ofir Nachum
    Sherry Yang
    Daniel Freeman
    Winnie Xu
    Eric Victor Jang
    Henryk Witold Michalewski
    Igor Mordatch
    Advances in Neural Information Processing Systems (NeurIPS) (2022)
    Preview abstract A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction. Additional information, videos and code can be seen at: http://sites.google.com/view/multi-game-transformers View details
    Preview abstract Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive Information Augmented Random Search (PI-ARS) to mitigate this limitation by leveraging recent advancements in representation learning to reduce the parameter search space for ES. Namely, PI-ARS combines a gradient-based representation learning technique, Predictive Information (PI), with a gradient-free ES algorithm, Augmented Random Search (ARS), to train policies that can process complex robot sensory inputs and handle highly nonlinear robot dynamics. We evaluate PI-ARS on a set of challenging visual-locomotion tasks where a quadruped robot needs to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task. Across all tasks, PI-ARS demonstrates significantly better learning efficiency and performance compared to the ARS baseline. We further validate our algorithm by demonstrating that the learned policies can successfully transfer to a real quadruped robot, for example, achieving a 100% success rate on the real-world stepping stone environment, dramatically improving prior results achieving 40% success. View details
    Compressive Visual Representations
    Anurag Arnab
    John Canny
    Advances in Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract Learning effective visual representations that generalize well without human supervision is a fundamental problem in order to apply Machine Learning to a wide variety of tasks. Recently, two families of self-supervised methods, contrastive learning and latent bootstrapping, exemplified by SimCLR and BYOL respectively, have made significant progress. In this work, we hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn. As Lipschitz continuity is closely related to robustness, this provides a new explanation for why compressed models are more robust. Our experiments confirm that adding compression to SimCLR and BYOL significantly improves linear evaluation accuracies and model robustness across a wide range of domain shifts. In particular, the compressed version of BYOL achieves 76.0% Top-1 linear evaluation accuracy on ImageNet with ResNet-50, and 78.8% with ResNet-50 2x. View details
    Predictive Information Accelerates Learning in RL
    Anthony Liu
    Yijie Guo
    Honglak Lee
    John Canny
    Advances in Neural Information Processing Systems (2020), pp. 11890-11901
    Preview abstract The Predictive Information is the mutual information between the past and the future, I(X_past; X_future). We hypothesize that capturing the predictive information is useful in RL, since the ability to model what will happen next is necessary for success on many tasks. To test our hypothesis, we train Soft Actor-Critic (SAC) agents from pixels with an auxiliary task that learns a compressed representation of the predictive information of the RL environment dynamics using a contrastive version of the Conditional Entropy Bottleneck (CEB) objective. We refer to these as Predictive Information SAC (PI-SAC) agents. We show that PI-SAC agents can substantially improve sample efficiency over challenging baselines on tasks from the DM Control suite of continuous control environments. We evaluate PI-SAC agents by comparing against uncompressed PI-SAC agents, other compressed and uncompressed agents, and SAC agents directly trained from pixels. Our implementation is given on GitHub. View details
    The Devil is in the Decoder: Classification, Regression and GANs
    Zbigniew Wojna
    Vittorio Ferrari
    Nathan Silberman
    Liang-chieh Chen
    IJCV (2019) (to appear)
    Preview abstract Many machine vision applications require predictions for every pixel of the input image (for exam- ple semantic segmentation, boundary detection). Mod- els for such problems usually consist of encoders which decreases spatial resolution while learning a high-di- mensional representation, followed by decoders who re- cover the original input resolution and result in low- dimensional predictions. While encoders have been stud- ied rigorously, relatively few studies address the decoder side. Therefore this paper presents an extensive com- parison of a variety of decoders for a variety of pixel- wise tasks ranging from classification, regression to syn- thesis. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce new residual-like connections for decoders. (3) We in- troduce a novel decoder: bilinear additive upsampling. (4) We explore prediction artefacts. View details
    From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following
    Anoop Korattikara
    Sergey Levine
    International Conference on Learning Representations (ICLR) (2019)
    Preview abstract Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a languageconditioned policy leads to poor performance. View details
    Preview abstract We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform optical flow based methods. Finally, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking. View details
    PixColor: Pixel Recursive Colorization
    Ryan Dahl
    Mohammad Norouzi
    Jonathon Shlens
    Proceedings of the 28th British Machine Vision Conference (BMVC) (2017)
    Preview abstract We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test". View details
    Preview abstract Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such BLEU, METEOR and ROUGE, are also not well correlated. The SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) algorithm to directly optimize a combination of SPICE and CIDEr (a combination we call SPIDEr): the SPICE score ensures our captions are semantically faithful to the image, and the CIDEr score ensures our captions are syntactically fluent. The PG algorithm we propose improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing ML training with PG. We show empirically that our algorithm leads to improved results compared to MIXER. Finally, we shoow that using our PG algorithm to optimize the novel SPIDEr metric results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained using different objective functions. View details
    Preview abstract The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task. View details