Kuang-Huei Lee
I am a Research Scientist at Google DeepMind, where I primarily focus on artificial agents. My work spans a broad spectrum -- from robotics to large language models (LLMs)."
https://kuanghuei.github.io
https://kuanghuei.github.io
Authored Publications
Sort By
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta
Ofir Nachum
Yutaka Matsuo
Shane Gu
Izzeddin Gur
International Conference on Learning Representations (ICLR) (2024)
Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators
Jarek Rettinghouse
Daniel Ho
Julian Ibarz
Sangeetha Ramesh
Matt Bennice
Alexander Herzog
Chuyuan Kelly Fu
Adrian Li
Kim Kleiven
Jeff Bingham
Yevgen Chebotar
David Rendleman
Wenlong Lu
Mohi Khansari
Mrinal Kalakrishnan
Ying Xu
Noah Brown
Khem Holden
Justin Vincent
Peter Pastor Sampedro
Jessica Lin
David Dovo
Daniel Kappler
Mengyuan Yan
Sergey Levine
Jessica Lam
Jonathan Weisz
Paul Wohlhart
Karol Hausman
Cameron Lee
Bob Wei
Yao Lu
Preview abstract
We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects.
View details
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Alexander Herzog
Alexander Toshkov Toshev
Andy Zeng
Anthony Brohan
Brian Andrew Ichter
Byron David
Chelsea Finn
Clayton Tan
Diego Reyes
Dmitry Kalashnikov
Eric Victor Jang
Jarek Liam Rettinghouse
Jornell Lacanlale Quiambao
Julian Ibarz
Karol Hausman
Kyle Alan Jeffrey
Linda Luu
Mengyuan Yan
Michael Soogil Ahn
Nicolas Sievers
Noah Brown
Omar Eduardo Escareno Cortes
Peng Xu
Peter Pastor Sampedro
Rosario Jauregui Ruano
Sally Augusta Jesmonth
Sergey Levine
Steve Xu
Yao Lu
Yevgen Chebotar
Yuheng Kuang
Conference on Robot Learning (CoRL) (2022)
Preview abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context.
For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.
We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.
We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator.
The project's website and the video can be found at \url{say-can.github.io}.
View details
Deep Hierarchical Planning from Pixels
Danijar Hafner
Pieter Abbeel
Advances in Neural Information Processing Systems (NeurIPS) (2022)
Preview abstract
Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.
View details
Multi-Game Decision Transformers
Ofir Nachum
Sherry Yang
Daniel Freeman
Winnie Xu
Eric Victor Jang
Henryk Witold Michalewski
Igor Mordatch
Advances in Neural Information Processing Systems (NeurIPS) (2022)
Preview abstract
A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction. Additional information, videos and code can be seen at: http://sites.google.com/view/multi-game-transformers
View details
PI-QT-Opt: Predictive Information Improves Multi-Task Robotic Reinforcement Learning at Scale
Adrian Li
Paul Wohlhart
Yao Lu
Conference on Robot Learning (CoRL) (2022)
Preview abstract
The predictive information, the mutual information between the past and future, has been shown to be a useful representation learning auxiliary loss for training reinforcement learning agents, as the ability to model what will happen next is critical to success on many control tasks. While existing studies are largely restricted to training specialist agents on single-task settings in simulation, in this work, we study modeling the predictive information for robotic agents and its importance for general-purpose agents that are trained to master a large repertoire of diverse skills from large amounts of data. Specifically, we introduce Predictive Information QT-Opt (PI-QT-Opt), a QT-Opt agent augmented with an auxiliary loss that learns representations of the predictive information to solve up to 297 vision-based robot manipulation tasks in simulation and the real world with a single set of parameters. We demonstrate that modeling the predictive information significantly improves success rates on the training tasks and leads to better zero-shot transfer to unseen novel tasks. Finally, we evaluate PI-QT-Opt on real robots, achieving substantial and consistent improvement over QT-Opt in multiple experimental settings of varying environments, skills, and multi-task configurations.
View details
PI-ARS: Accelerating Evolution-Learned Visual Locomotion with Predictive Information Representations
Ofir Nachum
International Conference on Intelligent Robots and Systems (IROS) (2022)
Preview abstract
Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive Information Augmented Random Search (PI-ARS) to mitigate this limitation by leveraging recent advancements in representation learning to reduce the parameter search space for ES. Namely, PI-ARS combines a gradient-based representation learning technique, Predictive Information (PI), with a gradient-free ES algorithm, Augmented Random Search (ARS), to train policies that can process complex robot sensory inputs and handle highly nonlinear robot dynamics. We evaluate PI-ARS on a set of challenging visual-locomotion tasks where a quadruped robot needs to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task. Across all tasks, PI-ARS demonstrates significantly better learning efficiency and performance compared to the ARS baseline. We further validate our algorithm by demonstrating that the learned policies can successfully transfer to a real quadruped robot, for example, achieving a 100% success rate on the real-world stepping stone environment, dramatically improving prior results achieving 40% success.
View details
Compressive Visual Representations
Anurag Arnab
John Canny
Advances in Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
Learning effective visual representations that generalize well without human supervision is a fundamental problem in order to apply Machine Learning to a wide variety of tasks. Recently, two families of self-supervised methods, contrastive learning and latent bootstrapping, exemplified by SimCLR and BYOL respectively, have made significant progress. In this work, we hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn. As Lipschitz continuity is closely related to robustness, this provides a new explanation for why compressed models are more robust. Our experiments confirm that adding compression to SimCLR and BYOL significantly improves linear evaluation accuracies and model robustness across a wide range of domain shifts. In particular, the compressed version of BYOL achieves 76.0% Top-1 linear evaluation accuracy on ImageNet with ResNet-50, and 78.8% with ResNet-50 2x.
View details
Learning Task Sampling Policy for Multitask Learning
Dhanasekar Sundararaman
Henry Tsai
Iulia Raluca Turc
Lawrence Carin
Findings of EMNLP (2021)
Preview abstract
It has been shown that training multi-task models with auxiliary tasks can improve the target tasks quality through cross-task transfer. However, the importance of each auxiliary task to the primary task is likely not known a priori. While the importance weights of auxiliary tasks can be manually tuned, it becomes practically infeasible with the number of tasks scaling up. To address this, we propose a search method that automatically assigns importance weights. We formulate it as a reinforcement learning problem and learn a task sampling schedule based on evaluation accuracy of the multi-task model. Our empirical evaluation on XNLI and GLUE shows that our method outperforms uniform sampling and the corresponding single-task baseline.
View details
An Empirical Investigation of Representation Learning for Imitation
Xin Chen
Sam Toyer
Cody Wild
Scott Emmons
Satyan Alex
Steven Wang
Ping Luo
Stuart Russell
Pieter Abbeel
Rohin Shah
Advances in Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data. Our Empirical Investigation of Representation Learning for Imitation (EIRLI) investigates whether similar benefits apply to imitation learning. We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites. In the settings we evaluate, we find that existing algorithms for image-based representation learning provide limited value relative to a well-tuned baseline with image augmentations. To explain this result, we investigate differences between imitation learning and other settings where representation learning has provided significant benefit, such as image classification. Finally, we release a well-documented codebase which both replicates our findings and provides a modular framework for creating new representation learning algorithms out of reusable components.
View details