Vincent Vanhoucke
Vincent Vanhoucke is a Distinguished Scientist, and Senior Director for Robotics at Google DeepMind. Prior to that, he led Google Brain's vision and perception research, and the speech recognition quality team for Google Search by Voice. He holds a Ph.D. in Electrical Engineering from Stanford University and a Diplôme d'Ingénieur from the Ecole Centrale Paris.
Research Areas
Authored Publications
Sort By
Robotic Table Tennis: A Case Study into a High Speed Learning System
Jon Abelian
Saminda Abeyruwan
Michael Ahn
Justin Boyd
Erwin Johan Coumans
Omar Escareno
Wenbo Gao
Navdeep Jaitly
Juhana Kangaspunta
Satoshi Kataoka
Gus Kouretas
Yuheng Kuang
Corey Lynch
Thinh Nguyen
Ken Oslund
Barney J. Reed
Anish Shankar
Avi Singh
Grace Vesom
Peng Xu
Robotics: Science and Systems (2023)
Preview abstract
We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material.
View details
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng
Brian Ichter
Stefan Welker
Aveek Purohit
Michael Ryoo
Pete Florence
arXiv (2022)
Preview abstract
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. Prototypes are available at socraticmodels.github.io
View details
Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research
Ryan Hoque
Kaushik Shivakumar
Shrey Aeron
Gabriel Deza
Aditya Ganapathi
Andy Zeng
Ken Goldberg
IEEE International Conference on Intelligent Robots and Systems (IROS) (2022) (to appear)
Preview abstract
Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a new cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task.
View details
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
Anthony G. Francis
Brandon Kinman
Laura Downs
Nathan Koenig
Ryan M. Hickman
Thomas B. McHugh
(2022)
Preview abstract
Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items; these models are preprocessed for use in Ignition Gazebo and the Bullet simulation platforms, but are easily adaptable to other simulators.
We describe our object scanning and curation pipeline, then provide statistics about the contents of the dataset and its usage. We hope that the diversity, quality, and flexibility that Google Scanned Objects provides will lead to further advances in interactive simulation, synthetic perception, and robotic learning.
View details
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Alexander Herzog
Alexander Toshkov Toshev
Andy Zeng
Anthony Brohan
Brian Andrew Ichter
Byron David
Chelsea Finn
Clayton Tan
Diego Reyes
Dmitry Kalashnikov
Eric Victor Jang
Jarek Liam Rettinghouse
Jornell Lacanlale Quiambao
Julian Ibarz
Karol Hausman
Kyle Alan Jeffrey
Linda Luu
Mengyuan Yan
Michael Soogil Ahn
Nicolas Sievers
Noah Brown
Omar Eduardo Escareno Cortes
Peng Xu
Peter Pastor Sampedro
Rosario Jauregui Ruano
Sally Augusta Jesmonth
Sergey Levine
Steve Xu
Yao Lu
Yevgen Chebotar
Yuheng Kuang
Conference on Robot Learning (CoRL) (2022)
Preview abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context.
For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.
We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.
We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator.
The project's website and the video can be found at \url{say-can.github.io}.
View details
Mechanical Search on Shelves using LAX-RAY: Lateral Access X-RAY
Huang Huang
Marcus Dominguez-Kuhne
Vishal Satish
Michael Danielczuk
Kate Sanders
Jeff Ichnowski
Andrew Lee
Ken Goldberg
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021)
Preview abstract
Finding an occluded object in a lateral access environment such as a shelf or cabinet is a problem that arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. While this problem, known as mechanical search, is well-studied in overhead access environments, lateral access environments introduce constraints on the poses of objects and on available grasp actions, and pushing actions are preferred to preserve the environment structure. We propose LAXRAY (Lateral Access maXimal Reduction in support Area of occupancY distribution): a system that combines target object occupancy distribution prediction with a mechanical search policy that sequentially pushes occluding objects to reveal a given target object. For scenarios with extruded polygonal objects, we introduce two lateral-access search policies that encode a history of predicted target distributions and can plan up to three actions into the future. We introduce a First-Order Shelf Simulator (FOSS) and use it to evaluate these policies in 800 simulated random shelf environments per policy. We also evaluate in 5 physical shelf environments using a Fetch robot with an embedded PrimeSense RGBD Camera and an attached pushing blade. The policies outperform baselines by up to 25 % in simulation and up to 60% in physical experiments. Additionally, the two-step prediction policy is the highest performing in simulation for 8 objects with a 69 % success rate, suggesting a tradeoff between future information and prediction errors. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray.
View details
X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
Michael Danielczuk
Ken Goldberg
International Conference on Intelligent Robots and Systems (IROS) (2020)
Preview abstract
For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/corp/berkeley.edu/x-ray.
View details
Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
Peter Karkus
Rico Jonschkowski
International Conference on Robotics and Automation (ICRA) (2020)
Preview abstract
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: https://sites.google.com/view/differentiable-mapping.
View details
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
Erwin Coumans
Danijar Hafner
Steven Bohez
RSS (2018)
Preview abstract
Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch with simple reward signals. In addition, users can provide an open loop reference to guide the learning process if more control over the learned gait is needed. The control policies are learned in a physical simulator and then deployed to real robots. In robotics, policies trained in simulation often does not transfer to the real world. We narrow this reality gap by improving the physical simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in real world.
View details
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
Dmitry Kalashnikov
Peter Pastor Sampedro
Julian Ibarz
Alexander Herzog
Eric Jang
Deirdre Quillen
Ethan Holly
Mrinal Kalakrishnan
Sergey Levine
CORL (2018)
Preview abstract
In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.
Supplementary experiment videos can be found at https://goo.gl/wQrYmc
View details