Vincent Vanhoucke

Vincent Vanhoucke

Vincent Vanhoucke is a Distinguished Scientist, and Senior Director for Robotics at Google DeepMind. Prior to that, he led Google Brain's vision and perception research, and the speech recognition quality team for Google Search by Voice. He holds a Ph.D. in Electrical Engineering from Stanford University and a Diplôme d'Ingénieur from the Ecole Centrale Paris.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Robotic Table Tennis: A Case Study into a High Speed Learning System
    Jon Abelian
    Saminda Abeyruwan
    Michael Ahn
    Justin Boyd
    Erwin Johan Coumans
    Omar Escareno
    Wenbo Gao
    Navdeep Jaitly
    Juhana Kangaspunta
    Satoshi Kataoka
    Gus Kouretas
    Yuheng Kuang
    Corey Lynch
    Thinh Nguyen
    Ken Oslund
    Barney J. Reed
    Anish Shankar
    Avi Singh
    Grace Vesom
    Peng Xu
    Robotics: Science and Systems (2023)
    Preview abstract We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material. View details
    Preview abstract Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. Prototypes are available at socraticmodels.github.io View details
    Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research
    Ryan Hoque
    Kaushik Shivakumar
    Shrey Aeron
    Gabriel Deza
    Aditya Ganapathi
    Andy Zeng
    Ken Goldberg
    IEEE International Conference on Intelligent Robots and Systems (IROS) (2022) (to appear)
    Preview abstract Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a new cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task. View details
    Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
    Anthony G. Francis
    Brandon Kinman
    Laura Downs
    Nathan Koenig
    Ryan M. Hickman
    Thomas B. McHugh
    (2022)
    Preview abstract Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items; these models are preprocessed for use in Ignition Gazebo and the Bullet simulation platforms, but are easily adaptable to other simulators. We describe our object scanning and curation pipeline, then provide statistics about the contents of the dataset and its usage. We hope that the diversity, quality, and flexibility that Google Scanned Objects provides will lead to further advances in interactive simulation, synthetic perception, and robotic learning. View details
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Andy Zeng
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Chelsea Finn
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Karol Hausman
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Sergey Levine
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning (CoRL) (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Mechanical Search on Shelves using LAX-RAY: Lateral Access X-RAY
    Huang Huang
    Marcus Dominguez-Kuhne
    Vishal Satish
    Michael Danielczuk
    Kate Sanders
    Jeff Ichnowski
    Andrew Lee
    Ken Goldberg
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021)
    Preview abstract Finding an occluded object in a lateral access environment such as a shelf or cabinet is a problem that arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. While this problem, known as mechanical search, is well-studied in overhead access environments, lateral access environments introduce constraints on the poses of objects and on available grasp actions, and pushing actions are preferred to preserve the environment structure. We propose LAXRAY (Lateral Access maXimal Reduction in support Area of occupancY distribution): a system that combines target object occupancy distribution prediction with a mechanical search policy that sequentially pushes occluding objects to reveal a given target object. For scenarios with extruded polygonal objects, we introduce two lateral-access search policies that encode a history of predicted target distributions and can plan up to three actions into the future. We introduce a First-Order Shelf Simulator (FOSS) and use it to evaluate these policies in 800 simulated random shelf environments per policy. We also evaluate in 5 physical shelf environments using a Fetch robot with an embedded PrimeSense RGBD Camera and an attached pushing blade. The policies outperform baselines by up to 25 % in simulation and up to 60% in physical experiments. Additionally, the two-step prediction policy is the highest performing in simulation for 8 objects with a 69 % success rate, suggesting a tradeoff between future information and prediction errors. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray. View details
    X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
    Michael Danielczuk
    Ken Goldberg
    International Conference on Intelligent Robots and Systems (IROS) (2020)
    Preview abstract For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/corp/berkeley.edu/x-ray. View details
    Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
    Peter Karkus
    Rico Jonschkowski
    International Conference on Robotics and Automation (ICRA) (2020)
    Preview abstract Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: https://sites.google.com/view/differentiable-mapping. View details
    Preview abstract Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch with simple reward signals. In addition, users can provide an open loop reference to guide the learning process if more control over the learned gait is needed. The control policies are learned in a physical simulator and then deployed to real robots. In robotics, policies trained in simulation often does not transfer to the real world. We narrow this reality gap by improving the physical simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in real world. View details
    QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
    Dmitry Kalashnikov
    Peter Pastor Sampedro
    Julian Ibarz
    Alexander Herzog
    Eric Jang
    Deirdre Quillen
    Ethan Holly
    Mrinal Kalakrishnan
    Sergey Levine
    CORL (2018)
    Preview abstract In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations. Supplementary experiment videos can be found at https://goo.gl/wQrYmc View details