Jump to Content
Vincent Vanhoucke

Vincent Vanhoucke

Vincent Vanhoucke is a Distinguished Scientist, and Senior Director for Robotics at Google DeepMind. Prior to that, he led Google Brain's vision and perception research, and the speech recognition quality team for Google Search by Voice. He holds a Ph.D. in Electrical Engineering from Stanford University and a Diplôme d'Ingénieur from the Ecole Centrale Paris.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Robotic Table Tennis: A Case Study into a High Speed Learning System
    Jon Abelian
    Saminda Abeyruwan
    Michael Ahn
    Justin Boyd
    Erwin Johan Coumans
    Omar Escareno
    Wenbo Gao
    Navdeep Jaitly
    Juhana Kangaspunta
    Satoshi Kataoka
    Gus Kouretas
    Yuheng Kuang
    Corey Lynch
    Thinh Nguyen
    Ken Oslund
    Barney J. Reed
    Anish Shankar
    Avi Singh
    Grace Vesom
    Peng Xu
    Robotics: Science and Systems (2023)
    Preview abstract We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material. View details
    Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research
    Ryan Hoque
    Kaushik Shivakumar
    Shrey Aeron
    Gabriel Deza
    Aditya Ganapathi
    Andy Zeng
    Ken Goldberg
    IEEE International Conference on Intelligent Robots and Systems (IROS) (2022) (to appear)
    Preview abstract Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a new cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task. View details
    Preview abstract Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. Prototypes are available at socraticmodels.github.io View details
    Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
    Anthony G. Francis
    Brandon Kinman
    Laura Downs
    Nathan Koenig
    Ryan M. Hickman
    Thomas B. McHugh
    Preview abstract Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items; these models are preprocessed for use in Ignition Gazebo and the Bullet simulation platforms, but are easily adaptable to other simulators. We describe our object scanning and curation pipeline, then provide statistics about the contents of the dataset and its usage. We hope that the diversity, quality, and flexibility that Google Scanned Objects provides will lead to further advances in interactive simulation, synthetic perception, and robotic learning. View details
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Andy Zeng
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Chelsea Finn
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Fei Xia
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Karol Hausman
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Sergey Levine
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning (CoRL) (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Mechanical Search on Shelves using LAX-RAY: Lateral Access X-RAY
    Huang Huang
    Marcus Dominguez-Kuhne
    Vishal Satish
    Michael Danielczuk
    Kate Sanders
    Jeff Ichnowski
    Andrew Lee
    Ken Goldberg
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021)
    Preview abstract Finding an occluded object in a lateral access environment such as a shelf or cabinet is a problem that arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. While this problem, known as mechanical search, is well-studied in overhead access environments, lateral access environments introduce constraints on the poses of objects and on available grasp actions, and pushing actions are preferred to preserve the environment structure. We propose LAXRAY (Lateral Access maXimal Reduction in support Area of occupancY distribution): a system that combines target object occupancy distribution prediction with a mechanical search policy that sequentially pushes occluding objects to reveal a given target object. For scenarios with extruded polygonal objects, we introduce two lateral-access search policies that encode a history of predicted target distributions and can plan up to three actions into the future. We introduce a First-Order Shelf Simulator (FOSS) and use it to evaluate these policies in 800 simulated random shelf environments per policy. We also evaluate in 5 physical shelf environments using a Fetch robot with an embedded PrimeSense RGBD Camera and an attached pushing blade. The policies outperform baselines by up to 25 % in simulation and up to 60% in physical experiments. Additionally, the two-step prediction policy is the highest performing in simulation for 8 objects with a 69 % success rate, suggesting a tradeoff between future information and prediction errors. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray. View details
    X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
    Michael Danielczuk
    Ken Goldberg
    International Conference on Intelligent Robots and Systems (IROS) (2020)
    Preview abstract For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/corp/berkeley.edu/x-ray. View details
    Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
    Peter Karkus
    Rico Jonschkowski
    International Conference on Robotics and Automation (ICRA) (2020)
    Preview abstract Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: https://sites.google.com/view/differentiable-mapping. View details
    Preview abstract We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4-dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity. View details
    Preview abstract Robotic learning algorithms based on reinforcement, self-supervision, and imitation can acquire end-to-end controllers from raw sensory inputs such as images. These end-to-end controllers acquire perception systems that are tailored to the task, picking up on the cues that are most useful for the task at hand. However, to learn generalizable robotic skills, we might prefer more structured image representations, such as ones encoding the persistence of objects and their identities. In this paper, we study a specific instance of this problem: acquiring object representations through autonomous robotic interaction with its environment. Our representation learning method is based on object persistence: when a robot picks up an object and ``subtracts'' it from the scene, its representation of the scene should change in a predictable way. We can use this observation to formulate a simple condition that an object-centric representation should satisfy: the features corresponding to a scene should be approximately equal to the feature values for the same scene after an object has been removed, minus the feature value for that object. View details
    Preview abstract Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch with simple reward signals. In addition, users can provide an open loop reference to guide the learning process if more control over the learned gait is needed. The control policies are learned in a physical simulator and then deployed to real robots. In robotics, policies trained in simulation often does not transfer to the real world. We narrow this reality gap by improving the physical simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in real world. View details
    Classification of crystallization outcomes using deep convolutional neural networks
    Andrew E. Bruno
    Patrick Charbonneau
    Janet Newman
    Edward H. Snell
    David Richard So
    Christopher J. Watkins
    Shawn Williams
    Julie Wilson
    PLOS One (2018)
    Preview abstract The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly half a million annotated images of macromolecular crystallization experiments from various sources and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different parts of this data set. We find that more than 94% of the test images can be correctly labeled, irrespective of their experimental origin. Because crystal recognition is key to high-density sampling and the systematic analysis of crystallization experiments, this approach opens the door to both industrial and fundamental research applications. View details
    Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
    Paul Wohlhart
    Matthew Kelcey
    Mrinal Kalakrishnan
    Laura Downs
    Julian Ibarz
    Peter Pastor Sampedro
    Kurt Konolige
    Sergey Levine
    ICRA (2018)
    Preview abstract Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms is prohibitively expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. To address this shortcoming, prior work introduced domain adaptation algorithms that attempt to make the resulting models domain-invariant. However, such works were evaluated primarily on offline image classification datasets. In this work, we adapt these techniques for learning, primarily in simulation, robotic hand-eye coordination for grasping. Our approaches generalize to diverse and previously unseen real-world objects. We show that, by using synthetic data and domain adaptation, we are able to reduce the amounts of real--world samples required for our goal and a certain level of performance by up to 50 times. We also show that by using our suggested methodology we are able to achieve good grasping results by using no real world labeled data. View details
    QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
    Dmitry Kalashnikov
    Peter Pastor Sampedro
    Julian Ibarz
    Alexander Herzog
    Eric Jang
    Deirdre Quillen
    Ethan Holly
    Mrinal Kalakrishnan
    Sergey Levine
    CORL (2018)
    Preview abstract In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations. Supplementary experiment videos can be found at https://goo.gl/wQrYmc View details
    TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow
    Danijar Hafner
    James Davidson
    arXiv preprint arXiv:1709.02878 (2017)
    Preview abstract We introduce TensorFlow Agents, an efficient infrastructure paradigm for building parallel reinforcement learning algorithms in TensorFlow. We simulate multiple environments in parallel, and group them to perform the neural network computation on a batch rather than individual observations. This allows the TensorFlow executing engine to parallelize computation, without the need for manual synchronization. Environments are stepped in separate Python processes to progress them in parallel without interference of the global interpreter lock. As part of this project, we introduce BatchPPO, an efficient implementation of the proximal policy optimization algorithm. By open sourcing TensorFlow Agents, we hope to provide a flexible starting point for future projects that accelerates future research in the field. View details
    YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Dataset for Object Detection in Video
    Jon Shlens
    Stefano Mazzocchi
    Xin Pan
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464-7473
    Preview abstract We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization to provide a point of comparison for future work. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. Please see the PDF file to find the URL to download the data. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking. View details
    Rethinking the Inception Architecture for Computer Vision
    Christian Szegedy
    Sergey Ioffe
    Jonathon Shlens
    Zbigniew Wojna
    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2016)
    Preview abstract Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set. View details
    Preview abstract Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge. View details
    Going Deeper with Convolutions
    Christian Szegedy
    Wei Liu
    Yangqing Jia
    Scott Reed
    Dragomir Anguelov
    Andrew Rabinovich
    Computer Vision and Pattern Recognition (CVPR) (2015)
    Preview abstract We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation of this architecture, GoogLeNet, a 22 layers deep network, was used to assess its quality in the context of object detection and classification. View details
    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
    Ashish Agarwal
    Eugene Brevdo
    Craig Citro
    Matthieu Devin
    Ian Goodfellow
    Andrew Harp
    Geoffrey Irving
    Yangqing Jia
    Rafal Jozefowicz
    Lukasz Kaiser
    Manjunath Kudlur
    Dan Mané
    Rajat Monga
    Chris Olah
    Mike Schuster
    Jonathon Shlens
    Benoit Steiner
    Ilya Sutskever
    Kunal Talwar
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    tensorflow.org (2015)
    Preview abstract TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org. View details
    Real-Time Pedestrian Detection With Deep Network Cascades
    Alex Krizhevsky
    Abhijit Ogale
    Dave Ferguson
    Proceedings of BMVC 2015
    Preview abstract We present a new real-time approach to object detection that exploits the efficiency of cascade classifiers with the accuracy of deep neural networks. Deep networks have been shown to excel at classification tasks, and their ability to operate on raw pixel input without the need to design special features is very appealing. However, deep nets are notoriously slow at inference time. In this paper, we propose an approach that cascades deep nets and fast features, that is both extremely fast and extremely accurate. We apply it to the challenging task of pedestrian detection. Our algorithm runs in real-time at 15 frames per second. The resulting approach achieves a 26.2% average miss rate on the Caltech Pedestrian detection benchmark, which is competitive with the very best reported results. It is the first work we are aware of that achieves extremely high accuracy while running in real-time. View details
    Preview abstract Pedestrian detection is of crucial importance to autonomous driving applications. Methods based on deep learning have shown significant improvements in accuracy, which makes them particularly suitable for applications, such as pedestrian detection, where reducing miss rate is very important. Although they are accurate, their runtime has been at best in seconds per image, which makes them not practical for onboard applications. We present here a Large-Field-Of-View (LFOV) deep network for pedestrian detection, that can achieve high accuracy and is designed to make deep networks work faster for detection problems. The idea of the proposed Large-Field-of-View deep network is to learn to make classification decisions simultaneously and accurately at multiple locations. The LFOV network processes larger image areas at much faster speeds than typical deep networks have been able to do, and can intrinsically reuse computations. Our pedestrian detection solution, which is a combination of a LFOV network and a standard deep network, works at 280 ms per image on GPU and achieves 35.85 average miss rate on the Caltech Pedestrian Detection Benchmark. View details
    Preview abstract We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural Network- Hidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predic- tions for each frame - from the different contexts it is associated with - we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional archi- tectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (test- dev93) and 9.3% on test set (test-eval92). View details
    Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
    Erik McDermott
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
    Preview abstract This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization. View details
    Multiframe Deep Neural Networks for Acoustic Modeling
    Matthieu Devin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations. View details
    On Rectified Linear Units For Speech Processing
    M.D. Zeiler
    M. Ranzato
    R. Monga
    M. Mao
    K. Yang
    P. Nguyen
    G.E. Hinton
    38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013)
    Preview abstract Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data. View details
    Multilingual acoustic models using distributed deep neural networks
    Patrick Nguyen
    Marc'aurelio Ranzato
    Matthieu Devin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). View details
    Preview abstract The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset. View details
    Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data
    Patrick Nguyen
    Mitchel Weintraub
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan (2012), pp. 4437-4440
    Preview abstract The acoustic models in state-of-the-art speech recognition systems are based on phones in context that are represented by hidden Markov models. This modeling approach may be limited in that it is hard to incorporate long-span acoustic context. Exemplar-based approaches are an attractive alternative, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented. View details
    Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Geoffrey Hinton
    Li Deng
    Dong Yu
    George Dahl
    Abdel-rahman Mohamed
    Navdeep Jaitly
    Patrick Nguyen
    Brian Kingsbury
    Signal Processing Magazine (2012)
    Preview abstract Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. View details
    Improving the speed of neural networks on CPUs
    Mark Z. Mao
    Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011
    Preview abstract Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3X improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware. View details
    Preview abstract One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a treestructured partition of the acoustic space, and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions. View details
    Reading Text in Consumer Digital Photographs
    S. Burak Gokturk
    Proceedings of SPIE DRR XIV (2007)
    Confidence Scoring and Rejection using Multi-Pass Speech Recognition
    Proceedings of Interspeech 2005
    Automatic Training Set Segmentation For Multi-Pass Speech Recognition
    Mark Z. Mao
    Brian Strope
    Proceedings of ICASSP 2005
    Design of Compact Acoustic Models through Clustering of Tied-Covariance Gaussians
    Mark Z. Mao
    Proceedings of ICSLP 2004
    Mixtures of Inverse Covariances
    Ananth Sankar
    IEEE Transactions on Speech and Audio Processing, vol. 13 (2004), pp. 250-264
    Variable Length Mixtures of Inverse Covariances
    Ananth Sankar
    Processings of Eurospeech 2003
    Mixtures of Inverse Covariances
    Ananth Sankar
    Proceedings of ICASSP2003, also in Proceedings of NNSP 2003
    Mixtures of Inverse Covariances: Covariance Modeling for Gaussian Mixtures with Applications to Automatic Speech Recognition
    Ph.D. Thesis, Stanford University (2003)
    Interpretability in Multidimensional Classification
    Rosaria Silipo
    Interpretability Issues in Fuzzy Modeling, Springer-Verlag (2003), pp. 193-217
    Speaker-Trained Recognition using Allophonic Enrollment Models
    Michael M. Hochberg
    Christopher J. Leggetter
    Proceedings of ASRU2001
    Effects of Prompt Style when Navigating through Structured Data
    W. Lawrence Neeley
    Maria Mortati
    Michael J. Sloan
    Clifford Nass
    Proceedings of INTERACT 2001, Eighth IFIP TC.13 Conference on Human Computer Interaction, pp. 530-536