Jump to Content
Carolina Parada

Carolina Parada

Carolina Parada is a Senior Engineering Manager at Google AI. She leads the robot-mobility and robot vision groups, which focuses on enabling safe, autonomous, and agile mobile robots in human centered environments through machine learning. Prior to that, she led the camera perception team for self-driving cars at Nvidia for 2 years. She was also a lead with Speech @ Google for 7 years, where she drove multiple research and engineering efforts that enabled Ok Google, the Google Assistant, and Voice-Search.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Andy Zeng
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Chelsea Finn
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Karol Hausman
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Sergey Levine
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning (CoRL) (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
    Anthony G. Francis
    Dmitry Kalashnikov
    Edward Lee
    Jake Varley
    Leila Takayama
    Mikael Persson
    Peng Xu
    Stephen Tu
    Xuesu Xiao
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Despite decades of research, existing navigation systems still face real-world challenges when being deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints of Model Predictive Control (MPC). Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers---a low-rank implicit-attention Transformer. We jointly train the cost function and construct the controller relying on it, effectively solving end-to-end the corresponding bi-level optimization problem. We show that the resulting policy improves standard MPC performance by leveraging a few expert demonstrations of the desired navigation behavior in different challenging real-world scenarios. Compared with a standard MPC policy, Performer-MPC achieves 40% better goal reached in cluttered environments and 65% better sociability when navigating around humans. View details
    Preview abstract Object-goal navigation (Object-nav) entails searching, recognizing and navigating to a target object. Object-nav has been extensively studied by the Embodied-AI community, but most solutions are often restricted to considering static objects (e.g., television, fridge, etc.). We propose a modular framework for object-nav that is able to efficiently search indoor environments for not just static objects but also movable objects (e.g. fruits, glasses, phones, etc.) that frequently change their positions due to human interaction. Our contextual-bandit agent efficiently explores the environment by showing optimism in the face of uncertainty and learns a model of the likelihood of spotting different objects from each navigable location. The likelihoods are used as rewards in a weighted minimum latency solver to deduce a trajectory for the robot. We evaluate our algorithms in two simulated environments and a real-world setting, to demonstrate high sample efficiency and reliability. View details
    Preview abstract We present a novel approach for improving overall quality of keyword spotting using contextual automatic speech recognition (ASR) system. On voice-activated devices with limited resources, it is common that a keyword spotting system is run on the device in order to detect a trigger phrase (e.g. “ok google”) and decide which audio should be sent to the server (to be transcribed by the ASR system and processed to generate a response to the user). Due to limited resources on a device, the device keyword spotting system might introduce false accepts (FAs) and false rejects (FRs) that can cause a negative user experience. We describe a system that uses server-side contextual ASR and dynamic classes for improved keyword spotting. We show that this method can significantly reduce FA rates (by 89%) while minimally increasing FR rate (0.15%). Furthermore, we show that this system helps reduce Word Error Rate (WER) (by 10% to 50% relative, on different test sets) and allows users to speak seamlessly, without pausing between the trigger phrase and the command. View details
    Preview abstract The task of endpointing is to determine when the user has finished speaking, which is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory, deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural networks) model. Specifically, we replace the convolution layer with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 39% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than CLDNNs. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieved 16% (100 ms) endpointer latency improvement relative to our previous best model. View details
    Preview abstract In many streaming speech recognition applications such as voice search it is important to determine quickly and accurately when the user has finished speaking their query. A conventional approach to this task is to declare end-of-query whenever a fixed interval of silence is detected by a voice activity detector (VAD) trained to classify each frame as speech or silence. However silence detection and end-of-query detection are fundamentally different tasks, and the criterion used during VAD training may not be optimal. In particular the conventional approach ignores potential acoustic cues such as filler sounds and past speaking rate which may indicate whether a given pause is temporary or query-final. In this paper we present a simple modification to make the conventional VAD training criterion more closely related to end-of-query detection. A unidirectional long short-term memory architecture allows the system to remember past acoustic events, and the training criterion incentivizes the system to learn to use any acoustic cues relevant to predicting future user intent. We show experimentally that this approach improves latency at a given accuracy for end-of-query detection for voice search. View details
    Preview abstract We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. View details
    Preview abstract Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. In addition, using a CLDNN, which takes advantage of both frequency modeling with the CNN and temporal modeling with LSTM, is a much better model for VAD compared to the DNN. The proposed system achieves over 78% relative improvement in False Alarms (FA) at the operating point of 2% False Rejects (FR) on both clean and noisy conditions compared to a DNN of comparable size trained with log-mel features. In addition, we study the impact of the model size and the learned features to provide a better understanding of the proposed architecture View details
    Automatic Gain Control and Multi-style Training for Robust Small-Footprint Keyword Spotting with Deep Neural Networks
    Raziel Alvarez
    Preetum Nakkiran
    Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2015), pp. 4704-4708
    Preview abstract We explore techniques to improve the robustness of small-footprint keyword spotting models based on deep neural networks (DNNs) in the presence of background noise and in far-field conditions. We find that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control (AGC) that estimates the levels of both speech and background noise. Further, we find that these techniques allow us to achieve competitive performance, even when applied to DNNs with an order of magnitude fewer parameters than our baseline. View details
    Compressing Deep Neural Networks using a Rank-Constrained Topology
    Preetum Nakkiran
    Raziel Alvarez
    Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2015), pp. 1473-1477
    Preview abstract We present a general approach to reduce the size of feed-forward deep neural networks (DNNs). We propose a rank-constrained topology, which factors the weights in the input layer of the DNN in terms of a low-rank representation: unlike previous work, our technique is applied at the level of the filters learned at individual hidden layer nodes, and exploits the natural two-dimensional time-frequency structure in the input. These techniques are applied on a small-footprint DNN-based keyword spotting task, where we find that we can reduce model size by 75% relative to the baseline, without any loss in performance. Furthermore, we find that the proposed approach is more effective at improving model performance compared to other popular dimensionality reduction techniques, when evaluated with a comparable number of parameters. View details
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Preethi Jyothi
    Thorsten Brants
    Vida Ha
    Will Neveitt
    University of Toronto (2012)
    Preview abstract A critical component of a speech recognition system targeting web search is the language model. The talk presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively. Using large scale, distributed language models can improve performance significantly---up to 10\% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance. View details
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Thorsten Brants
    Vida Ha
    Will Neveitt
    OGI/OHSU Seminar Series, Portland, Oregon, USA (2011)
    Preview abstract The talk presents key aspects faced when building language models (LM) for the google.com query stream, and their use for automatic speech recognition (ASR). Distributed LM tools enable us to handle a huge amount of data, and experiment with LMs that are two orders of magnitude larger than usual. An empirical exploration of the problem led us to re-discovering a less known interaction between Kneser-Ney smoothing and entropy pruning, possible non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. LM compression techniques allowed us to use one billion n-gram LMs in the first pass of an ASR system built on FST technology, and evaluate empirically whether a two-pass system architecture has any losses over one pass. View details
    Query Language Modeling for Voice Search
    Johan Schalkwyk
    Thorsten Brants
    Vida Ha
    Boulos Harb
    Will Neveitt
    Peng Xu
    Proceedings of the 2010 IEEE Workshop on Spoken Language Technology, IEEE, pp. 127-132
    Preview abstract The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1% for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. View details
    No Results Found