Carolina Parada
Carolina Parada is a Senior Engineering Manager at Google AI. She leads the robot-mobility and robot vision groups, which focuses on enabling safe, autonomous, and agile mobile robots in human centered environments through machine learning. Prior to that, she led the camera perception team for self-driving cars at Nvidia for 2 years. She was also a lead with Speech @ Google for 7 years, where she drove multiple research and engineering efforts that enabled Ok Google, the Google Assistant, and Voice-Search.
Authored Publications
Sort By
A Contextual Bandit Approach for Learning to Plan in Environments with Probabilistic Goal Configurations
Sohan Rudra
Saksham Goel
Gaurav Aggarwal
NeurIPS 5th Robot Learning Workshop: Trustworthy Robotics (2022) (to appear)
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Alexander Herzog
Alexander Toshkov Toshev
Andy Zeng
Anthony Brohan
Brian Andrew Ichter
Byron David
Chelsea Finn
Clayton Tan
Diego Reyes
Dmitry Kalashnikov
Eric Victor Jang
Jarek Liam Rettinghouse
Jornell Lacanlale Quiambao
Julian Ibarz
Karol Hausman
Kyle Alan Jeffrey
Linda Luu
Mengyuan Yan
Michael Soogil Ahn
Nicolas Sievers
Noah Brown
Omar Eduardo Escareno Cortes
Peng Xu
Peter Pastor Sampedro
Rosario Jauregui Ruano
Sally Augusta Jesmonth
Sergey Levine
Steve Xu
Yao Lu
Yevgen Chebotar
Yuheng Kuang
Conference on Robot Learning (CoRL) (2022)
Preview abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context.
For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.
We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.
We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator.
The project's website and the video can be found at \url{say-can.github.io}.
View details
Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
Anthony G. Francis
Dmitry Kalashnikov
Edward Lee
Jake Varley
Leila Takayama
Mikael Persson
Peng Xu
Stephen Tu
Xuesu Xiao
Conference on Robot Learning (2022) (to appear)
Preview abstract
Despite decades of research, existing navigation systems still face real-world challenges when being deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints of Model Predictive Control (MPC). Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers---a low-rank implicit-attention Transformer. We jointly train the cost function and construct the controller relying on it, effectively solving end-to-end the corresponding bi-level optimization problem. We show that the resulting policy improves standard MPC performance by leveraging a few expert demonstrations of the desired navigation behavior in different challenging real-world scenarios. Compared with a standard MPC policy, Performer-MPC achieves 40% better goal reached in cluttered environments and 65% better sociability when navigating around humans.
View details
Preview abstract
The task of endpointing is to determine when the user has finished speaking, which is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory, deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural networks) model. Specifically, we replace the convolution layer with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 39% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than CLDNNs. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieved 16% (100 ms) endpointer latency improvement relative to our previous best model.
View details
Preview abstract
We present a novel approach for improving overall quality of
keyword spotting using contextual automatic speech recognition
(ASR) system. On voice-activated devices with limited resources,
it is common that a keyword spotting system is run on
the device in order to detect a trigger phrase (e.g. “ok google”)
and decide which audio should be sent to the server (to be transcribed
by the ASR system and processed to generate a response
to the user). Due to limited resources on a device, the device
keyword spotting system might introduce false accepts (FAs)
and false rejects (FRs) that can cause a negative user experience.
We describe a system that uses server-side contextual ASR and
dynamic classes for improved keyword spotting. We show that
this method can significantly reduce FA rates (by 89%) while
minimally increasing FR rate (0.15%). Furthermore, we show
that this system helps reduce Word Error Rate (WER) (by 10%
to 50% relative, on different test sets) and allows users to speak
seamlessly, without pausing between the trigger phrase and the
command.
View details
Preview abstract
In many streaming speech recognition applications such as voice search it is important to determine quickly and accurately when the user has finished speaking their query. A conventional approach to this task is to declare end-of-query whenever a fixed interval of silence is detected by a voice activity detector (VAD) trained to classify each frame as speech or silence. However silence detection and end-of-query detection are fundamentally different tasks, and the criterion used during VAD training may not be optimal. In particular the conventional approach ignores potential acoustic cues such as filler sounds and past speaking rate which may indicate whether a given pause is temporary or query-final. In this paper we present a simple modification to make the conventional VAD training criterion more closely related to end-of-query detection. A unidirectional long short-term memory architecture allows the system to remember past acoustic events, and the training criterion incentivizes the system to learn to use any acoustic cues relevant to predicting future user intent. We show experimentally that this approach improves latency at a given accuracy for end-of-query detection for voice search.
View details
Preview abstract
Voice Activity Detection (VAD) is an important preprocessing
step in any state-of-the-art speech recognition system.
Choosing the right set of features and model architecture can
be challenging and is an active area of research. In this paper
we propose a novel approach to VAD to tackle both feature
and model selection jointly. The proposed method is based
on a CLDNN (Convolutional, Long Short-Term Memory, Deep
Neural Networks) architecture fed directly with the raw waveform.
We show that using the raw waveform allows the neural
network to learn features directly for the task at hand, which is
more powerful than using log-mel features, specially for noisy
environments. In addition, using a CLDNN, which takes advantage
of both frequency modeling with the CNN and temporal
modeling with LSTM, is a much better model for VAD compared
to the DNN. The proposed system achieves over 78% relative
improvement in False Alarms (FA) at the operating point
of 2% False Rejects (FR) on both clean and noisy conditions
compared to a DNN of comparable size trained with log-mel
features. In addition, we study the impact of the model size
and the learned features to provide a better understanding of the
proposed architecture
View details
Personalized Speech Recognition On Mobile Devices
Raziel Alvarez
David Rybach
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
View details
Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition
Preview
Yu-hsin Chen
Mirkó Visontai
Raziel Alvarez
Interspeech (2015)