Jump to Content
Kanishka Rao

Kanishka Rao

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    Alexander Herzog
    Alexander Toshkov Toshev
    Anthony Brohan
    Brian Andrew Ichter
    Byron David
    Clayton Tan
    Diego Reyes
    Dmitry Kalashnikov
    Eric Victor Jang
    Jarek Liam Rettinghouse
    Jornell Lacanlale Quiambao
    Julian Ibarz
    Kyle Alan Jeffrey
    Linda Luu
    Mengyuan Yan
    Michael Soogil Ahn
    Nicolas Sievers
    Noah Brown
    Omar Eduardo Escareno Cortes
    Peng Xu
    Peter Pastor Sampedro
    Rosario Jauregui Ruano
    Sally Augusta Jesmonth
    Steve Xu
    Yao Lu
    Yevgen Chebotar
    Yuheng Kuang
    Conference on Robot Learning 2022 (2022)
    Preview abstract Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator. The project's website and the video can be found at \url{say-can.github.io}. View details
    Preview abstract Robots trained via reinforcement-learning (RL) requirecollecting and labeling many real-world episodes, whichmay be costly and time-consuming. Training models with alarge amount of simulation is a cheaper alternative. How-ever, simulations are not perfect and such models may nottransfer to the real world. Techniques developed to closethis simulation-to-reality (Sim2Real) gap typically applyrandomization to the simulated images or adapt them withan additional Sim2Real model. A Generative Adversar-ial network (GAN) may be used to adapt the pixels of thesimulated image to be more realistic before use by a deepRL model. We find the CycleGAN which enforces a cycleconsistency between Sim2Real and Real2Sim adaptationsproduces better images for RL than a GAN alone. Ulti-mately, we develop RL-CycleGAN which includes a Cycle-GAN which trains jointly with the deep RL model and en-forces that the RL model is consistent across all the adap-tations.We evaluate the RL-CycleGAN on two vision-based robotics grasping tasks and compare it to previoustechniques. With 580,000 real episodes and millions ofsimulated episodes adapted with RL-CycleGAN achievesxx% grasp success, while a previous GAN-based approach,GraspGAN, achieves xx% grasp success. With only 5,000real episodes, RL-CycleGAN and GraspGAN achieve xx%and xx% grasp success respectively. On a multi-bin grasp-ing task, we show RL-CycleGAN drastically improves dataefficiency requiring 1/xth the amount of real data to reachthe same grasping performance. View details
    Preview abstract We show that a word-level recurrent neural network can predict emoji from text typed on a mobile keyboard. We demonstrate the usefulness of transfer learning for predicting emoji by pretraining the model using a language modeling task. We also propose mechanisms to trigger emoji and tune the diversity of candidates. The model is trained using a distributed on-device learning framework called federated learning. The federated model is shown to achieve better performance than a server-trained model. This work demonstrates the feasibility of using federated learning to train production-quality models for natural language understanding tasks while keeping users' data on their devices. View details
    Preview abstract We train a recurrent neural network language model using a distributed, on-device learning framework called federated learning for the purpose of next-word prediction in a virtual keyboard for smartphones. Server-based training using stochastic gradient descent is compared with training on client devices using the Federated Averaging algorithm. The federated algorithm, which enables training on a higher-quality dataset for this use case, is shown to achieve better prediction recall. This work demonstrates the feasibility and benefit of training language models on client devices without exporting sensitive user data to servers. The federated learning environment gives users greater control over their data and simplifies the task of incorporating privacy by default with distributed training and aggregation across a population of client devices. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Preview abstract Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER. View details
    Preview abstract Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages. View details
    Preview abstract We train a recurrent neural network language model using a distributed, on-device learning framework called federated learning for the purpose of next-word prediction in a virtual keyboard for smartphones. Server-based training using stochastic gradient descent is compared with training on client devices using the FederatedAveraging algorithm. The federated algorithm, which enables training on a higher-quality dataset for this use case, is shown to achieve better prediction recall. This work demonstrates the feasibility and benefit of training language models on client devices without exporting sensitive user data to servers. The federated learning environment gives users greater control over their data and simplifies the task of incorporating privacy by default with distributed training and aggregation across a population of client devices. View details
    Preview abstract We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components. Our models are trained to predict either phonemes or graphemes as subword units, thus allowing us to detect arbitrary keyword phrases, without any out-of-vocabulary words. In order to adapt the models to the requirements of keyword spotting, we propose a novel technique which biases the RNN-T system towards a specific keyword of interest. Our systems are compared against a strong sequence-trained, connectionist temporal classification (CTC) based “keyword-filler” baseline, which is augmented with a separate phoneme language model. Overall, our RNN-T system with the proposed biasing technique significantly improves performance over the baseline system. View details
    Preview abstract In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including online and full-sequence attention. Second, we explore different sub-word units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focussed over a small region of the acoustics during each step of next label prediction, full sequence attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that content-independent phonemes are a reasonable sub-word unit for attention models; when used in the second-pass to rescore N-best hypotheses these models provide over a 10% relative improvement in word error rate. View details
    Preview abstract Automatic speech recognition relies on pronunciation dictionaries for accurate results and previous work used pronunciation learning algorithms to build them. Efficient algorithms must balance having the ability to learn varied pronunciations while being constrained enough to be robust. Our approach extends one of such algorithms \cite{Kou2015} by replacing a finite state transducer (FST) built from a limited-size candidate list with a general and flexible FST building mechanism. This architecture can accommodate a wide variety of pronunciation predictions and can also learn pronunciations without having the written form. It can also use an FST built from a recursive neural network (RNN) and tune the importance given to the written form. The new approach reduces the number of incorrect pronunciations learned by up to 25% (relative) on a random sampling of Google voice traffic View details
    Preview abstract In this work, we conduct a detailed evaluation of various all-neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention-based model, and a model which augments the RNN-transducer with an attention mechanism. We find that end-to-end models are capable of learning all components of the speech recognition process: acoustic, pronunciation, and language models, directly outputting words in the written form (e.g., “one hundred dollars” to “$100”), in a single jointly-optimized neural network. Furthermore, the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline outperforms these models on voice-search test sets. View details
    Preview abstract We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an `encoder', which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a `decoder' which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units (`wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets, is comparable in performance to a state-of-the-art baseline on dictation and voice-search tasks. View details
    Preview abstract We explore the viability of grapheme-based recognition specifically how it compares to phoneme-based equivalents. We utilize the CTC loss to train models to directly predict graphemes, we also train models with hierarchical CTC and show that they improve on previous CTC models. We also explore how the grapheme and phoneme models scale with large data sets, we consider a single acoustic training data set where we combine various dialects of English from US, UK, India and Australia. We show that by training a single grapheme-based model on this multi-dialect data set we create a accent-robust ASR system View details
    Preview abstract We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. View details
    Preview abstract We present a new procedure to train acoustic models from scratch for large vocabulary speech recognition requiring no previous model for alignments or boot-strapping. We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly from a parallel corpus of audio data and transcribed data. With this augmented CTC function we train a phoneme recognition acoustic model directly from the written-domain transcript. Further, we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from 30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%. View details
    Preview abstract Word pronunciations, consisting of phoneme sequences and the associated syllabification and stress patterns, are vital for both speech recognition and text-to-speech (TTS) systems. For speech recognition phoneme sequences for words may be learned from audio data. We train recurrent neural network (RNN) based models to predict the syllabification and stress pattern for such pronunciations making them usable for TTS. We find these RNN models significantly outperform naive rulebased models for almost all languages we tested. Further, we find additional improvements to the stress prediction model by using the spelling as features in addition to the phoneme sequence. Finally, we train a single RNN model to predict the phoneme sequence, syllabification and stress for a given word. For several languages, this single RNN outperforms similar models trained specifically for either phoneme sequence or stress prediction. We report an exhaustive comparison of these approaches for twenty languages. View details
    Preview abstract This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments View details
    No Results Found