Petar Aleksic

Petar Aleksic

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Improving Automatic Speech Recognition with Neural Embeddings
    Christopher Li
    2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, 111 8th Ave New York, NY 10011 (2021)
    Preview abstract A common challenge in automatic speech recognition (ASR) systems is successfully decoding utterances containing long tail entities. Examples of entities include unique contact names and local restaurant names that may be out of vocabulary, and therefore absent from the training set. As a result, during decoding, such entities are assigned low likelihoods by the model and are unlikely to be recognized. In this paper, we apply retrieval in an embedding space to recover such entities. In the aforementioned embedding space, embedding representations of phonetically similar entities are designed to be close to one another in cosine distance. We describe the neural networks and the infrastructure to produce such embeddings. We also demonstrate that using neural embeddings improves ASR quality by achieving an over 50% reduction in word error rate (WER) on evaluation sets for popular media queries. View details
    Preview abstract End-to-end (E2E) mixed case automatic speech recognition systems (ASR) that directly predict words in the written domain are attractive due to being simple to build, not requiring explicit capitalization models, allowing streaming capitalization without additional effort beyond that required for streaming ASR, and their small size. However, the fact that these systems produce various versions of the same word with different capitalizations, and even different word segmentations for different case variants when wordpieces (WP) are predicted, leads to multiple problems with contextual ASR. In particular, the size and time to build contextual models grows considerably with the number of variants per word. In this paper, we propose separating orthographic recognition from capitalization, so that the ASR system first predicts a word, then predicts its capitalization in the form of a capitalization mask. We show that the use of capitalization masks achieves the same low error rate as traditional mixed case ASR, while reducing the size and compilation time of contextual models. Furthermore, we observe significant improvements in capitalization quality. View details
    Preview abstract Accurate recognition of numeric sequences is crucial for a number of contextual speech recognition applications. For example, a user might create a calendar event and be prompted by a virtual assistant for the time, date, and duration of the event. We propose using finite state transducers built from written domain numeric grammars, to increase the likelihood of hypotheses matching these grammars during beam search in an end-to-end speech recognition system. Using our technique results in significant reduction of word error rates (up to 59\%) on a variety of numeric sequence recognition tasks (times, percentages, digit sequences). View details
    Preview abstract Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size. In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality. View details
    Preview abstract As voice-driven intelligent assistants become commonplace, adaptation to user context becomes critical for Automatic Speech Recognition (ASR) systems. For example, ASR systems may be expected to recognize a user’s contact names containing improbable or out-of-vocabulary (OOV) words. We introduce a method to identify contextual cues in a firstpass ASR system’s output and to recover out-of-lattice hypotheses that are contextually relevant. Our proposed module is agnostic to the architecture of the underlying recognizer, provided it generates a word lattice of hypotheses; it is sufficiently compact for use on device. The module identifies subgraphs in the lattice likely to contain named entities (NEs), recovers phoneme hypotheses over corresponding time spans, and inserts NEs that are phonetically close to those hypotheses. We measure a decrease in the mean word error rate (WER) of word lattices from 11.5% to 4.9% on a test set of NEs. View details
    Preview abstract Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands. In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media. View details
    Preview abstract Usage of foreign entities in automatic speech recognition (ASR) systems is prevalent in various applications, yet correctly recognizing these foreign words while preserving the accuracy on native words still remains a challenge. We describe a novel approach for recognizing foreign words by injecting them with correctly mapped pronunciations into the recognizer decoder search space on-the-fly. The phoneme mapping between languages is learned automatically using acoustic coupling of Text-to-speech (TTS) audio and a pronunciation learning algorithm. The mapping allows us to utilize the pronunciation dictionary in a foreign language by mapping the pronunciations to the target recognizer language's phoneme inventory. Evaluation of our algorithm on Google Assistant use cases shows we can recognize English media songs with high accuracy on French and German recognizers without hurting recognition on general traffic. View details
    Preview abstract Recent work has shown that end-to-end (E2E) speech recognition architectures such as Listen Attend and Spell (LAS) can achieve state-of-the-art quality results in LVCSR tasks. One benefit of this architecture is that it does not require a separately trained pronunciation model, language model, and acoustic model. However, this property also introduces a drawback: it is not possible to adjust language model contributions separately from the system as a whole. As a result, inclusion of dynamic, contextual information (such as nearby restaurants or upcoming events) into recognition requires a different approach from what has been applied in conventional systems. We introduce a technique to adapt the inference process to take advantage of contextual signals by adjusting the output likelihoods of the neural network at each step in the beam search. We apply the proposed method to a LAS E2E model and show its effectiveness in experiments on a voice search task with both artificial and real contextual information. Given optimal context, our system reduces WER from 9.2% to 3.8%. The results show that this technique is effective at incorporating context into the prediction of an E2E system. Index Terms: speech recognition, end-to-end, contextual speech recognition, neural network View details
    Preview abstract We present a novel approach for improving overall quality of keyword spotting using contextual automatic speech recognition (ASR) system. On voice-activated devices with limited resources, it is common that a keyword spotting system is run on the device in order to detect a trigger phrase (e.g. “ok google”) and decide which audio should be sent to the server (to be transcribed by the ASR system and processed to generate a response to the user). Due to limited resources on a device, the device keyword spotting system might introduce false accepts (FAs) and false rejects (FRs) that can cause a negative user experience. We describe a system that uses server-side contextual ASR and dynamic classes for improved keyword spotting. We show that this method can significantly reduce FA rates (by 89%) while minimally increasing FR rate (0.15%). Furthermore, we show that this system helps reduce Word Error Rate (WER) (by 10% to 50% relative, on different test sets) and allows users to speak seamlessly, without pausing between the trigger phrase and the command. View details
    Preview abstract It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual in- formation [ref]. The contextual information can be used to simplify the search and improve recognition accuracy. The types of useful contextual information can include the name of the application the user is in, the contents on the user’s phone screen, user’s location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amount of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual (biasing) models. Our approach can be used to build a large number of small contextual models from a lim- ited amount of available unsupervised training data. We de- scribe how n-grams relevant for a particular context are au- tomatically selected as well as how an optimal size of a final contextual model built is chosen. Our experimental results show great accuracy improvements for several types of con- text. View details