Petar Aleksic
Authored Publications
Sort By
Improving Automatic Speech Recognition with Neural Embeddings
Christopher Li
2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, 111 8th Ave
New York, NY 10011 (2021)
Preview abstract
A common challenge in automatic speech recognition (ASR) systems is successfully decoding utterances containing long tail entities. Examples of entities include unique contact names and local restaurant names that may be out of vocabulary, and therefore absent from the training set. As a result, during decoding, such entities are assigned low likelihoods by the model and are unlikely to be recognized. In this paper, we apply retrieval in an embedding space to recover such entities. In the aforementioned embedding space, embedding representations of phonetically similar entities are designed to be close to one another in cosine distance. We describe the neural networks and the infrastructure to produce such embeddings. We also demonstrate that using neural embeddings improves ASR quality by achieving an over 50% reduction in word error rate (WER) on evaluation sets for popular media queries.
View details
Incorporating Written Domain Numeric Grammars Into End-to-End Contextual Speech Recognition Systems For Improved Recognition of Numeric Sequences
Ben Haynor
2019 IEEE Automatic Speech Recognition and Understanding Workshop (2020)
Preview abstract
Accurate recognition of numeric sequences is crucial for a number of contextual speech
recognition applications. For example, a user might create a calendar
event and be prompted by a virtual assistant for the time, date, and
duration of the event. We propose using finite state
transducers built from written domain numeric grammars, to increase the
likelihood of hypotheses matching these grammars during beam search in an end-to-end speech
recognition system.
Using our technique results in
significant reduction of word error rates (up to 59\%) on a variety of numeric
sequence recognition tasks (times, percentages, digit sequences).
View details
Preview abstract
Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size. In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
View details
Preview abstract
End-to-end (E2E) mixed case automatic speech recognition systems
(ASR) that directly predict words in the written domain are attractive
due to being simple to build, not requiring explicit capitalization
models, allowing streaming capitalization without additional effort
beyond that required for streaming ASR, and their small size.
However, the fact that these systems produce various versions of the same
word with different capitalizations, and even different word
segmentations for different case variants when wordpieces (WP) are predicted,
leads to multiple problems with contextual ASR. In particular,
the size and time to build contextual models grows considerably
with the number of variants per word. In this paper, we propose
separating orthographic recognition from capitalization, so that the
ASR system first predicts a word, then predicts its capitalization in
the form of a capitalization mask. We show that the use of capitalization
masks achieves the same low error rate as traditional mixed
case ASR, while reducing the size and compilation time of contextual models.
Furthermore, we observe significant improvements in capitalization quality.
View details
Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition
Jack Serrino
ISCA Interspeech 2019, ISCA, Graz, Austria (2019), pp. 3830-3834
Preview abstract
As voice-driven intelligent assistants become commonplace, adaptation to user context becomes critical for Automatic Speech Recognition (ASR) systems. For example, ASR systems may be expected to recognize a user’s contact names containing improbable or out-of-vocabulary (OOV) words.
We introduce a method to identify contextual cues in a firstpass ASR system’s output and to recover out-of-lattice hypotheses that are contextually relevant. Our proposed module is agnostic to the architecture of the underlying recognizer, provided it generates a word lattice of hypotheses; it is sufficiently compact for use on device. The module identifies subgraphs in the lattice likely to contain named entities (NEs), recovers phoneme hypotheses over corresponding time spans, and inserts NEs that are phonetically close to those hypotheses. We measure a decrease in the mean word error rate (WER) of word lattices from 11.5% to 4.9% on a test set of NEs.
View details
Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant
Ian Williams
Justin Scheiner
Pedro Moreno
Interspeech 2018, ISCA (2018), pp. 2222-2226
Preview abstract
Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands.
In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media.
View details
Preview abstract
Usage of foreign entities in automatic speech recognition (ASR) systems is prevalent in various applications, yet correctly recognizing these foreign words while preserving the accuracy on native words still remains a challenge. We describe a novel approach for recognizing foreign words by injecting them with correctly mapped pronunciations into the recognizer decoder search space on-the-fly. The phoneme mapping between languages is learned automatically using acoustic coupling of Text-to-speech (TTS) audio and a pronunciation learning algorithm. The mapping allows us to utilize the pronunciation dictionary in a foreign language by mapping the pronunciations to the target recognizer language's phoneme inventory. Evaluation of our algorithm on Google Assistant use cases shows we can recognize English media songs with high accuracy on French and German recognizers without hurting recognition on general traffic.
View details
Preview abstract
Recent work has shown that end-to-end (E2E) speech
recognition architectures such as Listen Attend and Spell (LAS)
can achieve state-of-the-art quality results in LVCSR tasks. One
benefit of this architecture is that it does not require a separately
trained pronunciation model, language model, and acoustic
model. However, this property also introduces a drawback:
it is not possible to adjust language model contributions separately
from the system as a whole. As a result, inclusion of
dynamic, contextual information (such as nearby restaurants or
upcoming events) into recognition requires a different approach
from what has been applied in conventional systems.
We introduce a technique to adapt the inference process
to take advantage of contextual signals by adjusting the output
likelihoods of the neural network at each step in the beam
search. We apply the proposed method to a LAS E2E model
and show its effectiveness in experiments on a voice search task
with both artificial and real contextual information. Given optimal
context, our system reduces WER from 9.2% to 3.8%.
The results show that this technique is effective at incorporating
context into the prediction of an E2E system.
Index Terms: speech recognition, end-to-end, contextual
speech recognition, neural network
View details
Preview abstract
We present a novel approach for improving overall quality of
keyword spotting using contextual automatic speech recognition
(ASR) system. On voice-activated devices with limited resources,
it is common that a keyword spotting system is run on
the device in order to detect a trigger phrase (e.g. “ok google”)
and decide which audio should be sent to the server (to be transcribed
by the ASR system and processed to generate a response
to the user). Due to limited resources on a device, the device
keyword spotting system might introduce false accepts (FAs)
and false rejects (FRs) that can cause a negative user experience.
We describe a system that uses server-side contextual ASR and
dynamic classes for improved keyword spotting. We show that
this method can significantly reduce FA rates (by 89%) while
minimally increasing FR rate (0.15%). Furthermore, we show
that this system helps reduce Word Error Rate (WER) (by 10%
to 50% relative, on different test sets) and allows users to speak
seamlessly, without pausing between the trigger phrase and the
command.
View details
Contextual Language Model Adaptation Using Dynamic Classes
Benjamin Haynor
IEEE Workshop on Spoken Language Technology (SLT), IEEE (2016)
Preview abstract
Recent focus on assistant products has increased the need for extremely
flexible speech systems that adapt
well to specific users' needs. An important aspect of this is enabling users to
make voice commands referencing their own personal data, such as favorite songs,
application names, and contacts. Recognition accuracy for common commands such
as playing music and sending text messages can be greatly improved if we know a
user's preferences.
In the past, we have addressed this problem using class-based language models
that allow for query-time injection of class instances. However, this approach
is limited by the need to train class-based models ahead of time.
In this work, we present a significantly more flexible system for query-time
injection of user context. Our system dynamically injects the classes
into a non-class-based language model. We remove the need to select the classes
at language model training time. Instead, our system can vary the classes on a
per-client, per-use case, or even a per-request basis.
With the ability to inject new classes per-request outlined in this work, our
speech system can support a diverse set of use cases by
taking advantage of a wide range of contextual information specific to each
use case.
View details