Leonid Velikovich

Leonid Velikovich

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Improving Automatic Speech Recognition with Neural Embeddings
    Christopher Li
    2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, 111 8th Ave New York, NY 10011 (2021)
    Preview abstract A common challenge in automatic speech recognition (ASR) systems is successfully decoding utterances containing long tail entities. Examples of entities include unique contact names and local restaurant names that may be out of vocabulary, and therefore absent from the training set. As a result, during decoding, such entities are assigned low likelihoods by the model and are unlikely to be recognized. In this paper, we apply retrieval in an embedding space to recover such entities. In the aforementioned embedding space, embedding representations of phonetically similar entities are designed to be close to one another in cosine distance. We describe the neural networks and the infrastructure to produce such embeddings. We also demonstrate that using neural embeddings improves ASR quality by achieving an over 50% reduction in word error rate (WER) on evaluation sets for popular media queries. View details
    Preview abstract As voice-driven intelligent assistants become commonplace, adaptation to user context becomes critical for Automatic Speech Recognition (ASR) systems. For example, ASR systems may be expected to recognize a user’s contact names containing improbable or out-of-vocabulary (OOV) words. We introduce a method to identify contextual cues in a firstpass ASR system’s output and to recover out-of-lattice hypotheses that are contextually relevant. Our proposed module is agnostic to the architecture of the underlying recognizer, provided it generates a word lattice of hypotheses; it is sufficiently compact for use on device. The module identifies subgraphs in the lattice likely to contain named entities (NEs), recovers phoneme hypotheses over corresponding time spans, and inserts NEs that are phonetically close to those hypotheses. We measure a decrease in the mean word error rate (WER) of word lattices from 11.5% to 4.9% on a test set of NEs. View details
    Preview abstract Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands. In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media. View details
    Semantic Model for Fast Tagging of Word Lattices
    IEEE Spoken Language Technology (SLT) Workshop (2016)
    Preview abstract This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a preexisting very large semantic model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 classes, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency. View details
    Garbage Modeling for On-device Speech Recognition
    Christophe Van Gysel
    Interspeech 2015, International Speech Communications Association (to appear)
    Preview
    The Viability of Web-derived Polarity Lexicons
    Sasha Blair-Goldensohn
    Kerry Hannan
    Ryan McDonald
    North American Chapter of the Association for Computational Linguistics (2010)
    Preview
    What’s great and what’s not: learning to classify the scope of negation for improved sentiment analysis
    Isaac Councill
    Ryan McDonald
    Workshop on Negation and Speculation in Natural Language Processing (2010)
    Preview