Jump to Content
Joel Shor

Joel Shor

Teaching machines to talk back.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recent advances in self-supervision have dramatically im- proved the quality of speech representations. However, wide deployment of state-of-the-art embedding models on devices has been severely restricted due to their limited public avail- ability and large resource footprint. Our work addresses these by publicly releasing a collection of paralinguistic speech models1 that are small, near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled only on public data. We explore differ- ent architectures and thoroughly evaluate our models on the Non-Semantic Speech (NOSS) benchmark. Our largest dis- tilled model is less than 16% the size of the original model (340MB vs 2.2GB) and achieves over 94% the accuracy on 6 of 7 tasks. The smallest model is less than 0.3% in size (22MB) and achieves over 90% as the accuracy on 6 of 7 tasks. View details
    Preview abstract Many speech applications require understanding aspects other than content, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. Generally-useful paralinguistic speech representations offer one solution to these kinds of problems. In this work, we introduce a new state-of-the-art paralinguistic speech representation based on self-supervised training of a 600M+ parameter Conformer-based architecture. Linear classifiers trained on top of our best representation outperform previous results on 7 of 8 tasks we evaluate. We perform a larger comparison than has been done previously both in terms of number of embeddings compared and number of downstream datasets evaluated on. Our analyses into the role of time demonstrate the importance of context window size for many downstream tasks. Furthermore, while the optimal representation is extracted internally in the network, we demonstrate stable high performance across several layers, allowing a single universal representation to reach near optimal performance on all tasks. View details
    Preview abstract Compression is essential to storing and transmitting medical videos, but the effect of compression artifacts on downstream medical tasks is often ignored. Furthermore, systems in practice rely on standard video codecs, which naively allocate bits evenly between medically interesting and uninteresting frames and parts of frames. In this work, we present an empirical study of some deficiencies of classical codecs on gastroenterology videos, and motivate our ongoing work to train a learned compression model for colonoscopy videos, which we call ``GastroEnterology Aware Compression" (GEAC). We show that H264 and HEVC, two of the most common classical codecs, perform worse on the most medically-relevant frames. We also show that polyp detector performance degrades rapidly as compression increases, and explain why a learned compressor would degrade more gracefully. Many of our proposed techniques generalize to medical video domains beyond gastroenterology. View details
    Knowledge distillation for fast and accurate DNA sequence correction
    Anastasiya Belyaeva
    Daniel Cook
    Kishwar Shafin
    Daniel Liu
    Armin Töpfer
    Aaron Wenger
    William J. Rowell
    Howard Yang
    Andrew Carroll
    Maria Nattestad
    Learning Meaningful Representations of Life (LMRL) Workshop NeurIPS 2022
    Preview abstract Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer–encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models. View details
    Preview abstract Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of a speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We find that the ASR encoder’s embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words. View details
    Preview abstract Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speedup techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our training and evaluation code is publicly available. View details
    A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan
    Arkady Epshteyn
    Ashwin Sura Ravi
    Beth Luan
    Chun-Liang Li
    Daisuke Yoneoka
    Dario Sava
    Hiroaki Miyata
    Hiroki Kayama
    Isaac Jones
    Joe Mckenna
    Johan Euphrosine
    Kris Popendorf
    Nate Yoder
    Shashank Singh
    Shuhei Nomura
    Thomas Tsai
    npj Digital Medicine (2021)
    Preview abstract The COVID-19 pandemic has highlighted the global need for reliable models of disease spread. We evaluate an AI-improved forecasting approach that provides daily predictions of the expected number of confirmed COVID-19 deaths, cases and hospitalizations during the following 28 days. We present an international, prospective evaluation of model performance across all states and counties in the USA and prefectures in Japan. National mean absolute percentage error (MAPE) for predicting COVID-19 associated deaths before and after prospective deployment remained consistently <3% (US) and <10% (Japan). Average statewide (US) and prefecture wide (Japan) MAPE was 6% and 20% respectively (14% when looking at prefectures with more than 10 deaths).We show our model performs well even during periods of considerable change in population behavior, and that it is robust to demographic differences across different geographic locations.We further demonstrate the model provides meaningful explanatory insights, finding that the model appropriately responds to local and national policy interventions. Our model enables counterfactual simulations, which indicate continuing NPIs alongside vaccinations is essential for more rapidly recovering from the pandemic, delaying the application of interventions has a detrimental effect, and allow exploration of the consequences of different vaccination strategies. The COVID-19 pandemic remains a global emergency. In the face of substantial challenges ahead, the approach presented here has the potential to inform critical decisions. View details
    Preview abstract The ultimate goal of transfer learning is to enable learning with a small amount of data, by using a strong embedding. While significant progress has been made in the visual and language domains, the speech domain does not have such a universal method. This paper presents a new representation of speech signals based on an unsupervised triplet-loss objective, which outperforms both existing state of the art and other representations on a number of transfer learning tasks in the non-semantic speech domain. The embedding is learned on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released. View details
    Preview abstract Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from ‘typical’ speech, which means that underrepresented groups don’t experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non standard speech. We focus on two types of non standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 76% of the improvement comes from only 5 min of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech Index Terms: speech recognition, personalization, accessibility View details
    Preview abstract In this work, we propose “global style tokens”(GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained in a completely unsupervised manner, and yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of surprising results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and modifying speak-ing style – independently of the text content. The labels can also be used for style transfer, replicating the speaking style of one “seed” phrase across an entire long-form text corpus. Perhaps most surprisingly, when trained on noisy, unlabelled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scaleable but robust speech synthesis. View details
    Preview abstract We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0), WebP, JPEG2000, and JPEG as measured by MS-SSIM. We introduce three improvements over previous research that lead to this state-of-the-art result using a single model. First, we show that training with a pixel-wise loss weighted by SSIM increases reconstruction quality according to several metrics. Second, we modify the recurrent architecture to improve spatial diffusion, which allows the network to more effectively capture and propagate image information through the network’s hidden state. Finally, in addition to lossless entropy coding, we use a spatially adaptive bit allocation algorithm to more efficiently use the limited number of bits to encode visually complex image regions. We evaluate our method on the Kodak and Tecnick image sets and compare against standard codecs as well recently published methods based on deep neural networks. View details
    Preview abstract We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task. View details
    Uncovering Latent Style Factors for Expressive Speech Synthesis
    Yuxuan Wang
    Ying Xiao
    NIPS Workshop on Machine Learning for Audio Signal Processing (ML4Audio) (2017) (to appear)
    Preview abstract Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way. View details
    Spatially adaptive image compression using a tiled deep network
    Michele Covell
    Sung Jin Hwang
    Damien Vincent
    Proceedings of the International Conference on Image Processing (2017), pp. 2796-2800
    Preview abstract Deep neural networks represent a powerful class of function approximators that can learn to compress and reconstruct images. Existing image compression algorithms based on neural networks learn quantized representations with a constant spatial bit rate across each image. While entropy coding introduces some spatial variation, traditional codecs have benefited significantly by explicitly adapting the bit rate based on local image complexity and visual saliency. This paper introduces an algorithm that combines deep neural networks with quality-sensitive bit rate adaptation using a tiled network. We demonstrate the importance of spatial context prediction and show improved quantitative (PSNR) and qualitative (subjective rater assessment) results compared to a non-adaptive baseline and a recently published image compression model based on fully-convolutional neural networks. View details
    Preview abstract This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each of the architectures we describe can provide variable compression rates during deployment without requiring retraining of the network: each network need only be trained once. All of our architectures consist of a recurrent neural network (RNN)-based encoder and decoder, a binarizer, and a neural network for entropy coding. We compare RNN types (LSTM, associative LSTM) and introduce a new hybrid of GRU and ResNet. We also study "one-shot" versus additive reconstruction architectures and introduce a new scaled-additive framework. We compare to previous work, showing improvements of 4.3%-8.8% AUC (area under the rate-distortion curve), depending on the perceptual metric used. As far as we know, this is the first neural network architecture that is able to outperform JPEG at image compression across most bitrates on the rate-distortion curve on the Kodak dataset images, with and without the aid of entropy coding. View details
    No Results Found