Subhashini Venugopalan

Subhashini Venugopalan

I work on machine learning applications motivated in healthcare and sciences. Some of my work pertains to improving speech recognition systems for users with impaired speech, others to transfer learning for bio/medical data (e.g. detecting diabetic retinopathy, breast cancer), and I have also developed methods to interpret such vision/audio models (model explanation) for medical applications. During my graduate studies, I applied natural language processing and computer vision techniques to generate descriptions of events depicted in videos and images. I am a key contributor to a number of works featuring in the Healed through A.I. documentary. Please refer to my website (https://vsubhashini.github.io/) for more information and my Google Scholar page for an up-to-date list of my publications.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered speech. Even so, erroneous transcripts from ASR models can help people with disordered speech be better understood, especially if the transcription doesn’t significantly change the intended meaning. Evaluating the efficacy of ASR for this use case requires a methodology for measuring the impact of transcription errors on the intended meaning and comprehensibility. Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. In this work, we tune and evaluate large language models for this task and find them to be a much better proxy for human evaluators than other metrics commonly used. We further present a case-study using the presented approach to assess the quality of personalized ASR models to make model deployment decisions and correctly set user expectations for model quality as part of our trusted tester program. View details
    Preview abstract We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a fivepoint scale. We trained three models following different deep learning approaches and evaluated them on ∼94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers, ∼2300 samples). View details
    SpeakFaster Observer: Long-Term Instrumentation of Eye-Gaze Typing for Measuring AAC Communication
    Richard Jonathan Noel Cave
    Bob MacDonald
    Jon Campbell
    Blair Casey
    Emily Kornman
    Daniel Vance
    Jay Beavers
    CHI23 Case Studies of HCI in Practice(2023) (to appear)
    Preview abstract Accelerating communication for users with severe motor and speech impairments, in particular for eye-gaze Augmentative and Alternative Communication (AAC) device users, is a long-standing area of research. However, observation of such users' communication over extended durations has been limited. This case study presents the real-world experience of developing and field-testing a tool for observing and curating the gaze typing-based communication of a consented eye-gaze AAC user with amyotrophic lateral sclerosis (ALS) from the perspective of researchers at the intersection of HCI and artificial intelligence (AI). With the intent to observe and accelerate eye-gaze typed communication, we designed a tool and a protocol called the SpeakFaster Observer to measure everyday conversational text entry by the consenting gaze-typing user, as well as several consenting conversation partners of the AAC user. We detail the design of the Observer software and data curation protocol, along with considerations for privacy protection. The deployment of the data protocol from November 2021 to April 2022 yielded a rich dataset of gaze-based AAC text entry in everyday context, consisting of 130+ hours of gaze keypresses and 5.5k+ curated speech utterances from the AAC user and the conversation partners. We present the key statistics of the data, including the speed (8.1±3.9 words per minute) and keypress saving rate (-0.18±0.87) of gaze typing, patterns of of utterance repetition and reuse, as well as the temporal dynamics of conversation turn-taking in gaze-based communication. We share our findings and also open source our data collections tools for furthering research in this domain. View details
    Preview abstract Saying more while typing less is the ideal we strive towards when designing assistive writing technology that can minimize effort. Complementary to efforts on predictive completions is the idea to use a drastically abbreviated version of an intended message, which can then be reconstructed using Language Models. This paper highlights the challenges that arise from investigating what makes an abbreviation scheme promising for a potential application. We hope that this can provide a guide for designing studies which consequently allow for fundamental insights on efficient and goal driven abbreviation strategies. View details
    Preview abstract Recent advances in self-supervision have dramatically im- proved the quality of speech representations. However, wide deployment of state-of-the-art embedding models on devices has been severely restricted due to their limited public avail- ability and large resource footprint. Our work addresses these by publicly releasing a collection of paralinguistic speech models1 that are small, near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled only on public data. We explore differ- ent architectures and thoroughly evaluate our models on the Non-Semantic Speech (NOSS) benchmark. Our largest dis- tilled model is less than 16% the size of the original model (340MB vs 2.2GB) and achieves over 94% the accuracy on 6 of 7 tasks. The smallest model is less than 0.3% in size (22MB) and achieves over 90% as the accuracy on 6 of 7 tasks. View details
    Preview abstract Amyotrophic Lateral Sclerosis (ALS) disease progression is usually measured using the subjective, questionnaire-based revised ALS Functional Rating Scale (ALSFRS-R). A purely objective measure for tracking disease progression would be a powerful tool for evaluating real-world drug effectiveness, efficacy in clinical trials, as well as identifying participants for cohort studies. Here we develop a machine learning based objective measure for ALS disease progression, based on voice samples and accelerometer measurements. The ALS Therapy Development Institute (ALS-TDI) collected a unique dataset of voice and accelerometer samples from consented individuals - 584 people living with ALS over four years. Participants carried out prescribed speaking and limb-based tasks. 542 participants contributed 5814 voice recordings, and 350 contributed 13009 accelerometer samples, while simultaneously measuring ALSFRS-R. Using the data from 475 participants, we trained machine learning (ML) models, correlating voice with bulbar-related FRS scores and accelerometer with limb related scores. On the test set (n=109 participants) the voice models achieved an AUC of 0.86 (95% CI, 0.847-0.884) , whereas the accelerometer models achieved a median AUC of 0.73 . We used the models and self-reported ALSFRS-R scores to evaluate the real-world effects of edaravone, a drug recently approved for use in ALS, on 54 test participants. In the test cohort, the digital data input into the ML models produced objective measures of progression rates over the duration of the study that were consistent with self-reported scores. This demonstrates the value of these tools for assessing both disease progression and potentially drug effects. In this instance, outcomes from edaravone treatment, both self-reported and digital-ML, resulted in highly variable outcomes from person to person. View details
    Assessing ASR Model Quality on Disordered Speech using BERTScore
    Qisheng Li
    Katie Seaver
    Richard Jonathan Noel Cave
    Proc. 1st Workshop on Speech for Social Good (S4SG)(2022), pp. 26-30 (to appear)
    Preview abstract Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers. It is hard to determine if models can be be useful at such high error rates. This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness. Both BERTScore and WER were compared to prediction errors manually annotated by Speech Language Pathologists for error type and assessment. BERTScore was found to be more correlated with human assessment of error type and assessment. BERTScore was specifically more robust to orthographic changes (contraction and normalization errors) where meaning was preserved. Furthermore, BERTScore was a better fit of error assessment than WER, as measured using an ordinal logistic regression and the Akaike's Information Criterion (AIC). Overall, our findings suggest that BERTScore can complement WER when assessing ASR model performance from a practical perspective, especially for accessibility applications where models are useful even at lower accuracy than for typical speech. View details
    Context-Aware Abbreviation Expansion Using Large Language Models
    Ajit Narayanan
    Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2022(2022) (to appear)
    Preview abstract Motivated by the need for accelerating text entry in augmentative and alternative communication (AAC) for people with severe motor impairments, we propose a paradigm in which phrases are abbreviated aggressively as primarily word-initial letters. Our approach is to expand the abbreviations into full-phrase options by leveraging conversation context with the power of pretrained large language models (LLMs). Through zero-shot, few-shot, and fine-tuning experiments on four public conversation datasets, we show that for replies to the initial turn of a dialog, an LLM with 64B parameters is able to exactly expand over 70% of phrases with abbreviation length up to 10, leading to an effective keystroke saving rate of up to about 77% on these exact expansions. Including a small amount of context in the form of a single conversation turn more than doubles abbreviation expansion accuracies compared to having no context, an effect that is more pronounced for longer phrases. Additionally, the robustness of models against typo noise can be enhanced through fine-tuning on noisy data. View details
    Scaling Symbolic Methods using Gradients for Neural Model Explanation
    Subham Sekhar Sahoo
    Li Li
    Rishabh Singh
    Patrick Francis Riley
    International Conference on Learning Representations (ICLR)(2021)
    Preview abstract Symbolic techniques based on Satisfiability Modulo Theory (SMT) solvers have been proposed for analyzing and verifying neural network properties, but their usage has been fairly limited owing to their poor scalability with larger networks. In this work, we propose a technique for combining gradient-based methods with symbolic techniques to scale such analyses and demonstrate its application for model explanation. In particular, we apply this technique to identify minimal regions in an input that are most relevant for a neural network's prediction. Our approach uses gradient information (based on Integrated Gradients) to focus on a subset of neurons in the first layer, which allows our technique to scale to large networks. The corresponding SMT constraints encode the minimal input mask discovery problem such that after masking the input, the activations of the selected neurons are still above a threshold. After solving for the minimal masks, our approach scores the mask regions to generate a relative ordering of the features within the mask. This produces a saliency map which explains "where a model is looking" when making a prediction. We evaluate our technique on three datasets - MNIST, ImageNet, and Beer Reviews, and demonstrate both quantitatively and qualitatively that the regions generated by our approach are sparser and achieve higher saliency scores compared to the gradient-based methods alone. View details
    Preview abstract Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of a speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We find that the ASR encoder’s embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words. View details