Responsible AI

Our research in Responsible AI aims to shape the field of artificial intelligence and machine learning in ways that foreground the human experiences and impacts of these technologies. We examine and shape emerging AI models, systems, and datasets used in research, development, and practice. This research uncovers foundational insights and devises methodologies that define the state-of-the-art across the field. We advance equity, fairness, transparency, robustness, interpretability, and inclusivity as key elements of AI systems. For example, recent research evaluates the generalizability of the fairness properties of medical AI algorithms and discusses the cultural properties of fair AI systems globally. We strive to ensure that the promise of AI is realized beneficially for all individuals and communities, prioritizing social and contextual implications.

Recent Publications

Automatic Speech Recognition of Conversational Speech in Individuals with Disordered Speech
Bob MacDonald
Rus Heywood
Richard Cave
Katie Seaver
Antoine Desjardins
Jordan Green
Journal of Speech, Language, and Hearing Research (2024) (to appear)
Preview abstract Purpose: This study examines the effectiveness of automatic speech recognition (ASR) for individuals with speech disorders, addressing the gap in performance between read and conversational ASR. We analyze the factors influencing this disparity and the effect of speech mode-specific training on ASR accuracy. Method: Recordings of read and conversational speech from 27 individuals with various speech disorders were analyzed using both (1) one speaker-independent ASR system trained and optimized for typical speech and (2) multiple ASR models that were personalized to the speech of the participants with disordered speech. Word Error Rates (WERs) were calculated for each speech mode, read vs conversational, and subject. Linear mixed-effect models were used to assess the impact of speech mode and disorder severity on ASR accuracy. We investigated nine variables, classified as technical, linguistic, or speech impairment factors, for their potential influence on the performance gap. Results: We found a significant performance gap between read and conversational speech in both personalized and unadapted ASR models. Speech impairment severity notably impacted recognition accuracy in unadapted models for both speech modes and in personalized models for read speech. Linguistic attributes of utterances were the most influential on accuracy, though atypical speech characteristics also played a role. Including conversational speech samples in model training notably improved recognition accuracy. Conclusions: We observed a significant performance gap in ASR accuracy between read and conversational speech for individuals with speech disorders. This gap was largely due to the linguistic complexity and unique characteristics of speech disorders in conversational speech. Training personalized ASR models using conversational speech significantly improved recognition accuracy, demonstrating the importance of domain-specific training and highlighting the need for further research into ASR systems capable of handling disordered conversational speech effectively. View details
Preview abstract Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered speech. Even so, erroneous transcripts from ASR models can help people with disordered speech be better understood, especially if the transcription doesn’t significantly change the intended meaning. Evaluating the efficacy of ASR for this use case requires a methodology for measuring the impact of transcription errors on the intended meaning and comprehensibility. Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. In this work, we tune and evaluate large language models for this task and find them to be a much better proxy for human evaluators than other metrics commonly used. We further present a case-study using the presented approach to assess the quality of personalized ASR models to make model deployment decisions and correctly set user expectations for model quality as part of our trusted tester program. View details
Generative AI in Creative Practice: ML-Artist Folk Theories of T2I Use, Harm, and Harm-Reduction
Shalaleh Rismani
Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), Association for Computing Machinery (2024), pp. 1-17 (to appear)
Preview abstract Understanding how communities experience algorithms is necessary to mitigate potential harmful impacts. This paper presents folk theories of text-to-image (T2I) models to enrich understanding of how artist communities experience creative machine learning (ML) systems. This research draws on data collected from a workshop with 15 artists from 10 countries who incorporate T2I models in their creative practice. Through reflexive thematic analysis of workshop data, we highlight theorization of T2I use, harm, and harm-reduction. Folk theories of use envision T2I models as an artistic medium, a mundane tool, and locate true creativity as rising above model affordances. Theories of harm articulate T2I models as harmed by engineering efforts to eliminate glitches and product policy efforts to limit functionality. Theories of harm-reduction orient towards protecting T2I models for creative practice through transparency and distributed governance. We examine how these theories relate, and conclude by discussing how folk theorization informs responsible AI efforts. View details
Take it, Leave it, or Fix it: Measuring Productivity and Trust in Human-AI Collaboration
29th International Conference on Intelligent User Interfaces (IUI ’24), ACM, New York, NY, USA (2024)
Preview abstract Although recent developments in generative AI have greatly enhanced the capabilities of conversational agents such as Google's Bard or OpenAI's ChatGPT, it's unclear whether the usage of these agents aids users across various contexts. To better understand how access to conversational AI affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (N=76) as they completed a programming exam with and without access to Bard. Effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). Our findings include evidence of automation complacency, increased reliance on the AI over the course of the task, and increased performance for novices on “solve”-type questions when using the AI. We discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational AI. View details
Generative models improve fairness of medical classifiers under distribution shifts
Ira Ktena
Olivia Wiles
Isabela Albuquerque
Sylvestre-Alvise Rebuffi
Ryutaro Tanno
Danielle Belgrave
Taylan Cemgil
Nature Medicine (2024)
Preview abstract Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution. View details
Preview abstract Gender bias in language technologies has been widely studied, but research has mostly been restricted to a binary paradigm of gender. It is important to also consider non-binary gender identities, as excluding them can cause further harm to an already marginalized group. One way in which English-speaking individuals linguistically encode their gender identity is through third-person personal pronoun declarations. This is often done using two or more pronoun forms, e.g., \textit{xe/xem}, or \textit{xe/xem/xyr}. In this paper, we comprehensively evaluate state-of-the-art language models for their ability to correctly use declared third-person personal pronouns. As far as we are aware, we are the first to do so. We evaluate language models in both zero-shot and few-shot settings. Models are still far from zero-shot gendering non-binary individuals accurately, and most also struggle with correctly using gender-neutral pronouns (singular \textit{they, them, their} etc.). This poor performance may be due to the lack of representation of non-binary pronouns in pre-training corpora, and some memorized associations between pronouns and names. We find an overall improvement in performance for non-binary pronouns when using in-context learning, demonstrating that language models with few-shot capabilities can adapt to using declared pronouns correctly. View details