Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 224 publications
    Generative AI in Creative Practice: ML-Artist Folk Theories of T2I Use, Harm, and Harm-Reduction
    Shalaleh Rismani
    Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), Association for Computing Machinery (2024), pp. 1-17 (to appear)
    Preview abstract Understanding how communities experience algorithms is necessary to mitigate potential harmful impacts. This paper presents folk theories of text-to-image (T2I) models to enrich understanding of how artist communities experience creative machine learning (ML) systems. This research draws on data collected from a workshop with 15 artists from 10 countries who incorporate T2I models in their creative practice. Through reflexive thematic analysis of workshop data, we highlight theorization of T2I use, harm, and harm-reduction. Folk theories of use envision T2I models as an artistic medium, a mundane tool, and locate true creativity as rising above model affordances. Theories of harm articulate T2I models as harmed by engineering efforts to eliminate glitches and product policy efforts to limit functionality. Theories of harm-reduction orient towards protecting T2I models for creative practice through transparency and distributed governance. We examine how these theories relate, and conclude by discussing how folk theorization informs responsible AI efforts. View details
    Take it, Leave it, or Fix it: Measuring Productivity and Trust in Human-AI Collaboration
    29th International Conference on Intelligent User Interfaces (IUI ’24), ACM, New York, NY, USA (2024)
    Preview abstract Although recent developments in generative AI have greatly enhanced the capabilities of conversational agents such as Google's Bard or OpenAI's ChatGPT, it's unclear whether the usage of these agents aids users across various contexts. To better understand how access to conversational AI affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (N=76) as they completed a programming exam with and without access to Bard. Effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). Our findings include evidence of automation complacency, increased reliance on the AI over the course of the task, and increased performance for novices on “solve”-type questions when using the AI. We discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational AI. View details
    Preview abstract Language models still struggle on moral reasoning, despite their impressive performance in many other tasks. In particular, the Moral Scenarios task in MMLU (Multi-task Language Understanding) is among the worst performing tasks for many language models, including GPT-3. In this work, we propose a new prompting framework, Thought Experiments, to teach language models to do better moral reasoning using counterfactuals. Experiment results show that our framework elicits counterfactual questions and answers from the model, which in turn helps improve the accuracy on Moral Scenarios task by 9-16% compared to other zero-shot baselines. Interestingly, unlike math reasoning tasks, zero-shot Chain-of-Thought (CoT) reasoning doesn't work out of the box, and even reduces accuracy by around 4% compared to direct zero-shot. We further observed that with minimal human supervision in the form of 5 few-shot examples, the accuracy of the task can be improved to as much as 80%. View details
    Preview abstract As new forms of data capture emerge to power new AI applications, questions abound about the ethical implications of these data collection practices. In this paper, we present clinicians' perspectives on the prospective benefits and harms of voice data collection during health consultations. Such data collection is being proposed as a means to power models to assist clinicians with medical data entry, administrative tasks, and consultation analysis. Yet, clinicians' attitudes and concerns are largely absent from the AI narratives surrounding these use cases, and the academic literature investigating them. Our qualitative interview study used the concept of an informed consent process as a type of design fiction, to support elicitation of clinicians' perspectives on voice data collection and use associated with a fictional, near-term AI assistant. Through reflexive thematic analysis of in-depth sessions with physicians, we distilled eight classes of potential risks that clinicians are concerned about, including workflow disruptions, self-censorship, and errors that could impact patient eligibility for services. We conclude with an in-depth discussion of these prospective risks, reflect on the use of the speculative processes that illuminated them, and reconsider evaluation criteria for AI-assisted clinical documentation technologies in light of our findings. View details
    Public Health Calls for/with AI: An Ethnographic Perspective
    Azra Ismail
    Neha Kumar
    Neha Madhiwalla
    ACM Conference On Computer-Supported Cooperative Work And Social Computing (2023)
    Preview abstract Artificial Intelligence (AI) based technologies are increasingly being integrated into public sector programs to help with decision-support and effective distribution of constrained resources. The field of Computer Supported Cooperative Work (CSCW) has begun to examine how the resultant sociotechnical systems may be designed appropriately when targeting underserved populations. We present an ethnographic study of a largescale real-world integration of an AI system for resource allocation in a call-based maternal and child health program in India. Our findings uncover complexities around determining who benefits from the intervention, how the human-AI collaboration is managed, when intervention must take place in alignment with various priorities, and why the AI is sought, for what purpose. Our paper offers takeaways for human-centered AI integration in public health, drawing attention to the work done by the AI as actor, the work of configuring the human-AI partnership with multiple diverse stakeholders, and the work of aligning program goals for design and implementation through continual dialogue across stakeholders. View details
    Preview abstract Along with the recent advances in large language modeling, there is growing concern that language technologies may reflect, propagate, and amplify various social stereotypes about groups of people. Publicly available stereotype benchmarks play a crucial role in detecting and mitigating this issue in language technologies to prevent both representational and allocational harms in downstream applications. However, existing stereotype benchmarks are limited in their size and coverage, largely restricted to stereotypes prevalent in the Western society. This is especially problematic as language technologies are gaining hold across the globe. To address this gap, we present SeeGULL, a broad-coverage stereotype dataset, expanding the coverage by utilizing the generative capabilities of large language models such as PaLM and GPT-3, and leveraging a globally diverse rater pool to validate prevalence of those stereotypes in society. SeeGULL is an order of magnitude larger in terms of size, and contains stereotypes for 179 identity groups spanning 6 continents, 8 different regions, 178 countries, 50 US states, and 31 Indian states and union territories. We also get fine-grained offensiveness scores for different stereotypes and demonstrate how stereotype perceptions for the same identity group differs across in-region vs out-region annotators. View details
    Infrastructuring Care: How Trans and Non-Binary People Meet Health and Well-Being Needs through Technology
    Lauren Wilcox
    Rajesh Veeraraghavan
    Oliver Haimson
    Gabi Erickson
    Michael Turken
    Beka Gulotta
    ACM Conference on Human Factors in Computing Systems (ACM CHI) 2023, Association for Computing Machinery, ACM (2023)
    Preview abstract We present a cross-cultural diary study with 64 transgender (trans) and non-binary (TGNB) adults in Mexico, the U.S., and India, to understand experiences keeping track of and managing aspects of personal health and well-being. Based on a reflexive thematic analysis of diary data, we highlight sociotechnical interactions that shape how transgender and non-binary people track and manage aspects of their health and well-being. Specifically, we surface the ways in which transgender and non-binary people infrastructure forms of care, by assembling together elements of informal social ecologies, formalized knowledge sources, and self-reflective media. We then examine the forms of precarity that interact with care infrastructure and shape management of health and well-being, including management of gender identity transitions. We discuss the ways in which our findings extend knowledge at the intersection of technology and marginalized health needs, and conclude by arguing for the importance of a research agenda to move toward TGNB-inclusive design. View details
    Preview abstract This paper demonstrates how the limitations of pre-trained models and open evaluation datasets factor into assessing the performance of binary semantic similarity classification tasks. As (1) end-user-facing documentation around the curation of these datasets and pre-trained model training regimes is often not easily accessible and (2) given the lower friction and higher demand to quickly deploy such systems in real-world contexts, our study reinforces prior work showing performance disparities across datasets, embedding techniques and distance metrics, while highlighting the importance of understanding how data is collected, curated and analyzed in semantic similarity classification. View details
    AI’s Regimes of Representation: A Community-centered Study of Text-to-Image Models in South Asia
    Rida Qadri
    Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, 506–517
    Preview abstract This paper presents a community-centered study of cultural limitations of text-to-image (T2I) models in the South Asian context. We theorize these failures using scholarship on dominant media regimes of representations and locate them within participants’ reporting of their existing social marginalizations. We thus show how generative AI can reproduce an outsiders gaze for viewing South Asian cultures, shaped by global and regional power inequities. By centering communities as experts and soliciting their perspectives on T2I limitations, our study adds rich nuance into existing evaluative frameworks and deepens our understanding of the culturally-specific ways AI technologies can fail in non-Western and Global South settings. We distill lessons for responsible development of T2I models, recommending concrete pathways forward that can allow for recognition of structural inequalities. View details
    VLSlice: Interactive Vision-and-Language Slice Discovery
    Eric Slyman
    Stefan Lee
    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023), pp. 15291-15301
    Preview abstract Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly. View details
    Identifying Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction
    Shalaleh Rismani
    Kathryn Henne
    AJung Moon
    Paul Nicholas
    N'Mah Yilla-Akbari
    Jess Gallegos
    Emilio Garcia
    Gurleen Virk
    Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, Association for Computing Machinery, 723–741
    Preview abstract Understanding the broader landscape of potential harms from algorithmic systems enables practitioners to better anticipate consequences of the systems they build. It also supports the prospect of incorporating controls to help minimize harms that emerge from the interplay of technologies and social and cultural dynamics. A growing body of scholarship has identified a wide range of harms across different algorithmic and machine learning (ML) technologies. However, computing research and practitioners lack a high level and synthesized overview of harms from algorithmic systems arising at the micro-, meso-, and macro-levels of society. We present an applied taxonomy of sociotechnical harms to support more systematic surfacing of potential harms in algorithmic systems. Based on a scoping review of prior research on harms from AI systems (n=172), we identified five major themes related to sociotechnical harms — allocative, quality-of-service, representational, social system, and interpersonal harms. We describe these categories of harm, and present case studies that illustrate the usefulness of the taxonomy. We conclude with a discussion of challenges and under-explored areas of harm in the literature, which present opportunities for future research. View details
    The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks
    Nikil Selvam
    Daniel Khashabi
    Tushar Khot
    Kai-Wei Chang
    ACL (2023)
    Preview abstract How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given model? In this work we study this question by contrasting social biases with \underline{non}-social biases that might not even be discernible to human eye. To do so, empirically we simulate various alternative constructions for a given benchmark based on innocuous modifications. (such as paraphrasing or random-sampling) that maintain the essence of their social bias. On two well-known social bias benchmarks (Winogender(Rudinger et al, 2019) and BiasNLI(Dev et al 2020)) we observe that the choice of these shallow modifications have surprising effect in the resulting degree of bias across various models. We hope these troubling observations motivates more robust measures of social biases. View details
    Preview abstract Measurements of fairness in NLP have been critiqued for lacking concrete definitions of biases or harms measured, and for perpetuating a singular, Western narrative of fairness globally. To combat some of these pivotal issues, methods for curating datasets and benchmarks that target specific harms are rapidly emerging. However, these methods still face the significant challenge of achieving coverage over global cultures and perspectives at scale. To address this, in this paper, we highlight the utility and importance of complementary approaches in these curation strategies, which leverage both community engagement as well as large generative models. We specifically target the harm of stereotyping and demonstrate a pathway to build a benchmark that covers stereotypes about diverse, and intersectional identities. View details
    Machine learning for healthcare: A bibliometric study of contributions from Africa
    Houcemeddine Turki
    Anastassios Pouris
    Francis-Alfred
    Michaelangelo Ifeanyichukwu
    Catherine Namayega
    Mohamed Ali Hadj Taieb
    Sadiq Adewale Adedayo
    Chris Fourie
    Christopher Brian Currin
    Atnafu Lambebo Tonja
    Abraham Toluwase Owodunni
    Abdulhameed Dere
    Chris Chinenye Emezue
    Shamsudden Hassan Muhammad
    Muhammad Musa Isa
    Mohamed Ben Aouicha
    Preprints (2023)
    Preview abstract Machine learning has seen enormous growth in the last decade, with healthcare being a prime application for advanced diagnostics and improved patient care. The application of machine learning for healthcare is particularly pertinent in Africa, where many countries are resource-scarce. However, it is unclear how much research on this topic is arising from African institutes themselves, which is a crucial aspect for applications of machine learning to unique contexts and challenges on the continent. Here, we conduct a bibliometric study of African contributions to research publications related to machine learning for healthcare, as indexed in Scopus, between 1993 and 2022. We identified 3,772 research outputs, with most of these published since 2020. North African countries currently lead the way with 64.5% of publications for the reported period, yet Sub-Saharan Africa is rapidly increasing its output. We found that international support in the form of funding and collaborations is correlated with research output generally for the continent, with local support garnering less attention. Understanding African research contributions to machine learning for healthcare is a crucial first step in surveying the broader academic landscape, forming stronger research communities, and providing advanced and contextually aware biomedical access to Africa. View details
    Auditing Gender Presentation Differences in Text-to-Image Models
    Yanzhe Zhang
    Lu Jiang
    Greg Turk
    Diyi Yang
    (2023) (to appear)
    Preview abstract Text-to-image models, which can generate high-quality images based on textual input, have recently enabled various content-creation tools. Despite significantly affecting a wide range of downstream applications, the distributions of these generated images still need to be comprehensively understood, especially regarding the potential stereotypical attributes of different genders. In this work, we propose a paradigm that utilizes fine-grained self-presentation attributes to study how different genders are presented differently in text-to-image models, namely Gender Presentation Differences. By probing the gender indicators in the input text (e.g., ``a woman'' or ``a man''), we quantify the frequency differences of human-centric attributes (e.g., ``a shirt'' and ``a dress'') through human annotation and introduce two novel metrics: GEP (GEnder Presentation Differences) vector and GEP score. Furthermore, the proposed automatic estimation of the two metrics correlates better with human annotations than existing CLIP-based measures, consistently across three state-of-the-art text-to-image models. Finally, we demonstrate that our metrics can generalize to gender/racial stereotypes related to occupations. View details