Jump to Content
Carrie Jun Cai

Carrie Jun Cai

My research aims to make artificial intelligence systems usable to human beings, so that human-AI interactions are more productive, enjoyable, and fair. I believe AI systems should be designed to augment human agency, and thus approach this process by considering the capabilities and limits of human intelligence. Before joining Google, I did my PhD research in the User Interface Design group at MIT, where I built "wait-learning" tools to help people practice desired skills in short chunks while waiting, thereby making use of fleeting moments in the day.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
    Michael Xieyang Liu
    Frederick Liu
    Alex Fiannaca
    Terry Koo
    In Extended Abstract in ACM CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024), pp. 9 (to appear)
    Preview abstract Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. View details
    Programming with a Programming Language: Challenges and Opportunities for Designing Developer Tools for Prompt Programming
    Alex Fiannaca
    Chinmay Kulkarni
    Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (CHI EA ’23), ACM, Hamburg, Germany (2023) (to appear)
    Preview abstract Existing tools for prompt programming provide little support to prompt programmers. Consequently, as prompts become more complex, they can be hard to read, understand, and edit. In this work, we draw on modern integrated development environments for traditional programming to improve the editor experience of prompt programming. We describe methods for understanding the semantically meaningful structure of natural language prompts in the absence of a rigid formal grammars, and demonstrate a range of editor features that can leverage this information to assist prompt programmers. Finally, we relate initial feedback from design probe explorations with a set of domain experts and provide insights to help guide the development of future prompt editors. View details
    PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Model Prompts
    Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery (to appear)
    Preview abstract Large Language Models have enabled novices without machine learning (ML) experience to quickly prototype ML functionalities with prompt programming. This paper investigates incorporating prompt-based prototyping into designing functional user interface (UI) mock-ups. To understand how infusing LLM prompts into UI mock-ups might affect the prototyping process, we conduct a exploratory study with five designers, and find that this capability might significantly speed up creating functional prototypes, inform designers earlier on how their designs will integrate ML, and enable user studies with functional prototypes earlier. From these findings, we built PromptInfuser, a Figma plugin for authoring LLM-infused mock-ups. PromptInfuser introduces two novel LLM-interactions: input-output, which makes content interactive and dynamic, and frame-change, which directs users to different frames depending on their natural language input. From initial observations, we find that PromptInfuser has the potential to transform the design process by tightly integrating UI and AI prototyping in a single interface. View details
    Generative Agents: Interactive Simulacra of Human Behavior
    Joon Sung Park
    Joseph C. O'Brien
    Percy Liang
    Michael Bernstein
    Proceedings of UIST 2023, ACM (2023)
    Preview abstract Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior. View details
    The Prompt Artists
    Stefania Druga
    Alex Fiannaca
    Pedro Vergani
    Chinmay Kulkarni
    Creativity and Cognition 2023 (2023)
    Preview abstract In this paper, we present the results of a study examining the art practices, artwork, and motivations of prolific users of the latest generation of text-to-image models. Through interviews, observations, and a survey, we present a sampling of the artistic styles, and describe the developed community of practice. We find that: 1) the text prompt and resulting image collectively can be considered the art piece (prompts as art), and 2) prompt templates (prompts with “slots” for others to fill in with their own words) are developed to create generative art pieces. We also find that this community’s premium on unique outputs leads to artists seeking specialized vocabulary to produce distinctive art pieces (e.g., by going to architectural blogs), while others look for “glitches” in the model that can turn into artistic styles in their own right. From these findings, we outline specific implications for design. View details
    The Design Space of Generative Models
    Jess Scon Holbrook
    Chinmay Kulkarni
    NeurIPS 2022 Human-Centered AI Workshop (2022) (to appear)
    Preview abstract Card et al.’s classic paper "The Design Space of Input Devices" established the value of design spaces as a tool for HCI analysis and invention. We posit that developing design spaces for emerging pre-trained, general AI models is necessary for supporting their integration into human-centered systems and practices. We explore what it means to develop an AI model design space by proposing two design spaces relating to pre-trained AI models: the first considers how HCI can impact pre-trained models (i.e., interfaces for models) and the second considers how pre-trained models can impact HCI (i.e., models as an HCI prototyping material). View details
    Social Simulacra: Creating Populated Prototypes for Social Computing Systems
    Joon Sung Park
    Lindsay Popowski
    Percy Liang
    Michael S. Bernstein
    Proceedings of UIST 2022, ACM (2022) (to appear)
    Preview abstract Prototyping techniques for social computing systems often recruit small groups to test a design, but many challenges that threaten the norms and moderation standards do not arise until a design achieves a larger scale. Can a designer understand how a social system might behave when later populated, and make adjustments before the system falls prey to such challenges? We introduce social simulacra, a technique enabling early prototyping of social computing systems by generating a breadth of possible social interactions that may emerge when the system is populated. Our implementation of social simulacra translates the designer’s description of a community’s goal, rules, and member personas into a set of posts, replies, and anti-social behaviors; shifts these behaviors appropriately in response to design changes; and enables exploration of "what if?" scenarios where community members or moderators intervene. We contribute techniques for prompting a large language model to generate such social interactions, drawing on the observation that large language models have consumed a wide variety of these behaviors on the public web. In evaluations, we show that participants were often unable to distinguish social simulacra from actual community behavior, and that social computing designers could use them to iterate on their designs. View details
    Preview abstract Prototyping is notoriously difficult to do with machine learning (ML), but recent advances in large language models may lower the barriers to people prototyping with ML, through the use of natural language prompts. This case study reports on the real-world experiences of industry professionals (e.g. designers, program managers, front-end developers) prototyping new ML-powered feature ideas via prompt-based prototyping. Through interviews with eleven practitioners during a three-week sprint and a workshop, we find that prompt-based prototyping reduced barriers of access by substantially broadening who can prototype with ML, sped up the prototyping process, and grounded communication between collaborators. Yet, it also introduced new challenges, such as the need to reverse-engineer prompt designs, source example data, and debug and evaluate prompt effectiveness. Taken together, this case study provides important implications that lay the groundwork toward a new future of prototyping with ML. View details
    Preview abstract In this paper, we present a natural language code synthesis tool, GenLine, backed by a large generative language model and a set of task-specific prompts. To understand the user experience of natural language code synthesis with these types of models, we conducted a user study in which participants applied GenLine to two programming tasks. Our results indicate that while natural language code synthesis can sometimes provide a magical experience, participants still faced challenges. In particular, participants felt that they needed to learn the model’s "syntax,'' despite their input being natural language. Participants also faced challenges in debugging model input, and demonstrated a wide range of variability in the scope and specificity of their requests. From these findings, we discuss design implications for future natural language code synthesis tools built using generating language models. View details
    Onboarding Materials as Cross-functional Boundary Objects for Developing AI Assistants
    Lauren Wilcox
    Samantha Winter
    Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, ACM (2021) (to appear)
    Preview abstract Deep neural networks (DNNs) routinely achieve state-of-the-art performance in a wide range of tasks. This case study reports on the development of onboarding (i.e., training) materials for a DNN-based medical AI Assistant to aid in the grading of prostate cancer. Specifically, we describe how the process of developing these materials deepened the team's understanding of end-user requirements, leading to changes in the development and assessment of the underlying machine learning model. In this sense, the onboarding materials served as a useful boundary object for a cross-functional team. We also present evidence of the utility of the subsequent onboarding materials by describing which information was found useful by participants in an experimental study. View details
    Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
    Abigail E. Huang
    ACM CHI Conference on Human Factors in Computing Systems (CHI 2020) (2020) (to appear)
    Preview abstract Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten non-specialist eye care professionals with either individual performance alone, or individual performance and expert discussions. Compared to performance feedback alone, seeing expert discussions significantly improved non-specialists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis. View details
    AI Song Contest: Human-AI Co-Creation in Songwriting
    Hendrik Vincent Koops
    Ed Newton-Rex
    Monica Dinculescu
    Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) (2020)
    Preview abstract Machine learning is challenging the way we make music. Although research in deep generative models has dramatically improved the capability and fluency of music models, recent work has shown that it can be challenging for humans to partner with this new class of algorithms. In this paper, we present findings on what 13 musician/developer teams, a total of 61 users, needed when co-creating a song with AI, the challenges they faced, and how they leveraged and repurposed existing characteristics of AI to overcome some of these challenges. Many teams adopted modular approaches, such as independently running multiple smaller models that align with the musical building blocks of a song, before re-combining their results. As ML models are not easily steerable, teams also generated massive numbers of samples and curated them post-hoc, or used a range of strategies to direct the generation or algorithmically ranked the samples. Ultimately, teams not only had to manage the ``flare and focus'' aspects of the creative process, but also juggle that with a parallel process of exploring and curating multiple ML models and outputs. These findings reflect a need to design machine learning-powered music interfaces that are more decomposable, steerable, interpretable, and adaptive, which in return will enable artists to more effectively explore how AI can extend their personal expression. View details
    Evaluation of the Use of Combined Artificial Intelligence and Pathologist Assessment to Review and Grade Prostate Biopsies
    Kunal Nagpal
    Davis J. Foote
    Adam Pearce
    Samantha Winter
    Matthew Symonds
    Liron Yatziv
    Trissia Brown
    Isabelle Flament-Auvigne
    Fraser Tan
    Martin C. Stumpe
    Cameron Chen
    Craig Mermel
    JAMA Network Open (2020)
    Preview abstract Importance: Expert-level artificial intelligence (AI) algorithms for prostate biopsy grading have recently been developed. However, the potential impact of integrating such algorithms into pathologist workflows remains largely unexplored. Objective: To evaluate an expert-level AI-based assistive tool when used by pathologists for the grading of prostate biopsies. Design, Setting, and Participants: This diagnostic study used a fully crossed multiple-reader, multiple-case design to evaluate an AI-based assistive tool for prostate biopsy grading. Retrospective grading of prostate core needle biopsies from 2 independent medical laboratories in the US was performed between October 2019 and January 2020. A total of 20 general pathologists reviewed 240 prostate core needle biopsies from 240 patients. Each pathologist was randomized to 1 of 2 study cohorts. The 2 cohorts reviewed every case in the opposite modality (with AI assistance vs without AI assistance) to each other, with the modality switching after every 10 cases. After a minimum 4-week washout period for each batch, the pathologists reviewed the cases for a second time using the opposite modality. The pathologist-provided grade group for each biopsy was compared with the majority opinion of urologic pathology subspecialists. Exposure: An AI-based assistive tool for Gleason grading of prostate biopsies. Main Outcomes and Measures: Agreement between pathologists and subspecialists with and without the use of an AI-based assistive tool for the grading of all prostate biopsies and Gleason grade group 1 biopsies. Results: Biopsies from 240 patients (median age, 67 years; range, 39-91 years) with a median prostate-specific antigen level of 6.5 ng/mL (range, 0.6-97.0 ng/mL) were included in the analyses. Artificial intelligence–assisted review by pathologists was associated with a 5.6% increase (95% CI, 3.2%-7.9%; P < .001) in agreement with subspecialists (from 69.7% for unassisted reviews to 75.3% for assisted reviews) across all biopsies and a 6.2% increase (95% CI, 2.7%-9.8%; P = .001) in agreement with subspecialists (from 72.3% for unassisted reviews to 78.5% for assisted reviews) for grade group 1 biopsies. A secondary analysis indicated that AI assistance was also associated with improvements in tumor detection, mean review time, mean self-reported confidence, and interpathologist agreement. Conclusions and Relevance: In this study, the use of an AI-based assistive tool for the review of prostate biopsies was associated with improvements in the quality, efficiency, and consistency of cancer detection and grading. View details
    "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making
    Samantha Winter
    Lauren Wilcox
    Proc. ACM Hum.-Comput. Interact., Association for Computing Machinery, ACM CSCW, New York, NY, USA (2019), pp. 24 (to appear)
    Preview abstract Although rapid advances in machine learning have made it increasingly applicable to expert decision-making, the delivery of accurate algorithmic predictions alone is insufficient for effective human–AI collaboration. In this work, we investigate the key types of information medical experts desire when they are first introduced to a diagnostic AI assistant. In a qualitative lab study, we interviewed 21 pathologists before, during, and after being presented deep neural network (DNN) predictions for prostate cancer diagnosis, to learn the types of information that they desired about the AI assistant. Our findings reveal that, far beyond understanding the local, case-specific reasoning behind any model decision, clinicians desired upfront information about basic, global properties of the model, such as its known strengths and limitations, its subjective point-of-view, and its overall design objective—what it’s designed to be optimized for. Participants compared these information needs to the collaborative mental models they develop of their medical colleagues when seeking a second opinion: the medical perspectives and standards that those colleagues embody, and the compatibility of those perspectives with their own diagnostic patterns. These findings broaden and enrich discussions surrounding AI transparency for collaborative decision-making, providing a richer understanding of what experts find important in their introduction to AI assistants before integrating them into routine practice. View details
    The Effects of Example-Based Explanations in a Machine Learning Interface
    Jonas Jongejan
    Jess Scon Holbrook
    International Conference on Intelligent User Interfaces (2019)
    Preview abstract The black-box nature of machine learning algorithms can make their predictions difficult to understand and explain to end-users. In this paper, we propose and evaluate two kinds of example-based explanations in the visual domain, normative explanations and comparative explanations (Figure 1), which automatically surface examples from the training set of a deep neural net sketch-recognition algorithm. To investigate their effects, we deployed these explanations to 1150 users on QuickDraw, an online platform where users draw images and see whether a recognizer has correctly guessed the intended drawing. When the algorithm failed to recognize the drawing, those who received normative explanations felt they had a better understanding of the system, and perceived the system to have higher capability. However, comparative explanations did not always improve perceptions of the algorithm, possibly because they sometimes exposed limitations of the algorithm and may have led to surprise. These findings suggest that examples can serve as a vehicle for explaining algorithmic behavior, but point to relative advantages and disadvantages of using different kinds of examples, depending on the goal. View details
    Preview abstract Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making. View details
    Similar Image Search for Histopathology: SMILY
    Jason Hipp
    Michael Emmert-Buck
    Daniel Smilkov
    Mahul Amin
    Craig Mermel
    Lily Peng
    Martin Stumpe
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract The increasing availability of large institutional and public histopathology image datasets is enabling the searching of these datasets for diagnosis, research, and education. Although these datasets typically have associated metadata such as diagnosis or clinical notes, even carefully curated datasets rarely contain annotations of the location of regions of interest on each image. As pathology images are extremely large (up to 100,000 pixels in each dimension), further laborious visual search of each image may be needed to find the feature of interest. In this paper, we introduce a deep-learning-based reverse image search tool for histopathology images: Similar Medical Images Like Yours (SMILY). We assessed SMILY’s ability to retrieve search results in two ways: using pathologist-provided annotations, and via prospective studies where pathologists evaluated the quality of SMILY search results. As a negative control in the second evaluation, pathologists were blinded to whether search results were retrieved by SMILY or randomly. In both types of assessments, SMILY was able to retrieve search results with similar histologic features, organ site, and prostate cancer Gleason grade compared with the original query. SMILY may be a useful general purpose tool in the pathologist’s arsenal, to improve the efficiency of searching large archives of histopathology images, without the need to develop and implement specific tools for each application. View details
    Software Developers Learning Machine Learning: Motivations, Hurdles, and Desires
    Philip Guo
    IEEE Symposium on Visual Language and Human-Centric Computing (VL/HCC) (2019)
    Preview abstract The growing popularity of machine learning (ML) has attracted more software developers to now want to adopt ML into their own practices, through tinkering with and learning from ML framework websites and online code examples. To investigate the motivations, hurdles, and desires of these software developers, we deployed a survey to the website of the TensorFlow.js ML framework. We found via 645 responses that many wanted to learn ML for aspirational reasons rather than for immediate job needs. Critically, developers faced hurdles due to a perceived lack of mathematical and theoretical background. They desired frameworks to provide more basic ML conceptual support, such as a curated corpus of best practices, conceptual tutorials, and a de-mystification of mathematical jargon into practical tips. These findings inform the design of ML frameworks and informal learning resources to broaden the base of people acquiring this increasingly important skill set. View details
    Preview abstract The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of “zebra” is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application. View details
    No Results Found