Jump to Content
Michael Terry

Michael Terry

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
    Michael Xieyang Liu
    Krystal Kallarackal
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
    Preview abstract Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models. View details
    "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
    Michael Xieyang Liu
    Frederick Liu
    Alex Fiannaca
    Terry Koo
    In Extended Abstract in ACM CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024), pp. 9 (to appear)
    Preview abstract Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. View details
    “The less I type, the better”: How AI Language Models can Enhance or Impede Communication for AAC Users
    Stephanie Valencia
    Richard Cave
    Krystal Kallarackal
    Katie Seaver
    ACM Conference on Human Factors in Computing Systems (ACM CHI) 2023, ACM (2023) (to appear)
    Preview abstract Users of augmentative and alternative communication (AAC) devices sometimes find it difficult to communicate in real time with others due to the time it takes to compose messages. AI technologies such as large language models (LLMs) provide an opportunity to support AAC users by improving the quality and variety of text suggestions. However, these technologies may fundamentally change how users interact with AAC devices as users transition from typing their own phrases to prompting and selecting AI-generated phrases. We conducted a study in which 12 AAC users tested live suggestions from a language model across three usage scenarios: extending short replies, answering biographical questions, and requesting assistance. Our study participants believed that AI-generated phrases could save time, physical and cognitive effort when communicating, but felt it was important that these phrases reflect their own communication style and preferences. This work identifies opportunities and challenges for future AI-enhanced AAC devices. View details
    Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges
    Qiaosi Wang
    Michael Adam Madaio
    Shivani Kapania
    Lauren Wilcox
    ACM Conference on Human Factors in Computing Systems (ACM CHI) 2023, ACM (2023)
    Preview abstract The shift towards Responsible AI (RAI) in the tech industry necessitates new practices and adaptations to roles. To understand practices at the intersection of user experience (UX) and RAI, we conducted an interview study with industrial UX practitioners and RAI subject matter experts, both of whom are actively involved in addressing RAI concerns, both early in and throughout the development of new AI-based prototypes, demos, and products. Many of the specific practices and their associated challenges have yet to be surfaced, and distilling them offers a critical view into how practitioners' roles are adapting to meet present-day RAI challenges. We present and discuss three emerging practices in which RAI is being enacted and reified in UX work. We conclude by arguing that the emerging practices, goals, and types of expertise that surfaced in our study point to an evolution in praxis that suggests important areas for further research in HCI. View details
    PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Model Prompts
    Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery (to appear)
    Preview abstract Large Language Models have enabled novices without machine learning (ML) experience to quickly prototype ML functionalities with prompt programming. This paper investigates incorporating prompt-based prototyping into designing functional user interface (UI) mock-ups. To understand how infusing LLM prompts into UI mock-ups might affect the prototyping process, we conduct a exploratory study with five designers, and find that this capability might significantly speed up creating functional prototypes, inform designers earlier on how their designs will integrate ML, and enable user studies with functional prototypes earlier. From these findings, we built PromptInfuser, a Figma plugin for authoring LLM-infused mock-ups. PromptInfuser introduces two novel LLM-interactions: input-output, which makes content interactive and dynamic, and frame-change, which directs users to different frames depending on their natural language input. From initial observations, we find that PromptInfuser has the potential to transform the design process by tightly integrating UI and AI prototyping in a single interface. View details
    Programming with a Programming Language: Challenges and Opportunities for Designing Developer Tools for Prompt Programming
    Alex Fiannaca
    Chinmay Kulkarni
    Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (CHI EA ’23), ACM, Hamburg, Germany (2023) (to appear)
    Preview abstract Existing tools for prompt programming provide little support to prompt programmers. Consequently, as prompts become more complex, they can be hard to read, understand, and edit. In this work, we draw on modern integrated development environments for traditional programming to improve the editor experience of prompt programming. We describe methods for understanding the semantically meaningful structure of natural language prompts in the absence of a rigid formal grammars, and demonstrate a range of editor features that can leverage this information to assist prompt programmers. Finally, we relate initial feedback from design probe explorations with a set of domain experts and provide insights to help guide the development of future prompt editors. View details
    The Prompt Artists
    Stefania Druga
    Alex Fiannaca
    Pedro Vergani
    Chinmay Kulkarni
    Creativity and Cognition 2023 (2023)
    Preview abstract In this paper, we present the results of a study examining the art practices, artwork, and motivations of prolific users of the latest generation of text-to-image models. Through interviews, observations, and a survey, we present a sampling of the artistic styles, and describe the developed community of practice. We find that: 1) the text prompt and resulting image collectively can be considered the art piece (prompts as art), and 2) prompt templates (prompts with “slots” for others to fill in with their own words) are developed to create generative art pieces. We also find that this community’s premium on unique outputs leads to artists seeking specialized vocabulary to produce distinctive art pieces (e.g., by going to architectural blogs), while others look for “glitches” in the model that can turn into artistic styles in their own right. From these findings, we outline specific implications for design. View details
    The Design Space of Generative Models
    Jess Scon Holbrook
    Chinmay Kulkarni
    NeurIPS 2022 Human-Centered AI Workshop (2022) (to appear)
    Preview abstract Card et al.’s classic paper "The Design Space of Input Devices" established the value of design spaces as a tool for HCI analysis and invention. We posit that developing design spaces for emerging pre-trained, general AI models is necessary for supporting their integration into human-centered systems and practices. We explore what it means to develop an AI model design space by proposing two design spaces relating to pre-trained AI models: the first considers how HCI can impact pre-trained models (i.e., interfaces for models) and the second considers how pre-trained models can impact HCI (i.e., models as an HCI prototyping material). View details
    Preview abstract In this paper, we present a natural language code synthesis tool, GenLine, backed by a large generative language model and a set of task-specific prompts. To understand the user experience of natural language code synthesis with these types of models, we conducted a user study in which participants applied GenLine to two programming tasks. Our results indicate that while natural language code synthesis can sometimes provide a magical experience, participants still faced challenges. In particular, participants felt that they needed to learn the model’s "syntax,'' despite their input being natural language. Participants also faced challenges in debugging model input, and demonstrated a wide range of variability in the scope and specificity of their requests. From these findings, we discuss design implications for future natural language code synthesis tools built using generating language models. View details
    Preview abstract Prototyping is notoriously difficult to do with machine learning (ML), but recent advances in large language models may lower the barriers to people prototyping with ML, through the use of natural language prompts. This case study reports on the real-world experiences of industry professionals (e.g. designers, program managers, front-end developers) prototyping new ML-powered feature ideas via prompt-based prototyping. Through interviews with eleven practitioners during a three-week sprint and a workshop, we find that prompt-based prototyping reduced barriers of access by substantially broadening who can prototype with ML, sped up the prototyping process, and grounded communication between collaborators. Yet, it also introduced new challenges, such as the need to reverse-engineer prompt designs, source example data, and debug and evaluate prompt effectiveness. Taken together, this case study provides important implications that lay the groundwork toward a new future of prototyping with ML. View details
    Guided Integrated Gradients: An Adaptive Path Method for Removing Noise
    Besim Namik Avci
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5050-5058
    Preview abstract Integrated Gradients (IG) is a commonly used feature attribution method for deep neural networks. While IG has many desirable properties, when applied to visual models, the method often produces spurious/noisy pixel attributions in regions that are not related to the predicted class. While this has been previously noted, most existing solutions are aimed at addressing the symptoms by explicitly reducing the noise in the resulting attributions. In this work, we show that one of the causes of the problem is the presence of "adversarial examples'' along the IG path. To minimize the effect of adversarial examples on attributions, we propose adapting the attribution path itself. We introduce Adaptive Path Methods (APMs), as a generalization of path methods, and Guided IG as a specific instance of an APM. Empirically, Guided IG creates saliency maps better aligned with the model's prediction and the input image that is being explained. We show through qualitative and quantitative experiments that Guided IG outperforms IG on ImageNet, Open Images, and diabetic retinopathy medical images. View details
    Program Synthesis with Large Language Models
    Augustus Odena
    David Martin Dohan
    Ellen Jiang
    Henryk Michalewski
    Maarten Paul Bosma
    Maxwell Nye
    n/a, n/a, n/a (2021), n/a
    Preview abstract Program synthesis is one of the grand challenges of artificial intelligence, but to date practical successes have focused on narrow settings and restricted domains. Large language models trained on massive corpora of web texts which include open-source code, programming websites, and tutorials have the potential to break through this barrier.This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate the performance of the language model LaMDA PT [Freitas et al.,2021] on several program synthesis tasks, at a variety of scales ranging from 244M to 137B parameters. First, we introduce a new benchmark, Mostly Basic Programming Problems (MBPP), to measure the ability of these models to synthesize short Python programs from natural language descriptions. The benchmark consists of around 1000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and automated test-cases. We also introduce a Python version of the MathQA benchmark, which evaluates the ability of the models to synthesize code from more complex text. On both datasets, we evaluate synthesis performance and find that synthesis performance scales log-linearly with model size. In contrast to some previous work, we find that LaMDAPT achieves non-negligible preformance in a few-shot setting, although fine-tuning still performs much better. Thel argest models we consider can synthesize solutions to 58% of the problems from MBPP using few-shot learning with a well-designed prompt; across model sizes, fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points. Finally, we conduct a thorough error analysis, shedding light on where these models fall short as program synthesizers, what types of programs are most difficult to generate, and how the models might be improved. As part of that analysis, we explore the semantic grounding of these models, finding that even our largest models are generally unable to predict the output of a program given a specific input. View details
    Onboarding Materials as Cross-functional Boundary Objects for Developing AI Assistants
    Lauren Wilcox
    Samantha Winter
    Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, ACM (2021) (to appear)
    Preview abstract Deep neural networks (DNNs) routinely achieve state-of-the-art performance in a wide range of tasks. This case study reports on the development of onboarding (i.e., training) materials for a DNN-based medical AI Assistant to aid in the grading of prostate cancer. Specifically, we describe how the process of developing these materials deepened the team's understanding of end-user requirements, leading to changes in the development and assessment of the underlying machine learning model. In this sense, the onboarding materials served as a useful boundary object for a cross-functional team. We also present evidence of the utility of the subsequent onboarding materials by describing which information was found useful by participants in an experimental study. View details
    Evaluation of the Use of Combined Artificial Intelligence and Pathologist Assessment to Review and Grade Prostate Biopsies
    Kunal Nagpal
    Davis J. Foote
    Adam Pearce
    Samantha Winter
    Matthew Symonds
    Liron Yatziv
    Trissia Brown
    Isabelle Flament-Auvigne
    Fraser Tan
    Martin C. Stumpe
    Cameron Chen
    Craig Mermel
    JAMA Network Open (2020)
    Preview abstract Importance: Expert-level artificial intelligence (AI) algorithms for prostate biopsy grading have recently been developed. However, the potential impact of integrating such algorithms into pathologist workflows remains largely unexplored. Objective: To evaluate an expert-level AI-based assistive tool when used by pathologists for the grading of prostate biopsies. Design, Setting, and Participants: This diagnostic study used a fully crossed multiple-reader, multiple-case design to evaluate an AI-based assistive tool for prostate biopsy grading. Retrospective grading of prostate core needle biopsies from 2 independent medical laboratories in the US was performed between October 2019 and January 2020. A total of 20 general pathologists reviewed 240 prostate core needle biopsies from 240 patients. Each pathologist was randomized to 1 of 2 study cohorts. The 2 cohorts reviewed every case in the opposite modality (with AI assistance vs without AI assistance) to each other, with the modality switching after every 10 cases. After a minimum 4-week washout period for each batch, the pathologists reviewed the cases for a second time using the opposite modality. The pathologist-provided grade group for each biopsy was compared with the majority opinion of urologic pathology subspecialists. Exposure: An AI-based assistive tool for Gleason grading of prostate biopsies. Main Outcomes and Measures: Agreement between pathologists and subspecialists with and without the use of an AI-based assistive tool for the grading of all prostate biopsies and Gleason grade group 1 biopsies. Results: Biopsies from 240 patients (median age, 67 years; range, 39-91 years) with a median prostate-specific antigen level of 6.5 ng/mL (range, 0.6-97.0 ng/mL) were included in the analyses. Artificial intelligence–assisted review by pathologists was associated with a 5.6% increase (95% CI, 3.2%-7.9%; P < .001) in agreement with subspecialists (from 69.7% for unassisted reviews to 75.3% for assisted reviews) across all biopsies and a 6.2% increase (95% CI, 2.7%-9.8%; P = .001) in agreement with subspecialists (from 72.3% for unassisted reviews to 78.5% for assisted reviews) for grade group 1 biopsies. A secondary analysis indicated that AI assistance was also associated with improvements in tumor detection, mean review time, mean self-reported confidence, and interpathologist agreement. Conclusions and Relevance: In this study, the use of an AI-based assistive tool for the review of prostate biopsies was associated with improvements in the quality, efficiency, and consistency of cancer detection and grading. View details
    Similar Image Search for Histopathology: SMILY
    Jason Hipp
    Michael Emmert-Buck
    Daniel Smilkov
    Mahul Amin
    Craig Mermel
    Lily Peng
    Martin Stumpe
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract The increasing availability of large institutional and public histopathology image datasets is enabling the searching of these datasets for diagnosis, research, and education. Although these datasets typically have associated metadata such as diagnosis or clinical notes, even carefully curated datasets rarely contain annotations of the location of regions of interest on each image. As pathology images are extremely large (up to 100,000 pixels in each dimension), further laborious visual search of each image may be needed to find the feature of interest. In this paper, we introduce a deep-learning-based reverse image search tool for histopathology images: Similar Medical Images Like Yours (SMILY). We assessed SMILY’s ability to retrieve search results in two ways: using pathologist-provided annotations, and via prospective studies where pathologists evaluated the quality of SMILY search results. As a negative control in the second evaluation, pathologists were blinded to whether search results were retrieved by SMILY or randomly. In both types of assessments, SMILY was able to retrieve search results with similar histologic features, organ site, and prostate cancer Gleason grade compared with the original query. SMILY may be a useful general purpose tool in the pathologist’s arsenal, to improve the efficiency of searching large archives of histopathology images, without the need to develop and implement specific tools for each application. View details
    Preview abstract Saliency methods can aid understanding of deep neural networks. Recent years have witnessed many improvements to saliency methods, as well as new ways for evaluating them. In this paper, we 1) present a novel region-based attribution method, XRAI, that builds upon integrated gradients (Sundararajan et al. 2017), 2) introduce evaluation methods for empirically assessing the quality of image-based saliency maps (Performance Information Curves (PICs)), and 3) contribute an axiom-based sanity check for attribution methods. Through empirical experiments and example results, we show that XRAI produces better results than other saliency methods for common models and the ImageNet dataset. View details
    TensorFlow.js: Machine Learning for the Web and Beyond
    Daniel Smilkov
    Nikhil Thorat
    Yannick Assogba
    Ann Yuan
    Nick Kreeger
    Ping Yu
    Kangyi Zhang
    Eric Nielsen
    Stan Bileschi
    Charles Nicholson
    Sandeep N. Gupta
    Sarah Sirajuddin
    Rajat Monga
    SysML, Palo Alto, CA, USA (2019)
    Preview abstract TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases. View details
    "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making
    Samantha Winter
    Lauren Wilcox
    Proc. ACM Hum.-Comput. Interact., Association for Computing Machinery, ACM CSCW, New York, NY, USA (2019), pp. 24 (to appear)
    Preview abstract Although rapid advances in machine learning have made it increasingly applicable to expert decision-making, the delivery of accurate algorithmic predictions alone is insufficient for effective human–AI collaboration. In this work, we investigate the key types of information medical experts desire when they are first introduced to a diagnostic AI assistant. In a qualitative lab study, we interviewed 21 pathologists before, during, and after being presented deep neural network (DNN) predictions for prostate cancer diagnosis, to learn the types of information that they desired about the AI assistant. Our findings reveal that, far beyond understanding the local, case-specific reasoning behind any model decision, clinicians desired upfront information about basic, global properties of the model, such as its known strengths and limitations, its subjective point-of-view, and its overall design objective—what it’s designed to be optimized for. Participants compared these information needs to the collaborative mental models they develop of their medical colleagues when seeking a second opinion: the medical perspectives and standards that those colleagues embody, and the compatibility of those perspectives with their own diagnostic patterns. These findings broaden and enrich discussions surrounding AI transparency for collaborative decision-making, providing a richer understanding of what experts find important in their introduction to AI assistants before integrating them into routine practice. View details
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    Preview abstract Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making. View details
    Preview abstract Data cleaning and feature engineering are both common practices when developing machine learning (ML) models. However, developers are not always aware of best practices for preparing or transforming data for a given model type, which can lead to suboptimal representations of input features. To address this issue, we introduce the data linter, a new class of ML tool that automatically inspects ML data sets to 1) identify potential issues in the data and 2) suggest potentially useful feature transforms, for a given model type. As with traditional code linting, data linting automatically identifies potential issues or inefficiencies; codifies best practices and educates end-users about these practices through tool use; and can lead to quality improvements. In this paper, we provide a detailed description of data linting, describe our initial implementation of a data linter for deep neural networks, and report results suggesting the utility of using a data linter during ML model design. View details
    AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
    Yannis Agiomyrgiannakis
    NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop (to appear)
    Preview abstract Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, we achieve correlations comparable to those of human raters. This model has a number of applications, such as the ability to automatically explore the parameter space of a speech synthesizer without requiring a human-in-the-loop. We explore a method of probing what the models have learned. View details
    No Results Found