Alicia Vail Parrish
As a research scientist, I use tools from linguistics and psychology to study how we can collect and understand high-quality NLP data. I received my M.A. in linguistics from Michigan State University and my
PhD in Linguistics from New York University.
Research Areas
Authored Publications
Sort By
A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
Heather Cole-Lewis
Nenad Tomašev
Liam McCoy
Leo Anthony Celi
Alanna Walton
Akeiylah DeWitt
Philip Mansfield
Sushant Prakash
Joelle Barral
Ivor Horn
Karan Singhal
Nature Medicine (2024)
Preview abstract
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.
View details
Intersecting Demographics: Bayesian Multilevel Models Reveal Age, Gender, and Racial Differences in Safety Perception of Chatbot Conversations
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences.
View details
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels.
View details
Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
Hannah Kirk
Jessica Quaye
Charvi Rastogi
Max Bartolo
Oana Inel
Meg Risdal
Will Cukierski
Vijay Reddy
Online (2023)
Preview abstract
Machine learning progress has been strongly influenced by the data used for
model training and evaluation. Only recently however, have development teams
shifted their focus more to the data. This shift has been triggered by the numerous
reports about biases and errors discovered in AI datasets. Thus, the data-centric
AI movement introduced the notion of iterating on the data used in AI systems, as
opposed to the traditional model-centric AI approach, which typically treats the
data as a given static artifact in model development. With the recent advancement of
generative AI, the role of data is even more crucial for successfully developing more
factual and safe models. DataPerf challenges follow up on recent successful data-
centric challenges drawing attention to the data used for training and evaluation of
machine learning model. Specifically, Adversarial Nibbler focuses on data used for
safety evaluation of generative text-to-image models. A typical bottleneck in safety
evaluation is achieving a representative diversity and coverage of different types
of examples in the evaluation set. Our competition aims to gather a wide range
of long-tail and unexpected failure modes for text-to-image models in order to
identify as many new problems as possible and use various automated approaches
to expand the dataset to be useful for training, fine tuning, and evaluation.
View details
DICES Dataset: Diversity in Conversational AI Evaluation for Safety
Alex Taylor
Chris Homan
Greg Serapio-García
NeurIPS2023 (2023)
Preview abstract
Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This risks simplifying
and even obscuring the inherent subjectivity present in many tasks. Preserving such
variance in content and diversity in datasets is often expensive and laborious. This
is especially troubling when building safety datasets for conversational AI systems,
as safety is both socially and culturally situated. To demonstrate this crucial
aspect of conversational AI safety, and to facilitate in-depth model performance
analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for
Safety) dataset that contains fine-grained demographic information about raters,
high replication of ratings per item to ensure statistical power for analyses, and
encodes rater votes as distributions across different demographics to allow for indepth explorations of different aggregation strategies. In short, the DICES dataset
enables the observation and measurement of variance, ambiguity, and diversity in
the context of conversational AI safety. We also illustrate how the dataset offers
a basis for establishing metrics to show how raters’ ratings can intersects with
demographic categories such as racial/ethnic groups, age groups, and genders. The
goal of DICES is to be used as a shared resource and benchmark that respects
diverse perspectives during safety evaluation of conversational AI systems.
View details