Mark Díaz
Mark Díaz is a Research Scientist with the Technology, AI, Society, and Culture (TASC) team in Responsible AI. His primary research investigates sociotechnical AI evaluation and documentation, including understanding data annotation and subjective disagreements related to differences in social context and experience. He has most recently begun work on the impacts of anthropomorphic generative AI on user perceptions and what those impacts mean for responsible AI practice.
Mark completed his Ph.D. in Technology & Social Behavior, a joint program in Computer Science and Communication at Northwestern University where he was advised by Darren Gergle. Before completing his doctoral work on age-related biases in sentiment analysis, he worked as a graduate fellow at SMART Chicago, a nonprofit focused on technology access and equity in Chicago. As a graduate fellow he researched perceptions among Black and low-income Chicago residents of city technology policy.
Research Areas
Authored Publications
Sort By
Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs
Jeffrey Basoah
Daniel Chechelnitsky
Tao Long
Katharina Reinecke
Chrysoula Zerva
Kaitlyn Zhou
Maarten Sap
Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ACM (2025), pp. 710-745
Preview abstract
As large language models (LLMs) increasingly adapt and personalize to diverse sets of users, there is an increased risk of systems appropriating sociolects, i.e., language styles or dialects that are associated with specific minoritized lived experiences (e.g., African American
English, Queer slang). In this work, we examine whether sociolect usage by a LLM agent affects user reliance on its outputs and user perception (satisfaction, frustration, trust, and social presence). We designed and conducted user studies where 498 African American English (AAE) speakers and 487 Queer slang speakers performed a set of question-answering tasks with LLM-based suggestions in either standard American English (SAE) or their self-identified sociolect.
Our findings showed that sociolect usage by LLMs influenced both reliance and perceptions, though in some surprising ways. Results suggest that both AAE and Queer slang speakers relied more on the SAELM, and had more positive perceptions of the SAELM. Yet, only Queer slang speakers felt more social presence from the QSLM over the SAE one, whereas only AAE speakers preferred and trusted the SAELM over the AAE one. These findings emphasize the need to test for behavioral outcomes rather than simply assume that personalization would lead to a better and safer reliance outcome. They also highlight the nuanced dynamics of minoritized language in machine interactions, underscoring the need for LLMs to be carefully designed to respect cultural and linguistic boundaries while fostering genuine user engagement and trust.
View details
Preview abstract
Detecting offensive content in text is an increasingly central challenge for both social-media platforms and AI-driven technologies. However offensiveness remains a subjective phenomenon as perspectives differ across sociodemographic characteristics, as well as cultural norms and moral values. This intricacy is largely ignored in the current AI-focused approaches for detecting offensiveness or related concepts such as hate speech and toxicity detection. We frame the task of determining offensiveness as essentially a matter of moral judgment --- deciding the boundaries of ethically wrong vs. right language to be used or generated within an implied set of sociocultural norms. In this paper, we investigate how judgment of offensiveness varies across diverse global cultural regions, and the crucial role of moral values in shaping these variations. Our findings highlight substantial cross-cultural differences in perceiving offensiveness, with moral concerns about Caring and Purity as the mediating factor driving these differences. These insights are of importance as AI safety protocols, shaped by human annotators' inputs and perspectives, embed their moral values which do not align with the notions of right and wrong in all contexts, and for all individuals.
View details
The Illusion of Artificial Inclusion
William Agnew
Stevie Bergman
Jennifer Chien
Seliem El-Sayed
Jaylen Pittman
Shakir Mohamed
Kevin McKee
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 12
Preview abstract
Human participants play a central role in the development of modern artificial intelligence (AI) technology, in psychological science, and in user research. Recent advances in generative AI have attracted growing interest to the possibility of replacing human participants in these domains with AI surrogates. We survey several such "substitution proposals" to better understand the arguments for and against substituting human participants with modern generative AI. Our scoping review indicates that the recent wave of these proposals is motivated by goals such as reducing the costs of research and development work and increasing the diversity of collected data. However, these proposals ignore and ultimately conflict with foundational values of work with human participants: representation, inclusion, and understanding. This paper critically examines the principles and goals underlying human participation to help chart out paths for future work that truly centers and empowers participants.
View details
Intersecting Demographics: Bayesian Multilevel Models Reveal Age, Gender, and Racial Differences in Safety Perception of Chatbot Conversations
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences.
View details
DICES Dataset: Diversity in Conversational AI Evaluation for Safety
Alex Taylor
Chris Homan
Greg Serapio-García
NeurIPS2023 (2023)
Preview abstract
Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This risks simplifying
and even obscuring the inherent subjectivity present in many tasks. Preserving such
variance in content and diversity in datasets is often expensive and laborious. This
is especially troubling when building safety datasets for conversational AI systems,
as safety is both socially and culturally situated. To demonstrate this crucial
aspect of conversational AI safety, and to facilitate in-depth model performance
analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for
Safety) dataset that contains fine-grained demographic information about raters,
high replication of ratings per item to ensure statistical power for analyses, and
encodes rater votes as distributions across different demographics to allow for indepth explorations of different aggregation strategies. In short, the DICES dataset
enables the observation and measurement of variance, ambiguity, and diversity in
the context of conversational AI safety. We also illustrate how the dataset offers
a basis for establishing metrics to show how raters’ ratings can intersects with
demographic categories such as racial/ethnic groups, age groups, and genders. The
goal of DICES is to be used as a shared resource and benchmark that respects
diverse perspectives during safety evaluation of conversational AI systems.
View details
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Jacob Austin
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Andrew M. Dai
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Katherine Lee
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Accounting for Offensive Speech as a Practice of Resistance
Razvan Adrian Amironesei
Laura Weidinger
Iason Gabriel
NAACL Workshop on Online Abuse and Harms (WOAH) (2022)
Preview abstract
Tasks such as toxicity detection, hate speech detection, and online harassment detection have been developed for identifying and intervening in interactions that have the potential to cause social harms. These tasks, for identifying and classifying offensive or undesirable language, have gone by different names and have employed varying task definitions. However, they are united by a goal of reducing harm and breakdowns in civil discourse. Because language use varies from context to context, a major challenge to the success of these methods arises from the need to properly model and understand nuanced social context. Modeling social context has been identified as a massive challenge that stands to limit the performance of natural language processing (NLP) systems.
In this work we articulate the need for a relational understanding of offensiveness as well as a north star definition of this concept for NLP research. Many classification tasks implicitly treat offensiveness as a fixed property of language. However, offense emerges in the context of relationships between individual or broader networks of social actors (including human-like actors) and the language used between them. Using examples of speech drawn from members of marginalized groups, we argue that a fuller account of offensive speech, and when it is objectionable, must focus on the ends– or impact– of language and how it is used. We also explore the degree to which NLP systems may encounter limits when modeling relational factors, for example due to technical limitations or concerns regarding privacy in data collection for training and evaluation. Nonetheless, developing a robust, translatable, relational understanding of offensiveness is key to the successful operationalization and use of this concept. Addressing this challenge, the present work considers how offensiveness has been operationalized in classification tasks, the affordances and weakness thereof. We also discuss how a more relational approach can be implemented in data collection techniques and operationalizations of offensiveness.
View details
The Reasonable Effectiveness of Diverse Evaluation Data
Christopher Homan
Alex Taylor
Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022
Preview abstract
In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations.
View details
Preview abstract
Note: Will be adding at least one more reviewer.
Machine translation (MT) is now widely and freely available, and has the potential to greatly improve interlingual communication. However, it can be difficult for users to detect and recover from mistranslations because limited language skills hinder comprehension of either the inputs or the outpus. In order to use MT reliably and safely, end users must be able to assess the quality of system outputs and determine how much they can rely on them to guide their decisions and actions. In this work we collected 19 MT-mediated high-stakes, role-play conversations and in-depth interviews to understand how users identify and recover from translation errors. Participants communicated using four language pairs: English, and one of Spanish, Farsi, Igbo, or Tagalog. We also collected human annotations of translation quality and conducted a mixed-method analysis to understand user challenges, strategies for recovery, and the kinds of translation errors that proved more or less difficult for users to overcome. We found that users broadly lacked relevant and helpful information to guide their assessments of translation quality. Instances where a user erroneously thought they had understood a translation correctly, were rare but held the potential for drastic consequences in the real world. Finally, inaccurate and disfluent translations had social consequences for the participants, because it was difficult to discern when disfluent message was reflective of the other person’s intentions, or an artifact of imperfect MT. We draw on theories of grounding and repair in communication to contextualize these findings, and propose design implications for HCI researchers, MT researchers, and opportunities for greater coherence and collaboration between these efforts.
View details