Mark Díaz
Mark Díaz is a Research Scientist with the Technology, AI, Society, and Culture (TASC) team in Responsible AI. His primary research investigates sociotechnical AI evaluation and documentation, including understanding data annotation and subjective disagreements related to differences in social context and experience. He has most recently begun work on the impacts of anthropomorphic generative AI on user perceptions and what those impacts mean for responsible AI practice.
Mark completed his Ph.D. in Technology & Social Behavior, a joint program in Computer Science and Communication at Northwestern University where he was advised by Darren Gergle. Before completing his doctoral work on age-related biases in sentiment analysis, he worked as a graduate fellow at SMART Chicago, a nonprofit focused on technology access and equity in Chicago. As a graduate fellow he researched perceptions among Black and low-income Chicago residents of city technology policy.
Authored Publications
Sort By
Preview abstract
Detecting offensive content in text is an increasingly central challenge for both social-media platforms and AI-driven technologies. However offensiveness remains a subjective phenomenon as perspectives differ across sociodemographic characteristics, as well as cultural norms and moral values. This intricacy is largely ignored in the current AI-focused approaches for detecting offensiveness or related concepts such as hate speech and toxicity detection. We frame the task of determining offensiveness as essentially a matter of moral judgment --- deciding the boundaries of ethically wrong vs. right language to be used or generated within an implied set of sociocultural norms. In this paper, we investigate how judgment of offensiveness varies across diverse global cultural regions, and the crucial role of moral values in shaping these variations. Our findings highlight substantial cross-cultural differences in perceiving offensiveness, with moral concerns about Caring and Purity as the mediating factor driving these differences. These insights are of importance as AI safety protocols, shaped by human annotators' inputs and perspectives, embed their moral values which do not align with the notions of right and wrong in all contexts, and for all individuals.
View details
DICES Dataset: Diversity in Conversational AI Evaluation for Safety
Alex Taylor
Chris Homan
Greg Serapio-García
NeurIPS2023 (2023)
Preview abstract
Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This risks simplifying
and even obscuring the inherent subjectivity present in many tasks. Preserving such
variance in content and diversity in datasets is often expensive and laborious. This
is especially troubling when building safety datasets for conversational AI systems,
as safety is both socially and culturally situated. To demonstrate this crucial
aspect of conversational AI safety, and to facilitate in-depth model performance
analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for
Safety) dataset that contains fine-grained demographic information about raters,
high replication of ratings per item to ensure statistical power for analyses, and
encodes rater votes as distributions across different demographics to allow for indepth explorations of different aggregation strategies. In short, the DICES dataset
enables the observation and measurement of variance, ambiguity, and diversity in
the context of conversational AI safety. We also illustrate how the dataset offers
a basis for establishing metrics to show how raters’ ratings can intersects with
demographic categories such as racial/ethnic groups, age groups, and genders. The
goal of DICES is to be used as a shared resource and benchmark that respects
diverse perspectives during safety evaluation of conversational AI systems.
View details
Intersecting Demographics: Bayesian Multilevel Models Reveal Age, Gender, and Racial Differences in Safety Perception of Chatbot Conversations
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences.
View details
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels.
View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yuanzhong Xu
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Preview abstract
Note: Will be adding at least one more reviewer.
Machine translation (MT) is now widely and freely available, and has the potential to greatly improve interlingual communication. However, it can be difficult for users to detect and recover from mistranslations because limited language skills hinder comprehension of either the inputs or the outpus. In order to use MT reliably and safely, end users must be able to assess the quality of system outputs and determine how much they can rely on them to guide their decisions and actions. In this work we collected 19 MT-mediated high-stakes, role-play conversations and in-depth interviews to understand how users identify and recover from translation errors. Participants communicated using four language pairs: English, and one of Spanish, Farsi, Igbo, or Tagalog. We also collected human annotations of translation quality and conducted a mixed-method analysis to understand user challenges, strategies for recovery, and the kinds of translation errors that proved more or less difficult for users to overcome. We found that users broadly lacked relevant and helpful information to guide their assessments of translation quality. Instances where a user erroneously thought they had understood a translation correctly, were rare but held the potential for drastic consequences in the real world. Finally, inaccurate and disfluent translations had social consequences for the participants, because it was difficult to discern when disfluent message was reflective of the other person’s intentions, or an artifact of imperfect MT. We draw on theories of grounding and repair in communication to contextualize these findings, and propose design implications for HCI researchers, MT researchers, and opportunities for greater coherence and collaboration between these efforts.
View details
Accounting for Offensive Speech as a Practice of Resistance
Razvan Adrian Amironesei
Laura Weidinger
Iason Gabriel
NAACL Workshop on Online Abuse and Harms (WOAH) (2022)
Preview abstract
Tasks such as toxicity detection, hate speech detection, and online harassment detection have been developed for identifying and intervening in interactions that have the potential to cause social harms. These tasks, for identifying and classifying offensive or undesirable language, have gone by different names and have employed varying task definitions. However, they are united by a goal of reducing harm and breakdowns in civil discourse. Because language use varies from context to context, a major challenge to the success of these methods arises from the need to properly model and understand nuanced social context. Modeling social context has been identified as a massive challenge that stands to limit the performance of natural language processing (NLP) systems.
In this work we articulate the need for a relational understanding of offensiveness as well as a north star definition of this concept for NLP research. Many classification tasks implicitly treat offensiveness as a fixed property of language. However, offense emerges in the context of relationships between individual or broader networks of social actors (including human-like actors) and the language used between them. Using examples of speech drawn from members of marginalized groups, we argue that a fuller account of offensive speech, and when it is objectionable, must focus on the ends– or impact– of language and how it is used. We also explore the degree to which NLP systems may encounter limits when modeling relational factors, for example due to technical limitations or concerns regarding privacy in data collection for training and evaluation. Nonetheless, developing a robust, translatable, relational understanding of offensiveness is key to the successful operationalization and use of this concept. Addressing this challenge, the present work considers how offensiveness has been operationalized in classification tasks, the affordances and weakness thereof. We also discuss how a more relational approach can be implemented in data collection techniques and operationalizations of offensiveness.
View details
Frameworks and Challenges to Participatory AI
Abeba Birhane
William Samuel Isaac
Madeleine Clare Elish
Iason Gabriel
Shakir Mohamed
In Proceeding of the Second Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO '22), ACM (2022)
Preview abstract
Participatory approaches to artificial intelligence (AI) and machine learning (ML) are gaining momentum: the increased attention comes partly with the view that participation opens the gateway to an inclusive, equitable, robust, responsible and trustworthy AI. Among other benefits, participatory approaches are essential to understanding and adequately representing the needs, desires and perspectives of historically marginalized communities. However, there currently exists lack of clarity on what meaningful participation entails and what it is expected to do. In this paper we first review participatory approaches as situated in historical contexts as well as participatory methods and practices within the AI and ML pipeline. We then introduce three case studies in participatory AI. Participation holds the potential for beneficial, emancipatory and empowering technology design, development and deployment while also being at risk for concerns such as cooptation and conflation with other activities. We lay out these limitations and concerns and argue that as participatory AI/ML becomes in vogue, a contextual and nuanced understanding of the term as well as consideration of who the primary beneficiaries of participatory activities ought to be constitute crucial factors to realizing the benefits and opportunities that participation brings.
View details
The Reasonable Effectiveness of Diverse Evaluation Data
Christopher Homan
Alex Taylor
Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022
Preview abstract
In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations.
View details