Ding Wang
Ding Wang is an HCI researcher at Google AI, her current research focus is to explore the intersection of HCI and AI, specifically the labour involved in production of data and its impact on AI systems subsequently. Prior to joining Google, Ding completed her postdoc research from Microsoft Research India, where her research projects focused on the future of work and health care. Prior to joining MSR, Ding has completed her PhD from HighWire Centre for Doctoral Training in Lancaster University and her doctoral thesis offers a critical/alternative view on how smart cities should be designed, developed and evaluated.
Research Areas
Authored Publications
Sort By
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels.
View details
DICES Dataset: Diversity in Conversational AI Evaluation for Safety
Alex Taylor
Chris Homan
Greg Serapio-García
NeurIPS2023 (2023)
Preview abstract
Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This risks simplifying
and even obscuring the inherent subjectivity present in many tasks. Preserving such
variance in content and diversity in datasets is often expensive and laborious. This
is especially troubling when building safety datasets for conversational AI systems,
as safety is both socially and culturally situated. To demonstrate this crucial
aspect of conversational AI safety, and to facilitate in-depth model performance
analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for
Safety) dataset that contains fine-grained demographic information about raters,
high replication of ratings per item to ensure statistical power for analyses, and
encodes rater votes as distributions across different demographics to allow for indepth explorations of different aggregation strategies. In short, the DICES dataset
enables the observation and measurement of variance, ambiguity, and diversity in
the context of conversational AI safety. We also illustrate how the dataset offers
a basis for establishing metrics to show how raters’ ratings can intersects with
demographic categories such as racial/ethnic groups, age groups, and genders. The
goal of DICES is to be used as a shared resource and benchmark that respects
diverse perspectives during safety evaluation of conversational AI systems.
View details
Intersecting Demographics: Bayesian Multilevel Models Reveal Age, Gender, and Racial Differences in Safety Perception of Chatbot Conversations
Chris Homan
Greg Serapio-García
Alex Taylor
(2023)
Preview abstract
Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences.
View details
"I wouldn’t say offensive but...": Disability-Centered Perspectives on Large Language Models
Vinitha Gadiraju
Alex Taylor
Robin Brewer
Proceedings of FAccT 2023 (2023) (to appear)
Preview abstract
Large language models (LLMs) trained on real-world data can inadvertently reflect harmful societal biases, particularly toward historically marginalized communities. While previous work has primarily focused on harms related to age and race, emerging research has shown that biases toward disabled communities exist. This study extends prior work exploring the existence of harms by identifying categories of LLM-perpetuated harms toward the disability community. We conducted 19 focus groups, during which 56 participants with disabilities probed a dialog model about disability and discussed and annotated its responses. Participants rarely characterized model outputs as blatantly offensive or toxic. Instead, participants used nuanced language to detail how the dialog model mirrored subtle yet harmful stereotypes they encountered in their lives and dominant media, e.g., inspiration porn and able-bodied saviors. Participants often implicated training data as a cause for these stereotypes and recommended training the model on diverse identities from disability-positive resources. Our discussion further explores representative data strategies to mitigate harm related to different communities through annotation co-design with ML researchers and developers.
View details
Annotator Diversity in Data Practices
Shivani Kapania
Alex Stephen Taylor
Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, New York, NY, USA (to appear)
Preview abstract
Diversity in datasets is a key component to building responsible AI/ML. Despite this recognition, we know little about the diversity among the annotators involved in data production. We investigated the approaches to annotator diversity through 16 semi-structured interviews and a survey with 44 AI/ML practitioners. While practitioners described nuanced understandings of annotator diversity, they rarely designed dataset production to account for diversity in the annotation process. The lack of action was explained through operational barriers: from the lack of visibility in the annotator hiring process, to the conceptual difficulty in incorporating worker diversity. We argue that such operational barriers and the widespread resistance to accommodating annotator diversity surface a prevailing logic in data practices— where neutrality, objectivity and ‘representationalist thinking’ dominate. By understanding this logic to be part of a regime of existence, we explore alternative ways of accounting for annotator subjectivity and diversity in data practices.
View details
Preview abstract
Data is fundamental to AI/ML models. This paper investigates the work practices concerning data annotation as performed in the industry, in India. Previous human-centred investigations have largely focused on annotators’ subjectivity, bias and efficiency. We present a wider perspective of the data annotation: following a grounded approach, we conducted 3 sets of interviews with 25
annotators, 10 industry experts and 12 ML/AI practitioners. Our results show that the work of annotators is dictated by the interests, priorities and values of others above their station. More than technical, we contend that data annotation is a systematic exercise of power through organisational structure and practice. We propose a set of implications for how we can cultivate and encourage better
practice to balance the tension between the need for high quality data at low cost and the annotators’ aspiration for well-being, career perspective, and active participation in building the AI dream.
View details
The Reasonable Effectiveness of Diverse Evaluation Data
Christopher Homan
Alex Taylor
Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022
Preview abstract
In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations.
View details