Lora Aroyo

Lora Aroyo

I am a research scientist at Google Research NYC where I work on Data Excellence for AI. My team DEER (Data Excellence for Evaluating Responsibly) is part of the Responsible AI (RAI) organization. Our work is focused on developing metrics and methodologies to measure the quality of human-labeled or machine-generated data. The specific scope of this work is for gathering and evaluation of adversarial data for Safety evaluation of Generative AI systems. I received MSc in Computer Science from Sofia University, Bulgaria, and PhD from Twente University, The Netherlands.

I am currently serving as a co-chair of the steering committee for the AAAI HCOMP conference series and I am a founding member of the DataPerf and the AI Safety Benchmarking working group both at MLCommons for benchmarking data-centric AI. Check out our data-centric challenge Adversarial Nibbler supported by Kaggle, Hugging Face and MLCommons. In 2023 I gave the opening keynote at NeurIPS Conference "The Many Faces of Responsible AI".

Prior to joining Google, I was a computer science professor heading the User-Centric Data Science research group at the VU University Amsterdam. Our team invented the CrowdTruth crowdsourcing method jointly with the Watson team at IBM. This method has been applied in various domains such as digital humanities, medical and online multimedia. I also guided the human-in-the-loop strategies as a Chief Scientist at a NY-based startup Tagasauris.

Some of my prior community contributions include president of the User Modeling Society, program co-chair of The Web Conference 2023, member of the ACM SIGCHI conferences board.

For a list of my publications, please see my profile on Google Scholar.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
    Hannah Kirk
    Jessica Quaye
    Charvi Rastogi
    Max Bartolo
    Oana Inel
    Meg Risdal
    Will Cukierski
    Vijay Reddy
    Online(2023)
    Preview abstract Machine learning progress has been strongly influenced by the data used for model training and evaluation. Only recently however, have development teams shifted their focus more to the data. This shift has been triggered by the numerous reports about biases and errors discovered in AI datasets. Thus, the data-centric AI movement introduced the notion of iterating on the data used in AI systems, as opposed to the traditional model-centric AI approach, which typically treats the data as a given static artifact in model development. With the recent advancement of generative AI, the role of data is even more crucial for successfully developing more factual and safe models. DataPerf challenges follow up on recent successful data- centric challenges drawing attention to the data used for training and evaluation of machine learning model. Specifically, Adversarial Nibbler focuses on data used for safety evaluation of generative text-to-image models. A typical bottleneck in safety evaluation is achieving a representative diversity and coverage of different types of examples in the evaluation set. Our competition aims to gather a wide range of long-tail and unexpected failure modes for text-to-image models in order to identify as many new problems as possible and use various automated approaches to expand the dataset to be useful for training, fine tuning, and evaluation. View details
    Preview abstract We tackle the problem of providing accurate, rigorous p-values for comparisons between the results of two evaluated systems whose evaluations are based on a crowdsourced “gold” reference standard. While this problem has been studied before, we argue that the null hypotheses used in previous work have been based on a common fallacy of equality of probabilities, as opposed to the standard null hypothesis that two sets are drawn from the same distribution. We propose using the standard null hypothesis, that two systems’ responses are drawn from the same distribution, and introduce a simulation-based framework for determining the true p-value for this null hypothesis. We explore how to estimate the true p-value from a single test set under different metrics, tests, and sampling methods, and call particular attention to the role of response variance, which exists in crowdsourced annotations as a product of genuine disagreement, and in system predictions as a product of stochastic training regimes, or in generative models as an expected property of the outputs. We find that response variance is a powerful tool for estimating p-values, and present results for the metrics, tests, and sampling methods that make the best p-value estimates in a simple machine learning model comparison View details
    Preview abstract Machine learning approaches often require training and evaluation datasets with a clear separation between positive and negative examples. This risks simplifying and even obscuring the inherent subjectivity present in many tasks. Preserving such variance in content and diversity in datasets is often expensive and laborious. This is especially troubling when building safety datasets for conversational AI systems, as safety is both socially and culturally situated. To demonstrate this crucial aspect of conversational AI safety, and to facilitate in-depth model performance analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for Safety) dataset that contains fine-grained demographic information about raters, high replication of ratings per item to ensure statistical power for analyses, and encodes rater votes as distributions across different demographics to allow for in￾depth explorations of different aggregation strategies. In short, the DICES dataset enables the observation and measurement of variance, ambiguity, and diversity in the context of conversational AI safety. We also illustrate how the dataset offers a basis for establishing metrics to show how raters’ ratings can intersects with demographic categories such as racial/ethnic groups, age groups, and genders. The goal of DICES is to be used as a shared resource and benchmark that respects diverse perspectives during safety evaluation of conversational AI systems. View details
    Preview abstract With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies. View details
    Preview abstract Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality. View details
    Preview abstract Conventional machine learning paradigms often rely on binary distinctions between positive and negative examples, disregarding the nuanced subjectivity that permeates real-world tasks and content. This simplistic dichotomy has served us well so far, but because it obscures the inherent diversity in human perspectives and opinions, as well as the inherent ambiguity of content and tasks, it poses limitations on model performance aligned with real-world expectations. This becomes even more critical when we study the impact and potential multifaceted risks associated with the adoption of emerging generative AI capabilities across different cultures and geographies. To address this, we argue that to achieve robust and responsible AI systems we need to shift our focus away from a single point of truth and weave in a diversity of perspectives in the data used by AI systems to ensure the trust, safety and reliability of model outputs. In this talk, I present a number of data-centric use cases that illustrate the inherent ambiguity of content and natural diversity of human perspectives that cause unavoidable disagreement that needs to be treated as signal and not noise. This leads to a call for action to establish culturally-aware and society-centered research on impacts of data quality and data diversity for the purposes of training and evaluating ML models and fostering responsible AI deployment in diverse sociocultural contexts. View details
    Preview abstract Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels. View details
    Preview abstract Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences. View details
    Annotator Response Distributions as a Sampling Frame
    Christopher Homan
    LREC WOrkshop on Perspectivist NLP(2022)
    Preview abstract Annotator disagreement is often dismissed as noise or the result of poor annotation process quality. Others have argued that it can be meaningful. But lacking a rigorous statistical foundation, the analysis of disagreement patterns can resemble a high-tech form of tea-leaf-reading. We contribute a framework for analyzing the variation of per-item annotator response distributions to data for humans-in-the-loop machine learning. We provide visualizations for, and use the framework to analyze the variance in, a crowdsourced dataset of hard-to-classify examples of the OpenImages archive. View details
    The Reasonable Effectiveness of Diverse Evaluation Data
    Christopher Homan
    Alex Taylor
    Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022
    Preview abstract In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations. View details