Lora Aroyo

Lora Aroyo

I am a research scientist at Google Research NYC where I work on Data Excellence for AI. My team DEER (Data Excellence for Evaluating Responsibly) is part of the Responsible AI (RAI) organization. Our work is focused on developing metrics and methodologies to measure the quality of human-labeled or machine-generated data. The specific scope of this work is for gathering and evaluation of adversarial data for Safety evaluation of Generative AI systems. I received MSc in Computer Science from Sofia University, Bulgaria, and PhD from Twente University, The Netherlands.

I am currently serving as a co-chair of the steering committee for the AAAI HCOMP conference series and I am a founding member of the DataPerf and the AI Safety Benchmarking working group both at MLCommons for benchmarking data-centric AI. Check out our data-centric challenge Adversarial Nibbler supported by Kaggle, Hugging Face and MLCommons. In 2023 I gave the opening keynote at NeurIPS Conference "The Many Faces of Responsible AI".

Prior to joining Google, I was a computer science professor heading the User-Centric Data Science research group at the VU University Amsterdam. Our team invented the CrowdTruth crowdsourcing method jointly with the Watson team at IBM. This method has been applied in various domains such as digital humanities, medical and online multimedia. I also guided the human-in-the-loop strategies as a Chief Scientist at a NY-based startup Tagasauris.

Some of my prior community contributions include president of the User Modeling Society, program co-chair of The Web Conference 2023, member of the ACM SIGCHI conferences board.

For a list of my publications, please see my profile on Google Scholar.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We tackle the problem of providing accurate, rigorous p-values for comparisons between the results of two evaluated systems whose evaluations are based on a crowdsourced “gold” reference standard. While this problem has been studied before, we argue that the null hypotheses used in previous work have been based on a common fallacy of equality of probabilities, as opposed to the standard null hypothesis that two sets are drawn from the same distribution. We propose using the standard null hypothesis, that two systems’ responses are drawn from the same distribution, and introduce a simulation-based framework for determining the true p-value for this null hypothesis. We explore how to estimate the true p-value from a single test set under different metrics, tests, and sampling methods, and call particular attention to the role of response variance, which exists in crowdsourced annotations as a product of genuine disagreement, and in system predictions as a product of stochastic training regimes, or in generative models as an expected property of the outputs. We find that response variance is a powerful tool for estimating p-values, and present results for the metrics, tests, and sampling methods that make the best p-value estimates in a simple machine learning model comparison View details
    Preview abstract With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies. View details
    Preview abstract Chatbots based on large language models (LLM) exhibit a level of human-like behavior that promises to have profound impacts on how people access information, create content, and seek social support. Yet these models have also shown a propensity toward biases and hallucinations, i.e., make up entirely false information and convey it as truthful. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. We use Bayesian multilevel models to explore the connection between rater demographics and their perception of safety in chatbot dialogues. We study a sample of 252 human raters stratified by gender, age, race/ethnicity, and location. Raters were asked to annotate the safety risks of 1,340 chatbot conversations. We show that raters from certain demographic groups are more likely to report safety risks than raters from other groups. We discuss the implications of these differences in safety perception and suggest measures to ameliorate these differences. View details
    Preview abstract Dialogue safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, legal concerns, health advice, etc. Who we ask to judge safety and who we ask to define safety may lead to differing conclusions. This is because definitions and understandings of safety can vary according to one’s identity, public opinion, and the interpretation of existing laws and regulations. In this study, we compare annotations from a diverse set of over 100 crowd raters to gold labels derived from trust and safety (T&S) experts in a dialogue safety task consisting of 350 human-chatbot conversations. We find patterns of disagreements rooted in dialogue structure, dialogue content, and rating rationale. In contrast to typical approaches which treat gold labels as ground truth, we propose alternative ways of interpreting gold data and incorporating crowd disagreement rather than mitigating it. We discuss the complexity of safety annotation as a task, what crowd and T&S labels each uniquely capture, and how to make determinations about when and how to rely on crowd or T&S labels. View details
    AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    Bhaktipriya Radharapu
    The 2023 Conference on Empirical Methods in Natural Language Processing (2023) (to appear)
    Preview abstract Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality. View details
    Preview abstract Conventional machine learning paradigms often rely on binary distinctions between positive and negative examples, disregarding the nuanced subjectivity that permeates real-world tasks and content. This simplistic dichotomy has served us well so far, but because it obscures the inherent diversity in human perspectives and opinions, as well as the inherent ambiguity of content and tasks, it poses limitations on model performance aligned with real-world expectations. This becomes even more critical when we study the impact and potential multifaceted risks associated with the adoption of emerging generative AI capabilities across different cultures and geographies. To address this, we argue that to achieve robust and responsible AI systems we need to shift our focus away from a single point of truth and weave in a diversity of perspectives in the data used by AI systems to ensure the trust, safety and reliability of model outputs. In this talk, I present a number of data-centric use cases that illustrate the inherent ambiguity of content and natural diversity of human perspectives that cause unavoidable disagreement that needs to be treated as signal and not noise. This leads to a call for action to establish culturally-aware and society-centered research on impacts of data quality and data diversity for the purposes of training and evaluating ML models and fostering responsible AI deployment in diverse sociocultural contexts. View details
    Preview abstract Machine learning approaches often require training and evaluation datasets with a clear separation between positive and negative examples. This risks simplifying and even obscuring the inherent subjectivity present in many tasks. Preserving such variance in content and diversity in datasets is often expensive and laborious. This is especially troubling when building safety datasets for conversational AI systems, as safety is both socially and culturally situated. To demonstrate this crucial aspect of conversational AI safety, and to facilitate in-depth model performance analyses, we introduce the DICES (Diversity In Conversational AI Evaluation for Safety) dataset that contains fine-grained demographic information about raters, high replication of ratings per item to ensure statistical power for analyses, and encodes rater votes as distributions across different demographics to allow for in￾depth explorations of different aggregation strategies. In short, the DICES dataset enables the observation and measurement of variance, ambiguity, and diversity in the context of conversational AI safety. We also illustrate how the dataset offers a basis for establishing metrics to show how raters’ ratings can intersects with demographic categories such as racial/ethnic groups, age groups, and genders. The goal of DICES is to be used as a shared resource and benchmark that respects diverse perspectives during safety evaluation of conversational AI systems. View details
    Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
    Hannah Kirk
    Jessica Quaye
    Charvi Rastogi
    Max Bartolo
    Oana Inel
    Meg Risdal
    Will Cukierski
    Vijay Reddy
    Online (2023)
    Preview abstract Machine learning progress has been strongly influenced by the data used for model training and evaluation. Only recently however, have development teams shifted their focus more to the data. This shift has been triggered by the numerous reports about biases and errors discovered in AI datasets. Thus, the data-centric AI movement introduced the notion of iterating on the data used in AI systems, as opposed to the traditional model-centric AI approach, which typically treats the data as a given static artifact in model development. With the recent advancement of generative AI, the role of data is even more crucial for successfully developing more factual and safe models. DataPerf challenges follow up on recent successful data- centric challenges drawing attention to the data used for training and evaluation of machine learning model. Specifically, Adversarial Nibbler focuses on data used for safety evaluation of generative text-to-image models. A typical bottleneck in safety evaluation is achieving a representative diversity and coverage of different types of examples in the evaluation set. Our competition aims to gather a wide range of long-tail and unexpected failure modes for text-to-image models in order to identify as many new problems as possible and use various automated approaches to expand the dataset to be useful for training, fine tuning, and evaluation. View details
    Data Excellence for AI: Why Should You Care
    Matt Lease
    Praveen Kumar Paritosh
    ACM IX Interactions (2022)
    Preview abstract The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which empirical progress is measured. Benchmark datasets such as SQuAD, GLUE, and ImageNet define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the models, e.g., via shared-task challenges or Kaggle contests, rather than critiquing and improving the data environment in which our models operate. Research and community challenges focused on improving the data itself are relatively rare. If “data is the new oil,” our use of data remains crude today, and we are missing work on the refineries by which the data itself could be optimized for more effective use. Important scientific opportunities and value are being neglected [Schaekermann et al., 2020]. Data is potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem. Data issues are often perceived and characterized as mundane and rote, the “pre-processing” that has to be done before the real (modeling) work can be done. For example, Kandel et al. (2012) emphasize that ML practitioners view data wrangling as tedious and time-consuming. However, Sambasivan et al. (2021) provide examples of how data quality is crucial to ensure that AI systems can accurately represent and predict the phenomenon it is claiming to measure. They introduce four classes of Data Cascades: compounding events causing negative, downstream effects from data issues triggered by conventional AI/ML practices that undervalue data quality. This emphasizes the significance of data due to its downstream impact on user wellbeing and societal effects. Real-world datasets are often ‘dirty’, with various data quality problems (Northcutt et al, 2021), with the risk of “garbage in = garbage out” in terms of the downstream AI systems we train and test on such data. This has inspired a steadily growing body of work on understanding and improving data quality (Chu, et al, 2013; Krishnan, et al, 2016; Redman, et al, 2018; Raman et al, 2001). It also highlights the importance of rigorously managing data quality using mechanisms specific to data validation, instead of relying on model performance as a proxy for data quality (Thomas, et al, 2020). Just as we rigorously test our code for software defects before deployment, we might test for data defects with the same degree of rigor, so that we might detect, prevent, or mitigate weaknesses in ML models caused by underlying issues in data quality. The “Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML)” Data Challenge (Aroyo and Paritosh, 2021) aims to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. Similarly to (Vandenhof, 2019) CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassified. This research is inspired by (Attenberg et al, 2015) following the claim “Humans should always be part of machine learning solutions, as they can guide machine learning systems to learn about things that the systems don't yet know — the “unknown unknowns.”” by Iperiotis, (2016). Many benchmark datasets contain instances that are relatively easy (e.g., photos with a subject that is easy to identify). In so doing, they miss the natural ambiguity of the real world in which our models are to be actually applied. Data instances with annotator disagreement are often aggregated to eliminate disagreement (obscuring uncertainty), or filtered out of datasets entirely. Exclusion of difficult and/or ambiguous real-world examples in evaluation risks “toy dataset” benchmarks that diverge from the real data to be encountered in practice. Successful benchmark models fail to generalize to real data, and inflated benchmark results mislead our assessment of state-of-the-art capabilities. ML models become prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set. Measuring data quality is challenging, nebulous, and often circularly defined, with annotated data defining the “ground truth” on which models are trained and tested [Riezler, 2014]. When dataset quality is considered, the ways in which it is measured in practice is often poorly understood and sometimes simply wrong. Challenges identified include fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019]. Measurement of AI success and progress today is often metrics-driven, with emphasis on rigorous measurement and A/B testing. However, measuring goodness of the fit of the model to the dataset completely ignores any consideration of how well the dataset fits the real world problem to be solved and its data. Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity (i.e., how well the dataset represents reality) and validity (how well the data explains things related to the phenomena captured by the data). No standardised metrics exist today for characterising the goodness-of-data [11,13]. Research on metrics is emerging [15,91] but is not yet widely known, accepted, or applied in the AI ecosystem today. As a result, there is an overreliance on goodness-of-fit metrics and post-deployment product metrics. Focusing on fidelity and validity of data will further increase its scientific value and reusability. Such research is necessary for enabling better incentives for data, as it is hard to improve something we can not measure. Researchers in human computation (HCOMP) and various ML-related fields have demonstrated a longstanding interest in applying crowdsourcing approaches to generate human-annotated data for model training and testing [25,128]. A series of workshops (Meta-Eval 2020 @ AAAI, REAIS 2019 @ HCOMP, SAD 2019 @ TheWebConf (WWW), SAD 2018 @ HCOMP) have helped increase further awareness about the issues of data quality for ML evaluation and provide a venue for scholarship on this subject. Because human annotated data represents the compass that the entire ML community relies on, data-focused research, by the HCOMP community and others, can potentially have a multiplicative effect on accelerating progress in ML more broadly. Optimizing the cost, size, and speed of collecting data has attracted significant attention in the first-to-market rush with data. However, aspects of maintainability, reliability, validity, and fidelity of datasets have been often overlooked. We argue we have now reached an inflection point in the field of ML in which attention to neglected data quality is poised to significantly accelerate progress. Toward this end, we advocate for research defining and creating processes to achieve data excellence. We highlight examples, case-studies, and methodologies. This will enable the necessary change in our research culture to value excellence in data practices, which is a critical milestone on the road to enabling the next generation of breakthroughs in ML and AI. View details
    Preview abstract To understand what captures people's attention (what they find relevant) we focussed on understanding better the content of videos. In information science, the concept of relevance is most connected to end-users' judgments and is considered fundamental as a subjective, dynamic user-centric perception. People might have or use different relevance standards or criteria when performing the task of video searching. Textual and visual criteria are essential for identifying relevant video content, but subjective, implicit criteria, such as interest or familiarity could be equally used by people. Typically, people tend to connect bridges to concepts or perspectives that are not necessarily shown in the video, but that might be expressed or referred to. We carried out a number of studies with news videos and broadcasts. In our initial study [6], we took a digital hermeneutics approach to understand which video aspects capture the attention of digital humanities scholars and drive the creation of narratives, or short audio-visual stories. In subsequent studies, we focused on understanding the utility of machine-extracted video concepts and how people can teach machines in terms of video concept relevance. We harnessed the intrinsic subjectivity of concept relevance to capture the diverse range of video concepts found relevant through the eyes of our participants [4]. We explored to what extent current information extraction systems meet users' goals, and what are the novel aspects users bring in video concept relevance assessment. We performed two types of crowdsourcing studies. The Selection study (Figure 1) focused on understanding the utility of machine-extracted video concepts from video subtitles and video streams. While the Free Input study (Figure 2) focused on understanding the complementarity between machine and human concepts in terms of relevance. By studying the gap between machines and humans in terms of perceived video concept relevance, we gained insights into how machines can collaborate with users, to better support their needs and preferences. Our studies revealed that events, locations, people, organizations, and general concepts (i.e., of any type) are fundamental elements for content exploration and understanding. They are most commonly machine concepts extracted and as such used in machine summarization of content as well as for information search. However, people engaging with online videos most often provide events, people, locations, and organizations as relevant concepts. Concepts of other types are also found relevant, but to a lesser extent. These concept types are thus fundamental for contextualizing the content of the videos, and also sufficient to capture human interest in terms of relevance. View details