Preethi Lahoti
Authored Publications
Sort By
Preview abstract
Safety classifiers are used in critical settings such as toxicity detection.
However, they are brittle and their failure cases are unknown. Traditional adversarial data generation methods are rigid and often result in similar types of attacks. Enumerating the attack types and collecting corresponding examples of each attack type is an expensive and infeasible solution. In order to discover new types of attacks, we need to automated methods for discovery of adversarial types. Current methods of attack generation rely on simple perturbations, which is unlikely to generate naturally occurring data, or language models, which are unlikely to generate data about unknown dimensions. To address this goal of discovering new types of attacks for safety classifiers, we introduce a discover-adapt framework that leverages large language models (LLMs) to iteratively identify different subtypes of toxicity (discover) and transform seed text to suit that subtype (adapt). Using adversarial success and dimensional diversity as evaluation metrics, we demonstrate that our method results in more desired data than existing approaches when generating identity attacks, insults and sexually explicit content.
View details
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
Bhaktipriya Radharapu
The 2023 Conference on Empirical Methods in Natural Language Processing (2023) (to appear)
Preview abstract
Adversarial testing of large language models (LLMs) is crucial for their safe and responsible deployment. We introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of LLM generations on new downstream applications. We call it AI-assisted Red-Teaming (AART) - an automated alternative to current manual red-teaming efforts. AART offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. AART generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). The data generation is steered by AI-assisted recipes to define, scope and prioritize diversity within the application context. This feeds into a structured LLM-generation process that scales up evaluation priorities. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality.
View details
Fairness without Demographics through Adversarially Reweighted Learning
Alex Beutel
Kang Lee
Advances in Neural Information Processing Systems 33 (2020)
Preview abstract
Much of the previous machine learning (ML) fairness literature assumes that protected features such as race and sex are present in the dataset, and relies upon them to mitigate fairness concerns. However, in practice factors like privacy and regulation often preclude the collection of protected features, or their use for training or inference, severely limiting the applicability of traditional fairness research. Therefore we ask: How can we train an ML model to improve fairness when we do not even know the protected group memberships? In this work we address this problem by proposing Adversarially Reweighted Learning (ARL). In particular, we hypothesize that non-protected features and task labels are valuable for identifying fairness issues, and can be used to co-train an adversarial reweighting approach for improving fairness. Our results show that ARL improves Rawlsian Max-Min fairness, with notable AUC improvements for worst-case protected groups in multiple datasets, outperforming state-of-the-art alternatives.
View details