Automated Adversarial Discovery for Safety Classifiers
Abstract
Safety classifiers are used in critical settings such as toxicity detection.
However, they are brittle and their failure cases are unknown. Traditional adversarial data generation methods are rigid and often result in similar types of attacks. Enumerating the attack types and collecting corresponding examples of each attack type is an expensive and infeasible solution. In order to discover new types of attacks, we need to automated methods for discovery of adversarial types. Current methods of attack generation rely on simple perturbations, which is unlikely to generate naturally occurring data, or language models, which are unlikely to generate data about unknown dimensions. To address this goal of discovering new types of attacks for safety classifiers, we introduce a discover-adapt framework that leverages large language models (LLMs) to iteratively identify different subtypes of toxicity (discover) and transform seed text to suit that subtype (adapt). Using adversarial success and dimensional diversity as evaluation metrics, we demonstrate that our method results in more desired data than existing approaches when generating identity attacks, insults and sexually explicit content.
However, they are brittle and their failure cases are unknown. Traditional adversarial data generation methods are rigid and often result in similar types of attacks. Enumerating the attack types and collecting corresponding examples of each attack type is an expensive and infeasible solution. In order to discover new types of attacks, we need to automated methods for discovery of adversarial types. Current methods of attack generation rely on simple perturbations, which is unlikely to generate naturally occurring data, or language models, which are unlikely to generate data about unknown dimensions. To address this goal of discovering new types of attacks for safety classifiers, we introduce a discover-adapt framework that leverages large language models (LLMs) to iteratively identify different subtypes of toxicity (discover) and transform seed text to suit that subtype (adapt). Using adversarial success and dimensional diversity as evaluation metrics, we demonstrate that our method results in more desired data than existing approaches when generating identity attacks, insults and sexually explicit content.