Learning to Attack: Adversarial Transformation Networks

Proceedings of AAAI-2018, AAAI (to appear)

Abstract

With the rapidly increasing popularity of deep neural networks
for image recognition tasks, a parallel interest in generating
adversarial examples to attack the trained models has
arisen. To date, these approaches have involved either directly
computing gradients with respect to the image pixels or directly
solving an optimization on the image pixels. We generalize
this pursuit in a novel direction: can a separate network
be trained to efficiently attack another fully trained network?
We demonstrate that it is possible, and that the generated
attacks yield startling insights into the weaknesses of
the target network. We call such a network an Adversarial
Transformation Network (ATN). ATNs transform any input
into an adversarial attack on the target network, while being
minimally perturbing to the original inputs and the target network’s
outputs. Further, we show that ATNs are capable of
not only causing the target network to make an error, but can
be constructed to explicitly control the type of misclassification
made. We demonstrate ATNs on both simple MNIST digit
classifiers and state-of-the-art ImageNet classifiers deployed
by Google, Inc.: Inception ResNet-v2.