Large-Scale Automatic Classification of Phishing Pages

Colin Whittaker
Brian Ryner
Marria Nazif
NDSS '10 (2010)

Abstract

Phishing websites, fraudulent sites that trick viewers into interacting with them, continue to cost Internet users over a billion dollars each year. In this paper, we describe the design and performance characteristics of a scalable machine learning classifier we developed to detect phishing web sites. We use this classifier to maintain Google's phishing blacklist automatically. Our classifier analyzes millions of pages a day, examining the URL and the contents of a page to determine whether or not a page is phishing. Unlike previous work in this field, we train the classifier on a noisy dataset consisting of millions of samples from previously collected live classification data. Despite the noise in the training data, our classifier learns a robust model for identifying phishing pages which correctly classifies more than 90% of phishing pages several weeks after training concludes.