Tutorial: New Templates for Scalable Data Analysis

Alex Smola
Markus Weimer
The 21st International World Wide Web conference (WWW)(2012)

Abstract

Scalable data analysis has come a long way since the intro- duction of the MapReduce paradigm a decade ago. In this tutorial we present algorithms for synchronous and asyn- chronous data processing. They are are capable of dealing with the amounts of data typically available on the internet. We given a brief description of the problems one faces when performing scalable machine learning on the inter- net. To motivate matters we provide a number of scenarios from spam ltering, advertising and collaborative ltering. This is followed by an extensive discussion of current and novel synchronous data processing techniques. In particu- lar we emphasize how insights from systems research and databases can be used to achieve signi cant improvements both in terms of expressiveness and in terms of eciency of the deployed algorithms. This is followed by a description of asynchronous data analysis and inference methods. The latter are particularly necessary whenever the estimation problem requires the use of a signi cant number of latent variables. This includes cases such as clustering, topic models, or graph factoriza- tion. We provide an ample number of motivating examples and applications, ranging from user pro ling to the analysis of communication networks. Special emphasis is placed on approximations needed to scale algorithms to hundreds of millions of users and billions of documents.

Research Areas