Tutorial: New Templates for Scalable Data Analysis
Abstract
Scalable data analysis has come a long way since the intro-
duction of the MapReduce paradigm a decade ago. In this
tutorial we present algorithms for synchronous and asyn-
chronous data processing. They are are capable of dealing
with the amounts of data typically available on the internet.
We given a brief description of the problems one faces
when performing scalable machine learning on the inter-
net. To motivate matters we provide a number of scenarios
from spam ltering, advertising and collaborative ltering.
This is followed by an extensive discussion of current and
novel synchronous data processing techniques. In particu-
lar we emphasize how insights from systems research and
databases can be used to achieve signicant improvements
both in terms of expressiveness and in terms of eciency of
the deployed algorithms.
This is followed by a description of asynchronous data
analysis and inference methods. The latter are particularly
necessary whenever the estimation problem requires the use
of a signicant number of latent variables. This includes
cases such as clustering, topic models, or graph factoriza-
tion. We provide an ample number of motivating examples
and applications, ranging from user proling to the analysis
of communication networks. Special emphasis is placed on
approximations needed to scale algorithms to hundreds of
millions of users and billions of documents.