Scalable Inference in Latent Variable Models

Mohamed Aly
Joseph Gonzalez
Shravan Narayanamurthy
Alex Smola
Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM) (2012)

Abstract

Latent variable techniques are pivotal in tasks ranging from
predicting user click patterns and targeting ads to organiz-
ing the news and managing user generated content. La-
tent variable techniques like topic modeling, clustering, and
subspace estimation provide substantial insight into the la-
tent structure of complex data with little or no external
guidance making them ideal for reasoning about large-scale,
rapidly evolving datasets. Unfortunately, due to the data
dependencies and global state introduced by latent variables
and the iterative nature of latent variable inference, latent-
variable techniques are often prohibitively expensive to ap-
ply to large-scale, streaming datasets.
In this paper we present a scalable parallel framework
for ecient inference in latent variable models over stream-
ing web-scale data. Our framework addresses three key
challenges: 1) synchronizing the global state which includes
global latent variables (e.g., cluster centers and dictionaries);
2) eciently storing and retrieving the large local state which
includes the data-points and their corresponding latent vari-
ables (e.g., cluster membership); and 3) sequentially incor-
porating streaming data (e.g., the news). We address these
challenges by introducing: 1) a novel delta-based aggrega-
tion system with a bandwidth-ecient communication pro-
tocol; 2) schedule-aware out-of-core storage; and 3) approxi-
mate forward sampling to rapidly incorporate new data. We
demonstrate state-of-the-art performance of our framework
by easily tackling datasets two orders of magnitude larger
than those addressed by the current state-of-the-art. Fur-
thermore, we provide an optimized and easily customizable
open-source implementation of the framework.

Research Areas

×