Scalable Inference in Latent Variable Models
Abstract
Latent variable techniques are pivotal in tasks ranging from
predicting user click patterns and targeting ads to organiz-
ing the news and managing user generated content. La-
tent variable techniques like topic modeling, clustering, and
subspace estimation provide substantial insight into the la-
tent structure of complex data with little or no external
guidance making them ideal for reasoning about large-scale,
rapidly evolving datasets. Unfortunately, due to the data
dependencies and global state introduced by latent variables
and the iterative nature of latent variable inference, latent-
variable techniques are often prohibitively expensive to ap-
ply to large-scale, streaming datasets.
In this paper we present a scalable parallel framework
for ecient inference in latent variable models over stream-
ing web-scale data. Our framework addresses three key
challenges: 1) synchronizing the global state which includes
global latent variables (e.g., cluster centers and dictionaries);
2) eciently storing and retrieving the large local state which
includes the data-points and their corresponding latent vari-
ables (e.g., cluster membership); and 3) sequentially incor-
porating streaming data (e.g., the news). We address these
challenges by introducing: 1) a novel delta-based aggrega-
tion system with a bandwidth-ecient communication pro-
tocol; 2) schedule-aware out-of-core storage; and 3) approxi-
mate forward sampling to rapidly incorporate new data. We
demonstrate state-of-the-art performance of our framework
by easily tackling datasets two orders of magnitude larger
than those addressed by the current state-of-the-art. Fur-
thermore, we provide an optimized and easily customizable
open-source implementation of the framework.