Scalable Inference in Latent Variable Models

Mohamed Aly
Joseph Gonzalez
Shravan Narayanamurthy
Alex Smola
Proceedings of The 5th ACM International Conference on Web Search and Data Mining (WSDM)(2012)

Abstract

Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organiz- ing the news and managing user generated content. La- tent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the la- tent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent- variable techniques are often prohibitively expensive to ap- ply to large-scale, streaming datasets. In this paper we present a scalable parallel framework for ecient inference in latent variable models over stream- ing web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) eciently storing and retrieving the large local state which includes the data-points and their corresponding latent vari- ables (e.g., cluster membership); and 3) sequentially incor- porating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggrega- tion system with a bandwidth-ecient communication pro- tocol; 2) schedule-aware out-of-core storage; and 3) approxi- mate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Fur- thermore, we provide an optimized and easily customizable open-source implementation of the framework.

Research Areas