A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications

Daniel Silva
James Long
KDD (2017), pp. 1713-1721

Abstract

Topic models are often applied in industrial settings to discover
user profiles from activity logs where documents correspond
to users and words to complex objects such as
web sites and installed apps. Standard topic models ignore
the content-based similarity structure between these objects
largely because of the inability of the Dirichlet prior to capture
such side information of word-word correlation. Several
approaches were proposed to replace the Dirichlet prior
with more expressive alternatives. However, this added expressivity
comes with a heavy premium: inference becomes
intractable and sparsity is lost which renders these alternatives
not suitable for industrial scale applications. In this
paper we take a radically different approach to incorporating
word-word correlation in topic models by applying this
side information at the posterior level rather than at the
prior level. We show that this choice preserves sparsity and
results in a graph-based sampler for LDA whose computational
complexity is asymptotically on bar with the state of
the art Alias base sampler for LDA. We illustrate the
efficacy of our approach over real industrial datasets that
span up to billion of users, tens of millions of words and
thousands of topics. To the best of our knowledge, our approach
provides the first practical and scalable solution to
this important problem

Research Areas