Unified Analysis of Streaming News

Qirong Ho
Jacob Eisenstein
Eric P. Xing
Alex Smola
Choon-hui Teo
Proceedings of the 20th International World Wide Web conference (WWW) (2011)

Abstract

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of
dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often
solved separately. In this paper we present a uni ed framework to group incoming news articles into temporary but
tightly-focused storylines, to identify prevalent topics and
key entities within these stories, and to reveal the temporal
structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the
available wealth of data we build an ecient parallel inference algorithm by sequential Monte Carlo estimation. Time
and memory costs are nearly constant in the length of the
history, and the approach scales to hundreds of thousands
of documents. We demonstrate the eciency and accuracy
on the publicly available TDT dataset and data of a major
internet news site.