Google Research

Scalable Realistic Recommendation Datasets through Fractal Expansions

arXiv, Cornell University (2019)

Abstract

Recommender System research suffers currently from a disconnect between the size of academic data sets and the scale of production systems. In order to bridge that gap we propose to generate more massive user/item interaction data sets by expanding pre-existing public data sets.

User/item incidence matrices record interactions between users and items on a given platform as a large sparse matrix whose rows correspond to users and whose columns correspond to items. Our novel scalable technique expands such matrices to larger numbers of rows (users), columns (items) and non zero values (interactions) while preserving key high order properties.

We adapt a Kronecker Graph Theory to user/item incidence matrices and show that the corresponding \emph{fractal expansions} preserves the fat tailed distributions of user engagements, item popularity and singular value spectra of user/item interaction matrices. Preserving such properties is key to building realistic large synthetic data sets which in turn can be employed reliably to benchmark recommender systems and the systems employed to train them.

We provide algorithms to produce such expansions and apply them to the MovieLens 20 million data set comprising 20 million ratings of 27K movies by 138K users. The resulting expanded data set has 10 billion ratings, 800K items and 2M users in its smallest version.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work