Data Sampling using Locality Sensitive Hashing for Large Scale Graph Learning

Mohamed Farghal
Animesh Nandi
Sarath Shekkizhar

Abstract

An important step in graph-based data analysis and processing is the construction of similarity graphs. Recent works have focused on the semi-supervised setting to learn an optimal similarity function for constructing a task-optimal graph. However, in many scenarios with billions of data points and trillions of potential edges, the run-time and computational requirements for training the similarity model make these approaches impractical.

In this work, we consider data sampling as a means to overcome this issue. Unlike typical sampling use-cases which only seek diversity, the similarity-learning for graph construction problem requires data samples that are both diverse and representative of highly similar data points. We present an efficient sampling approach by taking an adaptive partition view of locality sensitive hashing. Theoretically, we show that, though the samples obtained are correlated with sampling probabilities that do not sum to one, the training loss estimated for learning the graph similarity model using our approach is unbiased with a smaller variance compared to random sampling. Experiments on public datasets demonstrate the superior generalization of similarity models learned via our sampling. In a real large-scale industrial abuse-detection example, we observe ≈10× increase in identifying abusive items while having lower false positive rate compared to the baseline.