Graph mining

Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products.

Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products.

About the team

We formalize data mining and machine learning challenges as graph problems and perform fundamental research in those fields leading to publications in top venues. Our algorithms and systems are used in a wide array of Google products such as Search, YouTube, AdWords, Play, Maps, and Social.

Team focus summaries

Large-Scale Clustering and Connected Components

Our team specializes in clustering at Google scale, efficiently implementing many different algorithms including hierarchical agglomerative clustering, correlation clustering, k-means clustering, DBSCAN, and connected components. Our methods scale to graphs with trillions of edges using multiple machines and can efficiently handle graphs of tens of billions of edges on a single multicore machine. The clustering library powers over a hundred different use-cases across Google.

Graph Neural Networks and Graph Embeddings

Our team specializes in large-scale learning on graph-structured data. We push the boundary on scalability, efficiency, and flexibility of our methods, informed by the complex heterogeneous systems abundant in our real-world industrial setting. In pursuit of scalability, we leverage both algorithmic improvements and novel hardware architectures. Our team develops and maintains TensorFlow-GNN, a library for training graph neural networks at Google scale.

Large-Scale balanced partitioning

Balanced Partitioning splits a large graph into roughly equal parts while minimizing cut size. The problem of “fairly” dividing a graph occurs in a number of contexts, such as assigning work in a distributed processing environment. Our techniques provided a 40% drop in multi-shard queries in Google Maps driving directions, saving a significant amount of CPU usage.

Large-Scale link modeling

Our similarity ranking and centrality metrics serve as good features for understanding the characteristics of large graphs. They allow the development of link models useful for both link prediction and anomalous link discovery. Our tool Grale learns a similarity function that models the link relationships present in data.

Large-Scale similarity ranking

Our research in pairwise similarity ranking has produced a number of innovative methods, which we have published at top conferences such as WWW, ICML, and VLDB. We maintain a library of similarity algorithms including distributed Personalized PageRank, Egonet similarity, and others.

Public-private graph computation

Our research on novel models of graph computation addresses important issues of privacy in graph mining. Specifically, we present techniques to efficiently solve graph problems, including computing clustering, centrality scores and shortest path distances for each node, based on its personal view of the private data in the graph while preserving the privacy of each user.

Streaming and dynamic graph algorithms

We perform innovative research analyzing massive dynamic graphs. We have developed efficient algorithms for computing densest subgraph and triangle counting which operate even when subject to high velocity streaming updates.

Large-Scale centrality ranking

Google’s most famous algorithm, PageRank, is a method for computing importance scores for vertices of a directed graph. In addition to PageRank, we have scalable implementations of several other centrality scores, such as harmonic centrality.

Large-Scale graph building

The GraphBuilder library can convert data from a metric space (such as document text) into a similarity graph. GraphBuilder scales to massive datasets by applying fast locality sensitive hashing and neighborhood search.

Graph-based sampling

Distributed graph-based sampling has proved critical to various applications in active learning and data summarization, where the graph reveals signals about density and multi-hop connections. Combined with deep learning, we tackle provably hard problems and differentiable sampling helps GNN scalability too.

ML compiler optimization

We design and implement graph-based optimization techniques to improve the performance of ML compilers (e.g., XLA). For example, we replaced heuristic-based cost models with graph neural networks (GNNs), achieving significant training and serving speed-ups (see our external TpuGraphs benchmarks and large-scale GNN). We have also deployed model partitioning algorithms that split ML computation graphs across TPUs for pipeline parallelism, as well as designed novel methods to certify that these partitions are near-optimal.

Featured publications

Talk like a Graph: Encoding Graphs for Large Language Models

Bahar Fatemi

Bryan Perozzi

Jonathan Halcrow

ICLR (2024)

Measuring Re-identification Risk

CJ Carey

Travis Dick

Alessandro Epasto

Adel Javanmard

Josh Karlin

Shankar Kumar

Andres Munoz Medina

Vahab Mirrokni

Gabriel Henrique Nunes

Sergei Vassilvitskii

Peilin Zhong

SIGMOD (2023)

Near-Optimal Private and Scalable k-Clustering

Vincent Pierre Cohen-addad

Alessandro Epasto

Vahab Mirrokni

Shyam Narayanan

Peilin Zhong

NeurIPS 2022 (2022)

Optimal Distributed Submodular Optimization via Sketching

MohammadHossein Bateni

Hossein Esfandiari

Vahab Mirrokni

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018), pp. 1138-1147

Tackling Provably Hard Representative Selection via Graph Neural Networks

Anton Tsitsulin

Bryan Perozzi

Hossein Esfandiari

Mehran Kazemi

Mohammad "Hossein" Bateni

Vahab Mirrokni

Deepak Ramachandran

Transactions on Machine Learning Research (2023)

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Jason Lee

Jakub Łącki

Laxman Dhulipala

Vahab Mirrokni

SIGMOD'24 (2023)

Massively Parallel Computation via Remote Memory Access

Hossein Esfandiari

Jakub Łącki

Laxman Dhulipala

Soheil Behnezhad

Vahab Mirrokni

Warren Schudy

SPAA 2019

Affinity Clustering: Hierarchical Clustering at Scale

MohammadHossein Bateni

Soheil Behnezhad

Mahsa Derakhshan

MohammadTaghi Hajiaghayi

Raimondas Kiveris

Silvio Lattanzi

Vahab Mirrokni

NIPS 2017, pp. 6867-6877

Distributed Balanced Partitioning via Linear Embedding

Kevin Aydin

Mohammadhossein Bateni

Vahab Mirrokni

Ninth ACM International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 387-396

Distributed Graph Algorithmics: Theory and Practice

Silvio Lattanzi

Vahab S. Mirrokni

WSDM (2015), pp. 419-420

Grale: Designing Networks for Graph Learning

Jonathan Jesse Halcrow

Alexandru Moșoi

Sam Ruth

Bryan Perozzi

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery (2020), 2523–2532