Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products.
Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products.
We formalize data mining and machine learning challenges as graph problems and perform fundamental research in those fields leading to publications in top venues. Our algorithms and systems are used in a wide array of Google products such as Search, YouTube, AdWords, Play, Maps, and Social.
Our team specializes in clustering at Google scale, efficiently implementing many different algorithms including hierarchical agglomerative clustering, correlation clustering, k-means clustering, DBSCAN, and connected components. Our methods scale to graphs with trillions of edges using multiple machines and can efficiently handle graphs of tens of billions of edges on a single multicore machine. The clustering library powers over a hundred different use-cases across Google.
Our team specializes in large-scale learning on graph-structured data. We push the boundary on scalability, efficiency, and flexibility of our methods, informed by the complex heterogeneous systems abundant in our real-world industrial setting. In pursuit of scalability, we leverage both algorithmic improvements and novel hardware architectures. Our team develops and maintains TensorFlow-GNN, a library for training graph neural networks at Google scale.
Balanced Partitioning splits a large graph into roughly equal parts while minimizing cut size. The problem of “fairly” dividing a graph occurs in a number of contexts, such as assigning work in a distributed processing environment. Our techniques provided a 40% drop in multi-shard queries in Google Maps driving directions, saving a significant amount of CPU usage.
Our similarity ranking and centrality metrics serve as good features for understanding the characteristics of large graphs. They allow the development of link models useful for both link prediction and anomalous link discovery. Our tool Grale learns a similarity function that models the link relationships present in data.
Our research in pairwise similarity ranking has produced a number of innovative methods, which we have published at top conferences such as WWW, ICML, and VLDB. We maintain a library of similarity algorithms including distributed Personalized PageRank, Egonet similarity, and others.
Our research on novel models of graph computation addresses important issues of privacy in graph mining. Specifically, we present techniques to efficiently solve graph problems, including computing clustering, centrality scores and shortest path distances for each node, based on its personal view of the private data in the graph while preserving the privacy of each user.
We perform innovative research analyzing massive dynamic graphs. We have developed efficient algorithms for computing densest subgraph and triangle counting which operate even when subject to high velocity streaming updates.
Google’s most famous algorithm, PageRank, is a method for computing importance scores for vertices of a directed graph. In addition to PageRank, we have scalable implementations of several other centrality scores, such as harmonic centrality.
The GraphBuilder library can convert data from a metric space (such as document text) into a similarity graph. GraphBuilder scales to massive datasets by applying fast locality sensitive hashing and neighborhood search.
Distributed graph-based sampling has proved critical to various applications in active learning and data summarization, where the graph reveals signals about density and multi-hop connections. Combined with deep learning, we tackle provably hard problems and differentiable sampling helps GNN scalability too.
We design and implement graph-based optimization techniques to improve the performance of ML compilers (e.g., XLA). For example, we replaced heuristic-based cost models with graph neural networks (GNNs), achieving significant training and serving speed-ups (see our external TpuGraphs benchmarks and large-scale GNN). We have also deployed model partitioning algorithms that split ML computation graphs across TPUs for pipeline parallelism, as well as designed novel methods to certify that these partitions are near-optimal.
Alessandro Epasto
Allan Heydon
Anton Tsitsulin
Arjun Gopalan
Bahare Fatemi
Brandon Asher Mayer
Bryan Perozzi
CJ Carey
David Eisenstat
Dustin Zelle
Goran Žužić
Hendrik Fichtenberger
Jason Lee
Johannes Gasteiger, né Klicpera
Jonathan Halcrow
Kevin Aydin
Jakub Łącki
Laxman Dhulipala
Lin Chen
Matthew Fahrbach
Mohammadhossein Bateni
Nikos Parotsidis
Peilin Zhong
Rajesh Jayaram
Sami Abu-El-Haija
Sasan Tavakkol
Siddhartha Visveswara Jayanti
Silvio Lattanzi
Gang (Thomas) Fu
Vahab S. Mirrokni
Vincent Pierre Cohen-addad