Jump to Content

Graph mining

Our mission is to build the most scalable library for graph algorithms and analysis and apply it to a multitude of Google products.

graphs

Graph mining

About the team

We formalize data mining and machine learning challenges as graph problems and perform fundamental research in those fields leading to publications in top venues. Our algorithms and systems are used in a wide array of Google products such as Search, YouTube, AdWords, Play, Maps, and Social.

Team focus summaries

Large-Scale balanced partitioning

Balanced Partitioning splits a large graph into roughly equal parts while minimizing cut size. The problem of “fairly” dividing a graph occurs in a number of contexts, such as assigning work in a distributed processing environment. Our techniques provided a 40% drop in multi-shard queries in Google Maps driving directions, saving a significant amount of CPU usage.

Large-Scale clustering

Our team specializes in clustering graphs at Google scale, efficiently implementing many different algorithms including hierarchical clustering, overlapping clustering, local clustering, and spectral clustering.

Large-Scale connected components

Connected Components is a fundamental subroutine in many graph algorithms. We have state-of-the-art implementations in a variety of paradigms including MapReduce, a distributed hash table, Pregel, and ASYMP. Our methods are 10-30x faster than the best previously studied algorithms, and easily scale to graphs with trillions of edges.

Large-Scale link modeling

Large-Scale similarity ranking

Our research in pairwise similarity ranking has produced a number of innovative methods, which we have published at top conferences such as WWW, ICML, and VLDB. We maintain a library of similarity algorithms including distributed Personalized PageRank, Egonet similarity, and others.

Public-private graph computation

Our research on novel models of graph computation addresses important issues of privacy in graph mining. Specifically, we present techniques to efficiently solve graph problems, including computing clustering, centrality scores and shortest path distances for each node, based on its personal view of the private data in the graph while preserving the privacy of each user.

Streaming and dynamic graph algorithms

We perform innovative research analyzing massive dynamic graphs. We have developed efficient algorithms for computing densest subgraph and triangle counting which operate even when subject to high velocity streaming updates.

ASYMP: Async Message Passing Graph Mining

ASYMP is a graph mining framework based on asynchronous message passing. We have highly scalable code for Connected Components and shortest-path to a subset of nodes in this framework.

Large-Scale centrality ranking

Google’s most famous algorithm, PageRank, is a method for computing importance scores for vertices of a directed graph. In addition to PageRank, we have scalable implementations of several other centrality scores, such as harmonic centrality.

Large-Scale graph building

The GraphBuilder library can convert data from a metric space (such as document text) into a similarity graph. GraphBuilder scales to massive datasets by applying fast locality sensitive hashing and neighborhood search.

Featured publications

Affinity Clustering: Hierarchical Clustering at Scale
Soheil Behnezhad
Mahsa Derakhshan
MohammadTaghi Hajiaghayi
Raimondas Kiveris
NIPS 2017, pp. 6867-6877
Preview abstract Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying good hierarchical clustering structure is at the same time a fundamental and challenging problem for several applications. In many applications, the amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data. In this paper, we propose algorithms to address this problem. First, we analyze minimum spanning tree-based clustering algorithms and their corresponding hierarchical clusterings. In particular we consider classic single-linkage clustering based on Kruskal's algorithm and a variation of Boruvka algorithm that we call affinity clustering and prove new interesting properties of these clusterings via the concept of certificates. Then we present new algorithms in the MapReduce model and their efficient real world implementations via Distributed Hash Tables (DHTs). Our MapReduce algorithms indeed improve upon the previous MapReduce algorithms for finding a minimum spanning tree in graphs as well. Finally we show experimentally that our algorithms are scalable for huge data and competitive with state-of-the-art algorithms. In particular we show that Affinity Clustering is in practice superior to several state-of-the-art clustering algorithms. View details
Distributed Balanced Partitioning via Linear Embedding
Ninth ACM International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 387-396
Preview abstract Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques such as local swaps, minimum cuts on partition boundaries, as well as contraction and dynamic programming. Our empirical study compares the above techniques with each other, and to previous work in distributed algorithms, e.g., a label propagation method [34], FENNEL [32] and Spinner [23]. We report our results both on a private map graph and several public social networks, and show that our results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in cut size over [34]. We also observe that our algorithms allow for scalable distributed implementation for any number of partitions. Finally, we apply our techniques for the Google Maps Driving Directions to minimize the number of multi-shard queries with the goal of saving in CPU usage. During live experiments, we observe an ≈ 40% drop in the number of multi-shard queries when comparing our method with a standard geography-based method. View details
Preview abstract As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as MapReduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of MapReduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance. In this tutorial, we first discuss how to design and implement algorithms based on traditional MapReduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for MapReduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting MapReduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASyMP, one can improve the CPU usage, and achieve significantly improved running time. View details
Preview abstract We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, the public graph is visible to everyone, and the private graph at each node is visible only to the user at the node. From each node's viewpoint, the graph is just a union of its private graph and the public graph. We consider the problem of efficiently computing various properties of the graphs from each node's point of view, with minimal amount of recomputation on the public graph. To illustrate the richness of our model, we explore two powerful computational paradigms for studying large graphs, namely, sketching and sampling, and focus on some key problems in social networks and show efficient algorithms in the public-private graph model. In the sketching model, we show how to efficiently approximate the neighborhood function, which in turn can be used to approximate various notions of centrality. In the sampling model, we focus on all-pair shortest path distances, node similarities, and correlation clustering. View details
Preview abstract Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update. View details
Preview abstract In this paper, we present a study of the community structure of ego-networks—the graphs representing the connections among the neighbors of a node—for several online social networks. Toward this goal, we design a new technique to efficiently build and cluster all the ego-nets of a graph in parallel (note that even just building the ego-nets efficiently is challenging on large networks). Our experimental findings are quite compelling: at a microscopic level it is easy to detect high quality communities. Leveraging on this fact we, then, develop new features for friend suggestion based on co-occurrences of two nodes in different ego-nets’ communities. Our new features can be computed efficiently on very large scale graphs by just analyzing the neighborhood of each node. Furthermore, we prove formally on a stylized model, and by experimental analysis that this new similarity measure outperforms the classic local features employed for friend suggestions View details
Grale: Designing Networks for Graph Learning
Alexandru Moșoi
Sam Ruth
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery (2020), 2523–2532
Preview abstract How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static. In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques, we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise. We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone. View details
Preview abstract We study fundamental graph problems such as graph connectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching in a distributed setting. In particular, we focus on the Adaptive Massively Parallel Computation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table. We show the first AMPC algorithms for all of the studied problems that run in a constant number of rounds and use only O(n^ϵ) space per machine, where 0<ϵ<1. Our results improve both upon the previous results in the AMPC model, as well as the best-known results in the MPC model, which is the theoretical model underpinning many popular distributed computation frameworks, such as MapReduce, Hadoop, Beam, Pregel and Giraph. Finally, we provide an empirical comparison of the algorithms in the MPC and AMPC models in a fault-tolerant distriubted computation environment. We empirically evaluate our algorithms on a set of large real-world graphs and show that our AMPC algorithms can achieve improvements in both running time and round-complexity over optimized MPC baselines. View details
Scaling Graph Neural Networks with Approximate PageRank
Aleksandar Bojchevski
Johannes Klipera
Amol Kapoor
Martin Blais
Benedek András Rózemberczki
Stephan Günnnemann
KDD (2020)
Preview abstract Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, despite their successes on small datasets, efficiently utilizing them on massive web-scale data remains a challenge. All recently proposed scalable GNN approaches rely on a message passing procedure to propagate information on the graph, leading to expensive recursive neighborhood expansion (and aggregation) schemes during both training and inference. This limitation is particularly problematic if we want to consider neighbors that are multiple hops away. In contrast, by leveraging connections between GNNs and personalized PageRank, we develop a model that incorporates multi-hop neighborhood information in a single (non-recursive) step. Our model \model, is significantly faster than previous scalable approaches while maintaining state-of-the-art prediction performance. Moreover, our algorithm can produce a scalability certificate which guarantees that the predictions will not change if we would have used instead a more expensive non-scalable baseline. To demonstrate the strengths and the scalability of our approach we both evaluate on existing datasets, and propose a new large scale graph learning setting, using the open academic graph (90M nodes, 3B edges). Additionally, we discuss the practical applications of large-scale semi-supervised learning, like \model~ at Google to solve node classification problems. View details
When Recommendation Goes Wrong - Anomalous Link Discovery in Recommendation Networks
Michael Schueppert
Jack Saalweachter
Mayur Thakur
Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
Preview abstract We present a secondary ranking system to find and remove erroneous suggestions from a geospatial recommendation system. We discover such anomalous links by “double checking” the recommendation system’s output to ensure that it is both structurally cohesive, and semantically consistent. Our approach is designed for the Google Related Places Graph, a geographic recommendation system which provides results for hundreds of millions of queries a day. We model the quality of a recommendation between two geographic entities as a function of their structure in the Related Places Graph, and their semantic relationship in the Google Knowledge Graph. To evaluate our approach, we perform a large scale human evaluation of such an anomalous link detection system. For the long tail of unpopular entities, our models can predict the recommendations users will consider poor with up to 42% higher mean precision (29 raw points) than the live system. Results from our study reveal that structural and semantic features capture different facets of relatedness to human judges. We characterize our performance with a qualitative analysis detailing the categories of real-world anomalies our system is able to detect, and provide a discussion of additional applications of our method. View details