Jonathan Halcrow
Jonathan received his Bachelor's in Math and Physics from Georgia Tech in 2003, then his PhD in Physics from Georgia Tech in 2008. In his PhD work, he studied turbulence in Plane Couette Flow.
At Google, his main interests are in graph algorithms for semi-supervised learning, how to optimize graph construction for particular learning tasks, and graph neural networks.
Research Areas
Authored Publications
Sort By
Preview abstract
Graphs are a powerful tool for representing and analyzing complex relationships in real-world applications such as social networks, recommender systems, and computational finance. Reasoning on graphs is essential for drawing inferences about the relationships between entities in a complex system, and to identify hidden patterns and trends. Despite the remarkable progress in automated reasoning with natural text, reasoning on graphs with large language models (LLMs) remains an understudied problem. In this work, we perform the first comprehensive study of encoding graph-structured data as text for consumption by LLMs. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: (1) the graph encoding method, (2) the nature of the graph task itself, and (3) interestingly, the very structure of the graph considered. These novel results provide valuable insight on strategies for encoding graphs as text. Using these insights we illustrate how the correct choice of encoders can boost performance on graph reasoning tasks inside LLMs by 4.8% to 61.8%, depending on the task.
View details
HUGE: Huge Unsupervised Graph Embeddings with TPUs
SIGKDD Conference on Knowledge Discovery and Data Mining, ACM (2023) (to appear)
Preview abstract
Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of nodes and trillions of edges. A common first step for network understanding is Graph Embedding, the process of creating a continuous representation of nodes in a graph. A continuous representation is often more amenable, especially at scale, for solving downstream machine learning tasks such as classification, link prediction, and clustering. A high-performance graph embedding architecture leveraging Tensor Processing Units (TPUs) with configurable amounts of high-bandwidth memory is presented that simplifies the graph embedding problem and can scale to graphs with billions of nodes and trillions of edges. We verify the embedding space quality on real and synthetic large-scale datasets.
View details
Preview abstract
With modern fluorescent probes and light microscopes, the possibilities of monitoring the dynamics of cells, organelles, molecular complexes and even single molecules with high spatiotemporal resolution are more than ever [1]. Characterizing the motion of cellular and molecular entities can reveal a great deal of information about their functions and interactions and overall activity landscape [2]. Motion characterization is generally the end product of an image analysis pipeline that starts with identifying and localizing the objects in each image (segmentation/detection), tracking them over time, and then analyzing the resulting trajectories to characterize the objects’ motion. Whether objects are large (e.g. cells or organelles) or small (e.g. molecules or molecular complexes), as long as they can be represented by coordinates (e.g. cell centroid position or molecular position), following them over time in a series of images is effectively a (multiple) particle tracking problem. In their recent publication in Nature Machine Intelligence, Pineda et al. [3] describe a powerful deep learning approach based on graph neural networks (GNNs) for particle tracking or, if desired, for the characterization of object motion without explicit tracking.
Particle tracking is often the most challenging step in the analysis pipeline from images to motion characterization. Whenever the analysis involves a relatively high density of heterogeneously moving objects, there is ambiguity in determining which object has gonewent where throughout the image series [4]. In addition, objects may merge with each other – due to crossing paths or interactions – and may undergo splitting, such as during cell division. Despite these challenges, a high density of tracked objects is often desired, because of the rich information that it yields about the system studied [5]. The novel GNN-based approach by Pineda et al. [3], named MAGIK, offers solutions to the tracking problem in two ways: First, MAGIK can be employed to construct the trajectories of the imaged objects from the graph of their connections in space and time. Second, MAGIK can be employed to characterize the motion of the imaged objects directly from their graph of spatiotemporal connections, without explicit tracking.
Graphs are ubiquitously used in science to represent complex systems of interacting objects, from molecules to social and transportation networks [6]. GNNs provide a framework for incorporating existing information about the objects, with an inductive bias based on a larger structure relating them, to make predictions about these objects or the system as a whole. In MAGIK [3], spatiotemporal connections between imaged objects, encoded in the structure of the graph, provide this inductive bias , with the premise that objects close in space-time are likely to be the same. MAGIK utilizes this graph representation in a powerful way by employing GNNs [7] to perform various tracking and motion characterization tasks. The GNN model proposed in MAGIK considers both spatial and temporal information in a static graph. This model is enhanced by an adaptive and interpretable attention mechanism. Attention estimates the strength of association among the objects and provides insights into the dynamics of the system for the task.
GNNs enable MAGIK to provide a versatile platform for performing multiple tasks from linking coordinates into trajectories to inferring local and global dynamic properties. MAGIK is tested for its flexibility and reliability in real and simulated scenarios corresponding to a variety of biological experiments. The results of the tests show that MAGIK is able to identify which spatiotemporal connections in a graph influence the dynamic properties of each object. They further show that MAGIK accurately constructs trajectories, obtaining outstanding results for cell tracking, including the identification of cell division events, using multiple microscopy techniques and cell types. As in most applications the final goal of tracking is to characterize the dynamics of the system, Pineda et al. [3] have tested MAGIK for quantifying motion parameters without explicit tracking, and they have shown that MAGIK can accurately and sensitively quantify local or global motion properties of the imaged objects. Technically, MAGIK performs these various tasks by tailoring its training to the task: Tracking as a graph edge classification task, local motion motion characterization as a graph node regression task, and global motion characterization as a graph-level regression or classification task.
As demonstrated by MAGIK, GNNs offer powerful tools for the analysis of the spatiotemporal connections between objects in biological images. New developments in the fields of graphs and GNNs will further advance this goal. One possibility is to replace the fixed graph and the fully connected graph in MAGIK with a learnable sparse graph [8]. Another possibility is to use hypergraphs, which go beyond binary connections (a fundamental limitation of graphs). This would be a promising approach to characterize the spatiotemporal connections of systems with complex interactions [9]. Furthermore, as the problem studied here is temporal in nature, it may benefit from temporal GNNs [10], which directly incorporate time into the GNN formulation. All in all, the powerful combination of cutting-edge microscopes, fluorescent probes and geometric deep learning analytical tools will aid with the study of the organization, dynamics and interactions of diverse systems, from molecules in a cell, to cells in a tissue, and beyond.
View details
Stars: Tera-Scale Graph Building for Clustering and Learning
Warren Schudy
NeurIPS 2022 (2022)
Preview abstract
A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly controlled by the sparsity of the similarity graph. In this work, we present Stars: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. To complement our results, we have deployed Stars for multiple data sets allowing for graph building at the Tera-Scale, i.e., for graphs with hundreds of billions of nodes and tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons and significant improvements in runtime with negligible quality loss.
View details
Grale: Designing Networks for Graph Learning
Alexandru Moșoi
Sam Ruth
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery (2020), 2523–2532
Preview abstract
How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static.
In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques, we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise.
We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.
View details