John Palowitch

John Palowitch

John is a Research Scientist working on graph learning, document understanding, and social media modeling.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework. View details
    Preview abstract A vast amount of human discussion, storytelling, content creation, and reporting now occurs on social media platforms. As such, social media posts are often quoted on web pages as context. In this paper, we argue that these quotations and their surrounding page context provide a rich, platform-independent source of data for studying the intersection of natural language and social media. We introduce a taxonomy of quotation roles that categorizes how social media posts are used within content. We release a dataset of 38M social quotes derived from the Common Crawl, and role labels for a subset assessed by human raters. We show that the interplay of accounts, roles, and topics across the web graph reveal valuable social diffusion patterns, and that roles can be predicted with fine-tuned large language models from web context. View details
    Preview abstract Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs - does this mean that GNN pooling methods do a good job at clustering graphs? Surprisingly, the answer is no - current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics. View details
    GraphWorld: Fake Graphs Bring Real Insights for GNNs
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022)
    Preview abstract The continuing maturity of the deep learning subfield of graph neural networks (GNNs) has motivated recent studies into the standard datasets used to benchmark GNNs. While important improvements have been made to GNN datasets and experimental design, any one dataset provides only a singular, potentially spurious insight into the performance of any GNN being tested. We show that state-of-the-art GNN task datasets do not cover the distribution of graphs in a much larger real-data graph repository, with respect to several key graph metrics. Motivated by this finding, we introduce GraphWorld, a novel distributed framework and software package for testing GNN models on an arbitrarily-large population of \emph{synthetic} task datasets. GraphWorld allows a user to efficiently generate a \emph{world} of millions of graph datasets, with fine-grained control over graph generator parameters, and benchmark arbitrary GNN models, with built-in hyperparameter tuning. Using GraphWorld to generate diverse graph worlds corresponding to node classification, graph classification, and link prediction tasks, we provide insight into the sensitivity of 10,000+ GNN models to various parameters of graphs and node features and} show comparisons between models that have not been possible to make in any previous work. We also introduce a novel metric with which to explore each models' performance on the graph world, conditioning on graph metrics and graph generator parameters. View details
    Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks
    Minji Yoon
    Ziniu Hu
    Ruslan Salakhutdinov
    Advances in Neural Information Processing Systems (2022)
    Preview abstract Data continuously emitted from industrial ecosystems such as social or e-commerce platforms are commonly represented as heterogeneous graphs (HG) composed of multiple node/edge types. State-of-the-art graph learning methods for HGs known as heterogeneous graph neural networks (HGNNs) are applied to learn deep context-informed node representations. However, many HG datasets from industrial applications suffer from label imbalance between node types. As there is no direct way to learn using labels rooted at different node types, HGNNs have been applied to only a few node types with abundant labels. We propose a zero-shot transfer learning module for HGNNs called a Knowledge Transfer Network (KTN) that transfers knowledge from label-abundant node types to zero-labeled node types through rich relational information given in the HG. KTN is derived from the theoretical relationship, which we introduce in this work, between distinct feature extractors for each node type given in an HGNN model. KTN improves the performance of 6 different types of HGNN models by up to 960% for inference on zero-labeled node types and outperforms state-of-the-art transfer learning baselines by up to 73% across 18 different transfer learning tasks on HGs. View details
    Preview abstract Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. In this paper, we study unsupervised training of GNN pooling in terms of their clustering capabilities. We start by drawing a connection between graph clustering and graph pooling: intuitively, a good graph clustering is what one would expect from a GNN pooling layer. Counterintuitively, we show that this is not true for state-of-the-art pooling methods, such as MinCut pooling. To address these deficiencies, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. In order to clarify the regimes where existing methods fail, we carefully design a set of experiments on synthetic data which show that DMoN is able to jointly leverage the signal from the graph structure and node attributes. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results. View details
    Debiasing Graph Embeddings with Metadata-Orthogonal Training
    2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2020)
    Preview abstract In many real world graphs, the formation of edges can be influenced by certain sensitive features of the nodes (e.g. their gender, community, or reputation). In this paper we argue that when such influences exist, any downstream Graph Neural Network (GNN) will be implicitly biased by these structural correlations. To allow control over this phenomenon, we introduce the Metadata-Orthogonal Node Embedding Training (MONET) unit, a general neural network architecture component for performing training-time linear debiasing of graph embeddings. MONET operates by ensuring that the node embeddings are trained on a hyperplane orthogonal to that of the node features (metadata). Unlike debiasing approaches in similar domains, our method offers exact guarantees about the correlation between the resulting embeddings and any sensitive metadata. We illustrate the effectiveness of MONET though our experiments on a variety of real world graphs against challenging baselines (e.g. adversarial debiasing), showing superior performance in tasks such as preventing the leakage of political party affiliation in a blog network, and preventing the gaming of embedding-based recommendation systems. View details
    Preview abstract In scientific problems involving systems that can be modeled as a network (or “graph”), it is often of interest to find network communities - strongly connected node subsets - for unsupervised learning, feature discovery, anomaly detection, or scientific study. The vast majority of community detection methods proceed via optimization of a quality function, which is possible even on random networks without communities. Therefore there is usually not an easy way to tell if a community is “significant”, in this context meaning more internally connected than would be expected under a random graph model without communities. This paper generalizes existing null models and statistical tests for this purpose to bipartite graphs, and introduces a new significance scoring algorithm called Fast Optimized Community Significance (FOCS) that is highly scalable and agnostic to the type of graph. Compared with existing methods on unipartite graphs, FOCS is more numerically stable and better balances the trade-off between detection power and false positives. On a large-scale bipartite graph derived from the Internet Movie Database (IMDB), the significance scores provided by FOCS correlate strongly with meaningful actor/director collaborations on serial cinematic projects. View details