David Eisenstat

David Eisenstat

David is a software engineer in the Large-Scale Optimization team. He earned a PhD in Computer Science from Brown University in 2014.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a (1+ϵ)-approximation algorithm for this problem on m edge graphs using O(m polylog m) work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for exact average-linkage HAC is not possible under standard complexity-theoretic assumptions. We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines. On a broad set of large publicly-available real-world datasets, we find that ParHAC obtains a 50.1x speedup on average over the best sequential baseline, while achieving quality similar to the exact HAC algorithm. We also show that ParHAC can cluster one of the largest publicly available graph datasets with 124 billion edges in a little over three hours using a commodity multicore machine. View details
    Design and analysis of bipartite experiments under a linear exposure-response model
    Christopher Harshaw
    Fredrik Savje
    Proceedings of the 23rd ACM Conference on Economics and Computation (2022), pp. 606
    Preview abstract A bipartite experiment consists of one set of units being assigned treatments and another set of units for whichwe measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treatedunits can affect the outcome units. In this paper, we consider estimation of the average total treatment effectin the bipartite experimental framework under a linear exposure-response model. We introduce the ExposureReweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptoticallynormal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiasedand consistent estimator of the variance of theERLpoint estimator. In addition, we introduce a cluster-baseddesign,Exposure-Design, that uses heuristics to increase the precision of theERLestimator by realizinga desirable exposure distribution. Finally, we demonstrate the application of the described methodology tomarketplace experiments using a publicly available Amazon user-item review dataset. View details
    Preview abstract Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LambdaCC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality View details
    Preview abstract We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient Õ(m) time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the most popular variant of HAC, we provide an algorithm that runs in Õ (n sqrt(m)) time. For this variant, this is the first exact algorithm that runs in subquadratic time, as long as m=n^(2−ϵ) for some constant ϵ>0. We complement this result with a simple ϵ-close approximation algorithm for average-linkage in our framework that runs in Õ (m) time. As an application of our algorithms, we consider clustering points in a metric space by first using k-NN to generate a graph from the point set, and then running our algorithms on the resulting weighted graph. We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x. View details