Aaron Archer

Aaron Archer

Aaron is a Research Scientist at Google NYC, where he leads the Large-Scale Optimization research team, which is part of the broader NYC Algorithms and Optimization team. He came to Google in 2013 after 9+ years in the Algorithms and Optimization department at the AT&T Shannon Research Laboratory. Aaron earned degrees in Operations Research (M.S. 2002, Ph.D. 2004) from Cornell University, where he was advised by Eva Tardos and supported by a Hertz Fellowship. Prior to that, he studied Mathematics (B.S. 1998) at Harvey Mudd College, where he graduated first in his class. Aaron has pursued theoretical, experimental and applied research in diverse areas including mathematical programming (e.g., linear, integer, and convex programming), approximation algorithms, graph algorithms, algorithmic game theory, online algorithms, network design, load balancing, and unsupervised machine learning (e.g., clustering). He particularly enjoys finding effective ways to model complex real-world optimization problems. His primary mission at Google has been to apply these and other techniques to improve the efficiency of Google's computational infrastructure, such as the backend for websearch. Aaron has also made small forays into graphics, machine vision, combinatorics, graph theory, computational ecology, and mathematical rap.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Load is not what you should balance: Introducing Prequal
    Bartek Wydrowski
    Bobby Kleinberg
    Steve Rumble
    (2024)
    Preview abstract We present Prequal (\emph{Probing to Reduce Queuing and Latency}), a load balancer for distributed multi-tenant systems. Prequal aims to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. It actively probes server load to leverage the \emph{power of $d$ choices} paradigm, extending it with asynchronous and reusable probes. Cutting against received wisdom, Prequal does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). We explore its major design features on a testbed system and evaluate it on YouTube, where it has been deployed for more than two years. Prequal has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and other production systems at Google to run at much higher utilization. View details
    Practical Performance Guarantees for Pipelined DNN Inference
    Kuikui Liu
    Proceedings of the 41st International Conference on Machine Learning (2024), pp. 1655-1671
    Preview abstract This work optimizes pipeline parallelism of machine learning model inference by partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage. We design practical algorithms for this NP-complete problem and prove they are nearly optimal in practice by comparing against lower bounds obtained from solving novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound techniques to production models to achieve substantial improvements in the approximation guarantees, compared to simple combinatorial lower bounds. For example, our new MIP formulations improve the lower bounds enough to drop the geometric mean approximation ratio from $2.175$ to $1.082$ across production data with $k=16$ pipeline stages. This work shows that while bottleneck partitioning is theoretically hard, in practice we have a handle on the algorithmic side of the problem and much of the remaining challenge is in developing more accurate cost models to give to the partitioning algorithms. View details
    Cache-aware load balancing of data center applications
    Aaron Schild
    Ray Yang
    Richard Zhuang
    Proceedings of the VLDB Endowment, 12 (2019), pp. 709-723
    Preview abstract Our deployment of cache-aware load balancing in the Google web search backend reduced cache misses by ~0.5x, contributing to a double-digit percentage increase in the throughput of our serving clusters by relieving a bottleneck. This innovation has benefited all production workloads since 2015, serving billions of queries daily. A load balancer forwards each query to one of several identical serving replicas. The replica pulls each term's postings list into RAM from flash, either locally or over the network. Flash bandwidth is a critical bottleneck, motivating an application-directed RAM cache on each replica. Sending the same term reliably to the same replica would increase the chance it hits cache, and avoid polluting the other replicas' caches. However, most queries contain multiple terms and we have to send the whole query to one replica, so it is not possible to achieve a perfect partitioning of terms to replicas. We solve this via a voting scheme, whereby the load balancer conducts a weighted vote by the terms in each query, and sends the query to the winning replica. We develop a multi-stage scalable algorithm to learn these weights. We first construct a large-scale term-query graph from logs and apply a distributed balanced graph partitioning algorithm to cluster each term to a preferred replica. This yields a good but simplistic initial voting table, which we then iteratively refine via cache simulation to capture feedback effects. View details
    Wireless coverage prediction via parametric shortest paths
    David S. Johnson
    Evdokia Nikolova
    Mikkel Thorup
    Proceedings of the Nineteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc'18), ACM (2018), pp. 221-230
    Preview abstract When deciding where to place access points in a wireless network, it is useful to model the signal propagation loss between a proposed antenna location and the areas it may cover. The indoor dominant path (IDP) model, introduced by Wölfle et al., is shown in the literature to have good validation and generalization error, is faster to compute than competing methods, and is used in commercial software such as WinProp, iBwave Design, and CellTrace. Previously, the algorithms known for computing it involved a worst-case exponential-time tree search, with pruning heuristics to speed it up. We prove that the IDP model can be reduced to a parametric shortest path computation on a graph derived from the walls in the floorplan. It therefore admits a quasipolynomial-time (i.e., nO(logn)) algorithm. We also give a practical approximation algorithm based on running a small constant number of shortest path computations. Its provable worst-case additive error (in dB) can be made arbitrarily small via appropriate choices of parameters, and is well below 1dB for reasonable choices. We evaluate our approximation algorithm empirically against the exact IDP model, and show that it consistently beats its theoretical worst-case bounds, solving the model exactly (i.e., no error) in the vast majority of cases. View details
    Preview abstract We consider the reachability indexing problem for private-public directed graphs. In these graphs nodes come in three flavors: public—nodes visible to all users, private—nodes visible to a specific set of users, and protected—nodes visible to any user who can see at least one of the node’s parents. We are interested in computing the set of nodes visible to a specific user online. There are two obvious algorithms: precompute the result for every user, or run a reachability algorithm at query time. This paper explores the trade-off between these two strategies. Our approach is to identify a set of additional visible seed nodes for each user. The online reachability algorithm explores the graph starting at these nodes. We first formulate the problem as asymmetric k-center with outliers, and then give an efficient and practical algorithm. We prove new theoretical guarantees for this problem and show empirically that it performs very well in practice. View details
    Optimal Content Placement for a Large-Scale VoD System
    Vijay Gopalakrishnan
    K.K. Ramakrishnan
    IEEE/ACM Transactions on Networking, 24 (2016), pp. 2114-2127
    Preview abstract IPTV service providers offering Video-on-Demand currently use servers at each metropolitan office to store all the videos in their library. With the rapid increase in library sizes, it will soon become infeasible to replicate the entire library at each office. We present an approach for intelligent content placement that scales to large library sizes (e.g., 100Ks of videos). We formulate the problem as a mixed integer program (MIP) that takes into account constraints such as disk space, link bandwidth, and content popularity. To overcome the challenges of scale, we employ a Lagrangian relaxation-based decomposition technique combined with integer rounding. Our technique finds a near-optimal solution (e.g., within 1-2%) with orders of magnitude speedup relative to solving even the LP relaxation via standard software. We also present simple strategies to address practical issues such as popularity estimation, content updates, short-term popularity fluctuation, and frequency of placement updates. Using traces from an operational system, we show that our approach significantly outperforms simpler placement strategies. For instance, our MIP-based solution can serve all requests using only half the link bandwidth used by LRU or LFU cache replacement policies. We also investigate the trade-off between disk space and network bandwidth. View details