Jump to Content
Sergei Vassilvitskii

Sergei Vassilvitskii

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible. Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting. To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem. We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items. View details
    Preview abstract Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications. View details
    Preview abstract We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. View details
    Preview abstract Transformer-based language models are being pre-trained on ever growing datasets (hundreds of gigabytes) using ever growing amounts of parameters (millions to billions). These large amounts of training data are typically scraped from the public web, and may contain (public) personally identifiable information such as names and phone numbers. Moreover, recent findings also show that the capacity of these models allow them to memorize parts of the training data. One defense against such memorization that has not yet been fully explored in this context is differential privacy (DP). We focus on T5, a popular encoder-decoder, and show that by using recent advances in JAX and XLA we can train models with DP that do not suffer a big drop in utility, nor in training speed, and can still be fine-tuned to high accuracies on downstream tasks such as GLUE. Moreover, we show that T5's span corruption pre-training task, unlike next token prediction, is a good defense against data memorization. View details
    Secretaries with Advice
    Paul Duetting
    Proceedings of the 22nd ACM Conference on Economics and Computation (EC'21) (2021), pp. 409-429
    Preview abstract The secretary problem is probably the purest model of decision making under uncertainty. In this paper we ask which advice can we give the algorithm to improve its success probability? We propose a general model that unifies a broad range of problems: from the classic secretary problem with no advice, to the variant where the quality of a secretary is drawn from a known distribution and the algorithm learns each candidate’s quality quantile on arrival, to more modern ML-based versions of advice where a binary classifier gives us noisy advice about whether or not the current secretary is the best on the market. Our main technique is a factor revealing LP that captures all of the problems above. We use this LP formulation to gain structural insight into the optimal policy and present two case studies: a re-derivation of the classic known distributions result with tools from linear programming and a tight analysis of the noisy binary advice model. View details
    Online Scheduling via Learned Weights
    Benjamin Moseley
    Thomas Lavastida
    SODA 2020 (to appear)
    Preview abstract The use of machine learning and predictive models have produced a revolution in science and engineering. Online optimization problems are a natural source of uncertainty that predictions can be used to manage and improve performance. This paper studies how predictive techniques can be used to break through worst case barriers in online scheduling. The make-span minimization problem on unrelated machines and its special case, restricted assignment, are classic problems in online scheduling theory. Worst case analysis of these problems yields Ω(log m) lower bounds on the competitive ratio in the online setting. In this paper we construct non-trivial predictions for these problems and design algorithms that utilize these predictions to compute solutions online. Our predictions are compact in size, having dimension linear in the number of machines. The performance guarantees of our algorithms depend on the accuracy of the predictions, and moderately accurate predictions allow our techniques to beat the worst case lower bounds. More precisely, the predictions can be used to construct O(log η)-competitive fractional assignments, where η is the error of the predictions. We then give an online algorithm that is O(poly(log log(m)))-competitive to round these fractional assignments View details
    Preview abstract The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset. View details
    Preview abstract As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering. In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective. View details
    Preview abstract Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance—indeed, it is often assumed that each user contributes only a single example— we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data. View details
    Preview abstract The desire to use machine learning to assist in human decision making has spawned a large area of research in understanding the impact of such systems not only on the society as a whole, but also the specific impact on different subpopulations. Recent work has shown that while there are several natural ways to quantify the fairness of a particular system, none of them are universal, and except for trivial cases, satisfying one means violating another~\citet{Kleinberg, Goel, Kleinberg2}. In parallel to understanding the interplay between different definitions of fairness, researchers have looked to develop algorithms for finding fair solutions to specific classes of problems. For example, there is recent work on fair regression, fair clustering, fair bandit algorithms and many others. In this work we tackle another large class of problems that form a useful primitive in machine learning and optimization, that of optimizing a set function subject to a set of matroid constraints. A matroid is a combinatorial object that generalizes linear independence between vectors and is general enough to encode cardinality constraints (e.g. selecting at most $k$ elements), connectivity constraints (e.g. selecting a spanning tree of a graph), or matching constraints (ensuring a subgraph has a perfect matching). View details
    Preview abstract We propose online algorithms for Column Subset Selection (CSS) and Principal Component Analysis (PCA), two methods that are widely employed for data analysis, summarization, and visualization. Given a data matrix A that is revealed one column at a time, the online CSS problems asks to keep a small set of columns, S, that best approximates the space spanned by the columns of A. As each column arrives, the algorithm must irrevocably decide whether to add it to S, or to ignore it. In the online PCA problem, the goal is to output a projection of each column to a low dimensional subspace. In other words, the algorithm must provide an embedding for each column as it arrives, which cannot be changed as new columns arrive. While both of these problems have been studied in the online setting, only additive approximations were known prior to our work. The core of our approach is an adaptive sampling technique that gives a practical and efficient algorithm for both of these problems. We prove that by sampling columns using their “residual norm” (i.e. their norm orthogonal to directions sampled so far), we end up with a significantly better dependence between the number of columns sampled, and the desired error in the approximation. We further show how to combine our algorithm “in series” with prior algorithms. In particular, using the results of Boutsidis et al. [4] and Frieze et al. [14] that have additive guarantees, we show how to improve the bounds on the error of our algorithm. View details
    Preview abstract Modern online learning algorithms achieve low (sublinear) regret in a variety of diverse settings. These algorithms, however, update their solution at every time step. While, these updates are computationally efficient, the very requirement of frequent updates makes the algorithms untenable in some practical applications. In this work we look for online learning algorithms that update a sublinear number of times. We give a meta algorithm based on non-homogeneous Poisson Processes that gives a smooth trade- off between final regret and frequency of updates. Empirically, we show that in many cases we can significantly reduce the number at a minimal increase in regret. View details
    Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives
    Michele Borassi
    Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)
    Preview abstract The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space. In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties. Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms. View details
    Preview abstract Consider a buyer participating in a repeated auction, such as those prevalent in display advertising. How would she test whether the auction is incentive compatible? To bid effectively, she is interested in whether the auction is single-shot incentive compatible---a pure second-price auction, with fixed reserve price---and also dynamically incentive compatible---her bids are not used to set future reserve prices. In this work we develop tests based on simple bid perturbations that a buyer can use to answer these questions, with a focus on dynamic incentive compatibility. There are many potential A/B testing setups that one could use, but we find that many natural experimental designs are, in fact, flawed. For instance, we show that additive perturbations can lead to paradoxical results, where higher bids lead to lower optimal reserve prices. We precisely characterize this phenomenon and show that reserve prices are only guaranteed to be monotone for distributions satisfying the Monotone Hazard Rate (MHR) property. The experimenter must also decide how to split traffic to apply systematic perturbations. It is tempting to have this split be randomized, but we demonstrate empirically that unless the perturbations are aligned with the partitions used by the seller to compute reserve prices, the results are guaranteed to be inconclusive. We validate our results with experiments on real display auction data and show that a buyer can quantify both single-shot and dynamic incentive compatibility even under realistic conditions where only the cost of the impression is observed (as opposed to the exact reserve price). We analyze the cost of running such experiments, exposing trade-offs between test accuracy, cost, and underlying market dynamics. View details
    Preview abstract Maximizing submodular functions under cardinality constraints lies at the core of numerous data mining and machine learning applications, including data diversification, data summarization, and coverage problems. In this work, we study this question in the context of data streams, where elements arrive one at a time, and we want to design low-memory and fast update-time algorithms that maintain a good solution. Specifically, we focus on the sliding window model, where we are asked to maintain a solution that considers only the last $W$ items. In this context, we provide the first non-trivial algorithm that maintains a provable approximation of the optimum using space sublinear in the size of the window. In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in the spread of the values of the elements, $\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same time, processing each element only requires a polylogarithmic number of evaluations of the function itself. When a better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$ approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem. We demonstrate the efficacy of the algorithms on a number of real world datasets, showing that their practical performance far exceeds the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a negligible fraction of them. View details
    Preview abstract We introduce the {\em consistent $k$-clustering} problem, which asks to minimize the number of re-clusterings necessary to maintain a constant approximate solution as points arrive in an online manner. We prove lower bounds of $\Omega(k \log n)$ on the number of clustering changes necessary, and give an algorithm that needs only $O(k^3 \log^5 n)$ changes to maintain a constant competitive solution. This is an exponential improvement on the naive solution of reclustering at every time step. Finally, we show experimentally that our approach performs much better than the theoretical bound, with the number of changes growing approximately as $O(\log n)$ View details
    Preview abstract We study the question of fair clustering under the disparate impact doctrine, where each protected class must have approximately equal representation in every cluster. We formulate the fair clustering problem under both the k-center and the k-median objectives, and show that even with two protected classes the problem is challenging, as the optimum solution can violate common conventions—for instance a point may no longer be assigned to its nearest cluster center! En route we introduce the concept of fairlets, which are minimal sets that satisfy fair representation while approximately preserving the clustering objective. We show that any fair clustering problem can be decomposed into first finding good fairlets, and then using existing machinery for traditional clustering algorithms. While finding good fairlets can be NP-hard, we proceed to obtain efficient approximation algorithms based on minimum cost flow. We empirically quantify the value of fair clustering on real-world datasets with sensitive attributes. View details
    Preview abstract We consider the reachability indexing problem for private-public directed graphs. In these graphs nodes come in three flavors: public—nodes visible to all users, private—nodes visible to a specific set of users, and protected—nodes visible to any user who can see at least one of the node’s parents. We are interested in computing the set of nodes visible to a specific user online. There are two obvious algorithms: precompute the result for every user, or run a reachability algorithm at query time. This paper explores the trade-off between these two strategies. Our approach is to identify a set of additional visible seed nodes for each user. The online reachability algorithm explores the graph starting at these nodes. We first formulate the problem as asymmetric k-center with outliers, and then give an efficient and practical algorithm. We prove new theoretical guarantees for this problem and show empirically that it performs very well in practice. View details
    Preview abstract A significant part of optimizing revenue for ad auctions is setting a good reserve (or minimum) price. Set it too low, and the impression may yield little revenue, set it too high and there may not anyone willing to buy the item. Previous work has looked at predicting this value directly, however, the strongly non-convex objective function makes this a challenging proposition. In contrast, motivated by the fact that computing an optimal reserve price for a set of bids is easy, we propose a clustering approach, first finding a good partition of the data, and then finding an optimal reserve price for each partition. In this work, we take a major step in this direction: we derive the specific objective function that corresponds to revenue optimization in auctions, and give algorithms that optimize it. View details
    Preview abstract We introduce a novel, data-driven way for predicting battery consumption of apps. The state-of-the-art models used to blame battery consumption on apps are based on micro-benchmark experiments. These experiments are carried out on controlled setups where one can measure how much battery is consumed by each internal resource (CPU, bluetooth, WiFi...). The battery blame allocated to an app is simply the sum of the blames of the resources consumed by the app. We argue that this type of models do not capture the way phones work "in the wild" and propose instead to train a regression model using data collected from logs. We show that this type of learning is correct in the sense that under some assumptions, we can recover the true battery discharge rate of each component. We present experimental results where we consistently do better predictions than a model trained on micro-benchmarks. View details
    Pricing a low-regret seller
    Hoda Heidari
    Sadra Yazdanbod
    Proceedings of the Thirty-Third International Conference on Machine Learning (ICML 2016)
    Preview abstract As the number of ad exchanges has grown, publishers have turned to low regret learning algorithms to decide which exchange offers the best price for their inventory. This in turn opens the following question for the exchange: how to set prices to attract as many sellers as possible and maximize revenue. In this work we formulate this precisely as a learning problem, and present algorithms showing that by simply knowing that the counterparty is using a low regret algorithm is enough for the exchange to have its own low regret learning algorithm to find the optimal price. View details
    Preview abstract We study the question of setting and testing reserve prices in single item auctions when the bidders are not identical. At a high level, there are two generalizations of the standard second price auction: in the lazy version we first determine the winner, and then apply reserve prices; in the eager version we first discard the bidders not meeting their reserves, and then determine the winner among the rest. We show that the two versions have dramatically different properties: lazy reserves are easy to optimize, and A/B test in production, whereas eager reserves always lead to higher welfare, but their optimization is NP-complete, and naive A/B testing will lead to incorrect conclusions. Despite their different characteristics, we show that the overall revenue for the two scenarios is always within a factor of 2 of each other, even in the presence of correlated bids. Moreover, we prove that the eager auction dominates the lazy auction on revenue whenever the bidders are independent or symmetric. We complement our theoretical results with simulations on real world data that show that even suboptimally set eager reserve prices are preferred from a revenue standpoint. View details
    Incentivizing Advertiser Networks to Submit Multiple Bids
    Patrick Hummel
    R. Preston McAfee
    International Journal of Game Theory, vol. 45 (2016), pp. 1031-1052
    Preview abstract This paper illustrates a method for making side payments to advertiser networks that creates an incentive for the advertiser networks to submit the second-highest bids they received to an ad exchange and simultaneously ensures that the publishers will make more money on average in the short run as a result of adopting this scheme. We also illustrate how this payment scheme affects publisher payoffs in the long run after advertisers have a chance to modify their strategies in response to the changed incentives of the mechanism. View details
    Preview abstract Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges. View details
    Parallel Algorithms for Unsupervised Tagging
    Sujith Ravi
    Vibhor Rastogi
    Transactions of the ACL (2014)
    Preview
    Value of Targeting
    Patrick Hummel
    Proceedings of the 7th International Symposium on Algorithmic Game Theory (SAGT) (2014), pp. 194-205
    Preview abstract We undertake a formal study of the value of targeting data to an advertiser. As expected, this value is increasing in the utility difference between realizations of the targeting data and the accuracy of the data, and depends on the distribution of competing bids. However, this value may vary non-monotonically with an advertiser’s budget. Similarly, modeling the values as either private or correlated, or allowing other advertisers to also make use of the data, leads to unpredictable changes in the value of data. We address questions related to multiple data sources, show that utility of additional data may be non-monotonic, and provide tradeoffs between the quality and the price of data sources. In a game-theoretic setting, we show that advertisers may be worse off than if the data had not been available at all. We also ask whether a publisher can infer the value an advertiser would place on targeting data from the advertiser’s bidding behavior and illustrate that this is impossible. View details
    Scalable K-Means by ranked retrieval
    Lluis Garcia-Pueyo
    Vanja Josifovski
    Srihari Venkatesan
    Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
    Preview abstract The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality. The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance. View details
    An Overview of Practical Exchange Design
    R. Preston McAfee
    Current Science, vol. 103, no.9 (2012), pp. 1056-1063
    Preview abstract We consider the problem of designing an online exchange. We identify the goals of exchange design, and present key techniques for accomplishing these goals along with the tradeoffs inherent in the choices. View details
    Ad serving using a compact allocation plan
    Peiji Chen
    Wenjing Ma
    Srinath Mandalapu
    Chandrashekhar Nagarjan
    Jayavel Shanmugasundaram
    Erik Vee
    Manfai Yu
    Jason Zien
    Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 319-336
    Preview
    Ad auctions with data
    Hu Fu
    Patrick R. Jordan
    Inbal Talgam-Cohen
    INFOCOM Workshops (2012), pp. 184-189
    Preview
    To match or not to match: economics of cookie matching in online advertising
    Arpita Ghosh
    Preston McAfee
    Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 741-753
    Preview
    Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe
    Shirshanka Das
    Marcus Fontoura
    Bhaskar Ghosh
    Vanja Josifovski
    Jayavel Shanmugasundaram
    The 20th International World Wide Web Confererence (WWW 2011)
    Preview
    Filtering: a method for solving graph problems in MapReduce.
    Benjamin Moseley
    Siddharth Suri
    SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 85-94
    Preview abstract The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we will parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version. View details
    Hiring a secretary from a poset.
    Ravi Kumar
    Andrea Vattani
    Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 39-48
    Preview abstract The secretary problem lies at the core of mechanism design for online auctions. In this work we study the generalization of the classical secretary problem in a setting where there is only a partial order be- tween the elements and the goal of the algorithm is to return one of the maximal elements of the poset. This is equivalent to the setting where the seller has a multidimensional objective function with only a partial order among the outcomes. We obtain an algorithm that succeeds with probability at least?1 + l k^{−k/(k−1)} ((1+log^{-1/(k-1)} k)^k -1) where k is the number of maximal elements in the poset and is the only information about the poset that is known to the algorithm. On the other hand, we prove an almost matching upper bound of k^{−1/(k−1)} on the success probability of any algorithm for this problem; this upper bound holds even if the algorithm knows the complete structure of the poset. View details
    Factorization-based Lossless Compression of Inverted Indices
    George Beskales
    Marcus Fontoura
    Maxim Gurevich
    Vanja Josifovski
    20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)
    Preview
    Efficiently Encoding Term Co-occurrences in Inverted Indexes
    Marcus Fontoura
    Maxim Gurevich
    Vanja Josifovski
    20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)
    Preview
    Maximally representative allocations for guaranteed delivery advertising campaigns
    R. Preston McAfee
    Review of Economic Design, vol. 17 (2013), pp. 83-94
    Ad Serving Using a Compact Allocation Plan
    Peiji Chen
    Wenjing Ma
    Srinath Mandalapu
    Chandrashekhar Nagarajan
    Jayavel Shanmugasundaram
    Erik Vee
    Manfai Yu
    Jason Y. Zien
    CoRR, vol. abs/1203.3593 (2012)
    Scalable K-Means++
    Bahman Bahmani
    Benjamin Moseley
    Andrea Vattani
    Ravi Kumar
    PVLDB, vol. 5 (2012), pp. 622-633
    SHALE: An Efficient Algorithm for Allocation of Guaranteed Display Advertising
    Vijay Bharadwaj
    Peiji Chen
    Wenjing Ma
    Chandrashekhar Nagarajan
    John Tomlin
    Erik Vee
    Jian Yang
    CoRR, vol. abs/1203.3619 (2012)
    Inventory Allocation for Online Graphical Display Advertising using Multi-objective Optimization
    Jian Yang
    Erik Vee
    John Tomlin
    Jayavel Shanmugasundaram
    Tasos Anastasakos
    Oliver Kennedy
    ICORES (2012), pp. 293-304
    Densest Subgraph in Streaming and MapReduce
    Bahman Bahmani
    Ravi Kumar
    PVLDB, vol. 5 (2012), pp. 454-465
    Handling forecast errors while bidding for display advertising
    Kevin J. Lang
    Benjamin Moseley
    WWW (2012), pp. 371-380
    Ad serving using a compact allocation plan
    Peiji Chen
    Wenjing Ma
    Srinath Mandalapu
    Chandrashekhar Nagarajan
    Jayavel Shanmugasundaram
    Erik Vee
    Manfai Yu
    Jason Y. Zien
    ACM Conference on Electronic Commerce (2012), pp. 319-336
    SHALE: an efficient algorithm for allocation of guaranteed display advertising
    Vijay Bharadwaj
    Peiji Chen
    Wenjing Ma
    Chandrashekhar Nagarajan
    John Tomlin
    Erik Vee
    Jian Yang
    KDD (2012), pp. 1195-1203
    Efficiently encoding term co-occurrences in inverted indexes
    Marcus Fontoura
    Maxim Gurevich
    Vanja Josifovski
    CIKM (2011), pp. 307-316
    Optimal Envy-Free Pricing with Metric Substitutability
    Ning Chen
    Arpita Ghosh
    SIAM J. Comput., vol. 40 (2011), pp. 623-645
    Factorization-based lossless compression of inverted indices
    George Beskales
    Marcus Fontoura
    Maxim Gurevich
    Vanja Josifovski
    CIKM (2011), pp. 327-332
    Factorization-based Lossless Compression of Inverted Indices
    George Beskales
    Marcus Fontoura
    Maxim Gurevich
    Vanja Josifovski
    CoRR, vol. abs/1108.1956 (2011)
    Cross-Validation and Mean-Square Stability
    Satyen Kale
    Ravi Kumar
    ICS (2011), pp. 487-495
    Hiring a secretary from a poset
    Ravi Kumar
    Andrea Vattani
    ACM Conference on Electronic Commerce (2011), pp. 39-48
    Efficiently evaluating graph constraints in content-based publish/subscribe
    Shirshanka Das
    Marcus Fontoura
    Bhaskar Ghosh
    Vanja Josifovski
    Jayavel Shanmugasundaram
    WWW (2011), pp. 497-506
    Filtering: a method for solving graph problems in MapReduce
    Benjamin Moseley
    Siddharth Suri
    SPAA (2011), pp. 85-94
    The Multiple Attribution Problem in Pay-Per-Conversion Advertising
    Patrick R. Jordan
    Erik Vee
    SAGT (2011), pp. 31-43
    Counting triangles and the curse of the last reducer
    Siddharth Suri
    WWW (2011), pp. 607-614
    Generalized distances between rankings
    Ravi Kumar
    WWW (2010), pp. 571-580
    Optimal online assignment with forecasts
    Erik Vee
    Jayavel Shanmugasundaram
    ACM Conference on Electronic Commerce (2010), pp. 109-118
    Inventory Allocation for Online Graphical Display Advertising
    Jian Yang
    Erik Vee
    John Tomlin
    Jayavel Shanmugasundaram
    Tasos Anastasakos
    Oliver Kennedy
    CoRR, vol. abs/1008.3551 (2010)
    Finding the Jaccard Median
    Flavio Chierichetti
    Ravi Kumar
    Sandeep Pandey
    SODA (2010), pp. 293-311
    A Model of Computation for MapReduce
    Howard J. Karloff
    Siddharth Suri
    SODA (2010), pp. 938-948
    Efficiently evaluating complex boolean expressions
    Marcus Fontoura
    Suhas Sadanandan
    Jayavel Shanmugasundaram
    Erik Vee
    Srihari Venkatesan
    Jason Y. Zien
    SIGMOD Conference (2010), pp. 3-14
    Battling Predictability and Overconcentration in Recommender Systems
    Sihem Amer-Yahia
    Laks V. S. Lakshmanan
    Cong Yu
    IEEE Data Eng. Bull., vol. 32 (2009), pp. 33-40
    Adaptive bidding for display advertising
    Arpita Ghosh
    Benjamin I. P. Rubinstein
    Martin Zinkevich
    WWW (2009), pp. 251-260
    Nearest-neighbor caching for content-match applications
    Sandeep Pandey
    Flavio Chierichetti
    Vanja Josifovski
    Ravi Kumar
    WWW (2009), pp. 441-450
    Contract Auctions for Sponsored Search
    Sharad Goel
    Sébastien Lahaie
    WINE (2009), pp. 196-207
    Worst-Case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-Means Method
    David Arthur
    SIAM J. Comput., vol. 39 (2009), pp. 766-782
    Bidding for Representative Allocations for Display Advertising
    Arpita Ghosh
    Randolph Preston McAfee
    WINE (2009), pp. 208-219
    Social Networks and Stable Matchings in the Job Market
    Esteban Arcaute
    WINE (2009), pp. 220-231
    Impression-plus-click auctions
    Sharad Goel
    Sébastien Lahaie
    SIGecom Exchanges, vol. 8 (2009), pp. 8
    Social Networks and Stable Matchings in the Job Market
    Esteban Arcaute
    CoRR, vol. abs/0910.0916 (2009)
    Bidding for Representative Allocations for Display Advertising
    Arpita Ghosh
    Randolph Preston McAfee
    CoRR, vol. abs/0910.0880 (2009)
    Getting Recommender Systems to Think outside the Box
    Zeinab Abbassi
    Sihem Amer-Yahia
    Laks V. S. Lakshmanan
    Cong Yu
    RecSys (2009), pp. 285-288
    Indexing Boolean Expressions
    Steven Euijong Whang
    Chad Brower
    Jayavel Shanmugasundaram
    Erik Vee
    Ramana Yerneni
    Hector Garcia-Molina
    Proc. 35th Int'l Conf. on Very Large Data Bases (PVLDB) (2009), pp. 37-48
    The Hiring Problem and Lake Wobegon Strategies
    Adam Kirsch
    Ravi Kumar
    Michael Mitzenmacher
    Eli Upfal
    SIAM J. Comput., vol. 39 (2009), pp. 1233-1255
    Similarity caching
    Flavio Chierichetti
    Ravi Kumar
    PODS (2009), pp. 127-136
    Indexing Boolean Expressions
    Steven Whang
    Chad Brower
    Jayavel Shanmugasundaram
    Erik Vee
    Ramana Yerneni
    Hector Garcia-Molina
    PVLDB, vol. 2 (2009), pp. 37-48
    Top-k aggregation using intersections of ranked inputs
    Ravi Kumar
    Kunal Punera
    Torsten Suel
    WSDM (2009), pp. 222-231
    Optimal envy-free pricing with metric substitutability
    Ning Chen
    Arpita Ghosh
    ACM Conference on Electronic Commerce (2008), pp. 60-69
    The hiring problem and Lake Wobegon strategies
    Adam Kirsch
    Ravi Kumar
    Michael Mitzenmacher
    Eli Upfal
    SODA (2008), pp. 1184-1193
    Relaxation in text search using taxonomies
    Marcus Fontoura
    Vanja Josifovski
    Ravi Kumar
    Christopher Olston
    PVLDB, vol. 1 (2008), pp. 672-683
    Sponsored Search Auctions with Reserve Prices: Going Beyond Separability
    Rica Gonen
    WINE (2008), pp. 597-608
    k-means++: the advantages of careful seeding
    David Arthur
    SODA (2007), pp. 1027-1035
    Tracing the Path: New Model and Algorithms for Collaborative Filtering
    Rajeev Motwani
    ICDE Workshops (2007), pp. 853-862
    On threshold behavior in query incentive networks
    Esteban Arcaute
    Adam Kirsch
    Ravi Kumar
    David Liben-Nowell
    ACM Conference on Electronic Commerce (2007), pp. 66-74
    Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method
    David Arthur
    FOCS (2006), pp. 153-164
    How slow is the k-means method?
    David Arthur
    Symposium on Computational Geometry (2006), pp. 144-153
    Using web-graph distance for relevance feedback in web search
    Eric Brill
    SIGIR (2006), pp. 147-153
    Efficiently computing succinct trade-off curves
    Mihalis Yannakakis
    Theor. Comput. Sci., vol. 348 (2005), pp. 334-356
    Efficiently Computing Succinct Trade-Off Curves
    Mihalis Yannakakis
    ICALP (2004), pp. 1201-1213
    On the General Reconfiguration Problem for Expanding Cube Style Modular Robots
    Jeremy Kubica
    Eleanor G. Rieffel
    John W. Suh
    Mark Yim
    Proceedings of the 2002 IEEE Int. Conference on Robotics and Automation, pp. 801-808
    On the General Reconfiguration Problem for Expanding Cube Style Modular Robots
    Jeremy Kubica
    Eleanor G. Rieffel
    John W. Suh
    Mark Yim
    ICRA (2002), pp. 801-808