# Sergei Vassilvitskii

Authored Publications

Google Publications

Other Publications

Sort By

Measuring Re-identification Risk

Travis Dick

Adel Javanmard

Josh Karlin

Gabriel Henrique Nunes

SIGMOD (2023)

Preview abstract
Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.
View details

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Chris Schwiegelshohn

David Saulpic

2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)

Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details

Secretaries with Advice

Paul Duetting

Proceedings of the 22nd ACM Conference on Economics and Computation (EC'21) (2021), pp. 409-429

Preview abstract
The secretary problem is probably the purest model of decision making under uncertainty. In this
paper we ask which advice can we give the algorithm to improve its success probability?
We propose a general model that unifies a broad range of problems: from the classic secretary problem
with no advice, to the variant where the quality of a secretary is drawn from a known distribution and
the algorithm learns each candidate’s quality quantile on arrival, to more modern ML-based versions of
advice where a binary classifier gives us noisy advice about whether or not the current secretary is the
best on the market.
Our main technique is a factor revealing LP that captures all of the problems above. We use this LP
formulation to gain structural insight into the optimal policy and present two case studies: a re-derivation
of the classic known distributions result with tools from linear programming and a tight analysis of the
noisy binary advice model.
View details

Fair Hierarchical Clustering

Benjamin Moseley

Marina Knittel

Yuyan Wang

Neurips 2020

Preview abstract
As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective.
View details

Sliding Window Algorithms for k-Clustering Problems

Michele Borassi

Neurips 2020 (to appear)

Preview abstract
The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset.
View details

Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives

Michele Borassi

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)

Preview abstract
The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space.
In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties.
Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms.
View details

Testing Incentive Compatibility in Display Ad Auctions

Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018

Preview abstract
Consider a buyer participating in a repeated auction, such as those
prevalent in display advertising. How would she test whether the
auction is incentive compatible? To bid effectively, she is
interested in whether the auction is single-shot incentive
compatible---a pure second-price auction, with fixed reserve
price---and also dynamically incentive compatible---her bids
are not used to set future reserve prices. In this work we develop
tests based on simple bid perturbations that a buyer can use to
answer these questions, with a focus on dynamic incentive
compatibility.
There are many potential A/B testing setups that one could use, but
we find that many natural experimental designs are, in fact, flawed.
For instance, we show that additive perturbations can lead to paradoxical results,
where higher bids lead to lower optimal reserve prices. We precisely
characterize this phenomenon and show that reserve prices are only
guaranteed to be monotone for distributions satisfying the Monotone
Hazard Rate (MHR) property. The experimenter must also decide how to split traffic to apply
systematic perturbations. It is tempting to have this split be
randomized, but we demonstrate empirically that unless the perturbations are
aligned with the partitions used by the seller to compute
reserve prices, the results are guaranteed to be inconclusive.
We validate our results with experiments on real display auction
data and show that a buyer can quantify both single-shot and dynamic
incentive compatibility even under realistic conditions where only
the cost of the impression is observed (as opposed to the exact
reserve price). We analyze the cost of running such experiments,
exposing trade-offs between test accuracy, cost, and underlying
market dynamics.
View details

Submodular Optimization Over Sliding Windows

Proceedings of the 26th International World Wide Web Conference, WWW (2017)

Preview abstract
Maximizing submodular functions under cardinality constraints lies at
the core of numerous data mining and machine learning applications,
including data diversification, data summarization, and coverage problems.
In this work, we study this question in the context of data
streams, where elements arrive one at a time, and we want to
design low-memory and fast update-time algorithms that maintain a good solution.
Specifically, we focus on the sliding window
model, where we are asked to maintain a solution that considers only
the last $W$ items.
In this context, we provide the first non-trivial algorithm that
maintains a provable approximation of the optimum using space sublinear in the size of the window.
In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in
the spread of the values of the elements,
$\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same
time, processing each element only requires a
polylogarithmic number of evaluations of the function itself. When a
better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$
approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem.
We demonstrate the efficacy of the algorithms on a number of real
world datasets, showing that their practical performance far exceeds
the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a
negligible fraction of them.
View details

Learning mobile phone battery consumptions

Ashish Sharma

Paul Eastham

Workshop on On Device Intelligence (2016)

Preview abstract
We introduce a novel, data-driven way for predicting battery consumption of apps. The state-of-the-art models used to blame battery consumption on apps are based on micro-benchmark experiments. These experiments are carried out on controlled setups where one can measure how much battery is consumed by each internal resource (CPU, bluetooth, WiFi...). The battery blame allocated to an app is simply the sum of the blames of the resources consumed by the app. We argue that this type of models do not capture the way phones work "in the wild" and propose instead to train a regression model using data collected from logs. We show that this type of learning is correct in the sense that under some assumptions, we can recover the true battery discharge rate of each component. We present experimental results where we consistently do better predictions than a model trained on micro-benchmarks.
View details

Incentivizing Advertiser Networks to Submit Multiple Bids

Patrick Hummel

R. Preston McAfee

International Journal of Game Theory, vol. 45 (2016), pp. 1031-1052

Preview abstract
This paper illustrates a method for making side payments to advertiser networks that creates an incentive for the advertiser networks to submit the second-highest bids they received to an ad exchange and simultaneously ensures that the publishers will make more money on average in the short run as a result of adopting this scheme. We also illustrate how this payment scheme affects publisher payoffs in the long run after advertisers have a chance to modify their strategies in response to the changed incentives of the mechanism.
View details

Pricing a low-regret seller

Hoda Heidari

Sadra Yazdanbod

Proceedings of the Thirty-Third International Conference on Machine Learning (ICML 2016)

Preview abstract
As the number of ad exchanges has grown, publishers have turned to low regret learning algorithms to decide which exchange offers the best price for their inventory. This in turn opens the following question for the exchange: how to set prices to attract as many sellers as possible and maximize revenue. In this work we formulate this precisely as a learning problem, and present algorithms showing that by simply knowing that the counterparty is using a low regret algorithm is enough for the exchange to have its own low regret learning algorithm to find the optimal price.
View details

Value of Targeting

Patrick Hummel

Proceedings of the 7th International Symposium on Algorithmic Game Theory (SAGT) (2014), pp. 194-205

Preview abstract
We undertake a formal study of the value of targeting data to an advertiser. As expected, this value is increasing in the utility difference between realizations of the targeting data and the accuracy of the data, and depends on the distribution of competing bids. However, this value may vary non-monotonically with an advertiser’s budget. Similarly, modeling the values as either private or correlated, or allowing other advertisers to also make use of the data, leads to unpredictable changes in the value of data. We address questions related to multiple data sources, show that utility of additional data may be non-monotonic, and provide tradeoffs between the quality and the price of data sources. In a game-theoretic setting, we show that advertisers may be worse off than if the data had not been available at all. We also ask whether a publisher can infer the value an advertiser would place on targeting data from the advertiser’s bidding behavior and illustrate that this is impossible.
View details

Scalable K-Means by ranked retrieval

Lluis Garcia-Pueyo

Vanja Josifovski

Srihari Venkatesan

Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242

Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details

Ad auctions with data

Preview
Hu Fu

Patrick R. Jordan

Inbal Talgam-Cohen

INFOCOM Workshops (2012), pp. 184-189

Ad serving using a compact allocation plan

Preview
Peiji Chen

Wenjing Ma

Srinath Mandalapu

Chandrashekhar Nagarjan

Jayavel Shanmugasundaram

Erik Vee

Manfai Yu

Jason Zien

Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 319-336

To match or not to match: economics of cookie matching in online advertising

Preview
Arpita Ghosh

Preston McAfee

Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 741-753

Factorization-based Lossless Compression of Inverted Indices

Preview
George Beskales

Marcus Fontoura

Maxim Gurevich

Vanja Josifovski

20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)

Filtering: a method for solving graph problems in MapReduce.

Benjamin Moseley

Siddharth Suri

SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 85-94

Preview abstract
The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we will parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version.
View details

Efficiently Encoding Term Co-occurrences in Inverted Indexes

Preview
Marcus Fontoura

Maxim Gurevich

Vanja Josifovski

20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)

Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe

Preview
Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

The 20th International World Wide Web Confererence (WWW 2011)

Hiring a secretary from a poset.

Ravi Kumar

Andrea Vattani

Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 39-48

Preview abstract
The secretary problem lies at the core of mechanism design for online auctions. In this work we study the generalization of the classical secretary problem in a setting where there is only a partial order be- tween the elements and the goal of the algorithm is to return one of the maximal elements of the poset. This is equivalent to the setting where the seller has a multidimensional objective function with only a partial order among the outcomes. We obtain an algorithm that succeeds with probability at least?1 + l
k^{−k/(k−1)} ((1+log^{-1/(k-1)} k)^k -1) where k is the number of maximal elements in the poset and is the only information about the poset that is known to the algorithm. On the other hand, we prove an almost matching upper bound of k^{−1/(k−1)} on the success probability of any algorithm for this problem; this upper bound holds even if the algorithm knows the complete structure of the poset.
View details

Maximally representative allocations for guaranteed delivery advertising campaigns

R. Preston McAfee

Review of Economic Design, vol. 17 (2013), pp. 83-94

SHALE: An Efficient Algorithm for Allocation of Guaranteed Display Advertising

Vijay Bharadwaj

Peiji Chen

Wenjing Ma

Chandrashekhar Nagarajan

John Tomlin

Erik Vee

Jian Yang

CoRR, vol. abs/1203.3619 (2012)

Inventory Allocation for Online Graphical Display Advertising using Multi-objective Optimization

Jian Yang

Erik Vee

John Tomlin

Jayavel Shanmugasundaram

Tasos Anastasakos

Oliver Kennedy

ICORES (2012), pp. 293-304

Ad serving using a compact allocation plan

Peiji Chen

Wenjing Ma

Srinath Mandalapu

Chandrashekhar Nagarajan

Jayavel Shanmugasundaram

Erik Vee

Manfai Yu

Jason Y. Zien

ACM Conference on Electronic Commerce (2012), pp. 319-336

Densest Subgraph in Streaming and MapReduce

Scalable K-Means++

Bahman Bahmani

Benjamin Moseley

Andrea Vattani

Ravi Kumar

PVLDB, vol. 5 (2012), pp. 622-633

Ad Serving Using a Compact Allocation Plan

Peiji Chen

Wenjing Ma

Srinath Mandalapu

Chandrashekhar Nagarajan

Jayavel Shanmugasundaram

Erik Vee

Manfai Yu

Jason Y. Zien

CoRR, vol. abs/1203.3593 (2012)

SHALE: an efficient algorithm for allocation of guaranteed display advertising

Vijay Bharadwaj

Peiji Chen

Wenjing Ma

Chandrashekhar Nagarajan

John Tomlin

Erik Vee

Jian Yang

KDD (2012), pp. 1195-1203

Handling forecast errors while bidding for display advertising

Cross-Validation and Mean-Square Stability

The Multiple Attribution Problem in Pay-Per-Conversion Advertising

Factorization-based Lossless Compression of Inverted Indices

George Beskales

Marcus Fontoura

Maxim Gurevich

Vanja Josifovski

CoRR, vol. abs/1108.1956 (2011)

Efficiently encoding term co-occurrences in inverted indexes

Filtering: a method for solving graph problems in MapReduce

Factorization-based lossless compression of inverted indices

George Beskales

Marcus Fontoura

Maxim Gurevich

Vanja Josifovski

CIKM (2011), pp. 327-332

Counting triangles and the curse of the last reducer

Optimal Envy-Free Pricing with Metric Substitutability

Hiring a secretary from a poset

Ravi Kumar

Andrea Vattani

ACM Conference on Electronic Commerce (2011), pp. 39-48

Efficiently evaluating graph constraints in content-based publish/subscribe

Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

WWW (2011), pp. 497-506

A Model of Computation for MapReduce

Optimal online assignment with forecasts

Erik Vee

Jayavel Shanmugasundaram

ACM Conference on Electronic Commerce (2010), pp. 109-118

Generalized distances between rankings

Finding the Jaccard Median

Efficiently evaluating complex boolean expressions

Marcus Fontoura

Suhas Sadanandan

Jayavel Shanmugasundaram

Erik Vee

Srihari Venkatesan

Jason Y. Zien

SIGMOD Conference (2010), pp. 3-14

Inventory Allocation for Online Graphical Display Advertising

Jian Yang

Erik Vee

John Tomlin

Jayavel Shanmugasundaram

Tasos Anastasakos

Oliver Kennedy

CoRR, vol. abs/1008.3551 (2010)

Nearest-neighbor caching for content-match applications

Sandeep Pandey

Flavio Chierichetti

Vanja Josifovski

Ravi Kumar

WWW (2009), pp. 441-450

The Hiring Problem and Lake Wobegon Strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SIAM J. Comput., vol. 39 (2009), pp. 1233-1255

Top-k aggregation using intersections of ranked inputs

Bidding for Representative Allocations for Display Advertising

Arpita Ghosh

Randolph Preston McAfee

CoRR, vol. abs/0910.0880 (2009)

Bidding for Representative Allocations for Display Advertising

Social Networks and Stable Matchings in the Job Market

Getting Recommender Systems to Think outside the Box

Zeinab Abbassi

Sihem Amer-Yahia

Laks V. S. Lakshmanan

Cong Yu

RecSys (2009), pp. 285-288

Worst-Case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-Means Method

Indexing Boolean Expressions

Steven Whang

Chad Brower

Jayavel Shanmugasundaram

Erik Vee

Ramana Yerneni

Hector Garcia-Molina

PVLDB, vol. 2 (2009), pp. 37-48

Contract Auctions for Sponsored Search

Adaptive bidding for display advertising

Arpita Ghosh

Benjamin I. P. Rubinstein

Martin Zinkevich

WWW (2009), pp. 251-260

Indexing Boolean Expressions

Steven Euijong Whang

Chad Brower

Jayavel Shanmugasundaram

Erik Vee

Ramana Yerneni

Hector Garcia-Molina

Proc. 35th Int'l Conf. on Very Large Data Bases (PVLDB) (2009), pp. 37-48

Battling Predictability and Overconcentration in Recommender Systems

Sihem Amer-Yahia

Laks V. S. Lakshmanan

Cong Yu

IEEE Data Eng. Bull., vol. 32 (2009), pp. 33-40

Impression-plus-click auctions

Social Networks and Stable Matchings in the Job Market

Similarity caching

The hiring problem and Lake Wobegon strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SODA (2008), pp. 1184-1193

Optimal envy-free pricing with metric substitutability

Sponsored Search Auctions with Reserve Prices: Going Beyond Separability

Relaxation in text search using taxonomies

Marcus Fontoura

Vanja Josifovski

Ravi Kumar

Christopher Olston

PVLDB, vol. 1 (2008), pp. 672-683

On threshold behavior in query incentive networks

Esteban Arcaute

Adam Kirsch

Ravi Kumar

David Liben-Nowell

ACM Conference on Electronic Commerce (2007), pp. 66-74

Tracing the Path: New Model and Algorithms for Collaborative Filtering

k-means++: the advantages of careful seeding

How slow is the k-means method?

Using web-graph distance for relevance feedback in web search

Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method

Efficiently computing succinct trade-off curves

Efficiently Computing Succinct Trade-Off Curves

On the General Reconfiguration Problem for Expanding Cube Style Modular Robots

Jeremy Kubica

Eleanor G. Rieffel

John W. Suh

Mark Yim

Proceedings of the 2002 IEEE Int. Conference on Robotics and Automation, pp. 801-808

On the General Reconfiguration Problem for Expanding Cube Style Modular Robots