# Silvio Lattanzi

Silvio received his bachelor (2005), master (2007) and PhD(2011) degree from the Computer Science department of Sapienza University of Rome, under the supervision of Alessandro Panconesi. Silvio joined Google Research in the New York office in January 2011. Since April 2017 Silvio moved to Google Research Zurich.

Authored Publications

Google Publications

Other Publications

Sort By

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Andres Munoz Medina

Chris Schwiegelshohn

David Saulpic

2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)

Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details

Near-Optimal Correlation Clustering with Privacy

Ashkan Norouzi Fard

Chenglin Fan

Jakub Tarnawski

Nikos Parotsidis

Slobodan Mitrović

NeurIPS 2022 (2022) (to appear)

Preview abstract
Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labeling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our additive error is stronger than the one shown in prior work and is optimal up to polylogarithmic factors for fixed privacy parameters.
View details

Active Learning of Classifiers with Label and Seed Queries

Andrea Paudice

Marco Bressan

Maximilian Thiessen

Nicolo Cesa-Bianchi

NeurIPS 2022 (to appear)

Preview abstract
We study exact active learning of binary and multiclass classifiers with margin. Given an $n$-point set $X \subset \R^m$, we want to learn any unknown classifier on $X$ whose classes have finite \emph{strong convex hull margin}, a new notion extending the SVM margin.
On the other hand, using the more powerful \emph{seed} queries (a variant of equivalence queries), the target classifier could be learned in $\scO(m \log n)$ queries via Littlestone's Halving algorithm; however, Halving is computationally inefficient.
In this work we show that, by carefully combining the two types of queries, a binary classifier can be learned in time $\poly(n+m)$ using only $\scO(m^2 \log n)$ label queries and $\scO\big(m \log \frac{m}{\gamma}\big)$ seed queries; the result extends to $k$-class classifiers at the price of a $k!k^2$ multiplicative overhead. Similar results hold when the input points have bounded bit complexity, or when only one class has strong convex hull margin against the rest. We complement these upper bounds by showing that in the worst case any algorithm needs $\Omega\big(\frac{k m \log \nicefrac{1}{\gamma}}{\log m}\big)$ seed and label queries to learn a $k$-class classifier with strong convex hull margin $\gamma$.
View details

Correlation Clustering in Constant Many Parallel Rounds

Ashkan Norouzi Fard

Jakub Tarnawski

Nikos Parotsidis

Slobodan Mitrović

ICML (2022) (to appear)

Preview abstract
Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability\nnote{I would remove "scalability": it is not clear that this will be demonstrated with mid-sized graphs} evaluation of our techniques.
View details

Deletion Robust Submodular Maximization over Matroids

Ashkan Norouzi Fard

Federico Fusco

Paul Duetting

ICML'22 (2022)

Preview abstract
Maximizing a monotone submodular function is a fundamental task in machine learning. In this paper we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid and the number $d$ of deleted elements. In the centralized setting we present a $(3.582+O(\eps))$-approximation algorithm with summary size $O(k + \frac{d \log k}{\eps^2})$. In the streaming setting we provide a $(5.582+O(\eps))$-approximation algorithm with summary size and memory $O(k + \frac{d \log k}{\eps^2})$. We complement our theoretical results with an in-depth experimental analysis showing the effectiveness of our algorithms on real-world datasets.
View details

Spectral Clustering Oracles in Sublinear Time

Aidasadat Mousavifar

Christian Alexander Sohler

Grzegorz Gluch

Michael Kapralov

SODA 2021 (to appear)

Preview abstract
Given a graph G that can be partitioned into k clusters, can we efficiently construct a small
space data structure that allows quickly classifying vertices of G according to cluster they belong to?
Formally, if G can be partitioned into k disjoint expanders with outer conductance upper bounded
by ∈ (0, 1), how efficient can such a data structure be?
In this paper we show that surprisingly efficient and robust data structure exist. In particular we
prove that storing approximations to a small number of random walks from a few random nodes in the
graph G allows one to classify vertices according to cluster structure of G using only poly(k, log n) ·
n
1/2+O() n time per vertex, i.e. in sublinear time! This runtime is optimal for small constant
values of s and k [Chiplunkar et al’FOCS’18].
To the best of our knowledge, our result is the first spectral clustering algorithm that allows
classification in sublinear time even when the outer conductance of the partitions is only a small
constant (as opposed to vanishingly small).
View details

Parallel and Efficient Hierarchical k-Median Clustering

Ashkan Norouzi Fard

Christian Sohler

Ola Svensson

NeurIPS 2021

Preview abstract
As a fundamental unsupervised learning task, hierarchical
clustering has been extensively studied in the past decade. In
particular, standard metric formulations as hierarchical $k$-center, $k$-means,
and $k$-median received a lot of attention and the problems
have been studied extensively in different computation model.
Despite all this interest, not many efficient parallel algorithms
are known for those problems. In this paper we introduce a
new parallel algorithm for Euclidean
hierarchical $k$-median problem that using machine with memory $s$ (for $s\in
\Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd
\right)$ rounds, where $d$ is the dimension of the data set and
$\Delta$ is a polynomial upper-bound on the ratio between the maximum
and minimum
distance of two points in the input dataset. To the best of our
knowledge, this is the first \emph{parallel} algorithm
for the hierarchical $k$-median problem with theoretical guarantees.
We further complement our theoretical results with an empirical study
of our algorithm that shows its effectiveness in practice.
View details

Online Facility Location with Multiple Advice

Alessandro Panconesi

Flavio Chierichetti

Giuseppe Re

Matteo Almanza

NeurIPS 2021 (to appear)

Preview abstract
Clustering is a central topic in unsupervised learning and its online formulation has received a lot of attention in recent years. In this paper, we study the classic facility location problem in the presence of multiple machine-learned advice. We design an algorithm with provable performance guarantees such that, if the advice is good, it outperforms the best known online algorithms for the problem, and if it is bad it still matches their performance.
We complement our theoretical analysis with an in-depth study of the performance of our algorithm, showing its effectiveness on synthetic and real-world data sets.
View details

Secretaries with Advice

Paul Duetting

Proceedings of the 22nd ACM Conference on Economics and Computation (EC'21) (2021), pp. 409-429

Preview abstract
The secretary problem is probably the purest model of decision making under uncertainty. In this
paper we ask which advice can we give the algorithm to improve its success probability?
We propose a general model that unifies a broad range of problems: from the classic secretary problem
with no advice, to the variant where the quality of a secretary is drawn from a known distribution and
the algorithm learns each candidate’s quality quantile on arrival, to more modern ML-based versions of
advice where a binary classifier gives us noisy advice about whether or not the current secretary is the
best on the market.
Our main technique is a factor revealing LP that captures all of the problems above. We use this LP
formulation to gain structural insight into the optimal policy and present two case studies: a re-derivation
of the classic known distributions result with tools from linear programming and a tight analysis of the
noisy binary advice model.
View details

Fast and Accurate k-means++ via Rejection Sampling

Ashkan Norouzi Fard

Christian Sohler

Ola Svensson

NeurIPS 2020

Preview abstract
We present the first distributed approximation algorithm for the Euclidean $k$-median problem with an
optimal trade-off between memory usage and the number of parallel rounds. Our algorithm even works in the setting where each machine has very limited memory $s\in \Omega(\log n)$ and it is work efficient. In the future, it would be interesting to obtain similar results for other clustering problems and to improve the approximation factor of our algorithm.
View details

Fully Dynamic Algorithm for Constrained Submodular Optimization

Ashkan Norouzi Fard

Jakub Tarnawski

Slobodan Mitrović

NeurIPS 2020 (to appear)

Preview abstract
The task of maximizing a monotone submodular function under a cardinality constraint is at the core of many machine learning and data mining applications, including data summarization, sparse regression and coverage problems. We study this problem in the context of fully dynamic streams, where elements can be both inserted and removed.
Our main result is a randomized algorithm that maintains an efficient data structure with a poly-logarithmic ammortized update time and returns a (1/2 - \epsilon)-approximate solution.
We complement our theoretical analysis with an empirical study of the performance of our algorithm.
View details

Sliding Window Algorithms for k-Clustering Problems

Michele Borassi

Neurips 2020 (to appear)

Preview abstract
The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset.
View details

Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives

Michele Borassi

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)

Preview abstract
The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space.
In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties.
Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms.
View details

One-Shot Coresets: The Case of k-Clustering

International Conference on Artificial Intelligence and Statistics (2018)

Preview abstract
Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed.
While these techniques provide provably good and small summaries, they are inherently problem dependent --- the practitioner has to commit to a fixed clustering objective before even exploring the data.
However, can one construct small data summaries for a wide range of clustering problems simultaneously? In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees.
View details

Submodular Optimization Over Sliding Windows

Proceedings of the 26th International World Wide Web Conference, WWW (2017)

Preview abstract
Maximizing submodular functions under cardinality constraints lies at
the core of numerous data mining and machine learning applications,
including data diversification, data summarization, and coverage problems.
In this work, we study this question in the context of data
streams, where elements arrive one at a time, and we want to
design low-memory and fast update-time algorithms that maintain a good solution.
Specifically, we focus on the sliding window
model, where we are asked to maintain a solution that considers only
the last $W$ items.
In this context, we provide the first non-trivial algorithm that
maintains a provable approximation of the optimum using space sublinear in the size of the window.
In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in
the spread of the values of the elements,
$\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same
time, processing each element only requires a
polylogarithmic number of evaluations of the function itself. When a
better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$
approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem.
We demonstrate the efficacy of the algorithms on a number of real
world datasets, showing that their practical performance far exceeds
the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a
negligible fraction of them.
View details

Affinity Clustering: Hierarchical Clustering at Scale

Soheil Behnezhad

Mahsa Derakhshan

MohammadTaghi Hajiaghayi

Raimondas Kiveris

NIPS 2017, pp. 6867-6877

Preview abstract
Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying good hierarchical clustering structure is at the same time a fundamental and challenging problem for several applications. In many applications, the amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data.
In this paper, we propose algorithms to address this problem. First, we analyze minimum spanning tree-based clustering algorithms and their corresponding hierarchical clusterings. In particular we consider classic single-linkage clustering based on Kruskal's algorithm and a variation of Boruvka algorithm that we call affinity clustering and prove new interesting properties of these clusterings via the concept of certificates. Then we present new algorithms in the MapReduce model and their efficient real world implementations via Distributed Hash Tables (DHTs). Our MapReduce algorithms indeed improve upon the previous MapReduce algorithms for finding a minimum spanning tree in graphs as well. Finally we show experimentally that our algorithms are scalable for huge data and competitive with state-of-the-art algorithms. In particular we show that Affinity Clustering is in practice superior to several state-of-the-art clustering algorithms.
View details

Algorithms for ℓp Low Rank Approximation

Flavio Chierichetti

Sreenivas Gollapudi

Ravi Kumar

Rina Panigrahy

David P. Woodruff

ICML '17 (2017)

Preview abstract
We consider the problem of approximating a given matrix by a low-rank matrix so as to minimize the entrywise ℓp-approximation error, for any p≥1; the case p=2 is the classical SVD problem. We obtain the first provably good approximation algorithms for this version of low-rank approximation that work for every value of p≥1, including p=∞. Our algorithms are simple, easy to implement, work well in practice, and illustrate interesting tradeoffs between the approximation quality, the running time, and the rank of the approximating matrix.
View details

Linking Users Across Domains with Location Data: Theory and Validation

Preview
Chistopher Riederer

Yunsung Kim

Augustin Chaintreau

WWW (2016) (to appear)

Ego-net Community Mining Applied to Friend Suggestion

Ismail Sebe

Ahmed Taei

Sunita Verma

Proceedings of VLDB (2016)

Preview abstract
In this paper, we present a study of the community structure
of ego-networks—the graphs representing the connections
among the neighbors of a node—for several online social
networks. Toward this goal, we design a new technique to
efficiently build and cluster all the ego-nets of a graph in
parallel (note that even just building the ego-nets efficiently
is challenging on large networks). Our experimental findings
are quite compelling: at a microscopic level it is easy to
detect high quality communities.
Leveraging on this fact we, then, develop new features
for friend suggestion based on co-occurrences of two nodes
in different ego-nets’ communities. Our new features can
be computed efficiently on very large scale graphs by just
analyzing the neighborhood of each node. Furthermore, we
prove formally on a stylized model, and by experimental
analysis that this new similarity measure outperforms the
classic local features employed for friend suggestions
View details

Expanders via Local Edge Flips

Zeyuan Allen-Zhu,

Aditya Bhaskara

Lorenzo Orecchia

Society for Industrial and Applied Mathematics (2015), pp. 259-269

Preview abstract
Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network for d=Ω(logn), we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random "flip" transformation, where in each time step, a random pair of vertices that have an edge decide to `swap a neighbor'. They conjectured that performing O(nd) such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly O(n^17d^23), obtained via a delicate Markov chain comparison argument.
Our main result is to prove that a natural instantiation of the random flip produces an expander in at most O(n^2d^2√logn) steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the `random switch', and show that it produces an expander in O(nd) steps with high probability.
View details

Distributed Balanced Clustering via Mapping Coresets

Aditya Bhaskara

NIPS, Neural Information Processing Systems Foundation (2014)

Preview abstract
Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the balanced clustering'' problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound distributed algorithm remains an open problem. In the present paper we develop a general framework based on mapping coresets'' to tackle this issue. For a wide range of clustering objective functions such as k-center, k-median, and k-means, our techniques give distributed algorithms for balanced clustering that match the best known single machine approximation ratios.
View details

Communities, Random Walks, and Social Sybil Defense.

Preview
Lorenzo Alvisi

Allen Clement

Alessandro Panconesi

Internet Mathematics (2014)

Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs

Stefano Leonardi

WWW (2014), pp. 349-360

Preview abstract
We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.
View details

Learning Entangled Single-Sample Gaussians

Preview
Flavio Chierichetti

Anirban Dasgupta

Ravi Kumar

Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014

A Local Algorithm for Finding Well-Connected Clusters

Zeyuan Allen Zhu

The 30th International Conference on Machine Learning, ICML 2013

Preview abstract
Motivated by applications of large-scale graph clustering, we study random-walk-based LOCAL algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, both in terms of the clustering accuracy and the conductance of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data. More specifically, our method outperforms prior work when the cluster is WELL-CONNECTED. In fact, the better it is well-connected inside, the more significant improvement we can obtain. Our results shed light on why in practice some random-walk-based algorithms perform better than its previous theory, and help guide future research about local clustering.
View details

Sok: The Evolution of Sybil Defense via Social Networks

Lorenzo Alvisi

Allen Clement

Alessandro Panconesi

2013 IEEE Symposium on Security and Privacy, SP 2013

Preview abstract
Sybil attacks in which an adversary forges a potentially unbounded number of identities are a danger to distributed systems and online social networks. The goal of sybil defense is to accurately identify sybil identities. This paper surveys the evolution of sybil defense protocols that leverage the structural properties of the social graph underlying a distributed system to identify sybil identities. We make two main contributions. First, we clarify the deep connection between sybil defense and the theory of random walks. This leads us to identify a community detection algorithm that, for the first time, offers provable guarantees in the context of sybil defense. Second, we advocate a new goal for sybil defense that addresses the more limited, but practically useful, goal of securely white-listing a local region of the graph.
View details

Arrival and departure in Social Networks

Preview
Shaomei Wu

Atish Das Sarma

Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013

Milgram-routing in social networks.

Alessandro Panconesi

D. Sivakumar

Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 725-734

Preview abstract
We demonstrate how a recent model of social networks (“Affiliation Networks”) offers powerful cues in local routing within social networks, a theme made famous by sociologist Milgram’s “six degrees of separation” experiments. This model posits the existence of an “interest space” that underlies a social network; we prove that in networks produced by this model, not only do short paths exist among all pairs of nodes but natural local routing algorithms can discover them effectively. Specifically, we show that local routing can discover paths of length O(log^2 n) to targets chosen uniformly at random, and paths of length O(1) to targets chosen with probability proportional to their degrees. Experiments on the co-authorship graph derived from DBLP data confirm our theoretical results, and shed light into the power of one step of lookahead in routing algorithms for social networks.
View details

Hiring a secretary from a poset.

Ravi Kumar

Andrea Vattani

Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 39-48

Preview abstract
The secretary problem lies at the core of mechanism design for online auctions. In this work we study the generalization of the classical secretary problem in a setting where there is only a partial order be- tween the elements and the goal of the algorithm is to return one of the maximal elements of the poset. This is equivalent to the setting where the seller has a multidimensional objective function with only a partial order among the outcomes. We obtain an algorithm that succeeds with probability at least?1 + l
k^{−k/(k−1)} ((1+log^{-1/(k-1)} k)^k -1) where k is the number of maximal elements in the poset and is the only information about the poset that is known to the algorithm. On the other hand, we prove an almost matching upper bound of k^{−1/(k−1)} on the success probability of any algorithm for this problem; this upper bound holds even if the algorithm knows the complete structure of the poset.
View details

Filtering: a method for solving graph problems in MapReduce.

Benjamin Moseley

Siddharth Suri

SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 85-94

Preview abstract
The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we will parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version.
View details

Affiliation Networks

D. Sivakumar

Proceedings of the 41st Annual ACM Symposium on Theory of Computing, ACM (2009), pp. 427-434

Preview abstract
In the last decade, structural properties of several naturally arising networks (the Internet, social networks, the web graph, etc.) have been studied intensively with a view to understanding their evolution. In recent empirical work, Leskovec, Kleinberg, and Faloutsos identify two new and surprising properties of the evolution of many real-world networks: densification (the ratio of edges to vertices grows over time), and shrinking diameter (the diameter reduces over time to a constant). These properties run counter to conventional wisdom, and are certainly inconsistent with graph models prior to their work.
In this paper, we present the first model that provides a simple, realistic, and mathematically tractable generative model that intrinsically explains all the well-known properties of the social networks, as well as densification and shrinking diameter. Our model is based on ideas studied empirically in the social sciences, primarily on the groundbreaking work of Breiger (1973) on bipartite models of social networks that capture the affiliation of agents to societies.
We also present algorithms that harness the structural consequences of our model. Specifically, we show how to overcome the bottleneck of densification in computing shortest paths between vertices by producing sparse subgraphs that preserve or approximate shortest distances to all or a distinguished subset of vertices. This is a rare example of an algorithmic benefit derived from a realistic graph model.
Finally, our work also presents a modular approach to connecting random graph paradigms (preferential attachment, edge-copying, etc.) to structural consequences (heavy-tailed degree distributions, shrinking diameter, etc.)
View details

Efficient computation of Weighted Clustering Coefficient

Models for the Compressible Web

Flavio Chierichetti

Ravi Kumar

Alessandro Panconesi

SIAM Journal on Computing (2013)

Filtering: a method for solving graph problems in MapReduce

An Algorithmic Treatment of Strong Queries.

An algorithmic treatment of strong queries

Hiring a secretary from a poset

Ravi Kumar

Andrea Vattani

ACM Conference on Electronic Commerce (2011), pp. 39-48

Almost tight bounds for rumor spreading with conductance.

Rumour spreading and graph conductance.

Rumor Spreading in Social Networks.

Models for the Compressible Web

Flavio Chierichetti

Ravi Kumar

Alessandro Panconesi

FOCS (2009), pp. 331-340

Models for the Compressible Web.

On compressing social networks

Flavio Chierichetti

Ravi Kumar

Michael Mitzenmacher

Alessandro Panconesi

KDD (2009), pp. 219-228

Rumor Spreading in Social Networks.

On compressing social networks.

Flavio Chierichetti

Ravi Kumar

Michael Mitzenmacher

Alessandro Panconesi

KDD2009

Gossiping (via mobile?) in social networks.

On placing skips optimally in expectation.