Alessandro Epasto
I am a staff research scientist at Google, New York working in the Algorithms and Optimization team lead by Vahab Mirrokni.
I received a Ph.D in computer science from Sapienza University of Rome, where I was advised by Professor Alessandro Panconesi and supported by the Google Europe Ph.D. Fellowship in Algorithms, 2011.
I was also a post-doc at the department of computer science of Brown University in Providence (RI), USA where I was advised by Professor Eli Upfal.
My research interests include algorithmic problems in machine learning and data mining, in particular in the areas of clustering, privacy and large scale graphs analysis.
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements.
Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting.
Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items.
View details
Scalable Differentially Private Clustering via Hierarchically Separated Trees
Chris Schwiegelshohn
David Saulpic
2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)
Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details
Differentially Private Graph Learning via Sensitivity-Bounded Personalized PageRank
Advances in Neural Information Processing Systems (2022)
Preview abstract
Personalized PageRank (PPR) is a fundamental tool in unsupervised learning of graph representations such as node ranking, labeling, and graph embedding. However, while data privacy is one of the most important recent concerns, existing PPR algorithms are not designed to protect user privacy. PPR is highly sensitive to the input graph edges: the difference of only one edge may cause a big change in the PPR vector, potentially leaking private user data.
In this work, we propose an algorithm which outputs an approximate PPR and has provably bounded sensitivity to input edges. In addition, we prove that our algorithm achieves similar accuracy to non-private algorithms when the input graph has large degrees. Our sensitivity-bounded PPR directly implies private algorithms for several tools of graph learning, such as, differentially private (DP) PPR ranking, DP node classification, and DP node embedding. To complement our theoretical analysis, we also empirically verify the practical performances of our algorithms.
View details
Near-Optimal Private and Scalable k-Clustering
Shyam Narayanan
NeurIPS 2022 (2022)
Preview abstract
We study the differentially private (DP) $k$-means and $k$-median clustering problems of $n$ points in $d$-dimensional Euclidean space in the massively parallel computation (MPC) model. We provide two near-optimal algorithms where the near-optimality is in three aspects: they both achieve (1). $O(1)$ parallel computation rounds, (2). near-linear in $n$ and polynomial in $k$ total computational work (i.e., near-linear running time in the sequential setting), (3). $O(1)$ relative approximation and $\text{poly}(k, d)$ additive error, where $\Omega(1)$ relative approximation is provably necessary even for any polynomial-time non-private algorithm, and $\Omega(k)$ additive error is a provable lower bound for any polynomial-time DP $k$-means/median algorithm. Our two algorithms provide a tradeoff between the relative approximation and the additive error: the first has $O(1)$ relative approximation and $\sim (k^{2.5} + k^{1.01} \sqrt{d})$ additive error, and the second one achieves $(1+\gamma)$ relative approximation to the optimal non-private algorithm for an arbitrary small constant $\gamma>0$ and with $\text{poly}(k, d)$ additive error for a larger polynomial dependence on $k$ and $d$.
To achieve our result, we develop a general framework which partitions the data and reduces the DP clustering problem for the entire dataset to the DP clustering problem for each part. To control the blow-up of the additive error introduced by each part, we develop a novel charging argument which might be of independent interest.
View details
Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection
Vaggos Chatziafratis
AISTATS, AISTATS, AISTATS (2020), AISTATS (to appear)
Preview abstract
Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization view-point of hierarchical clustering with pair-
wise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage), but also designing new algorithms. Here, for the maximization dual of Dasgupta’s objec-
tive (introduced by Moseley-Wang), we present polynomial-time 42.46% approximation algorithms that use Max-Uncut Bisection as a subroutine. The previous best worst-case approximation factor in polynomial time was 33.6%, improving only slightly over Average-Linkage which achieves 33.3%. Finally, we complement our positive results by providing APX-hardness (even for 0-1 similarities), under the Small Set Expansion hypothesis.
View details
Fair Hierarchical Clustering
Benjamin Moseley
Marina Knittel
Yuyan Wang
Neurips 2020
Preview abstract
As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective.
View details
Sliding Window Algorithms for k-Clustering Problems
Michele Borassi
Neurips 2020 (to appear)
Preview abstract
The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset.
View details
Fair Correlation clustering
23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (2020) (to appear)
Preview abstract
In this paper, we study correlation clustering under fairness constraints. Fair variants of k-median and k-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints.
Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We define a fairlet decomposition with cost similar to the k-median cost and this allows us to obtain approximation algorithms for a wide range of fairness constraints.
We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms.
View details
On-Device Algorithms for Public-Private Data with Absolute Privacy
Proceedings of The Web Conference 2019 (WWW'19) (to appear)
Preview abstract
Motivated by the increasing need to preserve privacy in digital devices, we introduce the on-device public-private model of computation.
Our motivation comes from social-network based recommender systems in which the users want to receive recommendations based on the information available on their devices, as well as the suggestions of their social contacts, without sharing such information or contacts with the central recommendation system.
Our model allows us to solve many algorithmic problems while providing absolute (deterministic) guarantees of the privacy of on-device data and the user's contacts. In fact, we ensure that the private data and private contacts are never revealed to the central system.
Our restrictive model of computation presents several interesting algorithmic challenges because any computation based on private information and contacts must be performed on local devices of limited capabilities. Despite these challenges, under realistic assumptions of inter-device communication, we show several efficient algorithms for fundamental data mining and machine learning problems, ranging from k-means clustering to heavy hitters. We complement this analysis with strong impossibility results for efficient private algorithms without allowing inter-device communication. In our experimental evaluation, we show that our private algorithms provide results almost as accurate as those of the non-private ones while speeding up the on-device computations by orders of magnitude.
View details
Scalable diversity maximization via small-size composable core-sets
31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2019)
Preview abstract
Maximizing diversity is a central problem in information re-
trieval and data mining systems with prominent applications
in web search, recommender systems, news aggregators and
product search. In this paper, we study a diversity maxi-
mization problem (a.k.a. maximum dispersion problem) in
which given a set of n objects in a metric space, one wants to
find a subset of k objects with the maximum sum of pairwise
distances.
To solve this problem in a scalable distributed manner, we
apply a novel distributed framework for tackling large-scale
problems known as randomized composable core-sets: divide
the big data set into smaller parts, solve the problem for
each part, combine the solutions from each part, and solve
the problem on the union of these solutions. Our algorithms
improve significantly over the approximation guarantees of
state-of-the-art core-set-based algorithms while using min-
imum possible intermediate output size. In particular, we
present a simple distributed algorithm that achieves an al-
most optimal communication complexity, and moreover, it
asymptotically achieves approximation factor of 1/2 which
is the best possible approximation factor for the global opti-
mization problem under certain complexity theory assump-
tions.
Our algorithms are scalable and practical as shown by
our extensive empirical evaluation with large datasets and
they can be easily adapted to the major distributed comput-
ing systems like MapReduce. Furthermore, we show empir-
ically that, in real-life instances, our algorithms reach close-
to-optimal solutions with approximation factor of > 90%.
This approximation factor is far exceeding the approxima-
tion barrier for the problem and provide useful output.
View details
Preview abstract
Clustering is a fundamental problem in unsupervised machine learning. In many applications, clustering needs to be performed in presence of additional constraints, such as fairness or diversity constraints.
In this paper, we formulate the problem of k-center clustering without over-representation, and propose approximation algorithms to solve the problem, as well as hardness results. We empirically evaluate our clusterings on real-world dataset and show that fairness can be obtained with limited effect on clustering quality.
View details
Preview abstract
Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph.
But can nodes really be best described by a single vector representation?
In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network).
Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate.
These representations allow for improved reconstruction of the nuanced relationships that occur in the graph -- a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to $90\%$.
In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.
View details
Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives
Michele Borassi
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)
Preview abstract
The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space.
In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties.
Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms.
View details
Incentive-Aware Learning for Large Markets
Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1369-1378
Preview abstract
In a typical learning problem, one key step is to use training data to pick one model from a collection of models that optimizes an objective function. In many multi-agent settings, the training data is generated through the actions of the agents, and the model is
used to make a decision (e.g., how to sell an item) that affects the agents. An illustrative example of this is the problem of learning the
reserve price in an auction. In such cases, the agents have an incentive to influence the training data (e.g., by manipulating their bids in
the case of an auction) to game the system and achieve a more favorable outcome. In this paper, we study such incentive-aware learning
problem in a general setting and show that it is possible to approximately optimize the objective function under two assumptions:
(i) each individual agent is a “small” (part of the market); and (ii) there is a cost associated with manipulation. For our illustrative
application, this nicely translates to a mechanism for setting approximately optimal reserve prices in auctions where no individual
agent has significant market share. For this application, we also show that the second assumption (that manipulations are costly) is
not necessary since we can “perturb” any auction to make it costly for the agents to manipulate.
View details
Preview abstract
Covering the edges of a bipartite graph by a minimum set of bipartite complete
graphs (bicliques) is a basic graph theoretic problem, with numerous
applications. In particular, it is used to characterize parsimonious models of a
set of observations (each biclique corresponds to a {\it factor} or {\it
feature} that relates the observations in the two sets of nodes connected by
the biclique).
\revision{The decision version of the minimum biclique cover problem is NP-Complete}, and
unless $P=NP$, the cover size cannot be approximated in general within less than
a sub-linear factor of the number of nodes (or edges) in the graph.
In this work, we consider two natural restrictions to the problem, motivated by
practical applications. In the first case, we restrict the number of bicliques
a node can belong to. We show that when this number is at least $5$, the
problem is still NP-hard. In contrast, we show that when nodes belong to no more
than 2 bicliques, the problem has efficient approximations.
The second model we consider corresponds to observing a set of independent
samples from an unknown model, governed by a possibly large number of factors.
The model is defined by a bipartite graph
$G=(L,R,E)$, where each node in $L$ is assigned to an arbitrary subset of up to
a constant $f$ factors, while the nodes in $R$ (the independent observations)
are assigned to random subsets of the set of $k$ factors where $k$ can grow
with size of the graph. We show that this practical version of the biclique
cover problem is amenable to efficient approximations.
View details
Submodular Optimization Over Sliding Windows
Proceedings of the 26th International World Wide Web Conference, WWW (2017)
Preview abstract
Maximizing submodular functions under cardinality constraints lies at
the core of numerous data mining and machine learning applications,
including data diversification, data summarization, and coverage problems.
In this work, we study this question in the context of data
streams, where elements arrive one at a time, and we want to
design low-memory and fast update-time algorithms that maintain a good solution.
Specifically, we focus on the sliding window
model, where we are asked to maintain a solution that considers only
the last $W$ items.
In this context, we provide the first non-trivial algorithm that
maintains a provable approximation of the optimum using space sublinear in the size of the window.
In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in
the spread of the values of the elements,
$\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same
time, processing each element only requires a
polylogarithmic number of evaluations of the function itself. When a
better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$
approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem.
We demonstrate the efficacy of the algorithms on a number of real
world datasets, showing that their practical performance far exceeds
the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a
negligible fraction of them.
View details
Preview abstract
We propose a new framework called Ego-splitting for detecting
clusters in complex networks which leverage the local structures
known as ego-nets (i.e. the subgraph induced by the neighborhood
of each node) to de-couple overlapping clusters. Ego-splitting is
highly scalable and flexible framework, with provable theoretical
guarantees, that reduce the complex overlapping clustering problem
to a simpler and more amenable non-overlapping (partitioning)
problem. We can solve community detection in graphs with tens
of billions of edges and outperform previous solutions based on
ego-nets analysis.
More precisely, our framework works in two steps: a local ego-
net analysis, and a global graph partitioning. In the local step, we
first partition the nodes’ ego-nets using non-overlapping clustering.
We then use these clusters to split each node of the graph into
its persona nodes that represents the instantiation of the node in
its communities. Then, in the global step, we partition these new
persona nodes to obtain an overlapping clustering of the original
graph.
View details
The Spread of Physical Activity Through Social Networks
Proceedings of the 26th International World Wide Web Conference 2017, WWW
Preview abstract
We study the evolution of daily physical activity of 44.5K
users on the Fitbit social network over a period of eight
months. A time-aggregated analysis shows that average alter activity, gender, and body mass index (BMI) are significantly predictive of ego activity when controlling for ego
BMI, gender, and number of friends. The direction and
effect size of the associations surfaced vary when considering chronic conditions self-reported by a large portion of the
users including diabetes, dyslipidemia, hypertension and depression. When considering the co-evolution of activity and
friendship on a month by month basis in a within-subject
analysis, we show via fixed effects modeling that the fluctuations in average alter activity significantly predict fluctuations in ego activity. Finally, we investigate the causal
factors that may drive change of physical activity over time.
We leverage a class of novel non-parametric statistical tests
to rule out homophily as the sole source of dependence in activity, even in the presence of unobserved individual traits.
View details
Bicriteria Distributed Submodular Maximization in a Few Rounds
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (2017)
Preview abstract
We study the problem of efficiently optimizing submodular functions under cardinality constraints in distributed setting. Recently, several distributed algorithms for this problem have been introduced which either achieve a sub-optimal solution or they run in super-constant number of rounds of computation. Unlike previous work, we aim to design distributed algorithms in multiple rounds with almost optimal approximation guarantees at the cost of outputting a larger number of elements. Toward this goal, we present a distributed algorithm that, for any ε > 0 and any constant r, outputs a set S of O(rk/ε^(1/r)) items in r rounds, and achieves a (1-ε)-approximation of the value of the optimum set with k items. This is the first distributed algorithm that achieves an approximation factor of (1-ε) running in less than log 1/ε number of rounds. We also prove a hardness result showing that the output of any 1-ε approximation distributed algorithm limited to one distributed round should have at least Ω(k/ε) items. In light of this hardness result, our distributed algorithm in one round, r = 1, is asymptotically tight in terms of the output size. We support the theoretical guarantees with an extensive empirical study of our algorithm showing that achieving almost optimum solutions is indeed possible in a few rounds for large-scale real datasets.
View details
Bicriteria Distributed Submodular Maximization in a Few Rounds
29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2017)
Preview abstract
We study the problem of efficiently optimizing submodular functions under
cardinality constraints in distributed setting. Recently, several
distributed algorithms for this problem have been introduced
which either achieve a sub-optimal solution or they run in super-constant
number
of rounds of computation. Unlike previous work, we aim
to design distributed algorithms in multiple rounds
with almost optimal approximation guarantees
at the cost of outputting a larger number of elements.
Toward this goal, we present a distributed algorithm that, for any \epsilon >
0
and any constant r,
outputs a set
S of O(rk/\epsilon^{1\over r}) items in r rounds, and achieves
a (1-\epsilon)-approximation of the value of the optimum set with k items.
This is the first
distributed algorithm
that achieves an approximation factor of (1-\epsilon) running in less than
$\log {1\over \epsilon}$ number of rounds.
We also prove a hardness result showing that the output of any $1-\epsilon$ approximation distributed algorithm limited to one distributed round should have at least $\Omega(k/\epsilon)$ items. In light of this hardness result, our distributed algorithm in one round, $r=1$, is asymptotically tight in terms of the output size.
We support the theoretical guarantees
with an extensive empirical study of our
algorithm showing that achieving almost optimum solutions is indeed possible in
a few rounds for large-scale real datasets.
View details
Ego-net Community Mining Applied to Friend Suggestion
Ismail Sebe
Ahmed Taei
Sunita Verma
Proceedings of VLDB (2016)
Preview abstract
In this paper, we present a study of the community structure
of ego-networks—the graphs representing the connections
among the neighbors of a node—for several online social
networks. Toward this goal, we design a new technique to
efficiently build and cluster all the ego-nets of a graph in
parallel (note that even just building the ego-nets efficiently
is challenging on large networks). Our experimental findings
are quite compelling: at a microscopic level it is easy to
detect high quality communities.
Leveraging on this fact we, then, develop new features
for friend suggestion based on co-occurrences of two nodes
in different ego-nets’ communities. Our new features can
be computed efficiently on very large scale graphs by just
analyzing the neighborhood of each node. Furthermore, we
prove formally on a stylized model, and by experimental
analysis that this new similarity measure outperforms the
classic local features employed for friend suggestions
View details
Preview abstract
We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches which use hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they will use.
We show a full analysis of the variance of the estimations and novel concentration bounds for these quantities. Our experimental results on very large graphs show that TRIÈST outperforms state-of-the-art approaches in accuracy and exhibits a small update time
View details
Preview abstract
Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update.
View details
Preview abstract
We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, the public graph is visible to everyone, and the private graph at each node is visible only to the user at the node. From each node's viewpoint, the graph is just a union of its private graph and the public graph.
We consider the problem of efficiently computing various properties of the graphs from each node's point of view, with minimal amount of recomputation on the public graph. To illustrate the richness of our model, we explore two powerful computational paradigms for studying large graphs, namely, sketching and sampling, and focus on some key problems in social networks and show efficient algorithms in the public-private graph model. In the sketching model, we show how to efficiently approximate the neighborhood function, which in turn can be used to approximate various notions of centrality. In the sampling model, we focus on all-pair shortest path distances, node similarities, and correlation clustering.
View details
Communities, Random Walks, and Social Sybil Defense.
Preview
Lorenzo Alvisi
Allen Clement
Alessandro Panconesi
Internet Mathematics (2014)
Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs
Stefano Leonardi
WWW (2014), pp. 349-360
Preview abstract
We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.
View details
Sok: The Evolution of Sybil Defense via Social Networks
Lorenzo Alvisi
Allen Clement
Alessandro Panconesi
2013 IEEE Symposium on Security and Privacy, SP 2013
Preview abstract
Sybil attacks in which an adversary forges a potentially unbounded number of identities are a danger to distributed systems and online social networks. The goal of sybil defense is to accurately identify sybil identities. This paper surveys the evolution of sybil defense protocols that leverage the structural properties of the social graph underlying a distributed system to identify sybil identities. We make two main contributions. First, we clarify the deep connection between sybil defense and the theory of random walks. This leads us to identify a community detection algorithm that, for the first time, offers provable guarantees in the context of sybil defense. Second, we advocate a new goal for sybil defense that addresses the more limited, but practically useful, goal of securely white-listing a local region of the graph.
View details
Reconstructing Hidden Permutations Using the Average-Precision (AP) Correlation Statistic
Lorenzo De Stefani
Eli Upfal
Fabio Vandin
In Proceedings of the 30th AAAI Conference on Artificial Intelligence 2016
Spreading Rumours without the Network
Pawel Brach
Alessandro Panconesi
Piotr Sankovski
In Proceedings of the 2nd ACM Conference on Social Networks, COSN, 2014