# Edith Cohen

I am a Research Scientist at Google (Mountain View, CA) since 2015. I am also a (visiting) full professor at the School of Computer Science at Tel Aviv University in Israel. I attended
Tel Aviv University
(Israel)
for my undergraduate studies in math, physics, and
computer science, graduating in 1985, and continued to obtain an M.Sc.
in computer science in 1986, supervised by Michael Tarsi. I then headed to
Stanford, CA, where I
worked on a Ph.D. in Computer
Science,
with
Andrew Goldberg and Nimrod Megiddo, graduating in 1991.
From 1991 to 2012, I was a member of the research staff, first at the
legendary AT&T Bell Laboratories , and after a 1997 split, at AT&T Labs Research. In 1997 I also visited the Computer Science Division at UC Berkeley. From 2012 to 2014 I was a principal researcher at Microsoft Research (Silicon Valley).

My research interests include Algorithms design, Data Mining, Machine Learning, Optimization, Networks, and more. I prefer a principled approach to modeling and design and I am always looking to learn and expand my horizons. In the span of my research career, I developed models and scalable algorithms in a wide range of areas including query processing and optimization, data summarization, content search and delivery, caching, prefetching, routing, streaming and distributed computation, fundamental graph algorithms and scalable mining of massive graphs. My work enables the leveraging of much larger data sets than previously possible. More details on my work can be found on my personal web page.

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

HyperLogLog Hyper Extended: Sketches for Concave Sublinear Frequency Statistics

KDD (2017) (to appear)

Preview abstract
One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation.
We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and capping statistics.
We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the {\em complement Laplace} transform of the input frequencies.
View details

Greedy Maximization Framework for Graph-based Influence Functions

HotWeb, IEEE (2016)

Preview abstract
The study of graph-based submodular maximization problems was initiated in a seminal work of Kempe, Kleinberg, and Tardos (2003): An {\em influence} function of subsets of nodes is defined by the graph structure and the aim is to find subsets of seed nodes with (approximately) optimal tradeoff of size and influence. Applications include viral marketing, monitoring, and active learning of node labels. This powerful formulation was studied for (generalized) {\em coverage} functions, where the influence of a seed set on a node is the maximum utility of a seed item to the node, and for pairwise {\em utility} based on reachability, distances, or reverse ranks.
We define a rich class of influence functions which unifies and extends previous work beyond coverage functions and specific utility functions. We present a meta-algorithm for approximate greedy maximization with strong approximation quality guarantees and worst-case near-linear computation for all functions in our class. Our meta-algorithm generalizes a recent design by Cohen et al (2014) that was specific for distance-based coverage functions.
View details

Multi-Objective Weighted Sampling

HotWeb 2015 (to appear)

Preview abstract
Key value data sets of the form $\{(x,w_x)\}$ where $w_x >0$ are prevalent. Common queries over such data are {\em segment $f$-statistics} $Q(f,H) = \sum_{x\in H}f(w_x)$, specified for a segment $H$ of the keys and a function $f$. Different choices of $f$ correspond to count, sum, moments, cap, and threshold statistics.
When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to $f(w_x)$ provides estimates with statistically guaranteed quality for $f$-statistics. Such a sample $S^{(f)}$ can be used to estimate $g$-statistics for $g\not=f$, but quality degrades with the disparity between $g$ and $f$. In this paper we address applications that require quality estimates for a set $F$ of different functions. A naive solution is to compute and work with a different sample $S^{(f)}$ for each $f\in F$. Instead, this can be achieved more effectively and seamlessly using a single {\em multi-objective} sample $S^{(F)}$ of a much smaller size.
We review multi-objective sampling schemes and place them in our context of estimating $f$-statistics. We show that a multi-objective sample for $F$ provides quality estimates for any $f$ that is a positive linear combination of functions from $F$. We then establish a surprising and powerful result when the target set $M$ is {\em all} monotone non-decreasing functions, noting that $M$ includes most natural statistics. We provide efficient multi-objective sampling algorithms for $M$ and show that a sample size of $k \ln n$ (where $n$ is the number of active keys) provides the same estimation quality, for any $f\in M$, as a dedicated weighted sample of size $k$ for $f$.
View details

Stream Sampling for Frequency Cap Statistics

CoRR, vol. abs/1502.05955 (2015)

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

IEEE Trans. Knowl. Data Eng., vol. 27 (2015), pp. 2320-2334

Multi-Objective Weighted Sampling

CoRR, vol. abs/1509.07445 (2015)

Stream Sampling for Frequency Cap Statistics

KDD (2015), pp. 159-168

Min-Hash Sketches

Encyclopedia of Algorithms (2015)

All-Distances Sketches

Encyclopedia of Algorithms (2015)

Reverse Ranking by Graph Structure: Model and Scalable Algorithms

Coordinated Sampling

Encyclopedia of Algorithms (2015)

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Computing classic closeness centrality, at scale

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

All-distances sketches, revisited: HIP estimators for massive graphs analysis

PODS (2014), pp. 88-99

Variance Competitiveness for Monotone Estimation: Tightening the Bounds

CoRR, vol. abs/1406.6490 (2014)

Distance queries from sampled data: accurate and efficient

KDD (2014), pp. 681-690

Timed Influence: Computation and Maximization

Algorithms and estimators for summarization of unaggregated data streams

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

J. Comput. Syst. Sci., vol. 80 (2014), pp. 1214-1244

Author retrospective for search and replication in unstructured peer-to-peer networks

Computing Classic Closeness Centrality, at Scale

Probe scheduling for efficient detection of silent failures

Haim Kaplan

Yishay Mansour

Yoav Tzur

Perform. Eval., vol. 79 (2014), pp. 73-89

Estimation for monotone sampling: competitiveness and customization

PODC (2014), pp. 124-133

Scalable Neighborhood Sketching and Distance Distribution Estimation in Graph Datasets: Revisited, Unified, and Improved

CoRR, vol. abs/1306.3284 (2013)

A Labeling Approach to Incremental Cycle Detection

Scheduling Subset Tests: One-Time, Continuous, and How They Relate

Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths

Daniel Delling

Fabian Fuchs

Andrew V. Goldberg

Moisés Goldszmidt

Renato F. Werneck

COSN (2013), pp. 131-142

On the Tradeoff between Stability and Fit

Probe Scheduling for Efficient Detection of Silent Failures

Haim Kaplan

Yishay Mansour

Yoav Tzur

CoRR, vol. abs/1302.0792 (2013)

What You Can Do with Coordinated Samples

Envy-Free Makespan Approximation

Michal Feldman

Amos Fiat

Haim Kaplan

Svetlana Olonetsky

SIAM J. Comput., vol. 41 (2012), pp. 12-25

How to Estimate Change from Samples

What you can do with Coordinated Samples

Don't let the negatives bring you down: sampling from streams of signed updates

A Case for Customizing Estimators: Coordinated Samples

Structure-Aware Sampling: Flexible and Accurate Summarization

Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

Structure-aware sampling on data streams

Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

SIAM J. Comput., vol. 40 (2011), pp. 1402-1431

Truth, Envy, and Truthful Market Clearing Bundle Pricing

Get the most out of your sample: optimal unbiased estimators using partial information

Structure-Aware Sampling: Flexible and Accurate Summarization

Labeling Dynamic XML Trees

Envy-free makespan approximation: extended abstract

On the Interplay between Incentive Compatibility and Envy Freeness

Truth and Envy in Capacitated Allocation Games

Leveraging discarded samples for tighter estimation of multiple-set aggregates

Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Decay Models

Encyclopedia of Database Systems (2009), pp. 757-761

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Composable, Scalable, and Accurate Weight Summarization of Unaggregated Data Sets

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

PVLDB, vol. 2 (2009), pp. 431-442

Stream sampling for variance-optimal estimation of subset sums

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Envy-Free Makespan Approximation

Sketch-Based Estimation of Subpopulation-Weight

Estimating Aggregates over Multiple Sets

Confident estimation for multistage measurement sampling and aggregation

Tighter estimation using bottom k sketches

Variance optimal sampling based estimation of subset sums

Processing top-k queries from samples

Algorithms and estimators for accurate summarization of internet traffic

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

Internet Measurement Comference (2007), pp. 265-278

Spatially-decaying aggregation over a network

Associative search in peer to peer networks: Harnessing latent semantics

Bottom-k sketches: better and more efficient estimation of aggregates

Sketching unaggregated data streams for subpopulation-size queries

Summarizing data using bottom-k sketches

Processing top k queries from samples

Maintaining time-decaying stream aggregates

A short walk in the Blogistan

Making routing robust to changing traffic demands: algorithms and evaluation

Packet classification in large ISPs: design and evaluation of decision tree classifiers

Performance aspects of distributed caches using TTL-based consistency

Spatially-decaying aggregation over a network: model and algorithms

Balanced-Replication Algorithms for Distribution Trees

Efficient estimation algorithms for neighborhood variance and other moments

Guest Editors' foreword

Coping with network failures: routing strategies for optimal demand oblivious restoration

Optimal oblivious routing in polynomial time

Yossi Azar

Amos Fiat

Haim Kaplan

Harald Räcke

J. Comput. Syst. Sci., vol. 69 (2004), pp. 383-394

A case for associative peer to peer overlays

Proactive caching of DNS records: addressing a performance bottleneck

Reachability and Distance Queries via 2-Hop Labels

Associative Search in Peer to Peer Networks: Harnessing Latent Semantics

Optimal oblivious routing in polynomial time

Connection caching: model and algorithms

Maintaining time-decaying stream aggregates

Predicting and bypassing end-to-end Internet service degradations

Anat Bremler-Barr

Haim Kaplan

Yishay Mansour

IEEE Journal on Selected Areas in Communications, vol. 21 (2003), pp. 961-978

Efficient sequences of trials

Prefetching the means for document transfer: a new approach for reducing Web latency

Reachability and distance queries via 2-hop labels

Balanced-Replication Algorithms for Distribution Trees

Search and replication in unstructured peer-to-peer networks

Competitive Analysis of the LRFU Paging Algorithm

Caching Documents with Variable Sizes and Fetching Costs: An LP-Based Approach

Labeling Dynamic XML Trees

Search and replication in unstructured peer-to-peer networks

Refreshment policies for Web content caches

Replication strategies in unstructured peer-to-peer networks

Restoration by path concatenation: fast recovery of MPLS paths

Yehuda Afek

Anat Bremler-Barr

Haim Kaplan

Michael Merritt

Distributed Computing, vol. 15 (2002), pp. 273-283

Exploiting Regularities in Web Traffic Patterns for Cache Replacement

Predicting and bypassing end-to-end internet service degradations

Anat Bremler-Barr

Haim Kaplan

Yishay Mansour

Internet Measurement Workshop (2002), pp. 307-320

Refreshment Policies for Web Content Caches

Finding Interesting Associations without Support Pruning

Mayur Datar

Shinji Fujiwara

Aristides Gionis

Piotr Indyk

Rajeev Motwani

Jeffrey D. Ullman

Cheng Yang

IEEE Trans. Knowl. Data Eng., vol. 13 (2001), pp. 64-78

Aging through cascaded caches: performance issues in the distribution of web content

Restoration path concatenation: fast recovery of MPLS paths

Anat Bremler-Barr

Yehuda Afek

Haim Kaplan

Michael Merritt

SIGMETRICS/Performance (2001), pp. 316-317

Restoration by path concatenation: fast recovery of MPLS paths

Proactive Caching of DNS Records: Addressing a Performance Bottleneck

Performance Aspects of Distributed Caches Using TTL-Based Consistency

All-Pairs Small-Stretch Paths

The Age Penalty and Its Effect on Cache Performance

Competitive Analysis of the LRFU Paging Algorithm

Prefetching the Means for Document Transfer: A New Approach for Reducing Web Latency

Connection caching under vaious models of communication

Polylog-time and near-linear work approximation scheme for undirected shortest paths

J. ACM, vol. 47 (2000), pp. 132-166

Finding Interesting Associations without Support Pruning

Mayur Datar

Shinji Fujiwara

Aristides Gionis

Piotr Indyk

Rajeev Motwani

Jeffrey D. Ullman

Cheng Yang

ICDE (2000), pp. 489-499

LP-based Analysis of Greedy-dual-size

Connection Caching

Approximating Matrix Multiplication for Pattern Recognition Tasks

Efficient Algorithms for Predicting Requests to Web Servers

Exploiting Regularities in Web Traffic Patterns for Cache Replacement

Managing TCP Connections Under Persistent HTTP

Improving End-to-End Performance of the Web Using Server Volumes and Proxy Filters

Structure Prediction and Computation of Sparse Matrix Products

J. Comb. Optim., vol. 2 (1998), pp. 307-332

Fast Algorithms for Constructing t-Spanners and Paths with Stretch t

SIAM J. Comput., vol. 28 (1998), pp. 210-236

Evaluating Server-Assisted Cache Replacement in the Web

Learning Noisy Perceptrons by a Perceptron in Polynomial Time

FOCS (1997), pp. 514-523

Size-Estimation Framework with Applications to Transitive Closure and Reachability

J. Comput. Syst. Sci., vol. 55 (1997), pp. 441-453

All-Pairs Small-Stretch Paths

Approximating Matrix Multiplication for Pattern Recognition Tasks

Using Selective Path-Doubling for Parallel Shortest-Path Computations

J. Algorithms, vol. 22 (1997), pp. 30-56

Efficient Parallel Shortest-Paths in Digraphs with a Separator Decomposition

J. Algorithms, vol. 21 (1996), pp. 331-357

On Optimizing Multiplications of Sparse Matrices

IPCO (1996), pp. 219-233

Improvise: Interactive Multimedia Process Visualization Environment

Approximate Max-Flow on Small Depth Networks

SIAM J. Comput., vol. 24 (1995), pp. 579-597

Improved Algorithms for Linear Inequalities With Two Variables per Inequality

Estimating the Size of the Transitive Closure in Linear Time

FOCS (1994), pp. 190-200

Algorithms and Complexity Analysis for Some Flow Problems

Polylog-time and near-linear work approximation scheme for undirected shortest paths

STOC (1994), pp. 16-26

New algorithms for generalized network flows

Using Selective Path-Doubling for Parallel Shortest-Path Computations

ISTCS (1993), pp. 78-87

Fast algorithms for constructing t-spanners and paths with stretch t

FOCS (1993), pp. 648-658

Strongly Polynomial-Time and NC Algorithms for Detecting Cycles in Periodic Graphs

Efficient Parallel Shortest-Paths in Digraphs with a Separator Decomposition

SPAA (1993), pp. 57-67

New Algorithms for Generalized Network Flows

Approximate Max Flow on Small Depth Networks

FOCS (1992), pp. 648-658

NP-Completeness of graph decomposition problems

Improved Algorithms for Linear Inequalities with Two Variables per Inequality (Extended Abstract)

Algorithms and Complexity Analysis for Some Flow Problems