# Edith Cohen

I am a Research Scientist at Google (Mountain View, CA) since 2015. I am also a (visiting) full professor at the School of Computer Science at Tel Aviv University in Israel. I attended
Tel Aviv University
(Israel)
for my undergraduate studies in math, physics, and
computer science, graduating in 1985, and continued to obtain an M.Sc.
in computer science in 1986, supervised by Michael Tarsi. I then headed to
Stanford, CA, where I
worked on a Ph.D. in Computer
Science,
with
Andrew Goldberg and Nimrod Megiddo, graduating in 1991.
From 1991 to 2012, I was a member of the research staff, first at the
legendary AT&T Bell Laboratories , and after a 1997 split, at AT&T Labs Research. In 1997 I also visited the Computer Science Division at UC Berkeley. From 2012 to 2014 I was a principal researcher at Microsoft Research (Silicon Valley).

My research interests include Algorithms design, Data Mining, Machine Learning, Optimization, Networks, and more. I prefer a principled approach to modeling and design and I am always looking to learn and expand my horizons. In the span of my research career, I developed models and scalable algorithms in a wide range of areas including query processing and optimization, data summarization, content search and delivery, caching, prefetching, routing, streaming and distributed computation, fundamental graph algorithms and scalable mining of massive graphs. My work enables the leveraging of much larger data sets than previously possible. More details on my work can be found on my personal web page.

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

HyperLogLog Hyper Extended: Sketches for Concave Sublinear Frequency Statistics

KDD (2017) (to appear)

Preview abstract
One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation.
We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and capping statistics.
We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the {\em complement Laplace} transform of the input frequencies.
View details

Greedy Maximization Framework for Graph-based Influence Functions

HotWeb, IEEE (2016)

Preview abstract
The study of graph-based submodular maximization problems was initiated in a seminal work of Kempe, Kleinberg, and Tardos (2003): An {\em influence} function of subsets of nodes is defined by the graph structure and the aim is to find subsets of seed nodes with (approximately) optimal tradeoff of size and influence. Applications include viral marketing, monitoring, and active learning of node labels. This powerful formulation was studied for (generalized) {\em coverage} functions, where the influence of a seed set on a node is the maximum utility of a seed item to the node, and for pairwise {\em utility} based on reachability, distances, or reverse ranks.
We define a rich class of influence functions which unifies and extends previous work beyond coverage functions and specific utility functions. We present a meta-algorithm for approximate greedy maximization with strong approximation quality guarantees and worst-case near-linear computation for all functions in our class. Our meta-algorithm generalizes a recent design by Cohen et al (2014) that was specific for distance-based coverage functions.
View details

Multi-Objective Weighted Sampling

HotWeb 2015 (to appear)

Preview abstract
Key value data sets of the form $\{(x,w_x)\}$ where $w_x >0$ are prevalent. Common queries over such data are {\em segment $f$-statistics} $Q(f,H) = \sum_{x\in H}f(w_x)$, specified for a segment $H$ of the keys and a function $f$. Different choices of $f$ correspond to count, sum, moments, cap, and threshold statistics.
When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to $f(w_x)$ provides estimates with statistically guaranteed quality for $f$-statistics. Such a sample $S^{(f)}$ can be used to estimate $g$-statistics for $g\not=f$, but quality degrades with the disparity between $g$ and $f$. In this paper we address applications that require quality estimates for a set $F$ of different functions. A naive solution is to compute and work with a different sample $S^{(f)}$ for each $f\in F$. Instead, this can be achieved more effectively and seamlessly using a single {\em multi-objective} sample $S^{(F)}$ of a much smaller size.
We review multi-objective sampling schemes and place them in our context of estimating $f$-statistics. We show that a multi-objective sample for $F$ provides quality estimates for any $f$ that is a positive linear combination of functions from $F$. We then establish a surprising and powerful result when the target set $M$ is {\em all} monotone non-decreasing functions, noting that $M$ includes most natural statistics. We provide efficient multi-objective sampling algorithms for $M$ and show that a sample size of $k \ln n$ (where $n$ is the number of active keys) provides the same estimation quality, for any $f\in M$, as a dedicated weighted sample of size $k$ for $f$.
View details

Multi-Objective Weighted Sampling

CoRR, vol. abs/1509.07445 (2015)

Stream Sampling for Frequency Cap Statistics

KDD (2015), pp. 159-168

Min-Hash Sketches

Encyclopedia of Algorithms (2015)

Coordinated Sampling

Encyclopedia of Algorithms (2015)

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

IEEE Trans. Knowl. Data Eng., vol. 27 (2015), pp. 2320-2334

Stream Sampling for Frequency Cap Statistics

CoRR, vol. abs/1502.05955 (2015)

Reverse Ranking by Graph Structure: Model and Scalable Algorithms

All-Distances Sketches

Encyclopedia of Algorithms (2015)

Computing classic closeness centrality, at scale

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Probe scheduling for efficient detection of silent failures

Haim Kaplan

Yishay Mansour

Yoav Tzur

Perform. Eval., vol. 79 (2014), pp. 73-89

Estimation for monotone sampling: competitiveness and customization

PODC (2014), pp. 124-133

Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

Timed Influence: Computation and Maximization

All-distances sketches, revisited: HIP estimators for massive graphs analysis

PODS (2014), pp. 88-99

Variance Competitiveness for Monotone Estimation: Tightening the Bounds

CoRR, vol. abs/1406.6490 (2014)

Computing Classic Closeness Centrality, at Scale

Distance queries from sampled data: accurate and efficient

KDD (2014), pp. 681-690

Author retrospective for search and replication in unstructured peer-to-peer networks

Algorithms and estimators for summarization of unaggregated data streams

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

J. Comput. Syst. Sci., vol. 80 (2014), pp. 1214-1244

A Labeling Approach to Incremental Cycle Detection

Scalable Neighborhood Sketching and Distance Distribution Estimation in Graph Datasets: Revisited, Unified, and Improved

CoRR, vol. abs/1306.3284 (2013)

Scheduling Subset Tests: One-Time, Continuous, and How They Relate

Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths

Daniel Delling

Fabian Fuchs

Andrew V. Goldberg

MoisÃ©s Goldszmidt

Renato F. Werneck

COSN (2013), pp. 131-142

Probe Scheduling for Efficient Detection of Silent Failures

Haim Kaplan

Yishay Mansour

Yoav Tzur

CoRR, vol. abs/1302.0792 (2013)

What You Can Do with Coordinated Samples

On the Tradeoff between Stability and Fit

Envy-Free Makespan Approximation

Michal Feldman

Amos Fiat

Haim Kaplan

Svetlana Olonetsky

SIAM J. Comput., vol. 41 (2012), pp. 12-25

A Case for Customizing Estimators: Coordinated Samples

What you can do with Coordinated Samples

Don't let the negatives bring you down: sampling from streams of signed updates

How to Estimate Change from Samples

Structure-Aware Sampling: Flexible and Accurate Summarization

Truth, Envy, and Truthful Market Clearing Bundle Pricing

Get the most out of your sample: optimal unbiased estimators using partial information

Structure-aware sampling on data streams

Structure-Aware Sampling: Flexible and Accurate Summarization

Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

SIAM J. Comput., vol. 40 (2011), pp. 1402-1431

Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

Truth and Envy in Capacitated Allocation Games

Labeling Dynamic XML Trees

Envy-free makespan approximation: extended abstract

On the Interplay between Incentive Compatibility and Envy Freeness

Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Leveraging discarded samples for tighter estimation of multiple-set aggregates

Stream sampling for variance-optimal estimation of subset sums

Envy-Free Makespan Approximation

Composable, Scalable, and Accurate Weight Summarization of Unaggregated Data Sets

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

PVLDB, vol. 2 (2009), pp. 431-442

Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Decay Models

Encyclopedia of Database Systems (2009), pp. 757-761

Sketch-Based Estimation of Subpopulation-Weight

Confident estimation for multistage measurement sampling and aggregation

Estimating Aggregates over Multiple Sets

Processing top-k queries from samples

Tighter estimation using bottom k sketches

Variance optimal sampling based estimation of subset sums

Sketching unaggregated data streams for subpopulation-size queries

Associative search in peer to peer networks: Harnessing latent semantics

Algorithms and estimators for accurate summarization of internet traffic

Nick G. Duffield

Haim Kaplan

Carsten Lund

Mikkel Thorup

Internet Measurement Comference (2007), pp. 265-278

Spatially-decaying aggregation over a network

Summarizing data using bottom-k sketches

Bottom-k sketches: better and more efficient estimation of aggregates

Maintaining time-decaying stream aggregates

Processing top k queries from samples

A short walk in the Blogistan

Making routing robust to changing traffic demands: algorithms and evaluation

Performance aspects of distributed caches using TTL-based consistency

Packet classification in large ISPs: design and evaluation of decision tree classifiers

Guest Editors' foreword

Balanced-Replication Algorithms for Distribution Trees

Spatially-decaying aggregation over a network: model and algorithms

Optimal oblivious routing in polynomial time

Yossi Azar

Amos Fiat

Haim Kaplan

Harald RÃ¤cke

J. Comput. Syst. Sci., vol. 69 (2004), pp. 383-394

Efficient estimation algorithms for neighborhood variance and other moments

Coping with network failures: routing strategies for optimal demand oblivious restoration

Efficient sequences of trials

Optimal oblivious routing in polynomial time

A case for associative peer to peer overlays

Reachability and Distance Queries via 2-Hop Labels

Associative Search in Peer to Peer Networks: Harnessing Latent Semantics

Predicting and bypassing end-to-end Internet service degradations

Anat Bremler-Barr

Haim Kaplan

Yishay Mansour

IEEE Journal on Selected Areas in Communications, vol. 21 (2003), pp. 961-978

Connection caching: model and algorithms

Proactive caching of DNS records: addressing a performance bottleneck

Maintaining time-decaying stream aggregates

Reachability and distance queries via 2-hop labels

Search and replication in unstructured peer-to-peer networks

Balanced-Replication Algorithms for Distribution Trees

Caching Documents with Variable Sizes and Fetching Costs: An LP-Based Approach

Exploiting Regularities in Web Traffic Patterns for Cache Replacement

Search and replication in unstructured peer-to-peer networks

Restoration by path concatenation: fast recovery of MPLS paths

Yehuda Afek

Anat Bremler-Barr

Haim Kaplan

Michael Merritt

Distributed Computing, vol. 15 (2002), pp. 273-283

Prefetching the means for document transfer: a new approach for reducing Web latency

Replication strategies in unstructured peer-to-peer networks

Predicting and bypassing end-to-end internet service degradations

Anat Bremler-Barr

Haim Kaplan

Yishay Mansour

Internet Measurement Workshop (2002), pp. 307-320

Refreshment policies for Web content caches

Labeling Dynamic XML Trees

Competitive Analysis of the LRFU Paging Algorithm

Restoration by path concatenation: fast recovery of MPLS paths

Restoration path concatenation: fast recovery of MPLS paths

Anat Bremler-Barr

Yehuda Afek

Haim Kaplan

Michael Merritt

SIGMETRICS/Performance (2001), pp. 316-317

Proactive Caching of DNS Records: Addressing a Performance Bottleneck

Refreshment Policies for Web Content Caches

Performance Aspects of Distributed Caches Using TTL-Based Consistency

All-Pairs Small-Stretch Paths

The Age Penalty and Its Effect on Cache Performance

Competitive Analysis of the LRFU Paging Algorithm

Finding Interesting Associations without Support Pruning

Mayur Datar

Shinji Fujiwara

Aristides Gionis

Piotr Indyk

Rajeev Motwani

Jeffrey D. Ullman

Cheng Yang

IEEE Trans. Knowl. Data Eng., vol. 13 (2001), pp. 64-78

Aging through cascaded caches: performance issues in the distribution of web content

Connection caching under vaious models of communication

Finding Interesting Associations without Support Pruning

Mayur Datar

Shinji Fujiwara

Aristides Gionis

Piotr Indyk

Rajeev Motwani

Jeffrey D. Ullman

Cheng Yang

ICDE (2000), pp. 489-499

Polylog-time and near-linear work approximation scheme for undirected shortest paths

J. ACM, vol. 47 (2000), pp. 132-166

Prefetching the Means for Document Transfer: A New Approach for Reducing Web Latency

Connection Caching

Approximating Matrix Multiplication for Pattern Recognition Tasks

Efficient Algorithms for Predicting Requests to Web Servers

Exploiting Regularities in Web Traffic Patterns for Cache Replacement

Managing TCP Connections Under Persistent HTTP

LP-based Analysis of Greedy-dual-size

Structure Prediction and Computation of Sparse Matrix Products

J. Comb. Optim., vol. 2 (1998), pp. 307-332

Improving End-to-End Performance of the Web Using Server Volumes and Proxy Filters

Fast Algorithms for Constructing t-Spanners and Paths with Stretch t

SIAM J. Comput., vol. 28 (1998), pp. 210-236

Evaluating Server-Assisted Cache Replacement in the Web

Using Selective Path-Doubling for Parallel Shortest-Path Computations

J. Algorithms, vol. 22 (1997), pp. 30-56

Size-Estimation Framework with Applications to Transitive Closure and Reachability

J. Comput. Syst. Sci., vol. 55 (1997), pp. 441-453

Approximating Matrix Multiplication for Pattern Recognition Tasks

Learning Noisy Perceptrons by a Perceptron in Polynomial Time

FOCS (1997), pp. 514-523

All-Pairs Small-Stretch Paths

On Optimizing Multiplications of Sparse Matrices

IPCO (1996), pp. 219-233

Efficient Parallel Shortest-Paths in Digraphs with a Separator Decomposition

J. Algorithms, vol. 21 (1996), pp. 331-357

Approximate Max-Flow on Small Depth Networks

SIAM J. Comput., vol. 24 (1995), pp. 579-597

Improvise: Interactive Multimedia Process Visualization Environment

Estimating the Size of the Transitive Closure in Linear Time

FOCS (1994), pp. 190-200

Improved Algorithms for Linear Inequalities With Two Variables per Inequality

Polylog-time and near-linear work approximation scheme for undirected shortest paths

STOC (1994), pp. 16-26

Algorithms and Complexity Analysis for Some Flow Problems

New algorithms for generalized network flows

Fast algorithms for constructing t-spanners and paths with stretch t

FOCS (1993), pp. 648-658

Strongly Polynomial-Time and NC Algorithms for Detecting Cycles in Periodic Graphs

Using Selective Path-Doubling for Parallel Shortest-Path Computations

ISTCS (1993), pp. 78-87

Efficient Parallel Shortest-Paths in Digraphs with a Separator Decomposition

SPAA (1993), pp. 57-67

Approximate Max Flow on Small Depth Networks

FOCS (1992), pp. 648-658

New Algorithms for Generalized Network Flows

Algorithms and Complexity Analysis for Some Flow Problems

NP-Completeness of graph decomposition problems

Improved Algorithms for Linear Inequalities with Two Variables per Inequality (Extended Abstract)