Jump to Content
Haim Kaplan

Haim Kaplan

I work on data structures, algorithms, computational geometry and machine learning. Right now my main focus is on online learning, reinforcement learning, and clustering. Typically my research is theoretical. I am also a faculty at the school of computer science of Tel Aviv University.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\Tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\Tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\Tilde{O}(\sqrt{n})$ regret with $k= \Tilde{\Omega}(\log n)$ concurrent shufflers. View details
    Preview abstract In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound). View details
    Preview abstract A dynamic algorithm against an adaptive adversary is required to be correct when the adversary chooses the next update after seeing the previous outputs of the algorithm. We obtain faster dynamic algorithms against an adaptive adversary and separation results between what is achievable in the oblivious vs. adaptive settings. To get these results we exploit techniques from differential privacy, cryptography, and adaptive data analysis. We give a general reduction transforming a dynamic algorithm against an oblivious adversary to a dynamic algorithm robust against an adaptive adversary. This reduction maintains several copies of the oblivious algorithm and uses differential privacy to protect their random bits. Using this reduction we obtain dynamic algorithms against an adaptive adversary with improved update and query times for global minimum cut, all pairs distances, and all pairs effective resistance. We further improve our update and query times by showing how to maintain a sparsifier over an expander decomposition that can be refreshed fast. This fast refresh enables it to be robust against what we call a blinking adversary that can observe the output of the algorithm only following refreshes. We believe that these techniques will prove useful for additional problems. On the flip side, we specify dynamic problems that, assuming a random oracle, every dynamic algorithm that solves them against an adaptive adversary must be polynomially slower than a rather straightforward dynamic algorithm that solves them against an oblivious adversary. We first show a separation result for a search problem and then show a separation result for an estimation problem. In the latter case our separation result draws from lower bounds in adaptive data analysis. View details
    Preview abstract Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool $\mathsf{FriendlyCore}$ that takes a set of points $\cD$ from an unrestricted (pseudo) metric space as input. When $\cD$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a ``stable'' subset $\cC \subseteq \cD$ that includes all points, except possibly few outliers, and is {\em guaranteed} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods. View details
    Preview abstract A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters. View details
    Preview abstract We present differentially private efficient algorithms for learning polygons in the plane (which are not necessarily convex). Our algorithm achieves $(\alpha,\beta)$-PAC learning and $(\eps,\delta)$-differential privacy using a sample of size $O\left(\frac{k}{\alpha\eps}\log\left(\frac{|X|}{\beta\delta}\right)\right)$, where the domain is $X\times X$ and $k$ is the number of edges in the (potentially non-convex) polygon. View details
    Differentially-Private Bayes Consistency
    Aryeh Kontorovich
    Shay Moran
    Menachem Sadigurschi
    Archive, Archive, Archive
    Preview abstract We construct a universally Bayes consistent learning rule which satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labelled} sample complexity of $\tilde O(d/\eps)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution). View details
    Preview abstract We present efficient differentially private algorithms for learning unions of polygons in the plane (which are not necessarily convex). Our algorithms are $(\alpha,\beta)$--probably approximately correct and $(\varepsilon,\delta)$--differentially private using a sample of size $\tilde{O}\left(\frac{1}{\alpha\varepsilon}k\log d\right)$, where the domain is $[d]\times[d]$ and $k$ is the number of edges in the union of polygons. Our algorithms are obtained by designing a private variant of the classical (nonprivate) learner for conjunctions using the greedy algorithm for set cover. View details
    Preview abstract The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a monotone Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye, Gyorfi, and Lugosi (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our general transformation readily implies monotone learners in a variety of contexts: for example, Pestov’s result follows by applying it on any Bayes-consistent algorithm (e.g., k-Nearest-Neighbours). In fact, our transformation extends Pestov’s result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov’s work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions asked by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021). View details
    Preview abstract In this work we study the problem of differentially private (DP) quantiles, in which given data $X$ set and quantiles $q_1, ..., q_m \in [0,1]$, we want to output $m$ quantile estimations such that the estimation is as close as possible to the optimal solution and preserves DP. In this work we provide \algoname~(AQ), an algorithm and implementation for the DP-quantiels problem. We analyze our algorithm and provide a mathematical proof of its error bounds for the general case and for the concrete case of uniform quantiles utility. We also experimentally evaluate \algoref~and conclude that it obtains higher accuracy than the existing baselines while having lower run time. We reduce the problem of DP-data-sanitization to the DP-uniform-quantiles problem and analyze the resulting mathematical bounds for the error in this case. We analyze our algorithm under the definition of zero Concentrated Differential Privacy (zCDP), and supply the error guarantees of our \algoref~in this case. Finally, we show the empirical benefit our algorithm gains under the zCDP definition. View details
    Preview abstract We revisit one of the most basic and widely applicable techniques in the literature of differential privacy -- the sparse vector technique [Dwork et al., STOC 2009]. This simple algorithm privately tests whether the value of a given query on a database is close to what we expect it to be. It allows to ask an unbounded number of queries as long as the answer is close to what we expect, and halts following the first query for which this is not the case. We suggest an alternative, equally simple, algorithm that can continue testing queries as long as any single individual does not contribute to the answer of too many queries whose answer deviates substantially form what we expect. Our analysis is subtle and some of its ingredients may be more widely applicable. In some cases our new algorithm allows to privately extract much more information from the database than the original. We demonstrate this by applying our algorithm to the shifting heavy-hitters problem: On every time step, each of n users gets a new input, and the task is to privately identify all the current heavy-hitters. That is, on time step i, the goal is to identify all data elements x such that many of the users have x as their current input. We present an algorithm for this problem with improved error guarantees over what can be obtained using existing techniques. Specifically, the error of our algorithm depends on the maximal number of times that a single user holds a heavy-hitter as input, rather than the total number of times in which a heavy-hitter exists. View details
    Preview abstract We study online learning of finite-horizon Markov Decision Processes (MDPs) with adversarially changing loss functions and unknown dynamics. In each episode, the learner observes a trajectory realized by her policy chosen for this episode. In addition, the learner suffers and observes the loss accumulated along the trajectory which we call aggregate bandit feedback. The learner, however, never observes any additional information about the loss; in particular, the individual losses suffered along the trajectory. Our main result is a computationally-efficient algorithm with \sqrt{K} regret for this setting, where K is the number of episodes. We efficiently reduce \emph{Online MDPs with Aggregate Bandit Feedback} to a novel setting: Distorted Linear Bandits (DLB). This setting is a robust generalization of linear bandits in which selected actions are adversarially perturbed. We give a computationally-efficient online learning algorithm for DLB and prove a \sqrt{T} regret bound, where T is the number of time steps. Our algorithm is based on a schedule of increasing learning rates used in Online Mirror Descent with a self-concordant barrier regularization. We use the DLB algorithm to derive our main result of \sqrt{K} regret. View details
    Preview abstract Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify k cluster centers without disclosing information on individual data points. Despite significant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentially private clustering algorithms that provide utility when the data is "easy", e.g., when there exists a significant separation between the clusters. We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and k-means. We complement our theoretical analysis with an empirical evaluation on synthetic data. View details
    Preview abstract We give an $(\eps,\delta)$-differentially private algorithm for the Multi-Armed Bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\eps}\right)$, and a distribution-independent regret of $O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\eps}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the suboptimality gap of the action $a$, and $k$ is the total number of actions. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model. View details
    Preview abstract Streaming algorithms are algorithms for processing large data streams, using only a limited amount of memory. Classical streaming algorithms typically work under the assumption that the input stream is chosen independently from the internal state of the algorithm. Algorithms that utilize this assumption are called oblivious algorithms. Recently, there is a growing interest in studying streaming algorithms that maintain utility also when the input stream is chosen by an adaptive adversary, possibly as a function of previous estimates given by the streaming algorithm. Such streaming algorithms are said to be adversarially-robust. By combining techniques from learning theory with cryptographic tools from the bounded storage model, we separate the oblivious streaming model from the adversarially-robust streaming model. Specifically, we present a streaming problem for which every adversarially-robust streaming algorithm must use polynomial space, while there exists a classical (oblivious) streaming algorithm that uses only polylogarithmic space. This is the first general separation between the capabilities of these two models, resolving one of the central open questions in adversarial robust streaming. View details
    Preview abstract We study the question of how to compute a point in the convex hull of an input set $S$ of $n$ points in $R^d$ in a differentially private manner. This question, which is trivial non-privately, turns out to be quite deep when imposing differential privacy. In particular, it is known that the input points must reside on a fixed finite subset $G\subseteq R^d$, and furthermore, the size of $S$ must grow with the size of $G$. Previous works focused on understanding how $n$ needs to grow with $|G|$, and showed that $n=O\left(d^{2.5}\cdot8^{\log^*|G|}\right)$ suffices (so $n$ does not have to grow significantly with $|G|$). However, the available constructions exhibit running time at least $|G|^{d^2}$, where typically $|G|=X^d$ for some (large) discretization parameter $X$, so the running time is in fact $\Omega(X^{d^3})$. In this paper we give a differentially private algorithm that runs in $O(n^d)$ time, assuming that $n=\Omega(d^4\log X)$. To get this result we study and exploit some structural properties of the Tukey levels (the regions $D_{\ge k}$ consisting of points whose Tukey depth is at least $k$, for $k=0,1,...$). In particular, we derive lower bounds on their volumes for point sets $S$ in general position, and develop a rather subtle mechanism for handling point sets $S$ in degenerate position (where the deep Tukey regions have zero volume). A naive approach to the construction of the Tukey regions requires $n^{O(d^2)}$ time. To reduce the cost to $O(n^d)$, we use an approximation scheme for estimating the volumes of the Tukey regions (within their affine spans in case of degeneracy), and for sampling a point from such a region, a scheme that is based on the volume estimation framework of Lovasz and Vempala (FOCS 2003) and of Cousins and Vempala (STOC 2015). Making this framework differentially private raises a set of technical challenges that we address. View details
    Preview abstract We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size X (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $\Omega(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point in the original database in a differentially private manner. View details
    Near-optimal Regret Bounds for Stochastic Shortest Path
    Aviv Rosenberg
    International Conference on Machine Learning (ICML) 2020 (2020)
    Preview abstract Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while learning the problem’s optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent’s actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization, and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $O(B S \sqrt{A K})$ , where B is an upper bound on the expected cost of the optimal policy, S is the number of states, A is the number of actions and K is the total number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B \sqrt{S A K})$ regret in the worst case. View details
    Preview abstract We present a private learner for halfspaces over a finite grid $G$ in $R^d$ with sample complexity $d^{2.5}\cdot 2^{\log^*|G|}$, which improves the state-of-the-art result of [Beimel et al., COLT 2019] by a $d^2$ factor. The building block for our learner is a new differentially private algorithm for approximately solving the linear feasibility problem: Given a feasible collection of $m$ linear constraints of the form $Ax\geq b$, the task is to privately identify a solution $x$ that satisfies most of the constraints. Our algorithm is iterative, where each iteration determines the next coordinate of the constructed solution $x$. View details
    Duality-based approximation algorithms for maximum depth
    Micha Sharir
    arxiv, https://arxiv.org/abs/2006.12318 (2020)
    Preview abstract An ε-incidence between a point q and some curve c (e.g., line or circle) in the plane occurs when q is at (say, Euclidean) distance at most ε from c. We use variants of the recent grid-based primal-dual technique developed by the authors [ESA 2017, SoCG 2019] to design an efficient data structure for computing an approximate depth of any query point in the arrangement A(S) of a collection S of n halfplanes or triangles in the plane. We then use this structure to find a point of a suitably defined approximate maximum depth in A(S). Specifically, given an error parameter ε > 0, we compute (i) a point of approximate maximum depth, when we are allowed to exclude containments of points q in objects s when q is ε-incident to ∂s, and (ii) a point of approximate maximum depth, when we are allowed to include (false) containments of q in objects s when q is (outside s but) ε-incident to ∂s. For the case of triangles, the technique involves, on top of duality, a careful efficient implementation of a multi-level structure over the input triangles within a primal grid. View details
    Preview abstract A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters. View details
    Preview abstract We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length $T$ with $k < \frac{T}{2}$ zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worse case sequences have regret $O(\sqrt{T})$ and the best case sequence have regret $O(1)$. We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and deriv View details
    Preview abstract We derive and analyze learning algorithms for policy evaluation, policy gradient and apprenticeship learning for the average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov-chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time. View details
    Preview abstract We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, there is a Markov Decision Process (MDP), but the reward function is not given explicitly. Instead, there is an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope -- the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent to the most known AL algorithm, the projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from the convex optimization literature and to derive tighter bounds on AL. Specifically, we show that a variation of the FW method that is based on taking ``away steps" achieves a linear rate of convergence when applied to AL. We also show experimentally that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL. View details
    Preview abstract We consider the classical camera pose estimation problem that arises in many computer vision applications, in which we are given n 2D-3D correspondences between points in the scene and points in the camera image (some of which are incorrect associations), and where we aim to determine the camera pose (the position and orientation of the camera in the scene) from this data. We demonstrate that this posing problem can be reduced to the problem of computing ε-approximate incidences between two-dimensional surfaces (derived from the input correspondences) and points (on a grid) in a four-dimensional pose space. Similar reductions can be applied to other camera pose problems, as well as to similar problems in related application areas. We describe and analyze three techniques for solving the resulting ε-approximate incidences problem in the context of our camera posing application. The first is a straightforward assignment of surfaces to the cells of a grid (of side-length ε) that they intersect. The second is a variant of a primal-dual technique, recently introduced by a subset of the authors [2] for different (and simpler) applications. The third is a non-trivial generalization of a data structure Fonseca and Mount [3], originally designed for the case of hyperplanes. We present and analyze this technique in full generality, and then apply it to the camera posing problem at hand. We compare our methods experimentally on real and synthetic data. Our experiments show that for the typical values of n and ε, the primal-dual method is the fastest, also in practice. View details
    Output sensitive algorithms for approximate incidences and their applications
    Micha Sharir
    Computational Geometry Theory and Applications, vol. 91 (2019), pp. 101666
    Preview abstract An ε-approximate incidence between a point and some geometric object (line, circle, plane, sphere) occurs when the point and the object lie at distance at most ε from each other. Given a set of points and a set of objects, computing the approximate incidences between them is a major step in many database and web-based applications in computer vision and graphics, including robust model fitting, approximate point pattern matching, and estimating the fundamental matrix in epipolar (stereo) geometry. In a typical approximate incidence problem of this sort, we are given a set P of m points in two or three dimensions, a set S of n objects (lines, circles, planes, spheres), and an error parameter ε > 0, and our goal is to report all pairs (p, s) ∈ P × S that lie at distance at most ε from one another. We present efficient output-sensitive approximation algorithms for quite a few cases, including points and lines or circles in the plane, and points and planes, spheres, lines, or circles in three dimensions. Several of these cases arise in the applications mentioned above. Our algorithms report all pairs at distance ≤ ε, but may also report additional pairs, all of which are guaranteed to be at distance at most αε, for some problem-dependent constant α > 1. Our algorithms are based on simple primal and dual grid decompositions and are easy to implement. We note that (a) the use of duality, which leads to significant improvements in the overhead cost of the algorithms, appears to be novel for this kind of problems; (b) the correct choice of duality in some of these problems is fairly intricate and requires some care; and (c) the correctness and performance analysis of the algorithms (especially in the more advanced versions) is fairly non-trivial. We analyze our algorithms and prove guaranteed upper bounds on their running time and on the “distortion” parameter α. View details
    Preview abstract Imagine a large firm with multiple departments that plans a large recruitment. Candidates arrive one by-one, and for each candidate the firm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The firm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as a matching problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching. We consider two variants of this problem: (i) in the first variant it is assumed that the n items are drawn independently from an unknown distribution D. (ii) In the second variant it is assumed that before the process starts, the algorithm has an access to a training set of n items drawn independently from the same unknown distribution (e.g. data of candidates from previous recruitment seasons). We give tight bounds on the minimum possible number of retained items in each of these variants. These results demonstrate that one can retain exponentially less items in the second variant (with the training set). Our algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms. View details
    Preview abstract Clustering of data points is a fundamental tool in data analysis. We consider points $X$ in a relaxed metric space, where the triangle inequality holds within a constant factor. A clustering of $X$ is a partition of $X$ defined by a set of points $Q$ ({\em centroids}), according to the closest centroid. The {\em cost} of clustering $X$ by $Q$ is $V(Q)=\sum_{x\in X} d_{xQ}$. This formulation generalizes classic $k$-means clustering, which uses squared distances. Two basic tasks, parametrized by $k \geq 1$, are {\em cost estimation}, which returns (approximate) $V(Q)$ for queries $Q$ such that $|Q|=k$ and {\em clustering}, which returns an (approximate) minimizer of $V(Q)$ of size $|Q|=k$. When the data set $X$ is very large, we seek efficient constructions of small samples that can act as surrogates for performing these tasks. Existing constructions that provide quality guarantees, however, are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. The core of our design are the novel {\em one2all} probabilities, computed for a set $M$ of centroids and $\alpha \geq 1$: The clustering cost of {\em each} $Q$ with cost $V(Q) \geq V(M)/\alpha$ can be estimated well from a sample of size $O(\alpha |M|\epsilon^{-2})$. For cost estimation, we apply one2all with a bicriteria approximate $M$, while adaptively balancing $|M|$ and $\alpha$ to optimize sample size per quality. For clustering, we present a wrapper that adaptively applies a base clustering algorithm to a sample $S$, using the smallest sample that provides the desired statistical guarantees on quality. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods. View details
    Preview abstract An ε-approximate incidence between a point and some geometric object (line, circle, plane, sphere) occurs when the point and the object lie at distance at most ε from each other. Given a set of points and a set of objects, computing the approximate incidences between them is a major step in many database and web-based applications in computer vision and graphics, including robust model fitting, approximate point pattern matching, and estimating the fundamental matrix in epipolar (stereo) geometry. In a typical approximate incidence problem of this sort, we are given a set P of m points in two or three dimensions, a set S of n objects (lines, circles, planes, spheres), and an error parameter ε > 0, and our goal is to report all pairs (p, s) ∈ P × S that lie at distance at most ε from one another. We present efficient output-sensitive approximation algorithms for quite a few cases, including points and lines or circles in the plane, and points and planes, spheres, lines, or circles in three dimensions. Several of these cases arise in the applications mentioned above. Our algorithms report all pairs at distance ≤ ε, but may also report additional pairs, all of which are guaranteed to be at distance at most αε, for some constant α > 1. Our algorithms are based on simple primal and dual grid decompositions and are easy to implement. We note though that (a) the use of duality, which leads to significant improvements in the overhead cost of the algorithms, appears to be novel for this kind of problems; (b) the correct choice of duality in some of these problems is fairly intricate and requires some care; and (c) the correctness and performance analysis of the algorithms (especially in the more advanced versions) is fairly non-trivial. We analyze our algorithms and prove guaranteed upper bounds on their running time and on the “distortion” parameter α. We also briefly describe the motivating applications, and show how they can effectively exploit our solutions. The superior theoretical bounds on the performance of our algorithms, and their simplicity, make them indeed ideal tools for these applications. In a series of preliminary experimentations (not included in this abstract), we substantiate this feeling, and show that our algorithms lead in practice to significant improved performance of the aforementioned applications. View details
    Reporting Neighbors in High-Dimensional Euclidean Space
    Micha Sharir
    SIAM journal of computing, vol. 43 (2014), pp. 1239-1511
    Preview abstract We consider the following problem, which arises in many database and web-based applications: Given a set P of n points in a high-dimensional space Rd and a distance r, we want to report all pairs of points of P at Euclidean distance at most r. We present two randomized algorithms, one based on randomly shifted grids, and the other on randomly shifted and rotated grids. The running time of both algorithms is of the form C(d)(n + k)log n, where k is the output size and C(d) is a constant that depends on the dimension d. The log n factor is needed to guarantee, with high probability, that all neighbor pairs are reported, and can be dropped if it suffices to report, in expectation, an arbitrarily large fraction of the pairs. When only translations are used, C(d) is of the form (a√d)d, for some (small) absolute constant a≈0.484; this bound is worst-case tight, up to an exponential factor of about 2d. When both rotations and translations are used, C(d) can be improved to roughly 6.74d, getting rid of the super-exponential factor √dd. When the input set (lies in a subset of d-space that) has low doubling dimension ,the performance of the first algorithm improves to C(d,δ)(n + k)log n (or to C(d,δ)(n + k)), where C(d,δ)=O((ed/δ),δ), for δ≤ √d. Otherwise, (d,δ)=O(e√d√dδ. We also present experimental results on several large datasets, demonstrating that our algorithms run significantly faster than all the leading existing algorithms for reporting neighbors. View details
    Joint Cache Partition and Job Assignment on Multi-Core Processors
    WADS'13: Proceedings of the 13th international conference on Algorithms and Data Structures (2012)
    Preview abstract Multicore shared cache processors pose a challenge for designers of embedded systems who try to achieve minimal and predictable execution time of workloads consisting of several jobs. To address this challenge the cache is statically partitioned among the cores and the jobs are assigned to the cores so as to minimize the makespan. Several heuristic algorithms have been proposed that jointly decide how to partition the cache among the cores and assign the jobs. We initiate a theoretical study of this problem which we call the joint cache partition and job assignment problem. By a careful analysis of the possible cache partitions we obtain a constant approximation algorithm for this problem. For some practical special cases we obtain a 2-approximation algorithm, and show how to improve the approximation factor even further by allowing the algorithm to use additional cache. We also study possible improvements that can be obtained by allowing dynamic cache partitions and dynamic job assignments. We define a natural restriction of the well known scheduling problem on unrelated machines in which machines are ordered by “strength”. We call this restriction the ordered unrelated machines scheduling problem. We show that our joint cache partition and job assignment problem is harder than this scheduling problem. The ordered unrelated machines scheduling problem is of independent interest and we give a polynomial time algorithm for certain natural workloads. View details
    Preview abstract Many practically deployed flow algorithms produce the output as a set of values associated with the network links. However, to actually deploy a flow in a network we often need to represent it as a set of paths between the source and destination nodes. In this paper we consider the problem of decomposing a flow into a small number of paths. We show that there is some fixed constant β >; 1 such that it is NP-hard to find a decomposition in which the number of paths is larger than the optimal by a factor of at most β. Furthermore, this holds even if arcs are associated only with three different flow values. We also show that straightforward greedy algorithms for the problem can produce much larger decompositions than the optimal one, on certain well tailored inputs. On the positive side we present a new approximation algorithm that decomposes all but an c-fraction of the flow into at most O(1/ϵ2) times the smallest possible number of paths. We compare the decompositions produced by these algorithms on real production networks and on synthetically generated data. Our results indicate that the dependency of the decomposition size on the fraction of flow covered is exponential. Hence, covering the last few percent of the flow may be costly, so if the application allows, it may be a good idea to decompose most but not all the flow. The experiments also reveal the fact that while for realistic data the greedy approach works very well, our novel algorithm which has a provable worst case guarantee, typically produces only slightly larger decompositions. View details
    Preview abstract We study markets of indivisible items in which price-based (Walrasian) equilibria often do not exist due to the discrete non-convex setting. Instead we consider Nash equilibria of the market viewed as a game, where players bid for items, and where the highest bidder on an item wins it and pays his bid. We first observe that pure Nash-equilibria of this game excatly correspond to price-based equilibiria (and thus need not exist), but that mixed-Nash equilibria always do exist, and we analyze their structure in several simple cases where no price-based equilibrium exists. We also undertake an analysis of the welfare properties of these equilibria showing that while pure equilibria are always perfectly efficient (“first welfare theorem”), mixed equilibria need not be, and we provide upper and lower bounds on their amount of inefficiency. View details
    No Results Found