Jump to Content
Yishay Mansour

Yishay Mansour

Prof. Yishay Mansour got his PhD from MIT in 1990, following it he was a postdoctoral fellow in Harvard and a Research Staff Member in IBM T. J. Watson Research Center. Since 1992 he is at Tel-Aviv University, where he is currently a Professor of Computer Science and has serves as the first head of the Blavatnik School of Computer Science during 2000-2002. He was the founder and first director of the Israeli Center of Research Excellence in Algorithms. Prof. Mansour has published over 100 journal papers and over 200 proceeding papers in various areas of computer science with special emphasis on machine learning, algorithmic game theory, communication networks, and theoretical computer science and has supervised over a dozen graduate students in those areas. Prof. Mansour was named as an ACM fellow 2014, and he is currently an associate editor in a number of distinguished journals and has been on numerous conference program committees. He was both the program chair of COLT (1998) and the STOC (2016) and served twice on the COLT steering committee and is a member of the ALT steering committee.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Principal-Agent Reward Shaping in MDPs
    Omer Ben-Porat
    Michal Moshkovitz
    Boaz Taitler
    AAAI 2024
    Preview abstract Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon. View details
    Preview abstract We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2.5} \sqrt{ T|S||A| ( } \linebreak[1] \overline{ \mathcal{R}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}(\mathcal{O}) = \mathcal{R}(\mathcal{O}_{\mathrm{sq}}^\mathcal{F}) + \mathcal{R}(\mathcal{O}_{\mathrm{log}}^\mathcal{P})$ is the sum of the regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient and rate optimal regret minimization algorithm for adversarial CMDPs which operates under the minimal and standard assumption of online function approximation. Our technique relies on standard convex optimization algorithms, and we show that it is robust to approximation errors. View details
    Uniswap Liquidity Provision: An Online Learning Approach
    Yogev Bar-On
    FC23: 3rd Workshop on Decentralized Finance (2023)
    Preview abstract Uniswap v3 is a decentralized exchange (DEX) that allows liquidity providers to allocate funds more efficiently by specifying an active price interval for their funds. This introduces the problem of finding an optimal strategy for choosing price intervals. We formalize this problem as an online learning problem with non-stochastic rewards. We use regret-minimization methods to show a Liquidity Provision strategy that guarantees a lower bound on the reward. This is true even for adversarial changes to asset pricing, and we express this bound in terms of the trading volume. View details
    Preview abstract An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for \emph{swap regret}, and thus, along the way, imply convergence to a \emph{correlated} equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees. View details
    Preview abstract Blackwell's celebrated theory measures approachability using the $\ell_2$ (Euclidean) distance. In many applications such as regret minimization, it is often more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$ metric. However, the time and space complexity of the algorithms designed for $\ell_\infty$ approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. We present a framework for converting high-dimensional $\ell_\infty$ approachability problems to low-dimensional \emph{pseudonorm} approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$ distance between the average payoff and the approachability set can be equivalently defined as a \emph{pseudodistance} between a lower-dimensional average vector payoff and a new convex convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability analogous to previous work norm approachability showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$ approachability algorithm whose convergence is independent of the dimension of the original vector payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$ approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization. View details
    Preview abstract We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\Tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\Tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\Tilde{O}(\sqrt{n})$ regret with $k= \Tilde{\Omega}(\log n)$ concurrent shufflers. View details
    Preview abstract A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}. View details
    Preview abstract Given a policy, we define a {\newprob}%\emph{safe zone} as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the {\newprob }%safe zone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. Safe zones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal {\textsc{safeZones}}, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate {\textsc{safeZones}.} Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and safe zone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm. View details
    Preview abstract We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being “slightly better than random guessing”. We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. Furthermore, we utilize our new boosting technique in two fundamental settings: multiclass PAC learning and List PAC learning, resulting in simplified algorithms compared to previous works. View details
    Preview abstract We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains a $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$. View details
    Preview abstract We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains $ \tilde{O}\left( {1}/{p_{min}}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\F|,|\Fp|\}/\delta)} \right) $ regret bound, with probability $1-\delta$, where $\Fp$ and $\F$ are finite and realizable regressors classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to Contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\F|)/\ln(|S||A|)})$, which holds even in the case of known dynamics. View details
    Preview abstract There is often a great degree of freedom in the reward design when formulating a task as a reinforcement learning (RL) problem. The choice of reward function has significant impact on the learned policy and how fast the algorithm converges to it. Typically several iterations of specifying and learning with the reward function are necessary to find one that leads to sample-efficient learning of desired behavior. In this work, we instead propose to directly pass multiple alternate reward formulations of the task to the RL agent. We show that natural extensions of action-elimination algorithms to multiple rewards achieve more favorable instance-dependent regret bounds than their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes. Specifically our bounds scale for each state-action pair with the inverse of the most favorable gap among all reward functions. This suggests that learning with multiple rewards can indeed be more sample-efficient, as long as the rewards agree on an optimal policy. We further prove that when rewards do not agree on the optimal policy, multi-reward action elimination in multi-armed bandits still learns a policy that is good across all reward functions. View details
    Preview abstract We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy. View details
    Preview abstract In classic reinforcement Learning (RL) problems, policies are evaluated with respect to some reward function and all optimal policies obtain the same expected return. However, when considering real-world dynamic environments in which different users have different preferences, a policy that is optimal for one user might sub-optimal for another. In this work, we propose a multi-objective reinforcement learning framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our setup consists of a Markov Decision Process with a multi-objective reward function, in which each user corresponds to (unknown) personal preferences vector and their reward in each state-action is the inner product of their preference vector with the multi-objective reward at that state-action. Our goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and the user feedback is their preferred policy. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries. View details
    Preview abstract In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound). View details
    Differentially-Private Bayes Consistency
    Aryeh Kontorovich
    Shay Moran
    Menachem Sadigurschi
    Archive, Archive, Archive
    Preview abstract We construct a universally Bayes consistent learning rule which satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labelled} sample complexity of $\tilde O(d/\eps)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution). View details
    Preview abstract Reinforcement learning typically assumes that the agent observes feedback from the environment immediately, but in many real-world applications (like recommendation systems) the feedback is observed in delay. Thus, we consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode $k$ are only available at the end of episode $k + d^k$, where the delays $d^k$ are neither identical nor bounded, and are chosen by an adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of $\widetilde O ( \sqrt{K} + \sqrt{D} )$ under full-information feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay. Under bandit feedback, we prove similar $\widetilde O ( \sqrt{K} + \sqrt{D} )$ regret assuming that the costs are stochastic, and $\widetilde O ( K^{2/3} + D^{2/3} )$ regret in the general case. To our knowledge, View details
    Nonstochastic Bandits with Composite Anonymous Feedback
    Nicolo Cesa-Bianchi
    Tommaso Cesari
    Roberto Colomboni
    JMLR (2022)
    Preview abstract We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually. Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order $\sqrt{(d+1)KT}$, where $d$ is the maximum delay, $K$ is the number of arms, and $T$ is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting. View details
    Modeling Attrition in Recommender Systems with Departing Bandits
    Omer Ben-Porat
    Lee Cohen
    Liu Leqi
    Zachary Lipton
    AAAI 2022
    Preview abstract Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user \emph{types}, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, where $T$ is the number of users. View details
    Preview abstract We study the problem of \textit{semi-supervised} learning of an adversarially-robust predictor in the $\pac$ model, where the learner has access to both \textit{labeled} and \textit{unlabeled} examples. The sample complexity in semi-supervised learning has two parameters, the number of labeled examples and the number of unlabeled examples. We consider the complexity measures, $\VCU\leq \dim_\U\leq \VC$ and $\VC^*$, where $\VC$ is the standard $\VC$-dimension, $\VC^*$ is its dual, and the other two measures appeared in \cite{montasser2019vc}. The best sample bound for robust supervised PAC learning is $\Lambda=O(\VC\cdot \VC^*)$, and we will compare our sample bounds to $\Lambda$. Our main results are the following: (1) in the realizable setting it is sufficient to have $\O(\VCU)$ labeled examples and $\O(\Lambda)$ unlabeled examples. (2) In the agnostic setting, let $\eta$ be the minimal error. The sample complexity depends, if we allow an error of $2\eta+\epsilon$, in which case it is still sufficient to have $\O(\VCU)$ labeled examples and $\O(\Lambda)$ unlabeled examples. If we insist on have error $\eta+\epsilon$ then $\Omega(\dim_\U)$ labeled examples are necessary. The above results show that there is a significant benefit in semi-supervised robust learning, as there are hypothesis classes with $\VCU=0$ and $\dim_\U$ arbitrary large. Having access only to labeled examples requires at least $\dim_{\U}$ labeled examples, while we require only $\O(1)$ labeled examples and the \textit{unlabeled} sample size is at the same order of the \textit{labeled} sample size required for supervised robust learning. Any improvement in the supervised robust sample complexity $\Lambda$ will immediate improve our bounds. A byproduct of our result is that if we assume that the distribution is robustly realizable by a hypothesis class (i.e., there exist a hypothesis with zero robust error) then, with respect to the non-robust loss (i.e., the standard $0$-$1$ loss) we can learn with only $O(\VCU)$ labeled examples, even if the $\VC$ is infinite. View details
    Preview abstract A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters. View details
    Preview abstract We present differentially private efficient algorithms for learning polygons in the plane (which are not necessarily convex). Our algorithm achieves $(\alpha,\beta)$-PAC learning and $(\eps,\delta)$-differential privacy using a sample of size $O\left(\frac{k}{\alpha\eps}\log\left(\frac{|X|}{\beta\delta}\right)\right)$, where the domain is $X\times X$ and $k$ is the number of edges in the (potentially non-convex) polygon. View details
    Preview abstract We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: \emph{fresh} -- where each agent's trajectory is sampled i.i.d, and \emph{non-fresh} -- where the realization is shared by all agents (but each agent's trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs. View details
    Fair Wrapping for Black-box Predictions
    Alexander Soen
    Sanmi Koyejo
    Nyalleng Moorosi
    Ke Sun
    Lexing Xie
    NeurIPS (2022)
    Preview abstract We introduce a new family of techniques to post-process an accurate black box posterior and reduce its bias, born out of the recent analysis of improper loss functions whose optimisation can correct any \textit{twist} in prediction, unfairness being treated as one. Post-processing involves learning a function we define as an $\alpha$-tree for the correction, for which we provide two generic boosting compliant training algorithms. We show that our correction has appealing properties in terms of composition of corrections, generalization, interpretability and divergence to the black box. We exemplify the use of our technique for fairness compliance in three models: conditional value at risk, equality of opportunity and statistical parity and provide experiments on several readily available domains. View details
    Preview abstract The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a monotone Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye, Gyorfi, and Lugosi (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our general transformation readily implies monotone learners in a variety of contexts: for example, Pestov’s result follows by applying it on any Bayes-consistent algorithm (e.g., k-Nearest-Neighbours). In fact, our transformation extends Pestov’s result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov’s work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions asked by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021). View details
    Preview abstract A dynamic algorithm against an adaptive adversary is required to be correct when the adversary chooses the next update after seeing the previous outputs of the algorithm. We obtain faster dynamic algorithms against an adaptive adversary and separation results between what is achievable in the oblivious vs. adaptive settings. To get these results we exploit techniques from differential privacy, cryptography, and adaptive data analysis. We give a general reduction transforming a dynamic algorithm against an oblivious adversary to a dynamic algorithm robust against an adaptive adversary. This reduction maintains several copies of the oblivious algorithm and uses differential privacy to protect their random bits. Using this reduction we obtain dynamic algorithms against an adaptive adversary with improved update and query times for global minimum cut, all pairs distances, and all pairs effective resistance. We further improve our update and query times by showing how to maintain a sparsifier over an expander decomposition that can be refreshed fast. This fast refresh enables it to be robust against what we call a blinking adversary that can observe the output of the algorithm only following refreshes. We believe that these techniques will prove useful for additional problems. On the flip side, we specify dynamic problems that, assuming a random oracle, every dynamic algorithm that solves them against an adaptive adversary must be polynomially slower than a rather straightforward dynamic algorithm that solves them against an oblivious adversary. We first show a separation result for a search problem and then show a separation result for an estimation problem. In the latter case our separation result draws from lower bounds in adaptive data analysis. View details
    Preview abstract Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool $\mathsf{FriendlyCore}$ that takes a set of points $\cD$ from an unrestricted (pseudo) metric space as input. When $\cD$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a ``stable'' subset $\cC \subseteq \cD$ that includes all points, except possibly few outliers, and is {\em guaranteed} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods. View details
    Preview abstract We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer is guaranteed to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these generalizations be implemented efficiently? While building this theory of optimizer-learner interactions, we define a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings. View details
    Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
    Aviv Rosenberg
    Haipeng Luo
    tal lancewicki
    Tiancheng Jin
    NeurIPS 2022
    Preview abstract The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$. View details
    Preview abstract We present efficient differentially private algorithms for learning unions of polygons in the plane (which are not necessarily convex). Our algorithms are $(\alpha,\beta)$--probably approximately correct and $(\varepsilon,\delta)$--differentially private using a sample of size $\tilde{O}\left(\frac{1}{\alpha\varepsilon}k\log d\right)$, where the domain is $[d]\times[d]$ and $k$ is the number of edges in the union of polygons. Our algorithms are obtained by designing a private variant of the classical (nonprivate) learner for conjunctions using the greedy algorithm for set cover. View details
    Preview abstract A natural, seemingly undisputed understanding is that a learning algorithm must obtain a good fit to the training data in order to perform well on independent test data. This view is reflected both in the traditional bias-complexity trade-off, and in the newly emerging theory of generalization of interpolating learning rules. In this work, we ask to what extent may stochastic gradient descent (SGD) be similarly understood to generalize by means of fit to the training data. We consider the fundamental stochastic convex optimization framework, and seek bounds on the empirical risk and generalization gap of one-pass SGD. Surprisingly, we discover there exist convex learning problems where the output of SGD exhibits empirical risk and generalization gap both $\Omega(1)$, but generalizes well nonetheless with population risk of $O(1/\sqrt n)$, as guaranteed by the classical analysis. Consequently, it turns out SGD is not algorithmically stable in \emph{any} sense, and cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the variant of SGD given by sampling \emph{with}-replacement from the training set, for which we prove that in counter to what might be expected, overfitting does not occur and the population risk converges at the optimal rate. Finally, we study empirical risk guaranties of multiple epochs of without-replacement SGD, and derive nearly tight upper and lower bounds significantly improving over previously known results in this setting. View details
    Preview abstract Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman-Eluder dimension. We define an exploration-gap quantity, alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that the exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure. View details
    Preview abstract We study learning contextual MDPs using a function approximation for both the rewards and the dynamics We consider both the case where the dynamics is known and unknown, and the case that the dynamics dependent or independent of the context. For all four models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning. View details
    Adversarial Dueling Bandits
    Aadirupa Saha
    ICML 2021 (2021) (to appear)
    Preview abstract We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret algorithm, where $\Delta$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to logarithmic factors). View details
    Dueling Convex Optimization
    Aadirupa Saha
    ICML 2021 (2021) (to appear)
    Preview abstract We address the problem of convex optimization with preference (dueling) feedback. Like the traditional optimization objective, the goal is to find the optimal point with least possible query complexity, however without the luxury of even a zeroth order feedback (function value at the queried point). Instead, the learner can only observe a single noisy bit which is a win-loss feedback for a pair of queried points based on the difference of their function values. The problem is undoubtedly of great practical relevance as in many real world scenarios, such as recommender systems or learning from customer preferences, where the system feedback is often restricted to just one binary-bit preference information. We consider the problem of online convex optimization (OCO) solely by actively querying $\{0,1\}$ noisy-comparison feedback of decision point pairs, with the objective of finding a near-optimal point (function minimizer) with the least possible number of queries. For the non-stationary OCO setup, where the underlying convex function may change over time, we prove an impossibility result towards achieving the above objective. We next focus only the on stationary OCO problem and our main contribution lies in designing a normalized-gradient-descent based algorithm towards finding a $\epsilon$-best optimal point. Towards this our algorithm is shown to yield a convergence rate of $\tilde O(\nicefrac{d\beta}{\epsilon \nu^2})$ ($\nu$ being the noise parameter) when the underlying function is $\beta$-smooth. Further we show an improved convergence rate of just $\tilde O(\nicefrac{d\beta}{\alpha \nu^2} \log \frac{1}{\epsilon})$ when the function is additionally also $\alpha$-strongly convex. View details
    Preview abstract We revisit one of the most basic and widely applicable techniques in the literature of differential privacy -- the sparse vector technique [Dwork et al., STOC 2009]. This simple algorithm privately tests whether the value of a given query on a database is close to what we expect it to be. It allows to ask an unbounded number of queries as long as the answer is close to what we expect, and halts following the first query for which this is not the case. We suggest an alternative, equally simple, algorithm that can continue testing queries as long as any single individual does not contribute to the answer of too many queries whose answer deviates substantially form what we expect. Our analysis is subtle and some of its ingredients may be more widely applicable. In some cases our new algorithm allows to privately extract much more information from the database than the original. We demonstrate this by applying our algorithm to the shifting heavy-hitters problem: On every time step, each of n users gets a new input, and the task is to privately identify all the current heavy-hitters. That is, on time step i, the goal is to identify all data elements x such that many of the users have x as their current input. We present an algorithm for this problem with improved error guarantees over what can be obtained using existing techniques. Specifically, the error of our algorithm depends on the maximal number of times that a single user holds a heavy-hitter as input, rather than the total number of times in which a heavy-hitter exists. View details
    Preview abstract Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify k cluster centers without disclosing information on individual data points. Despite significant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentially private clustering algorithms that provide utility when the data is "easy", e.g., when there exists a significant separation between the clusters. We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and k-means. We complement our theoretical analysis with an empirical evaluation on synthetic data. View details
    Preview abstract We give an $(\eps,\delta)$-differentially private algorithm for the Multi-Armed Bandit (MAB) problem in the shuffle model with a distribution-dependent regret of $O\left(\left(\sum_{a:\Delta_a>0}\frac{\log T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\eps}\right)$, and a distribution-independent regret of $O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log T}{\eps}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the suboptimality gap of the action $a$, and $k$ is the total number of actions. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model. View details
    Preview abstract We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple source domains and limited labeled data from the target domain. We analyze existing algorithms and propose an instance-optimal approach based on model selection. We provide efficient algorithms and empirically demonstrate the benefits of our approach. View details
    Preview abstract There have been many recent advances on provably efficient reinforcement learning in problems with rich observation spaces and general function classes. Unfortunately, common to all such approaches is a realizability assumption, that requires the function class to contain the optimal value function of true MDP model, that holds in hardly any real-world setting. In this work, we consider the more realistic setting of agnostic reinforcement learning with a policy class (that may not contain any near-optimal policy). We provide an algorithm for this setting and prove instance-dependent regret bounds when the MDP has small rank $d$. Our bounds scale exponentially with the rank $d$ in the worst case but importantly are polynomial in the horizon, number of actions and the log number of policies. We further show through a nearly matching lower bound that this dependency on horizon is unavoidable. View details
    Preview abstract We study provably-efficient reinforcement learning in non-episodic factored Markov decision processes (FMDPs). All previous regret minimization algorithms in this setting made the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracle-access to an FMDP planner. In addition, we give a variant of our algorithm that remains efficient even when the oracle is limited to non-factored actions, which is the case with almost all existing approximate planners. Finally, we also provide a novel lower bound for the known structure case that matches the best known regret bound of \citet{chen2020efficient}. View details
    Dueling Bandits with Team Comparisons
    Lee Cohen
    Ulrike Schmidt-Kraepelin
    NeurIPS 2021 (2021) (to appear)
    Preview abstract We introduce the \emph{dueling teams problem}, a new online-learning setting in which the learner compares disjoint pairs of $k$-sized \emph{teams} from a universe of $n$ players. The goal of the learner is to minimize the number of duels required to identify, with high probability, a \textit{Condorcet winning team}, i.e., a team which wins against any other disjoint team (with a probability of at least $1/2$). In particular, we assume a linear order on the teams which implies the existence of a Condorcet winning team. We formalize our model by building upon the dueling bandits setting \cite{Yue2012} and provide several algorithms, both for stochastic and deterministic settings. For the deterministic case, our algorithm identifies a Condorcet winning team after $\mathcal{O}(nk\log{k}+k\log k(\frac{\log \log k}{\Delta} + k))$ duels, where $\Delta$ is a gap parameter. In addition, we provide a gap-independent algorithm which requires $\mathcal{O}(nk\log{k}+2^{O(k)})$ duels. For the stochastic setting, we reduce this case to the deterministic case, under the assumption that the probabilities are bounded away from $1/2$. View details
    Preview abstract We study online convex optimization in the random order model, recently proposed by \citet{garber2020online}, where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order. We focus on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex. Our algorithms achieve the optimal bounds and significantly outperform the results of \citet{garber2020online}, completely removing the dimension dependence and improving the dependence on the strong convexity parameter. Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i.i.d.~setting, as well as on a refined average stability analysis of stochastic gradient descent. View details
    Preview abstract Streaming algorithms are algorithms for processing large data streams, using only a limited amount of memory. Classical streaming algorithms typically work under the assumption that the input stream is chosen independently from the internal state of the algorithm. Algorithms that utilize this assumption are called oblivious algorithms. Recently, there is a growing interest in studying streaming algorithms that maintain utility also when the input stream is chosen by an adaptive adversary, possibly as a function of previous estimates given by the streaming algorithm. Such streaming algorithms are said to be adversarially-robust. By combining techniques from learning theory with cryptographic tools from the bounded storage model, we separate the oblivious streaming model from the adversarially-robust streaming model. Specifically, we present a streaming problem for which every adversarially-robust streaming algorithm must use polynomial space, while there exists a classical (oblivious) streaming algorithm that uses only polylogarithmic space. This is the first general separation between the capabilities of these two models, resolving one of the central open questions in adversarial robust streaming. View details
    Preview abstract Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we consider adversarial SSPs that also account for adversarial changes in the costs over time, while the dynamics (i.e., transition function) remains unchanged. Formally, an agent interacts with an SSP environment for $K$ episodes, the cost function changes arbitrarily between episodes, and the fixed dynamics are unknown to the agent. We give high probability regret bounds of $\tO (\sqrt{K})$ assuming all costs are strictly positive, and $\tO (K^{3/4})$ for the general case. To the best of our knowledge, we are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it. View details
    Minimax Regret for Stochastic Shortest Path
    Aviv Rosenberg
    Yonathan Efroni
    NeurIPS 2021 (2021) (to appear)
    Preview abstract We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) |S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 |S| |A| K})$ lower bound of \citet{rosenberg2020near} for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{|S|}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star |S| |A| K})$. Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon. View details
    Preview abstract We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where he has at his disposal a large amount of labeled data from multiple source domains. We show that a new family algorithms based on model selection ideas benefit from very favorable guarantees in this scenario and discuss some theoretical obstacles affecting some alternative techniques. We also report the results of several experiments with our algorithms that demonstrate their practical effectiveness in several tasks View details
    Preview abstract We study online learning of finite-horizon Markov Decision Processes (MDPs) with adversarially changing loss functions and unknown dynamics. In each episode, the learner observes a trajectory realized by her policy chosen for this episode. In addition, the learner suffers and observes the loss accumulated along the trajectory which we call aggregate bandit feedback. The learner, however, never observes any additional information about the loss; in particular, the individual losses suffered along the trajectory. Our main result is a computationally-efficient algorithm with \sqrt{K} regret for this setting, where K is the number of episodes. We efficiently reduce \emph{Online MDPs with Aggregate Bandit Feedback} to a novel setting: Distorted Linear Bandits (DLB). This setting is a robust generalization of linear bandits in which selected actions are adversarially perturbed. We give a computationally-efficient online learning algorithm for DLB and prove a \sqrt{T} regret bound, where T is the number of time steps. Our algorithm is based on a schedule of increasing learning rates used in Online Mirror Descent with a self-concordant barrier regularization. We use the DLB algorithm to derive our main result of \sqrt{K} regret. View details
    Preview abstract We study the stochastic Multi-Armed Bandit~(MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the {\it reward-dependent} delay setting, where realized delays may depend on the stochastic rewards, and the {\it reward-independent} delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for the case of infinite delays where the algorithm might occasionally not observe any feedback. View details
    A New Theoretical Framework for Fast and Accurate Online Decision-Making
    Nicolo Cesa-Bianchi
    Tommaso Cesari
    Vianney Perchet
    (2021) (to appear)
    Preview abstract We study a setting in which a learner faces a sequence of A/B tests and has to make as many good decisions as possible within a given budget constraint. Each A/B test $n=1,2,\l$ is associated with an unknown (and potentially negative) reward $\mu_n \in [-1,1]$, drawn i.i.d.\ from an unknown and fixed distribution. For each A/B test $n$, the learner draws i.i.d.\ samples of a $\{-1,1\}$-valued random variable with mean $\mu_n$ sequentially until a halting criterion is met. The learner then decides to either accept the reward $\mu_n$ or to reject it and get $0$ instead. We measure the learner's performance as the average reward per time step. More precisely, as the sum of the expected rewards of the accepted $\mu_n$ divided by the total number of time steps. Note that this is different than the average reward per $\mu_n$. We design algorithms and prove data-dependent regret bounds against any set of policies based on arbitrary halting criteria and decision rules. Though our algorithms borrow ideas from multiarmed bandits, the two settings are significantly different and not comparable. In fact, the value of $\mu_n$ is never observed directly in our setting---unlike rewards in stochastic bandits. Moreover, the particular structure of our problem allows our regret bounds to be independent of the number of policies. View details
    Preview abstract We derive and analyze learning algorithms for policy evaluation, policy gradient and apprenticeship learning for the average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov-chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time. View details
    Preview abstract There is a growing interest in societal concerns in machine learning systems, especially in fairness. Multicalibration gives a comprehensive methodology to address group fairness. In this work, we address the multicalibration error and decouple it from the prediction error. The importance of decoupling the fairness metric (multicalibration) and the accuracy (prediction error) is due to the inherent trade-off between the two, andther societal decision regarding the “right trade-off” (as imposed many times by regulators). Our work gives sample complexity bounds for uniform convergence guarantees of multicalibration error, which implies that regardless of the accuracy, we can guarantee that the empirical and (true) multicalibration errors are close. We emphasize that our results: (1) are more general than previous bounds, as they apply to both agnostic and realizable settings, and do not rely on a specific typeof algorithm (such as deferentially private), (2)improve over previous multicalibration sample complexity bounds and (3) implies uniform convergence guarantees for the classical calibration error. View details
    Preview abstract We study the sample complexity of learning threshold functions under the constraint of differential privacy. It is assumed that each labeled example in the training data is the information of one individual and we would like to come up with a generalizing hypothesis while guaranteeing differential privacy for the individuals. Intuitively, this means that any single labeled example in the training data should not have a significant effect on the choice of the hypothesis. This problem has received much attention recently; unlike the non-private case, where the sample complexity is independent of the domain size and just depends on the desired accuracy and confidence, for private learning the sample complexity must depend on the domain size X (even for approximate differential privacy). Alon et al. (STOC 2019) showed a lower bound of $\Omega(\log^*|X|)$ on the sample complexity and Bun et al. (FOCS 2015) presented an approximate-private learner with sample complexity $\tilde{O}\left(2^{\log^*|X|}\right)$. In this work we reduce this gap significantly, almost settling the sample complexity. We first present a new upper bound (algorithm) of $\tilde{O}\left(\left(\log^*|X|\right)^2\right)$ on the sample complexity and then present an improved version with sample complexity $\tilde{O}\left(\left(\log^*|X|\right)^{1.5}\right)$. Our algorithm is constructed for the related interior point problem, where the goal is to find a point between the largest and smallest input elements. It is based on selecting an input-dependent hash function and using it to embed the database into a domain whose size is reduced logarithmically; this results in a new database, an interior point of which can be used to generate an interior point in the original database in a differentially private manner. View details
    Near-optimal Regret Bounds for Stochastic Shortest Path
    Aviv Rosenberg
    International Conference on Machine Learning (ICML) 2020 (2020)
    Preview abstract Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes while learning the problem’s optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent’s actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization, and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $O(B S \sqrt{A K})$ , where B is an upper bound on the expected cost of the optimal policy, S is the number of states, A is the number of actions and K is the total number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B \sqrt{S A K})$ regret in the worst case. View details
    Preview abstract We present a private learner for halfspaces over a finite grid $G$ in $R^d$ with sample complexity $d^{2.5}\cdot 2^{\log^*|G|}$, which improves the state-of-the-art result of [Beimel et al., COLT 2019] by a $d^2$ factor. The building block for our learner is a new differentially private algorithm for approximately solving the linear feasibility problem: Given a feasible collection of $m$ linear constraints of the form $Ax\geq b$, the task is to privately identify a solution $x$ that satisfies most of the constraints. Our algorithm is iterative, where each iteration determines the next coordinate of the constructed solution $x$. View details
    Preview abstract When recruiting job candidates, employers rarely observe their underlying skill level directly. Instead, they must administer a series of interviews and/or collate other noisy signals in order to estimate the worker's skill. Traditional economics papers address screening models where employers access worker skill via a single noisy signal. In this paper, we extend this theoretical analysis to a multi-test setting, considering both Bernoulli and Gaussian models. We analyze the optimal employer policy both when the employer sets a fixed number of tests per candidate and when the employer can set a dynamic policy, % in which tests are assigning further tests adaptively based on results from the previous tests. To start, we characterize the optimal policy when employees constitute a single group, demonstrating some interesting trade-offs. Subsequently, we address the multi-group setting, demonstrating that when the noise levels vary across groups, a fundamental impossibility emerges whereby we cannot administer the same number of tests, subject candidates to the same decision rule, and yet realize the same outcomes in both groups. View details
    Preview abstract A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters. View details
    Designing Committees for Mitigating Biases
    Michal Feldman
    Moshe Tennenholtz
    Noam Nisan
    Sigal Oren
    (2020) (to appear)
    Preview abstract It is widely observed that individuals prefer to interact with others who are more similar to them (this phenomenon is termed homophily). This similarity manifests itself in various ways such as beliefs, values and education. Thus, it should not come as a surprise that when people make hiring choices, for example, their similarity to the candidate plays a role in their choice. In this paper, we suggest that putting the decision in the hands of a committee instead of a single person can reduce this bias. We study a novel model of voting in which a committee of experts is constructed to reduce the biases of its members. We first present voting rules that optimally reduce the biases of a given committee. Our main results include the design of committees, for several settings, that are able to reach a nearly optimal (unbiased) choice. We also provide a thorough analysis of the trade-offs between the committee size and the obtained error. Our model is inherently different from the well-studied models of voting that focus on aggregation of preferences or on aggregation of information due to the introduction of similarity biases. View details
    Preview abstract Combinatorial Bandits is a generalization of multi-armed bandits, where $k$ out of $n$ arms are chosen at each round and the sum of the rewards is gained. We address the full-bandit feedback, in which the agent observes only the sum of rewards, in contrast to the semi-bandit feedback, in which the agent observes also the individual arms' rewards. We present the \emph{Combinatorial Successive Accepts and Rejects} (CSAR) algorithm, which is a generalization of the SAR algorithm \cite{bubeck2013multiple} for the combinatorial setting. Our main contribution is an efficient sampling scheme that uses Hadamard matrices in order to estimate accurately the individual arms' expected rewards. We discuss two variants of the algorithm, the first minimizes the sample complexity and the second minimizes the regret. For the sample complexity we also prove a matching lower bound that shows it is optimal. For the regret minimization, we prove a lower bound which is tight up to factor $k$. View details
    Preview abstract We study reinforcement learning in tabular MDPs where the agent receives additional side observations per step in the form of several transition samples -- e.g. from data augmentation. We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage side observations for more sample-efficient learning. We give a regret bound that predominantly depends on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of side observations. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the well-studied bandit setting and propose a new algorithm that can use such a dominating set to learn a near-optimal policy faster. View details
    Preview abstract We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length $T$ with $k < \frac{T}{2}$ zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worse case sequences have regret $O(\sqrt{T})$ and the best case sequence have regret $O(1)$. We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and deriv View details
    Preview abstract We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, there is a Markov Decision Process (MDP), but the reward function is not given explicitly. Instead, there is an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope -- the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent to the most known AL algorithm, the projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from the convex optimization literature and to derive tighter bounds on AL. Specifically, we show that a variation of the FW method that is based on taking ``away steps" achieves a linear rate of convergence when applied to AL. We also show experimentally that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL. View details
    Preview abstract In this work we provide theoretical guarantees for reward decomposition in deterministic MDPs. Reward decomposition is a special case of Hierarchical Reinforcement Learning, that allows one to learn many policies in parallel and combine them into a composite solution. Our approach builds on mapping this problem into a Reward Discounted Traveling Salesman Problem, and then deriving approximate solutions for it. In particular, we focus on approximate solutions that are local, i.e., solutions that only observe information about the current state. Local policies are easy to implement and do not require many computational resources as they do not perform planning. While local deterministic policies, like Nearest Neighbor, are being used in practice for hierarchical reinforcement learning, we propose three stochastic policies that guarantee better performance than any deterministic policy View details
    Preview abstract We present the first computationally-efficient algorithm with $\tO(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown linear dynamics and known quadratic costs. View details
    Delay and Cooperation in Nonstochastic Bandits
    Nicolo Cesa-Bianchi
    Journal of Machine Learning Research (2019)
    Preview abstract We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc Exp3} algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$, where $\alpha_{\le d}$ is the independence number of the $d$-th power of the communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of \textsc{Exp3-Coop}, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. View details
    Preview abstract We present and study models of adversarial online learning where the feedback observed by the learner is noisy, and the feedback is either full information feedback or bandit feedback. Specifically, we consider binary losses xored with the noise, which is a Bernoulli random variable. We consider both a constant noise rate and a variable noise rate. Our main results are tight regret bounds for leaning with noise in the adversarial online learning model. View details
    Improved generalization bounds for robust learning
    Aryeh Kontorovich
    Idan Attias
    Chicago, IL, USA
    Preview abstract We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be used by the adversary during testing. Their aim is to build a robust classifier that would be tested on future adversarially corrupted data. We use a zero-sum game between the learner and the adversary as our game theoretic framework. The adversary is limited to k possible corruptions for each input. Our model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017) We refer to binary and multi-class classification settings, and regression setting. Our main results are generalization bounds for all settings. For the binary classification setting, we both improve a generalization bound, previously found in Feige, Mansour, and Schapire (2015), which handles a weighted average of hypotheses from H, and also are able to handle an infinite hypothesis class H. The sample complexity is improved from O( 1 log( |H| )) to O( 1 (k log(k) VC(H) + log 1 )). ε4δε2 δ The core of all our proofs is based on bounds of the empirical Rademacher complexity. For the binary classification, we use a known regret minimization algorithm of Feige, Mansour, and Schapire (2015) that uses an ERM oracle as a blackbox and we expand on the multi-class and regression settings. The algorithm provides us near optimal policies for the players on a given training sample. The learner starts with a fixed hypothesis class H and chooses a convex combination of hypotheses from H. The learner’s loss is measured on adversarial corrupted inputs. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of k-fold maxima over function classes; these may be of independent interest. View details
    Preview abstract We consider a social planner faced with a stream of myopic selfish agents. The goal of the social planner is to maximize the social welfare, however, it is limited to using only information asymmetry (regarding previous outcomes) and cannot use any monetary incentives. The planner recommends actions to agents, but her recommendations need to be Bayesian Incentive Compatible to be followed by the agents. Our main results is an {\bf optimal} algorithm for the planner, in the case that the actions realizations are deterministic and have a limited support, making significant important progress on this open problem. Our optimal protocol has two interesting features. First, it always completes the exploration of {\em a priori} more beneficial actions before exploring a priori less beneficial actions. Second, the randomization in the protocol is correlated across agents and actions (and not independent at each decision time). View details
    Preview abstract We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. We show $\tilde{O}(L|X|\sqrt{|A|T})$ regret bound, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria\footnote{A {\em performance criterion} aggregates all the losses of a single episode to a single objective we would like to minimize.}, as well as improve previous regret bounds. View details
    Preview abstract A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. % This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. % Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some $h\in H$ the expected value on the two distributions is (significantly) different. %This is the dominant approach used in training distribution discriminators. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneficial". To address this challenge we define $k$-ary based discriminators, which have a family of Boolean $k$-ary functions $\G$. Each function $g\in \G$ naturally defines a hyper-graph, indicating whether a given hyper-edge exists. A function $g\in \G$ distinguishes between two distributions, if the expected value of $g$, on a $k$-tuple of i.i.d examples, on the two distributions is (significantly) different. We study the expressiveness of families of $k$-ary functions, compared to the classical hypothesis class $H$, which is $k=1$. We show a separation in expressiveness of $k+1$-ary versus $k$-ary functions. This demonstrate the great benefit of having $k\geq 2$ as distinguishers. For $k\geq 2$ we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. We proceed and provide upper and lower bounds as a function of our extended notion of VC-dimension. View details
    Preview abstract We study a network of agents communicating with each other and optimizing their performance in a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of \[ \widetilde{O}\left(\sqrt{\left(1+\frac{K}{\left|\mathcal{N}\left(v\right)\right|}\right)T}\right), \] where $T$ is the number of time steps, $K$ is the number of actions and $\mathcal{N}\left(v\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the communication graph is unknown, each agent knows only the set of its neighbors and a bound on the total number of agents. The individual regret between the models differ only by a logarithmic factor. Our work resolves an open problem from (\citet{cesa2019delay}). View details
    Preview abstract Imagine a large firm with multiple departments that plans a large recruitment. Candidates arrive one by-one, and for each candidate the firm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The firm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as a matching problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching. We consider two variants of this problem: (i) in the first variant it is assumed that the n items are drawn independently from an unknown distribution D. (ii) In the second variant it is assumed that before the process starts, the algorithm has an access to a training set of n items drawn independently from the same unknown distribution (e.g. data of candidates from previous recruitment seasons). We give tight bounds on the minimum possible number of retained items in each of these variants. These results demonstrate that one can retain exponentially less items in the second variant (with the training set). Our algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms. View details
    Preview abstract We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta > 0$ under any policy, we give a regret bound of $\tilde{O} ( L|X|\sqrt{|A|T} / \beta )$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L^{3/2} |X| |A|^{1/4} T^{3/4})$, that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function. View details
    Preview abstract We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee O(√T) regret under mild assumptions, where T is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to strongly stable'' policies that mix exponentially fast to a steady state. View details
    Planning and Learning with Stochastic Action Sets
    Martin Mladenov
    Proceedings of the Twenty-seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm (2018), pp. 4674-4682
    Preview abstract This is an extended version of the paper Planning and Learning with Stochastic Action Sets that appeared in the Proceedings of the Twenty-seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp.4674-4682, Stockholm (2018). In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution. View details
    History-Independent Distributed Multi-agent Learning
    Amos Fiat
    Algorithmic Game Theory: 9th International Symposium, SAGT 2016, Proceedings, Springer, pp. 77-89
    Preview abstract How should we evaluate a rumor? We address this question in a setting where multiple agents seek an estimate of the probability, b, of some future binary event. A common uniform prior on b is assumed. A rumor about b meanders through the network, evolving over time. The rumor evolves, not because of ill will or noise, but because agents incorporate private signals about b before passing on the (modified) rumor. The loss to an agent is the (realized) square error of her opinion. Our setting introduces strategic behavior based on evidence regarding an exogenous event to current models of rumor/influence propagation in social networks. We study a simple Exponential Moving Average (EMA) for combining experience evidence and trusted advice (rumor), quantifying its resulting performance and comparing it to the optimal achievable using Bayes posterior having access to the agents private signals. We study the quality of p_T, the prediction of the last agent along a chain of T rumor-mongering agents. The prediction p_T can be viewed as an aggregate estimator of b that depends on the private signals of T agents. View details
    Robust Domain Adaptation
    Annals of Mathematics and Artificial Intelligence, vol. 71 Issue 4 (2014), 365–380
    Preview abstract We derive a generalization bound for domain adaptation by using the properties of robust algorithms. Our new bound depends on λ-shift, a measure of prior knowledge regarding the similarity of source and target domain distributions. Based on the generalization bound, we design SVM variants for binary classification and regression domain adaptation algorithms. View details
    Ad Exchange – Proposal for a New Trading Agent Competition Game
    Agent-Mediated Electronic Commerce. Designing Trading Strategies and Mechanisms for Electronic Markets: AMEC and TADA (2013), pp. 133-145
    Learning with Maximum-Entropy Distributions
    Machine Learning, vol. 45, Issue 2 (2001), pp. 123-145