Yishay Mansour

Yishay Mansour

Prof. Yishay Mansour got his PhD from MIT in 1990, following it he was a postdoctoral fellow in Harvard and a Research Staff Member in IBM T. J. Watson Research Center. Since 1992 he is at Tel-Aviv University, where he is currently a Professor of Computer Science and has serves as the first head of the Blavatnik School of Computer Science during 2000-2002. He was the founder and first director of the Israeli Center of Research Excellence in Algorithms. Prof. Mansour has published over 100 journal papers and over 200 proceeding papers in various areas of computer science with special emphasis on machine learning, algorithmic game theory, communication networks, and theoretical computer science and has supervised over a dozen graduate students in those areas. Prof. Mansour was named as an ACM fellow 2014, and he is currently an associate editor in a number of distinguished journals and has been on numerous conference program committees. He was both the program chair of COLT (1998) and the STOC (2016) and served twice on the COLT steering committee and is a member of the ALT steering committee.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Partially Interpretable Models with Guarantees on Coverage and Accuracy
    Nave Frost
    Zachary Lipton
    Michal Moshkovitz
    Algorithmic Learning Theory (ALT)(2024)
    Preview abstract Simple, sufficient explanations furnished by short decision lists can be useful for guiding stakeholder actions. Unfortunately, this transparency can come at the expense of the higher accuracy enjoyed by black box methods, like deep nets. To date, practitioners typically either (i) insist on the simpler model, forsaking accuracy; or (ii) insist on maximizing accuracy, settling for post-hoc explanations of dubious faithfulness. In this paper, we propose a hybrid \emph{partially interpretable model} that represents a compromise between the two extremes. In our setup, each input is first processed by a decision list that can either execute a decision or abstain, handing off authority to the opaque model. The key to optimizing the decision list is to optimally trade off the accuracy of the composite system against coverage (the fraction of the population that receives explanations). We contribute a new principled algorithm for constructing partially interpretable decision lists, providing theoretical guarantees addressing both interpretability and accuracy. As an instance of our result, we prove that when the optimal decision list has length $k$, coverage $c$, and $b$ mistakes, our algorithm will generate a decision list that has length no greater than $4k$, coverage at least $c/2$, and makes at most $4b$ mistakes. Finally, we empirically validate the effectiveness of the new model. View details
    Principal-Agent Reward Shaping in MDPs
    Omer Ben-Porat
    Michal Moshkovitz
    Boaz Taitler
    AAAI 2024
    Preview abstract Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon. View details
    Preview abstract We study a generalization of boosting to the multiclass setting. We introduce a weak learning condition for multiclass classification that captures the original notion of weak learnability as being “slightly better than random guessing”. We give a simple and efficient boosting algorithm, that does not require realizability assumptions and its sample and oracle complexity bounds are independent of the number of classes. Furthermore, we utilize our new boosting technique in two fundamental settings: multiclass PAC learning and List PAC learning, resulting in simplified algorithms compared to previous works. View details
    Preview abstract We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\Tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\Tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\Tilde{O}(\sqrt{n})$ regret with $k= \Tilde{\Omega}(\log n)$ concurrent shufflers. View details
    Preview abstract Given a policy, we define a {\newprob}%\emph{safe zone} as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the {\newprob }%safe zone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. Safe zones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal {\textsc{safeZones}}, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate {\textsc{safeZones}.} Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and safe zone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm. View details
    Preview abstract In this work we revisit an interactive variant of joint differential privacy, recently introduced by Naor et al. [2023], and generalize it towards handling online processes in which existing privacy definitions seem too restrictive. We study basic properties of this definition and demonstrate that it satisfies (suitable variants) of group privacy, composition, and post processing. In order to demonstrate the advantages of this privacy definition compared to traditional forms of differential privacy, we consider the basic setting of online classification. We show that any (possibly non-private) learning rule can be effectively transformed to a private learning rule with only a polynomial overhead in the mistake bound. This demonstrates a stark difference with traditional forms of differential privacy, such as the one studied by Golowich and Livni [2021], where only a double exponential overhead in the mistake bound is known (via an information theoretic upper bound). View details
    Preview abstract We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2.5} \sqrt{ T|S||A| ( } \linebreak[1] \overline{ \mathcal{R}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}(\mathcal{O}) = \mathcal{R}(\mathcal{O}_{\mathrm{sq}}^\mathcal{F}) + \mathcal{R}(\mathcal{O}_{\mathrm{log}}^\mathcal{P})$ is the sum of the regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient and rate optimal regret minimization algorithm for adversarial CMDPs which operates under the minimal and standard assumption of online function approximation. Our technique relies on standard convex optimization algorithms, and we show that it is robust to approximation errors. View details
    Preview abstract We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains $ \tilde{O}\left( {1}/{p_{min}}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\F|,|\Fp|\}/\delta)} \right) $ regret bound, with probability $1-\delta$, where $\Fp$ and $\F$ are finite and realizable regressors classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to Contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\F|)/\ln(|S||A|)})$, which holds even in the case of known dynamics. View details
    Preview abstract We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains a $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$. View details
    Preview abstract An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for \emph{swap regret}, and thus, along the way, imply convergence to a \emph{correlated} equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees. View details