Yishay Mansour
Prof. Yishay Mansour got his PhD from MIT in 1990, following it he was
a postdoctoral fellow in Harvard and a Research Staff Member in IBM T.
J. Watson Research Center. Since 1992 he is at Tel-Aviv University,
where he is currently a Professor of Computer Science and has serves
as the first head of the Blavatnik School of Computer Science during
2000-2002. He was the founder and first director of the Israeli Center of Research
Excellence in Algorithms.
Prof. Mansour has published over 100 journal papers and over 200
proceeding papers in various areas of computer science with special
emphasis on machine learning, algorithmic game theory, communication
networks, and theoretical computer science and has supervised over a
dozen graduate students in those areas.
Prof. Mansour was named as an ACM fellow 2014, and he is currently an
associate editor in a number of distinguished journals and has been on
numerous conference program committees. He was both the program chair
of COLT (1998) and the STOC (2016) and served twice on the COLT
steering committee and is a member of the ALT steering committee.
Research Areas
Authored Publications
Sort By
Preview abstract
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.
View details
Preview abstract
Simple, sufficient explanations
furnished by short decision lists
can be useful for guiding stakeholder actions.
Unfortunately, this transparency can come at the expense
of the higher accuracy enjoyed by black box methods,
like deep nets.
To date, practitioners typically either (i) insist on the simpler model, forsaking accuracy; or (ii) insist on maximizing accuracy, settling for post-hoc explanations of dubious faithfulness.
In this paper, we propose a hybrid \emph{partially interpretable model} that represents a compromise between the two extremes.
In our setup, each input is first processed by a decision list that can either execute a decision or abstain,
handing off authority to the opaque model.
The key to optimizing the decision list is to optimally
trade off the accuracy of the composite system
against coverage (the fraction of the population
that receives explanations).
We contribute a new principled algorithm for constructing partially interpretable decision lists,
providing theoretical guarantees
addressing both interpretability and accuracy.
As an instance of our result, we prove
that when the optimal decision list has length $k$, coverage $c$, and $b$ mistakes,
our algorithm will generate a decision list
that has length no greater than $4k$,
coverage at least $c/2$,
and makes at most $4b$ mistakes.
Finally, we empirically validate the effectiveness of the new model.
View details
Preview abstract
We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an
$\widetilde{O}(H^{2.5} \sqrt{ T|S||A| (
} \linebreak[1]
\overline{ \mathcal{R}(\mathcal{O})
+ H \log(\delta^{-1}) )})$ regret guarantee,
with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}(\mathcal{O}) = \mathcal{R}(\mathcal{O}_{\mathrm{sq}}^\mathcal{F}) + \mathcal{R}(\mathcal{O}_{\mathrm{log}}^\mathcal{P})$
is the sum of the regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient and rate optimal regret minimization algorithm for adversarial CMDPs which operates under the minimal and standard assumption of online function approximation. Our technique relies on standard convex optimization algorithms, and we show that it is robust to approximation errors.
View details
Preview abstract
A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.
View details
Preview abstract
Uniswap v3 is a decentralized exchange (DEX) that allows liquidity providers to allocate funds more efficiently by specifying an active price interval for their funds. This introduces the problem of finding an optimal strategy for choosing price intervals. We formalize this problem as an online learning problem with non-stochastic rewards. We use regret-minimization methods to show a Liquidity Provision strategy that guarantees a lower bound on the reward. This is true even for adversarial changes to asset pricing, and we express this bound in terms of the trading volume.
View details
Preview abstract
We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle.
We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains
$
\tilde{O}\left( {1}/{p_{min}}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\F|,|\Fp|\}/\delta)}
\right)
$
regret bound, with probability $1-\delta$, where $\Fp$ and $\F$ are finite and realizable regressors classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes.
To our knowledge, our approach is the first optimistic approach applied to Contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.).
In addition, we present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\F|)/\ln(|S||A|)})$, which holds even in the case of known dynamics.
View details
Preview abstract
Blackwell's celebrated theory measures approachability using the $\ell_2$ (Euclidean) distance. In many applications such as regret minimization, it is often more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$ metric. However, the time and space complexity of the algorithms designed for $\ell_\infty$ approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. We present a framework for converting high-dimensional $\ell_\infty$ approachability problems to low-dimensional \emph{pseudonorm} approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$ distance between the average payoff and the approachability set can be equivalently defined as a \emph{pseudodistance} between a lower-dimensional average vector payoff and a new convex convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability analogous to previous work norm approachability showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$ approachability algorithm whose convergence is independent of the dimension of the original vector payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$ approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization.
View details
Preview abstract
An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable.
Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure.
In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
The bounds we obtain are for \emph{swap regret},
and thus, along the way, imply convergence to a \emph{correlated} equilibrium.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
Our key observation is that
online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.
View details
Preview abstract
In classic reinforcement Learning (RL) problems, policies are evaluated with respect to some reward function and all optimal policies obtain the same expected return. However, when considering real-world dynamic environments in which different users have different preferences, a policy that is optimal for one user might sub-optimal for another.
In this work, we propose a multi-objective reinforcement learning framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons.
Our setup consists of a Markov Decision Process with a multi-objective reward function, in which each user corresponds to (unknown) personal preferences vector and their reward in each state-action is the inner product of their preference vector with the multi-objective reward at that state-action. Our goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models.
We first address the case where a user is provided with two policies and the user feedback is their preferred policy. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one.
In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.
View details
Preview abstract
We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.
View details