Yishay Mansour
Prof. Yishay Mansour got his PhD from MIT in 1990, following it he was
a postdoctoral fellow in Harvard and a Research Staff Member in IBM T.
J. Watson Research Center. Since 1992 he is at Tel-Aviv University,
where he is currently a Professor of Computer Science and has serves
as the first head of the Blavatnik School of Computer Science during
2000-2002. He was the founder and first director of the Israeli Center of Research
Excellence in Algorithms.
Prof. Mansour has published over 100 journal papers and over 200
proceeding papers in various areas of computer science with special
emphasis on machine learning, algorithmic game theory, communication
networks, and theoretical computer science and has supervised over a
dozen graduate students in those areas.
Prof. Mansour was named as an ACM fellow 2014, and he is currently an
associate editor in a number of distinguished journals and has been on
numerous conference program committees. He was both the program chair
of COLT (1998) and the STOC (2016) and served twice on the COLT
steering committee and is a member of the ALT steering committee.
Research Areas
Authored Publications
Sort By
Preview abstract
We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary.
As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays.
For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\F|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$
which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
View details
Preview abstract
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the rewards. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.
View details
Preview abstract
\textit{Precision} and \textit{Recall} are foundational metrics in machine learning applications where both accurate predictions and comprehensive coverage are essential, such as in recommender systems and multi-label learning. In these tasks, balancing precision (the proportion of relevant items among those predicted) and recall (the proportion of relevant items successfully predicted) is crucial. A key challenge is that one-sided feedback—where only positive examples are observed during training—is inherent in many practical problems. For instance, in recommender systems like YouTube, training data only consists of songs or videos that a user has actively selected, while unselected items remain unseen. Despite this lack of negative feedback in training, it becomes crucial at test time; for example, in a recommender system, it’s essential to avoid recommending items the user would likely dislike.
We introduce a Probably Approximately Correct (PAC) learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions, such as between users and items. This framework subsumes the classical binary and multi-class PAC learning models as well as multi-label learning with partial feedback, where only a single random correct label per example is observed, rather than all correct labels.
Our work uncovers a rich statistical and algorithmic landscape, with nuanced boundaries on what can and cannot be learned. Notably, classical methods like Empirical Risk Minimization fail in this setting, even for simple hypothesis classes with only two hypotheses. To address these challenges, we develop novel algorithms that learn exclusively from positive data, effectively minimizing both precision and recall losses. Specifically, in the realizable setting, we design algorithms that achieve optimal sample complexity guarantees. In the agnostic case, we show that it is impossible to achieve additive error guarantees (i.e., additive regret)—as is standard in PAC learning—and instead obtain meaningful multiplicative approximations.
View details
Preview abstract
We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually.
In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i:\Delta_i>0}\roundy{\logp T/\Delta_i} + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the sub-optimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations.
To the best of our knowledge, this is the first \textit{BoBW} algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.
View details
Of Dice and Games: A Theory of Generalized Boosting
Marco Bressan
Nataly Brukhim
Nicolo Cesa-Bianchi
Emmanuel Esposito
Shay Moran
Maximilian Thiessen
COLT (2025)
Preview abstract
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to severe consequences. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed.
In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses.
Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses,
and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold).
We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., they can always be achieved), which are boostable (i.e., they imply strong learning), and which are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a useful duality between cost-sensitive and multi-objective losses.
View details
Preview abstract
We study an online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback,
the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting.
In the known-dynamics case, we achieve the first \textit{optimal} regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the horizon, $S$ is the number of states, and $A$ is the number of actions of the MDP.
In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.
View details
Preview abstract
We consider non-stationary multi-arm bandit (MAB) where the expected reward of each action follows a linear function of the number of times we executed the action.
Our main result is a tight regret bound of $\tilde{\Theta}(T^{4/5}K^{3/5})$, by providing both upper and lower bounds.
We extend our results to derive instance dependent regret bounds, which depend on the unknown parametrization of the linear drift of the rewards.
View details
A Fine-grained Characterization of PAC Learnability
Marco Bressan
Nataly Brukhim
Nicolo Cesa-Bianchi
Emmanuel Esposito
Shay Moran
Maximilian Thiessen
COLT (2025)
Preview abstract
In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy ``learnable vs.\ non-learnable'', leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success—for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable. In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability.
To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class \(\mathcal{H}\), given a loss function (which quantifies the penalty for predicting \(\hat{y}\) instead of the true label \(y\)) and a target loss threshold \(z\), our theory determines whether it is possible to achieve a loss of at most \(z\). In contrast, classical PAC learning considers only the special case of the zero-one loss and \(z = 0\), corresponding to a near perfect classification guarantee.
We give a complete characterization of all attainable guarantees,
captured by a \emph{finite family} of combinatorial dimensions, which we term the \emph{\(J\)-cube dimensions} of \(\mathcal{H}\). These dimensions are defined for every subset \(J\) of at least two labels.
This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class $\H$.
View details
Preview abstract
In many repeated auction settings, participants care not only about how frequently they win but also about how their winnings are distributed over time.
This problem arises in various practical domains where avoiding congested demand is crucial and spacing is important, such as advertising campaigns.
We initiate the study of repeated auctions with preferences for spacing of wins over time.
We introduce a simple model of this phenomenon, where the value of a win is given by any concave function of the time since the last win.
We study the optimal policies for this setting in repeated second-price auctions and offer learning algorithms for the bidders that achieve $\tilde O(\sqrt{T})$ regret.
We achieve this by showing that an infinite-horizon Markov decision process (MDP) with the budget constraint in expectation is essentially equivalent to our problem, \emph{even when limiting that MDP to a very small number of states}.
The algorithm achieves low regret by learning a bidding policy that chooses bids as a function of the context and the state of the system, which will be the time elapsed since the last win (or conversion).
View details
Preview abstract
We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph.
We analyzed a variant of Cooperative Successive Elimination algorithm, $\coopse$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound.
Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{\Delta_i > 0}\log(T)/\Delta_i$ is the optimal single agent regret, where $\Delta_i$ is the sub-optimality gap of action $i$.
Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter.
When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds.
First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size.
Second, for logarithmic number of
communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.
View details