Jump to Content
Tomer Koren

Tomer Koren

I'm a Senior Research Scientist at Google Research, Tel Aviv, and an Assistant Professor at the Blavatnik School of Computer Science at Tel Aviv University. My research interests are in machine learning and optimization.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains a $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$. View details
    Preview abstract An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for \emph{swap regret}, and thus, along the way, imply convergence to a \emph{correlated} equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of \emph{weighted} regret minimization, with \emph{unknown} weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees. View details
    Preview abstract A natural, seemingly undisputed understanding is that a learning algorithm must obtain a good fit to the training data in order to perform well on independent test data. This view is reflected both in the traditional bias-complexity trade-off, and in the newly emerging theory of generalization of interpolating learning rules. In this work, we ask to what extent may stochastic gradient descent (SGD) be similarly understood to generalize by means of fit to the training data. We consider the fundamental stochastic convex optimization framework, and seek bounds on the empirical risk and generalization gap of one-pass SGD. Surprisingly, we discover there exist convex learning problems where the output of SGD exhibits empirical risk and generalization gap both $\Omega(1)$, but generalizes well nonetheless with population risk of $O(1/\sqrt n)$, as guaranteed by the classical analysis. Consequently, it turns out SGD is not algorithmically stable in \emph{any} sense, and cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the variant of SGD given by sampling \emph{with}-replacement from the training set, for which we prove that in counter to what might be expected, overfitting does not occur and the population risk converges at the optimal rate. Finally, we study empirical risk guaranties of multiple epochs of without-replacement SGD, and derive nearly tight upper and lower bounds significantly improving over previously known results in this setting. View details
    Preview abstract We study online convex optimization in the random order model, recently proposed by \citet{garber2020online}, where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order. We focus on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex. Our algorithms achieve the optimal bounds and significantly outperform the results of \citet{garber2020online}, completely removing the dimension dependence and improving the dependence on the strong convexity parameter. Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i.i.d.~setting, as well as on a refined average stability analysis of stochastic gradient descent. View details
    Preview abstract We consider stochastic optimization with delayed gradients where, at each time step~$t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$, where $\tau$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $\sigma^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed. View details
    Preview abstract We study the stochastic Multi-Armed Bandit~(MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the {\it reward-dependent} delay setting, where realized delays may depend on the stochastic rewards, and the {\it reward-independent} delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for the case of infinite delays where the algorithm might occasionally not observe any feedback. View details
    Dueling Convex Optimization
    Aadirupa Saha
    ICML 2021 (2021) (to appear)
    Preview abstract We address the problem of convex optimization with preference (dueling) feedback. Like the traditional optimization objective, the goal is to find the optimal point with least possible query complexity, however without the luxury of even a zeroth order feedback (function value at the queried point). Instead, the learner can only observe a single noisy bit which is a win-loss feedback for a pair of queried points based on the difference of their function values. The problem is undoubtedly of great practical relevance as in many real world scenarios, such as recommender systems or learning from customer preferences, where the system feedback is often restricted to just one binary-bit preference information. We consider the problem of online convex optimization (OCO) solely by actively querying $\{0,1\}$ noisy-comparison feedback of decision point pairs, with the objective of finding a near-optimal point (function minimizer) with the least possible number of queries. For the non-stationary OCO setup, where the underlying convex function may change over time, we prove an impossibility result towards achieving the above objective. We next focus only the on stationary OCO problem and our main contribution lies in designing a normalized-gradient-descent based algorithm towards finding a $\epsilon$-best optimal point. Towards this our algorithm is shown to yield a convergence rate of $\tilde O(\nicefrac{d\beta}{\epsilon \nu^2})$ ($\nu$ being the noise parameter) when the underlying function is $\beta$-smooth. Further we show an improved convergence rate of just $\tilde O(\nicefrac{d\beta}{\alpha \nu^2} \log \frac{1}{\epsilon})$ when the function is additionally also $\alpha$-strongly convex. View details
    Preview abstract We study online learning of finite-horizon Markov Decision Processes (MDPs) with adversarially changing loss functions and unknown dynamics. In each episode, the learner observes a trajectory realized by her policy chosen for this episode. In addition, the learner suffers and observes the loss accumulated along the trajectory which we call aggregate bandit feedback. The learner, however, never observes any additional information about the loss; in particular, the individual losses suffered along the trajectory. Our main result is a computationally-efficient algorithm with \sqrt{K} regret for this setting, where K is the number of episodes. We efficiently reduce \emph{Online MDPs with Aggregate Bandit Feedback} to a novel setting: Distorted Linear Bandits (DLB). This setting is a robust generalization of linear bandits in which selected actions are adversarially perturbed. We give a computationally-efficient online learning algorithm for DLB and prove a \sqrt{T} regret bound, where T is the number of time steps. Our algorithm is based on a schedule of increasing learning rates used in Online Mirror Descent with a self-concordant barrier regularization. We use the DLB algorithm to derive our main result of \sqrt{K} regret. View details
    Adversarial Dueling Bandits
    Aadirupa Saha
    ICML 2021 (2021) (to appear)
    Preview abstract We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret algorithm, where $\Delta$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to logarithmic factors). View details
    Stochastic Optimization with Laggard Data Pipelines
    Cyril Zhang
    Kunal Talwar
    Rohan Anil
    Thirty-fourth Conference on Neural Information Processing Systems, 2020 (2020) (to appear)
    Preview abstract State-of-the-art optimization has increasingly moved toward massively parallel pipelines with extremely large batches. As a consequence, the performance bottleneck is shifting towards the CPU- and disk-bound data loading and preprocessing, as opposed to hardware-accelerated backpropagation. In this regime, a recently proposed approach is data echoing (Choi et al. '19), which takes repeated gradient steps on the same batch. We provide the first convergence analysis of data echoing-based extensions of ubiquitous optimization methods, exhibiting provable improvements over their synchronous counterparts. Specifically, we show that asynchronous batch reuse can magnify the gradient signal in a stochastic batch, without harming the statistical rate. View details
    Preview abstract We present the first computationally-efficient algorithm with $\tO(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown linear dynamics and known quadratic costs. View details
    Robust Bi-Tempered Logistic Loss Based on Bregman Divergences
    Ehsan Amid
    Rohan Anil
    Thirty-Third Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract We introduce a temperature into the exponential function and replace the softmax output layer of neural nets by a high temperature generalization. Similarly, the logarithm in the log loss we use for training is replaced by a low temperature logarithm. By tuning the two temperatures we create loss functions that are non-convex already in the single layer case. When replacing the last layer of the neural nets by our bi-temperature generalization of logistic loss, the training becomes more robust to noise. We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large data sets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method using the Tsallis divergence. View details
    Preview abstract Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in machine learning. These methods maintain second-moment statistics of each entry in the gradient, thus doubling the optimizer's memory footprint. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with a sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training very large deep architectures. View details
    Semi-Cyclic Stochastic Gradient Descent
    Hubert Eichner
    Nathan Srebro
    Kunal Talwar
    Accepted to ICML 2019.
    Preview abstract We consider convex SGD updates with a block-cyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling. View details
    Preview abstract We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee O(√T) regret under mild assumptions, where T is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to strongly stable'' policies that mix exponentially fast to a steady state. View details
    Preview abstract Preconditioned gradient methods are among the most general and powerful toolsin optimization. However, preconditioning requires storing and manipulatingprohibitively large matrices. We describe and analyze a new structure-awarepreconditioning algorithm, called Shampoo, for stochastic optimization overtensor spaces. Shampoo maintains a set of preconditioning matrices, each ofwhich operates on a single dimension, contracting over the remainingdimensions. We establish convergence guarantees in the stochastic convexsetting, the proof of which builds upon matrix trace inequalities. Ourexperiments with state-of-the-art deep learning models show that Shampoo iscapable of converging considerably faster than commonly used optimizers.Surprisingly, although it involves a more complex update rule, Shampoo's runtime per step is comparable in practice to that of simple gradientmethods such as SGD, AdaGrad, and Adam. View details
    Preview abstract We describe a unified adaptive optimization algorithm, which uses a potential function to construct a series of preconditioning matrices that shape the gradient based directions to update the learning parameters. We prove a regret bound for this, and show that various well known algorithms, such as AdaGrad and Online Newton Stepping can be obtained from this by choosing appropriate potential functions. Besides the uniform derivation of regret bounds for all these algorithms, this approach also enables us to vary the potential function and come up with new algorithms. View details
    Preview abstract We present a new affine-invariant optimization algorithm called \emph{Online Lazy Newton}. The algorithm is a modification of the Online Newton Step algorithm. The convergence rate of Online Lazy Newton is independent of conditioning. Instead, the algorithm performance depends on the best possible preconditioning of the problem in retrospect and on its \emph{intrinsic} dimensionality. As an application, we show how the Online Lazy Newton can achieve the optimal $\widetildet{\Theta}\sqrt{rT})$ regret for the low rank experts problem, improving by a $\sqrt{r}$ factor over the previously best known bound and resolving an open problem posed by \citet{hazan2016online}. Our rate is also applicable in the \emph{supervised low rank expert} proposed in \cite{foster2017zigzag} that obtained regret of $O(\sqrt{rT}+\log N\log r)$ View details
    Multi-Armed Bandits with Metric Movement Costs
    Roi Livni
    Yishay Mansour
    NIPS (2017) (to appear)
    Preview abstract We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure~$C$ of the underlying metric which depends on its covering numbers. In finite metric spaces with $k$ actions, we give an efficient algorithm that achieves regret of the form $\tO(\max\set{C^{1/3}T^{2/3},\sqrt{kT}})$, and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret $\widetilde{\Theta}(\max\set{k^{1/3}T^{2/3},\sqrt{kT}})$ where $C=\Theta(k)$, and (ii) the interval metric with regret $\widetilde{\Theta}(\max\set{T^{2/3},\sqrt{kT}})$ where $C=\Theta(1)$. For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of $\widetilde{\Theta}(T^{\frac{d+1}{d+2}})$ where $d \ge 1$ is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs. View details
    Preview abstract We present a new affine-invariant optimization algorithm called \emph{Online Lazy Newton}. The algorithm is a modification of the Online Newton Step algorithm. The convergence rate of Online Lazy Newton is independent of conditioning. Instead, the algorithm performance depends on the best possible preconditioning of the problem in retrospect and on its \emph{intrinsic} dimensionality. As an application, we show how the Online Lazy Newton can achieve the optimal $\widetildet{\Theta}\sqrt{rT})$ regret for the low rank experts problem, improving by a $\sqrt{r}$ factor over the previously best known bound and resolving an open problem posed by \citet{hazan2016online}. Our rate is also applicable in the \emph{supervised low rank expert} proposed in \cite{foster2017zigzag} that obtained regret of $O(\sqrt{rT}+\log N\log r)$ View details
    Preview abstract We describe a framework for deriving and analyzing online optimization algorithms that incorporate adaptive, data-dependent regularization, also termed preconditioning. Such algorithms have been proven useful in stochastic optimization by reshaping the gradients according to the geometry of the data. Our framework captures and unifies much of the existing literature on adaptive online methods, including the AdaGrad and Online Newton Step algorithms as well as their diagonal versions. As a result, we obtain new convergence proofs for these algorithms that are substantially simpler than previous analyses. Our framework also exposes the rationale for the different preconditioned updates used in common stochastic optimization methods. View details
    No Results Found