Jump to Content

Csaba Szepesvari

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice
    Toshinori Kitamura
    Tadashi Kozuno
    Yunhao Tang
    Nino Vieillard
    Michal Valko
    Wenhao Yang
    Jincheng Mei
    Pierre Menard
    Mo Azar
    Remi Munos
    Olivier Pietquin
    Matthieu Geist
    Wataru Kumagai
    Yutaka Matsuo
    International Conference on Machine Learning (ICML) (2023)
    Preview abstract Mirror descent value iteration (MDVI), an abstraction of Kullback--Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear function approximation through its sample complexity required to identify an $\varepsilon$-optimal policy with probability $1-\delta$ under the settings of an infinite-horizon linear MDP, generative model, and G-optimal design. We demonstrate that least-squares regression weighted by the variance of an estimated optimal value function of the next state is crucial to achieving minimax optimality. Based on this observation, we present Variance-Weighted Least-Squares MDVI (VWLS-MDVI), the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs. Furthermore, we propose a practical VWLS algorithm for value-based deep RL, Deep Variance Weighting (DVW). Our experiments demonstrate that DVW improves the performance of popular value-based deep RL algorithms on a set of MinAtar benchmarks. View details
    Preview abstract Classical global convergence results for first-order methods rely on uniform smoothness and the Łojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform Łojasiewicz inequality} (NŁ). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical Ω(1/t2) lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to O(e−t) while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings. View details
    Preview abstract We study stochastic policy optimization in the on-policy case and make the following four contributions. \textit{First}, we show that the ordering of optimization algorithms by their efficiency gets reversed when they have or they not to the true gradient information. In particular, this finding implies that, unlike in the true gradient scenario, geometric information cannot be easily exploited without detrimental consequences in stochastic policy optimization. \textit{Second}, to explain these findings we introduce the concept of \textit{committal rate} for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. \textit{Third}, we show that if there is no external mechanism that allows an algorithm to determine the difference between optimal and sub-optimal actions using only on-policy samples, then there must be an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves a faster than $O(1/t)$ convergence rate but then must fail to converge to the globally optimal deterministic policy with some positive probability. \textit{Finally}, we use our committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, and how an ensemble method with parallelism can be guaranteed to achieve near-optimal solutions with high probability. View details
    On the Optimality of Batch Policy Optimization Algorithms
    Chenjun Xiao
    Yifan Wu
    Tor Lattimore
    Jincheng Mei
    Lihong Li
    ICML 2021 (2021)
    Preview abstract Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization. View details
    Meta-Thompson Sampling
    Branislav Kveton
    Michael Konobeev
    Martin Mladenov
    Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pp. 5884-5893
    Preview abstract Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior. View details
    Escaping the Gravitational Pull of Softmax
    Jincheng Mei
    Chenjun Xiao
    Lihong Li
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
    Preview abstract The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, the softmax poses serious drawbacks for gradient descent optimization. We establish two negative results for this transform: (1) optimizing any expectation with respect to the softmax must exhibit extreme sensitivity to parameter initialization (``the softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (``softmax damping''). Both findings are based on an analysis of convergence rates using the Lojasiewicz inequality. To circumvent these shortcomings we investigate an alternative transformation, the escort (p-norm) mapping, that demonstrates better optimization properties. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation. View details
    Differentiable Meta-Learning of Bandit Policies
    Branislav Kveton
    Martin Mladenov
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 2122-2134
    Preview abstract Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances. View details
    On the Global Convergence Rates of Softmax Policy Gradient Methods
    Jincheng Mei
    Chenjun Xiao
    International Conference on Machine Learning (ICML) (2020)
    Preview abstract We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-t})$ toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies. View details
    Randomized Exploration in Generalized Linear Bandits
    Branislav Kveton
    Lihong Li
    Mohammad Ghavamzadeh
    23rd International Conference on Artificial Intelligence and Statistics (2020)
    Preview abstract We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration. View details
    Preview abstract We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods View details
    BubbleRank: Safe Online Learning to Re-Rank via Implicit Click Feedback
    Chang Li
    Branislav Kveton
    Tor Lattimore
    Ilya Markov
    Maarten de Rijke
    35th Conference on Uncertainty in Artificial Intelligence (2019)
    Preview abstract In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists. Learning to rank has traditionally been studied in two settings. In the offline setting, rankers are typically learned from relevance labels created by judges. This approach has generally become standard in industrial applications of ranking, such as search. However, this approach lacks exploration and thus is limited by the information content of the offline training data. In the online setting, an algorithm can experiment with lists and learn from feedback on them in a sequential fashion. Bandit algorithms are well-suited for this setting but they tend to learn user preferences from scratch, which results in a high initial cost of exploration. This poses an additional challenge of safe exploration in ranked lists. We propose BubbleRank, a bandit algorithm for safe re-ranking that combines the strengths of both the offline and online settings. The algorithm starts with an initial base list and improves it online by gradually exchanging higher-ranked less attractive items for lower-ranked more attractive items. We prove an upper bound on the n-step regret of BubbleRank that degrades gracefully with the quality of the initial base list. Our theoretical findings are supported by extensive experiments on a large-scale real-world click dataset. View details
    Politex: Regret Bounds for Policy Iteration using Expert Prediction
    Yasin Abbasi-Yadkori
    Peter Bartlett
    Kush Bhatia
    Gellért Weisz
    ICML (2019)
    Preview abstract We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for m time steps scales as E(m) = E0 + O((d/m)^{1/2}), where E0 is the worst-case approximation error and d is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as O(d^{1/2}T^{3/4} + E0T), where O(.) hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation. View details
    Perturbed-History Exploration in Stochastic Multi-Armed Bandits
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China (2019), pp. 2786-2793
    Preview abstract We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines. View details
    Perturbed-History Exploration in Stochastic Linear Bandits
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of the Thirty-fifth Conference on Uncertainty in Artificial Intelligence (UAI-19), Tel Aviv, Israel (2019), pp. 176-186
    Preview abstract We propose a new online algorithm for cumulative regret minimization in a stochastic linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history. Therefore, we call it perturbed-history exploration in a linear bandit (LinPHE). The perturbed history is a mixture of observed rewards and randomly generated i.i.d. pseudo-rewards. We derive a $\tilde{O}(d \sqrt{n})$ gap-free bound on the $n$-round regret of LinPHE, where $d$ is the number of features. The key steps in our analysis are new concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we generalize LinPHE to a logistic model. We evaluate our algorithms empirically and show that they are practical. View details
    Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
    Branislav Kveton
    Sharan Vaswani
    Zheng Wen
    Mohammad Ghavamzadeh
    Tor Lattimore
    36th International Conference on Machine Learning (2019)
    Preview abstract We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and derive a $O(K \Delta^{-1} \log n)$ bound on its $n$-round regret, where $\Delta$ is the difference in the expected rewards of the optimal and the best suboptimal arms, and $K$ is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well. View details
    Preview abstract Model-free approaches for reinforcement learning (RL) and continuous control find policies based only on past states and rewards, without fitting a model of the system dynamics. They are appealing as they are general purpose and easy to implement; however, they also come with fewer theoretical guarantees than model-based RL. In this work, we present a new model-free algorithm for controlling linear quadratic (LQ) systems, and show that its regret scales as O(T^(ξ+2/3)). The algorithm is based on a reduction of control of Markov decision processes to an expert prediction problem. In practice, it corresponds to a variant of policy iteration with forced exploration, where the policy in each phase is greedy with respect to the average of all previous value functions. This is the first model-free algorithm for adaptive control of LQ systems that provably achieves sublinear regret and has a polynomial computation cost. Empirically, our algorithm dramatically outperforms standard policy iteration, but performs worse than a model-based approach. View details
    No Results Found