Jump to Content

Yinlam Chow

Yinlam Chow is a research scientist at Google Research. Prior to Google, he was a research scientist at DeepMind (from 2017 to 2019) and a research scientist at Osaro, Inc (from 2016 to 2017). He received a Ph.D. from Stanford Institute of Computational and Mathematical Engineering (ICME) in 2017. He has published over 30 papers in major machine learning and control journals and conferences. His research focuses have been on deriving algorithms for risk-sensitive, safe, robust control, sequential decision making, and (model-based and model-free) reinforcement learning, with applications to problems in robotics, power systems, and personalized recommendation.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details
    A Mixture-of-Expert Approach to RL-based Dialogue Management
    Ofir Nachum
    Dhawal Gupta
    Moonkyung Ryu
    Mohammad Ghavamzadeh
    Proceedings of the Eleventh International Conference on Learning Representations (ICLR-23), Kigali, Rwanda (2023)
    Preview abstract Despite recent advancements in language models (LMs), their application to dialogue management (DM) and ability to carry on rich conversations remains a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (often outputting generic utterances) and maximizes overall user satisfaction. However, existing RL approaches focus on training an agent that operates at the word level. Since generating semantically-correct and sensible utterances from a large vocabulary space is combinatorially complex, RL can struggle to produce engaging dialogue, even if warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture-of-expert (MoE) approach, which consists of (i) a language representation that captures diverse information, (ii) several modulated LMs (or experts) to generate candidate utterances, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. This MoE approach provides greater flexibility to generate sensible utterances of different intents, and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in the diversity and sensibility of the generated utterances as well as the overall DM performance. View details
    Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management
    Dhawal Gupta
    Mohammad Ghavamzadeh
    Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS-23), New Orleans (2023)
    Preview abstract Reinforcement learning (RL) has offered great promise for developing dialogue management (DM) agents that avoid being short-sighted, conduct rich conversations, and maximize overall user satisfaction. Despite recent developments in deep RL and language models (LMs), using RL to power conversational chatbots remain a formidable challenge. This is because deep RL algorithms require online exploration to learn effectively, but collecting fresh human-bot interactions can be expensive and unsafe. This issue is exacerbated by the combinatorial action space that these algorithms need to handle, as most LM agents generate responses at the word-level. Leveraging the recent advances of Mixture-of-Expert Language Models (MoE-LMs) that capture diverse semantics, generate utterances of different intents, and are amenable for multi-turn DM, we develop a gamut of offline RL algorithms that excel in dialogue planning. Through exploiting the MoE-LM structure, our methods significantly reduce the action space and improve the efficacy of RL DM. We compare that with SOTA methods on open-domain dialogues to demonstrate their effectiveness both in the diversity of generated utterances and the overall DM performance. View details
    Preview abstract Interactive Recommender Systems (RSs) have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional RSs (e.g., clicks, item consumption, ratings), allowing users to express intent, preferences, constraints, and contexts in a richer fashion using natural language. Still, more research is needed to find the most effective ways to use this feedback. One major challenge is inferring a user's intended semantic intent from given the open-ended terms (say, attributes or tags) used to describe a desired item, and utilize that to refine recommendation results. Leveraging Concept Activation Vectors (CAVs) [13], we develop a framework to learn a representation that captures the semantics of such attributes and connect them to user preferences and behaviors in RSs. One novel feature of our approach is its ability to distinguish objective and subjective attributes (including subjectivity of degree and of sense) and associate different senses of subjective attributes with different user. We demonstrate on both synthetic and real-world datasets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations. View details
    Non-Stationary Off-policy Optimization
    Joey Hong
    Branislav Kveton
    International Conference on Artificial Intelligence and Statistics (AISTATS) (2021)
    Preview abstract Off-policy learning is a framework for estimating the value of and optimizing policies offline from logged data without deploying them. Real-world environments are nonstationary, and the optimized policies should be able to adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary environments. Our key idea is to use a change-point detector to partition the logged data into categorical latent states, then find a near-optimal policy conditioned on latent state. We derive high-probability bounds on our off-policy estimates and optimization. Furthermore, we also propose a practical approach to deploy our policy online and evaluate our approach comprehensively on a real-world clickstream dataset. View details
    CaQL: Continuous Action Q-Learning
    Moonkyung Ryu
    Proceedings of the Eighth International Conference on Learning Representations (ICLR-20), Addis Ababa, Ethiopia (2020)
    Preview abstract In this work we propose CaQL, a value-based reinforcement learning (RL) algorithm that handles continuous actions, whose Q-function is modeled by a generic feed-forward neural network. We show that the problem of calculating Bellman residual can be posed as a mixed-integer linear programming (MILP) problem. Furthermore to reduce the complexity of computing Bellman residual, we propose three techniques (i) dynamic tolerance, (ii) dual filter, (iii) clustering to speed up the computation of max-Q values. Finally, to illustrate the efficiency of CaQL, we compare it with state-of-the-art RL algorithms on benchmark continuous control problems that have various action constraints, and show that CaQL significantly outperforms policy-based methods in heavily constrained environments. View details
    Latent Bandits Revisited
    Joey Hong
    Branislav Kveton
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 13423-13433
    Preview abstract A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning, where complex models can be learned offline and the agent identifies the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper confidence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach. View details
    Safe Policy Learning for Continuous Control
    Ofir Nachum
    Mohammad Ghavamzadeh
    Conference on Robot Learning (CoRL) (2020)
    Preview abstract We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through near-safe policies, i.e.,~policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while enforcing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, in practice our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. View details
    BRPO: Batch Residual Policy Optimization
    Sungryull Sohn
    Ofir Nachum
    Honglak Lee
    Proceedings of the Twenty-ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan (2020), pp. 2824-2830
    Preview abstract In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the- art performance in a number of tasks. View details
    Preview abstract We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods View details
    Preview abstract In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, our algorithm eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques. View details
    Risk-Sensitive Generative Adversarial Imitation Learning
    Jonathan Lacotte
    Mohammad Ghavamzadeh
    Marco Pavone
    AISTATS (2018)
    Preview abstract We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert in terms of a risk profile. We first formulate our risk-sensitive imitation learning setting. We consider the generative adversarial approach to imitation learning (GAIL) and derive an optimization problem for our formulation, which we call it risk-sensitive GAIL (RS-GAIL). We then derive two different versions of our RS-GAIL optimization problem that aim at matching the risk profiles of the agent and the expert w.r.t. Jensen-Shannon (JS) divergence and Wasserstein distance, and develop risk-sensitive generative adversarial imitation learning algorithms based on these optimization problems. We evaluate the performance of our algorithms and compare them with GAIL and the risk-averse imitation learning (RAIL) algorithms in two MuJoCo and two OpenAI classical control tasks. View details
    Preview abstract In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance it is crucial to guarantee the safety of an agent during training as well as deployment (e.g. a robot should avoid taking actions - exploratory or not - which irrevocably harm its hardware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision problems (CMDPs), an extension of the standard Markov decision problems (MDPs) augmented with constraints on expected cumulative costs. Our approach hinges on a novel \emph{Lyapunov} method. We define and present a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints. Leveraging these theoretical underpinnings, we show how to use the Lyapunov approach to systematically transform dynamic programming (DP) and RL algorithms into their safe counterparts. To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance. View details
    Preview abstract We study the sparse entropy-regularized RL (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The opti mal policy of this formulation is sparse, i.e., at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of standard (soft) ERL, namely having softmax optimal policy. The problem with a soft max policy is that at every state, it may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. Lee et al. (2018) studied the properties of the sparse ERL problem and proposed value-based algorithms to solve it. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called sparse PCL, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a consistency equation for sparse ERL, called sparse consistency. We then prove that sparse consistency only implies sub-optimality (unlike the soft consistency in soft ERL). We then use the sparse consistency to derive our sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with large number of actions. View details
    Imitation Learning from Visual Data with Multiple Intentions
    Aviv Tamar
    Khashayar Rohanimanesh
    Chris Virgorito
    Ben Goodrich
    Michael Kahane
    Derik Pridmore
    ICLR (2018)
    Preview abstract Recent advances in learning from demonstrations (LfD) with deep neural networks have enabled learning complex robot skills that involve high dimensional perception such as raw image inputs. LfD algorithms generally assume learning from single task demonstrations. In practice, however, it is more efficient for a teacher to demonstrate a multitude of tasks without careful task set up, labeling, and engineering. Unfortunately in such cases, traditional imitation learning techniques fail to represent the multi-modal nature of the data, and often result in sub-optimal behavior. In this paper we present an LfD approach for learning multiple modes of behavior from visual data. Our approach is based on a stochastic deep neural network (SNN), which represents the underlying intention in the demonstration as a stochastic activation in the network. We present an efficient algorithm for training SNNs, and for learning with vision inputs, we also propose an architecture that associates the intention with a stochastic attention module. Furthermore, we demonstrate our method on real robot visual object reaching tasks, and show that it can reliably learn the multiple behavior modes in the demonstration data. Video results are available at \url{https://vimeo.com/240212286/fd401241b9}. View details
    More Robust Doubly Robust Off-policy Evaluation
    Mehrdad Farajtabar
    Mohammad Ghavamzadeh
    ICML 2018 (2018)
    Preview abstract We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. In particular, we focus on studying the doubly robust (DR) estimator, a hybrid off-policy value estimator that is unbiased and oftentimes has lower variance than traditional importance sampling (IS) estimators, in sequential decision making. While (Jiang & Li, 2016) already proposed the use of this estimator in RL, the important problem of how to properly choose model parameters in the DR estimator remains unresolved. In this work, we propose a novel methodology to design the model parameters in DR estimation for the sake of variance minimization, for which the resulting DR estimator is termed as the more robust doubly robust (MRDR) estimator. We further show the asymptotically optimality of this estimator over the class of consistent and asymptotically normal estimators, and we finally illustrate the improved accuracy of the MRDR estimator in several contextual bandit and RL benchmark experiments. View details
    A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
    Bo Liu
    Tengyang Xie
    Yangyang Xu
    Mohammad Ghavamzadeh
    Daoming Lyu
    Daesob Yoon
    NeurIPS (2018)
    Preview abstract Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this paper, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided, and their applicability is demonstrated on several benchmark domains. View details
    Risk-Constrained Reinforcement Learning with Percentile Risk Criteria
    Lucas Janson
    Marco Pavone
    Mohammad Ghavamzadeh
    JMLR, vol. 18 (2017), pp. 1-51
    Preview abstract In many sequential decision-making problems one is interested in minimizing an expected cumula- tive cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application. View details
    No Results Found