Yinlam Chow

Yinlam Chow is a research scientist at Google Research. Prior to Google, he was a research scientist at DeepMind (from 2017 to 2019) and a research scientist at Osaro, Inc (from 2016 to 2017). He received a Ph.D. from Stanford Institute of Computational and Mathematical Engineering (ICME) in 2017. He has published over 30 papers in major machine learning and control journals and conferences. His research focuses have been on deriving algorithms for risk-sensitive, safe, robust control, sequential decision making, and (model-based and model-free) reinforcement learning, with applications to problems in robotics, power systems, and personalized recommendation.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details
    Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors
    Christina Göpfert
    Alex Haig
    Ivan Vendrov
    Tyler Lu
    Hubert Pham
    Mohammad Ghavamzadeh
    ACM Transactions on Recommender Systems(2024)
    Preview abstract Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue). Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user's semantic intent from the open-ended terms or attributes often used to describe a desired item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) (Kim, et al., 2018) a recently developed approach for model interpretability in machine learning, we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in recommender systems. One novel feature of our approach is its ability to distinguish objective and subjective attributes (both subjectivity of degree and of sense), and associate different senses of subjective attributes with different users. We demonstrate on both synthetic and real-world data sets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations through interactive item critiquing. View details
    Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management
    Dhawal Gupta
    Mohammad Ghavamzadeh
    Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS-23), New Orleans(2023)
    Preview abstract Reinforcement learning (RL) has offered great promise for developing dialogue management (DM) agents that avoid being short-sighted, conduct rich conversations, and maximize overall user satisfaction. Despite recent developments in deep RL and language models (LMs), using RL to power conversational chatbots remain a formidable challenge. This is because deep RL algorithms require online exploration to learn effectively, but collecting fresh human-bot interactions can be expensive and unsafe. This issue is exacerbated by the combinatorial action space that these algorithms need to handle, as most LM agents generate responses at the word-level. Leveraging the recent advances of Mixture-of-Expert Language Models (MoE-LMs) that capture diverse semantics, generate utterances of different intents, and are amenable for multi-turn DM, we develop a gamut of offline RL algorithms that excel in dialogue planning. Through exploiting the MoE-LM structure, our methods significantly reduce the action space and improve the efficacy of RL DM. We compare that with SOTA methods on open-domain dialogues to demonstrate their effectiveness both in the diversity of generated utterances and the overall DM performance. View details
    A Mixture-of-Expert Approach to RL-based Dialogue Management
    Ofir Nachum
    Dhawal Gupta
    Moonkyung Ryu
    Mohammad Ghavamzadeh
    Proceedings of the Eleventh International Conference on Learning Representations (ICLR-23), Kigali, Rwanda(2023)
    Preview abstract Despite recent advancements in language models (LMs), their application to dialogue management (DM) and ability to carry on rich conversations remains a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (often outputting generic utterances) and maximizes overall user satisfaction. However, existing RL approaches focus on training an agent that operates at the word level. Since generating semantically-correct and sensible utterances from a large vocabulary space is combinatorially complex, RL can struggle to produce engaging dialogue, even if warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture-of-expert (MoE) approach, which consists of (i) a language representation that captures diverse information, (ii) several modulated LMs (or experts) to generate candidate utterances, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. This MoE approach provides greater flexibility to generate sensible utterances of different intents, and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in the diversity and sensibility of the generated utterances as well as the overall DM performance. View details
    Preview abstract Interactive Recommender Systems (RSs) have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional RSs (e.g., clicks, item consumption, ratings), allowing users to express intent, preferences, constraints, and contexts in a richer fashion using natural language. Still, more research is needed to find the most effective ways to use this feedback. One major challenge is inferring a user's intended semantic intent from given the open-ended terms (say, attributes or tags) used to describe a desired item, and utilize that to refine recommendation results. Leveraging Concept Activation Vectors (CAVs) [13], we develop a framework to learn a representation that captures the semantics of such attributes and connect them to user preferences and behaviors in RSs. One novel feature of our approach is its ability to distinguish objective and subjective attributes (including subjectivity of degree and of sense) and associate different senses of subjective attributes with different user. We demonstrate on both synthetic and real-world datasets that our CAV representation not only accurately interprets users' subjective semantics, but can also be used to improve recommendations. View details
    Non-Stationary Off-policy Optimization
    Joey Hong
    Branislav Kveton
    International Conference on Artificial Intelligence and Statistics (AISTATS)(2021)
    Preview abstract Off-policy learning is a framework for estimating the value of and optimizing policies offline from logged data without deploying them. Real-world environments are nonstationary, and the optimized policies should be able to adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary environments. Our key idea is to use a change-point detector to partition the logged data into categorical latent states, then find a near-optimal policy conditioned on latent state. We derive high-probability bounds on our off-policy estimates and optimization. Furthermore, we also propose a practical approach to deploy our policy online and evaluate our approach comprehensively on a real-world clickstream dataset. View details
    Preview abstract We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods View details
    Safe Policy Learning for Continuous Control
    Ofir Nachum
    Mohammad Ghavamzadeh
    Conference on Robot Learning (CoRL)(2020)
    Preview abstract We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through near-safe policies, i.e.,~policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while enforcing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, in practice our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. View details
    BRPO: Batch Residual Policy Optimization
    Sungryull Sohn
    Ofir Nachum
    Honglak Lee
    Proceedings of the Twenty-ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan(2020), pp. 2824-2830
    Preview abstract In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-of-the- art performance in a number of tasks. View details
    CaQL: Continuous Action Q-Learning
    Moonkyung Ryu
    Proceedings of the Eighth International Conference on Learning Representations (ICLR-20), Addis Ababa, Ethiopia(2020)
    Preview abstract In this work we propose CaQL, a value-based reinforcement learning (RL) algorithm that handles continuous actions, whose Q-function is modeled by a generic feed-forward neural network. We show that the problem of calculating Bellman residual can be posed as a mixed-integer linear programming (MILP) problem. Furthermore to reduce the complexity of computing Bellman residual, we propose three techniques (i) dynamic tolerance, (ii) dual filter, (iii) clustering to speed up the computation of max-Q values. Finally, to illustrate the efficiency of CaQL, we compare it with state-of-the-art RL algorithms on benchmark continuous control problems that have various action constraints, and show that CaQL significantly outperforms policy-based methods in heavily constrained environments. View details