Lior Shani
Authored Publications
Sort By
Multi-turn Reinforcement Learning with Preference Human Feedback
Remi Munos
Bilal Piot
Asaf Cassel
Avital Zipori
Hila Noga
Daniele Calandriello
2024
Preview abstract
In this paper, we discuss the multi-turn preference based RL problem. We start by extending the regularized self-play Nash-MD formulation of the preference based RL to the general multi-turn case and show it converges to a Nash equilibrium in the online setting, where the transition and preferences model are known. We empirically test our algorithm on two environments: one where there is an explicit reward, and another in which only preference data is available without assuming any reward. Our experiment show that our algorithm is able to recover the same performance as as a direct reward-based RL algorithm, when a reward-signal is available, even when using the weaker preference signal. When only direct preference is available, our algorithm improves upon both the supervised and reward-based RLHF baselines.
View details
Embedding-Aligned Language Models
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS-24), Vancouver (2024)
Preview abstract
We propose a novel approach for training large language models (LLMs) to adhere to objectives imposed by a latent embedding space. Our method leverages reinforcement learning (RL), treating a pre-trained LLM as an environment. An Embedding-Aligned Guided LanguagE (EAGLE) agent it trained using a significantly smaller language model to iteratively stir the LLM's generation towards optimal regions of a latent embedding space, given some predefined criteria. We demonstrate the effectiveness of the EAGLE agent using the MovieLens 25M dataset, on extrapolation tasks for content gap to satisfy latent user demand, and multi-attribute satisfaction for generating creative variations of entities. Our work paves the way for controlled and grounded text generation using LLMs, ensuring consistency with domain-specific knowledge and data representations.
View details
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit
Johan Ferret
Geoffrey Cideron
Matthieu Geist
Sertan Girgin
Léonard Hussenot
Nikola Momchev
Piotr Stanczyk
Nino Vieillard
Olivier Pietquin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2023), 6252–6272
Preview abstract
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries.
View details
Reinforcement Learning with History Dependent Dynamic Contexts
Nadav Merlis
Martin Mladenov
Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, Hawaii
Preview abstract
We introduce a framework for modeling and solving reinforcement learning problems in non-Markovian, history-dependent environments. Our framework, called the Dynamic Contextual Markov Decision Process (DCMDP), generalizes the contextual MDP framework to handle non-Markov environments where contexts change over time. To overcome the exponential dependence on history, we leverage an aggregated mapping of previous visits to states, actions and contexts to construct an optimistic upper confidence-based algorithm, for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm that addresses history-dependent contexts, by planing in a latent space and using optimism over history-dependent features. We demonstrate the efficiency and performance of our approach on a recommendation task using the MovieLens dataset, in which the user's behavior is influenced by the agent's recommendations and changes over time.
View details