Robert Dadashi

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
    Paul Roit
    Johan Ferret
    Geoffrey Cideron
    Matthieu Geist
    Sertan Girgin
    Léonard Hussenot
    Nikola Momchev
    Piotr Stanczyk
    Nino Vieillard
    Olivier Pietquin
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2023), 6252–6272
    Preview abstract Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries. View details
    Offline Reinforcement Learning with On-Policy Q-Function Regularization
    Laixi Shi
    Yuejie Chi
    Matthieu Geist
    European Conference on Machine Learning (ECML) (2023)
    Preview abstract The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks. View details
    Continuous Control with Action Quantization from Demonstrations
    Léonard Hussenot
    Damien Vincent
    Sertan Girgin
    Matthieu Geist
    Olivier Pietquin
    International Conference on Machine Learning (ICML) (2022)
    Preview abstract In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https://google-research.github.io/aquadem/ and make the code available: https://github.com/google-research/google-research/tree/master/aquadem. View details
    Offline Reinforcement Learning as Anti-Exploration
    Shideh Rezaeifar
    Nino Vieillard
    Léonard Hussenot
    Olivier Pietquin
    Matthieu Geist
    AAAI (2022)
    Preview abstract Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward instead of adding it for exploration. This allows the policy to stay close to the support of the dataset. We connect this approach to a more usual regularization of the learnt policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks. View details
    Learning Energy Networks with Generalized Fenchel-Young Losses
    Felipe Llinares
    Léonard Hussenot
    Matthieu Geist
    Neural Information Processing Systems (NeurIPS) (2022)
    Preview abstract Energy-based models, a.k.a.\ energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks. View details
    Primal Wasserstein Imitation Learning
    Léonard Hussenot
    Matthieu Geist
    Olivier Pietquin
    ICLR (2021)
    Preview abstract Imitation Learning (IL) methods seek to match the behaviour of an expert with an agent. In the present work, we propose a new IL method based on a conceptually simple algorithm: \textit{PWIL}, which ties to the primal form of the Wasserstein distance. We present a reward function which is derived offline, as opposed to recent adversarial IL that learn a reward function through interactions with the environment. We show that we can recover expert behaviour on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of environment interactions and expert interactions. Finally, we show that the behaviour of the agent we train matches the behaviour of the expert with a distance, rather than the commonly used proxy of performance. View details
    Offline Reinforcement Learning with Pseudometric Learning
    Shideh Rezaeifar
    Nino Vieillard
    Léonard Hussenot
    Olivier Pietquin
    Matthieu Geist
    ICML (2021)
    Preview abstract Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs "close" to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric from logged transitions, and use it to define this notion of closeness. We show its convergence guarantees and extend it to the sampled function approximation setting. We then use this pseudometric to define a new look-up based malus in an actor-critic algorithm: this encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method against hand manipulation and locomotion tasks. View details
    Hyperparameter Selection for Imitation Learning
    Léonard Hussenot
    Marcin Andrychowicz
    Damien Vincent
    Lukasz Piotr Stafiniak
    Sertan Girgin
    Nikola M Momchev
    Manu Orsini
    Matthieu Geist
    Olivier Pietquin
    ICML (2021)
    Preview abstract We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, although this is not a realistic setting. Indeed, would this reward function be available, it should then directly be used for policy training and imitation would not make sense. To tackle this mostly ignored problem, we propose and study, for different representative agents and benchmarks, a number of possible proxies to the return, within an extensive empirical study. We observe that, depending on the algorithm and the environment, some methods allow good performance to be achieved without using the unknown return. View details
    What Matters for Adversarial Imitation Learning?
    Manu Orsini
    Léonard Hussenot
    Damien Vincent
    Sertan Girgin
    Matthieu Geist
    Olivier Pietquin
    Marcin Andrychowicz
    NeurIPS (2021)
    Preview abstract Adversarial imitation learning has become a standard framework for imitation in continuous control. Over the years, several variations of its components were proposed to enhance the performance of the learned policies as well as the sample complexity of the algorithm. In practice, many of these choices are rarely tested all together in rigorous empirical studies. It is therefore difficult to discuss and understand what choices, among the high-level algorithmic options as well as low-level implementation details, matter. To tackle this issue, we implement more than 50 of these choices in a generic adversarial imitation learning framework and investigate their impacts in a large-scale study (>500k trained agents) with both synthetic and human-generated demonstrations. We analyze the key results and highlight the most surprising findings. View details
    Show Me the Way: Intrinsic Motivation from Demonstrations
    Léonard Hussenot
    Matthieu Geist
    Olivier Pietquin
    AAMAS (2021)
    Preview abstract In reinforcement learning, exploration of sparse-reward environments remains a great challenge. Most algorithms introduced to tackle this issue make use of an intrinsic motivation derived from the notion of curiosity. While randomness alone allows a very local exploration, these methods generally lead to a more exhaustive search of the state space and thus a higher chance of getting any reward. However, in many environments, exhaustive exploration is impossible due to the number of states and actions. Moreover, it is generally not even desirable, as most behaviours in a realistic setting are -to a human- obviously meaningless. We propose to extract an intrinsic bonus from exploratory demonstrations. We exhibit how to learn this bonus and show how it conveys the demonstrator's way of exploring its environment. View details