Robert Dadashi
Research Areas
Authored Publications
Sort By
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit
Johan Ferret
Geoffrey Cideron
Matthieu Geist
Sertan Girgin
Léonard Hussenot
Nikola Momchev
Piotr Stanczyk
Nino Vieillard
Olivier Pietquin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2023), 6252–6272
Preview abstract
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries.
View details
Offline Reinforcement Learning with On-Policy Q-Function Regularization
Laixi Shi
Yuejie Chi
Matthieu Geist
European Conference on Machine Learning (ECML) (2023)
Preview abstract
The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which
is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
View details
Continuous Control with Action Quantization from Demonstrations
Léonard Hussenot
Damien Vincent
Sertan Girgin
Matthieu Geist
Olivier Pietquin
International Conference on Machine Learning (ICML) (2022)
Preview abstract
In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https://google-research.github.io/aquadem/ and make the code available: https://github.com/google-research/google-research/tree/master/aquadem.
View details
Offline Reinforcement Learning as Anti-Exploration
Shideh Rezaeifar
Nino Vieillard
Léonard Hussenot
Olivier Pietquin
Matthieu Geist
AAAI (2022)
Preview abstract
Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward instead of adding it for exploration. This allows the policy to stay close to the support of the dataset. We connect this approach to a more usual regularization of the learnt policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.
View details
Learning Energy Networks with Generalized Fenchel-Young Losses
Felipe Llinares
Léonard Hussenot
Matthieu Geist
Neural Information Processing Systems (NeurIPS) (2022)
Preview abstract
Energy-based models, a.k.a.\ energy networks, perform inference by optimizing
an energy function, typically parametrized by a neural network.
This allows one to capture potentially complex relationships between inputs and
outputs.
To learn the parameters of the energy function, the solution to that
optimization problem is typically fed into a loss function.
The key challenge for training energy networks lies in computing loss gradients,
as this typically requires argmin/argmax differentiation.
In this paper, building upon a generalized notion of conjugate function,
which replaces the usual bilinear pairing with a general energy function,
we propose generalized Fenchel-Young losses, a natural loss construction for
learning energy networks. Our losses enjoy many desirable properties and their
gradients can be computed efficiently without argmin/argmax differentiation.
We also prove the calibration of their excess risk in the case of linear-concave
energies. We demonstrate our losses on multilabel classification and
imitation learning tasks.
View details
Preview abstract
Imitation Learning (IL) methods seek to match the behaviour of an expert with an agent. In the present work, we propose a new IL method based on a conceptually simple algorithm: \textit{PWIL}, which ties to the primal form of the Wasserstein distance. We present a reward function which is derived offline, as opposed to recent adversarial IL that learn a reward function through interactions with the environment. We show that we can recover expert behaviour on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of environment interactions and expert interactions. Finally, we show that the behaviour of the agent we train matches the behaviour of the expert with a distance, rather than the commonly used proxy of performance.
View details
Offline Reinforcement Learning with Pseudometric Learning
Shideh Rezaeifar
Nino Vieillard
Léonard Hussenot
Olivier Pietquin
Matthieu Geist
ICML (2021)
Preview abstract
Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs "close" to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric from logged transitions, and use it to define this notion of closeness. We show its convergence guarantees and extend it to the sampled function approximation setting. We then use this pseudometric to define a new look-up based malus in an actor-critic algorithm: this encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method against hand manipulation and locomotion tasks.
View details
Hyperparameter Selection for Imitation Learning
Léonard Hussenot
Marcin Andrychowicz
Damien Vincent
Lukasz Piotr Stafiniak
Sertan Girgin
Nikola M Momchev
Manu Orsini
Matthieu Geist
Olivier Pietquin
ICML (2021)
Preview abstract
We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, although this is not a realistic setting. Indeed, would this reward function be available, it should then directly be used for policy training and imitation would not make sense. To tackle this mostly ignored problem, we propose and study, for different representative agents and benchmarks, a number of possible proxies to the return, within an extensive empirical study. We observe that, depending on the algorithm and the environment, some methods allow good performance to be achieved without using the unknown return.
View details
What Matters for Adversarial Imitation Learning?
Manu Orsini
Léonard Hussenot
Damien Vincent
Sertan Girgin
Matthieu Geist
Olivier Pietquin
Marcin Andrychowicz
NeurIPS (2021)
Preview abstract
Adversarial imitation learning has become a standard framework for imitation in continuous control. Over the years, several variations of its components were proposed to enhance the performance of the learned policies as well as the sample complexity of the algorithm. In practice, many of these choices are rarely tested all together in rigorous empirical studies. It is therefore difficult to discuss and understand what choices, among the high-level algorithmic options as well as low-level implementation details, matter.
To tackle this issue, we implement more than 50 of these choices in a generic adversarial imitation learning framework and investigate their impacts in a large-scale study (>500k trained agents) with both synthetic and human-generated demonstrations. We analyze the key results and highlight the most surprising findings.
View details
Preview abstract
In reinforcement learning, exploration of sparse-reward environments remains a great challenge. Most algorithms introduced to tackle this issue make use of an intrinsic motivation derived from the notion of curiosity. While randomness alone allows a very local exploration, these methods generally lead to a more exhaustive search of the state space and thus a higher chance of getting any reward. However, in many environments, exhaustive exploration is impossible due to the number of states and actions. Moreover, it is generally not even desirable, as most behaviours in a realistic setting are -to a human- obviously meaningless.
We propose to extract an intrinsic bonus from exploratory demonstrations. We exhibit how to learn this bonus and show how it conveys the demonstrator's way of exploring its environment.
View details