Raphaël Marinier

Raphaël Marinier

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Learning Equilibria in Mean-Field Games: Introducing Mean-Field PSRO
    Paul Muller
    Mark Rowland
    Romuald Elie
    Georgios Piliouras
    Julien Perolat
    Mathieu Lou-roch Lauriere
    Olivier Pietquin
    Karl Paul Tuyls
    Proc. of AAMAS2022 (2022)
    Preview abstract Recent advances in multiagent learning have seen the introduction of a family of algorithms that revolve around the population-based training method PSRO, showing convergence to nash, correlated and coarse correlated equilibria. Notably, when the number of agents increases learning best-responses becomes exponentially more difficult, and as such hampers PSRO training methods. The field of Mean-Field games provides an asymptotic solution to this problem when the considered games are anonymous-symmetric. Unfortunately, the Mean-Field approximation introduces non-linearities which prevent a straightforward adaptation of PSRO. Building upon optimization and adversarial regret minimization, this paper sidesteps this issue and introduces Mean-Field PSRO, an adaptation of PSRO which learns nash, coarse correlated and correlated equilibria in Mean-Field Games. The key is to replace the exact distribution computation step by newly-defined Mean-Field no-adversarial-regret learners, or by black-box optimization. We compare the asymptotic complexity of the approach to standard PSRO, and greatly improve empirical bandit convergence speed by compressing temporal mixture weights, and ensure it is theoretically robust to payoff noise. Finally, we illustrate the speed and accuracy of Mean-Field PSRO on several Mean-Field games, demonstrating convergence to strong and weak equilibria. View details
    Solving N player dynamic routing games with congestion: a mean field approach (extended abstract)
    Alexandre Bayen
    Eric Goubault
    Julien Perolat
    Mathieu Lou-roch Lauriere
    Olivier Pietquin
    Romuald Elie
    Sarah Perrin
    Sertan Girgin
    Theophile Cabannes
    (2022)
    Preview abstract The recent emergence of navigational applications has changed traffic patterns and enables new types of congestion-aware routing control like dynamic road pricing. Using fundamental diagram of traffic flows -- applied in macroscopic and mesoscopic traffic modeling -- we introduce a new N-player dynamic routing game with explicit congestion dynamics. The model is well-posed and can reproduce heterogeneous departure time and congestion spill back phenomenon. However, as Nash equilibrium computation is PPAD-complete, solving the game becomes untractable for large but realistic numbers of drivers N, as illustrated numerically. Therefore, the corresponding mean field game is introduced. Its equilibrium policy provides an approximate Nash equilibrium of the original game, with vanishing average deviation incentives when N goes to infinity. Experiments are performed on several classical benchmark networks of the traffic community: the Pigou, Braess, and Sioux Falls networks with heterogeneous origin, destination and departure time tuples. The Pigou and the Braess examples reveal the accuracy of the mean field approximation whenever the number of vehicles exceeds N=30. On the Sioux Falls network (76 links, 100 time steps), this approach enables learning traffic dynamics with more than 14,000 vehicles. View details
    Hyperparameter Selection for Imitation Learning
    Léonard Hussenot
    Marcin Andrychowicz
    Damien Vincent
    Lukasz Piotr Stafiniak
    Sertan Girgin
    Nikola M Momchev
    Manu Orsini
    Matthieu Geist
    Olivier Pietquin
    ICML (2021)
    Preview abstract We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, although this is not a realistic setting. Indeed, would this reward function be available, it should then directly be used for policy training and imitation would not make sense. To tackle this mostly ignored problem, we propose and study, for different representative agents and benchmarks, a number of possible proxies to the return, within an extensive empirical study. We observe that, depending on the algorithm and the environment, some methods allow good performance to be achieved without using the unknown return. View details
    What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
    Marcin Andrychowicz
    Piotr Michal Stanczyk
    Manu Orsini
    Sertan Girgin
    Léonard Hussenot
    Matthieu Geist
    Olivier Pietquin
    Marcin Michalski
    Sylvain Gelly
    ICLR (2021)
    Preview abstract In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [Engstrom'20]. As a step towards filling that gap, we implement >50 such ``"choices" in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents. View details
    Self-Attentive Credit Assignment for Transfer in Reinforcement Learning
    Johan Ferret
    Matthieu Geist
    Olivier Pietquin
    Proc. of IJCAI 2020
    Preview abstract The ability to transfer representations to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in Reinforcement Learning is still an open and under-exploited research area. In this paper, we suggest that credit assignment, regarded as a supervised learning task, could be used to accomplish transfer. Our contribution is twofold, we introduce a new credit assignment mechanism based on self-attention, and show that the learned credit can be transferred to in-domain and out-of-domain scenarios. View details
    SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference
    Lasse Espeholt
    Piotr Michal Stanczyk
    Marcin Michalski
    International Conference on Learning Representations (2020) (to appear)
    Preview abstract We present a modern scalable reinforcement learning agent called SEED (Scalable, Efficient Deep-RL). By effectively utilizing modern accelerators, we show that it is not only possible to train on millions of frames per second but also to lower the cost of experiments compared to current methods. We achieve this with a simple architecture that features centralized inference and an optimized communication layer. SEED adopts two state of the art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning), and is evaluated on Atari-57, DeepMind Lab and Google Research Football. We improve the state of the art on Football and are able to reach state of the art on Atari-57 three times faster in wall-time. For the scenarios we consider, a 40% to 80% cost reduction for running experiments is achieved. The implementation along with experiments is open-sourced so results can be reproduced and novel ideas tried out. View details
    Episodic Curiosity through Reachability
    Nikolay Savinov
    Damien Vincent
    Marc Pollefeys
    Timothy Lillicrap
    Sylvain Gelly
    ICLR (2019)
    Preview abstract Rewards are sparse in the real world and most today’s reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself — thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward — making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory — which incorporates rich information about environment dynamics. This allows us to overcome the known “couch-potato” issues of prior work — when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences. We test our approach in visually rich 3D environments in VizDoom, DMLab and MuJoCo. In navigational tasks from VizDoom and DMLab, our agent outperforms the state-of-the-art curiosity method ICM. In MuJoCo, an ant equipped with our curiosity module learns locomotion out of the first-person-view curiosity only. The code is available at https://github.com/google-research/episodic-curiosity. View details
    Preview abstract Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video data sets and challenging real-world data sets in terms of complexity. To this extent we propose Fréchet Video Distance (FVD), a new metric for generative models of video, and StarCraft 2 Videos (SCV), a benchmark of game play from custom starcraft 2 scenarios that challenge the current capabilities of generative models of video. We contribute a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV. View details