Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning
Abstract
Algorithms for imitation learning based on adversarial optimization, such as generative adversarial imitation learning (GAIL) and adversarial inverse reinforcement learning (AIRL), can effectively mimic demonstrated behaviours by employing both reward and reinforcement learning (RL). However, applications of
such algorithms are challenged by the inherent instability and poor sample efficiency of on-policy RL. In particular, the inadequate handling of absorbing states
in canonical implementations of RL environments causes an implicit bias in reward functions used by these algorithms. While these biases might work well
for some environments, they lead to sub-optimal behaviors in others. Moreover,
despite the ability of these algorithms to learn from a few demonstrations, they
require a prohibitively large number of the environment interactions for many
real-world applications. To address these issues, we first propose to extend the environment MDP with absorbing states which leads to task-independent, and more
importantly, unbiased rewards. Secondly, we introduce an off-policy learning algorithm, which we refer to as Discriminator-Actor-Critic. We demonstrate the
effectiveness of proper handling of absorbing states, while empirically improving
the sample efficiency by an average factor of 10. Our implementation is available
online.
such algorithms are challenged by the inherent instability and poor sample efficiency of on-policy RL. In particular, the inadequate handling of absorbing states
in canonical implementations of RL environments causes an implicit bias in reward functions used by these algorithms. While these biases might work well
for some environments, they lead to sub-optimal behaviors in others. Moreover,
despite the ability of these algorithms to learn from a few demonstrations, they
require a prohibitively large number of the environment interactions for many
real-world applications. To address these issues, we first propose to extend the environment MDP with absorbing states which leads to task-independent, and more
importantly, unbiased rewards. Secondly, we introduce an off-policy learning algorithm, which we refer to as Discriminator-Actor-Critic. We demonstrate the
effectiveness of proper handling of absorbing states, while empirically improving
the sample efficiency by an average factor of 10. Our implementation is available
online.