Implicitly Regularized RL with Implicit Q-values

Nino Vieillard; Marcin Andrychowicz; Anton Raichuk; Olivier Pietquin; Matthieu Geist

Implicitly Regularized RL with Implicit Q-values

Nino Vieillard

Marcin Andrychowicz

Anton Raichuk

Olivier Pietquin

Matthieu Geist

AISTATS (2022)

Google Scholar

Abstract

The Q-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to Q. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax over actions cannot be computed exactly otherwise. More specifically, the usage of function approximation to deal with continuous action spaces in modern actor-critic architectures intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function implicitly, as the sum of a log-policy and a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Implicitly Regularized RL with Implicit Q-values

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs