A Connection between Actor Regularization and Critic Regularization in Reinforcement Learning
As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting, with most methods regularizing either the actor or the critic. These methods appear distinct. Actor regularization (e.g., behavioral cloning penalties) is simpler and has appealing convergence properties, while critic regularization typically requires significantly more compute because it involves solving a game, but it has appealing lower-bound guarantees. Empirically, prior work alternates between claiming better results with actor regularization and critic regularization. In this paper, we show that these two regularization techniques can be equivalent under some assumptions: regularizing the critic using a CQL-like objective is equivalent to updating the actor with a BC- like regularizer and with a SARSA Q-value (i.e., “1-step RL”). Our experiments show that this theoretical model makes accurate, testable predictions about the performance of CQL and one-step RL. While our results do not definitively say whether users should prefer actor regularization or critic regularization, our results hint that actor regularization methods may be a simpler way to achieve the desirable properties of critic regularization. The results also suggest that the empirically- demonstrated benefits of both types of regularization might be more a function of implementation details rather than objective superiority.