Google Research

Lazy-MDPs: Towards Interpretable RL by Learning When to Act


Traditionally, Reinforcement Learning (RL) aims at deciding how to act optimally for an artificial agent. We argue that deciding when to act is equally important. As humans, we drift from default, instinctive or memorized behaviors to focused, thought-out behaviors when required by the situation. To enhance RL agents with this aptitude, we propose to augment the standard Markov Decision Process and make a new mode of action available: the lazy mode, which defers decision-making to a given default policy. In addition, we penalize non-lazy actions in order to enforce minimal effort and have agents focus on critical decisions only. We name the resulting formalism lazy-MDPs. We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing greediness and optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs are generally more interpretable and highlight the states where it is important for the agent to act. When the default policy is uniformly random, we observe that agents are still able to approximate or even to surpass classic DQN agents on some Atari games while only taking control at a fewer subset of the states.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work