Google Research

RL-LIM: Reinforcement Learning-based Locally-Interpretable Models

Abstract

Understanding black-box machine learning models can be crucial towards their widespread adoption. In this paper, we propose a novel framework for interpretability - Reinforcement Learning based locally-interpretable Models (RL-LIM). RL-LIM employs reinforcement learning to select a small number of samples to distill into a low-capacity locally-interpretable model. The training is guided with a reward obtained from the agreement of the predictions of the locally-interpretable model with the black-box model. RL-LIM significantly outperforms the state-of-the-art in terms of overall prediction performance and fidelity, consistently across various cases. While almost matching the performance of the black-box models, RL-LIM yields human-like interpretability, along with the most valuable training samples enabling it. Such capability is expected to be beneficial for many artificial intelligence deployments, to understand instance-wise dynamics, to build trust by explaining the constituent components behind the decisions or to enable actionable insights such as manipulating outcomes.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work