A reduction from apprenticeship learning to classification
Abstract
We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal
is to perform well relative to an observed expert. We study a common approach
to learning from expert demonstrations: using a classification algorithm to learn
to imitate the expert’s behavior. Although this straightforward learning strategy
is widely-used in practice, it has been subject to very little formal analysis. We
prove that, if the learned classifier has error rate Q, the difference between the
value of the apprentice’s policy and the expert’s policy is O(\sqrt{\epsilon}). Further, we
prove that this difference is only O(\epsilon) when the expert’s policy is close to optimal.
This latter result has an important practical consequence: Not only does imitating
a near-optimal expert result in a better policy, but far fewer demonstrations are
required to successfully imitate such an expert. This suggests an opportunity for
substantial savings whenever the expert is known to be good, but demonstrations
are expensive or difficult to obtain.