Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

Orin Levy

Yishay Mansour

AAAI (2023)

Download Google Scholar

Abstract

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle.
We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains
$
\tilde{O}\left( {1}/{p_{min}}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\F|,|\Fp|\}/\delta)}
\right)
$
regret bound, with probability $1-\delta$, where $\Fp$ and $\F$ are finite and realizable regressors classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes.
To our knowledge, our approach is the first optimistic approach applied to Contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.).
In addition, we present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\F|)/\ln(|S||A|)})$, which holds even in the case of known dynamics.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs