Papers
arxiv:2006.06294

Adaptive Reward-Free Exploration

Published on Jun 11, 2020
Authors:
,
,
,
,
,

Abstract

An adaptive approach for reward-free exploration in reinforcement learning reduces MDP estimation error and improves sample-complexity bounds compared to previous methods.

AI-generated summary

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order ({SAH^4}/{varepsilon^2})(log(1/δ) + S) episodes to output, with probability 1-δ, an varepsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small varepsilon and the small δ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2006.06294 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2006.06294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.