In reinforcement learning, the exploration/exploitation dilemma is a very crucial issue, which can be described as searching between the exploration of the environment to find more profitable actions, and the exploitation of the best empirical actions for the current state. We focus on the single trajectory reinforcement learning problem where an agent is interacting with a partially unknown MDP over single trajectories, and try to deal with the exploration/exploitation in this setting. Given the reward function, we try to find a good E/E strategy to address the MDPs under some MDP distribution. This is achieved by selecting the best strategy in mean over a potential MDP distribution from a large set of candidate strategies, which is done by exploiting single trajectories drawn from plenty of MDPs. In this paper, we mainly make the following contributions: 1) we discuss the strategy-selector algorithm based on formula set and polynomial function.2) we provide the theoretical and experimental regret analysis of the learned strategy under an given MDP distribution. 3) we compare these methods with the ``state-of-the-art" Bayesian RL method experimentally.
Previous Article in event
Previous Article in congress
Next Article in event
Next Article in congress
Single Trajectory Learning: Exploration VS. Exploitation
Published:
22 December 2016
by MDPI
in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed.
congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
Abstract:
Keywords: single trajectory, MDP distribution, E/E delimma, Bayesian reinforcement learning