The multi-armed bandit (MAB) problem is classic problem of the exploration versus exploitation dilemma in reinforcement learning. As an archetypal MAB problem, the stochastic multi-armed bandit (SMAB) problem is the base of many new MAB problems. To solve the problems of weak theoretical analysis and single information used in existing SMAB methods, this paper presents "the Chosen Number of Arm with Minimal Value" (CNAMV), a method for balancing exploration and exploitation adaptively. Theoretically, the upper bound of CNAMV’s regret is proved, that is the loss due to the fact that the globally optimal policy is not followed all the times. Experimental results show that CNAMV yields greater reward and smaller regret with high efficiency than commonly used methods such as ε-greedy, softmax, or UCB1. Therefore the CNAMV can be an effective SMAB method.
Previous Article in event
Previous Article in congress
Next Article in event
Adaptive Exploration in Stochastic Multi-armed Bandit Problem
Published:
27 December 2016
by MDPI
in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed.
congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
Abstract:
Keywords: reinforcement learning; stochastic multi-armed bandit problem; exploration; exploitation; adaptation