Please login first
Adaptive Exploration in Stochastic Multi-armed Bandit Problem
1 , * 1, 2 , 1 , 1, 3, 4
1  School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, 215006
2  State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023
3  Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210000
4  Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012


The multi-armed bandit (MAB) problem is classic problem of the exploration versus exploitation dilemma in reinforcement learning. As an archetypal MAB problem, the stochastic multi-armed bandit (SMAB) problem is the base of many new MAB problems. To solve the problems of weak theoretical analysis and single information used in existing SMAB methods, this paper presents "the Chosen Number of Arm with Minimal Value" (CNAMV), a method for balancing exploration and exploitation adaptively. Theoretically, the upper bound of CNAMV’s regret is proved, that is the loss due to the fact that the globally optimal policy is not followed all the times. Experimental results show that CNAMV yields greater reward and smaller regret with high efficiency than commonly used methods such as ε-greedy, softmax, or UCB1. Therefore the CNAMV can be an effective SMAB method.

Keywords: reinforcement learning; stochastic multi-armed bandit problem; exploration; exploitation; adaptation