Reactive Bandits with AttitudePedro A. Ortega, Kee-Eung Kim and Daniel D. Lee
GRASP Laboratory, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, U.S.A.
Reactive Bandits
Bandit Algorithms
Learning
Conclusions
2. The bandit can react to the player's strategy by choosing a location.1. The possible locations
depend on the attitude parameter.
1
0
-3 +30
The optimum switchesbetween arms. In the limit, the arm withhighest variance is
chosen.
1
0
-0.09-12e5 4e5 6e5 8e5
1
0
-0.092e5 4e5 6e5 8e5
1
0
1
0
UCB1's empirical frequencies do not
converge to the optimum.
0.2
0.0
-0.2
-0.4
1
0
-0.092e4 4e4 6e4 8e4
1
0
True parameter is learned very quickly.
The mixed strategy converges to the optimal
strategy.
Motivation
Optimal Strategy