Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go Reinforcement LearningExperiments

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006

Presented by Markus Enzenberger.Go Seminar, University of Alberta.

November 16, 2006

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments


BackgroundIndigo project

Monte-Carlo GoBasic Monte-Carlo GoMonte-Carlo Go with specific knowledge

Reinforcement Learning and Monte-Carlo GoThe randomness dimension

ExperimentsExperiment 1: On program vs. populationExperiment 2: relative difference at MC level

Conclusion



Indigo project

Indigo project



Basic Monte-Carlo Go


I General MC model (Abramson)

I all-moves-as-first heuristic (Brugmann)

I Standard deviation, random games, 9×9 ≈ 35

I Precision: 1 point

I 1000 games ≡ 68 % confidence

I 4000 games ≡ 95 % confidence




Progressive pruning




Advantages of MC

I Bouzy, Helmstetter: MC programs on par with Indigo2002

I MC takes advantage of faster computers

I Heavily knowledge based programs less robust

I Global view (avoids errors due to decomposition)

I Easy to develop



Monte-Carlo Go with specific knowledge

Indigo2003

Two ways of associating Go knowledge with MC

I Pre-select moves for MC

I Insert knowledge in random games(“pseudo-random” := non-uniform probability)




Two Modules of Indigo2003




Pseudo-Random (PR) player

I Capture-escape urgency: fill last liberty of one-liberty stringurgency linear in string size

I Cut-connect urgency: 3×3 patterns(smaller: insufficient; larger: lower urgency)

I Total urgency is sum of both urgencies

I Probability of move proportional to total urgency




MC (Manual) vs MC(Zero)

I Manual: urgencies assigned to 3×3 patterns assigned byhuman expert

I correspond to Go concepts such as cut and connect

I contains non-zero and high values only very sparsely



Reinforcement Learning and Monte-Carlo Go

I Q-learning for learning action values

I p1 better than p2 (at random level)does not necessarily mean thatMC (p1) better than MC (p2)(atMClevel)

I Randomization vs. determinism



The randomness dimension

The randomness dimension



Experiment 1: On program vs. population

Experiment 1: One program vs. population

I Patterns have action values Q ∈]− 1,+1[I Urgency:

U = (1 + Q

1− Q)k

I Value updates after playing a block of b games:Qplay = Qplay + λbQlearn




Experiment 1a: one unique learning program

I RLPR � Zero

I RLPR < Manual

I MC(RLPR) � MC(Manual)

I Highly dependent on initial conditions

I Q-values learnt in first block,later only these Q-values are increased




Experiment 1b: a population of learning programs

I Evolutionary computing

I Test against Zero and Manual

I Select and duplicate best programs according to some scheme

Starting population = Zero

I RLPR = Zero + 180

I RLPR = Manual - 50

I MC(RLPR) � MC(Manual)

Starting population = Manual

I RLPR = Zero + 180

I RLPR = Manual + 50

I MC(RLPR) = MC(Manual) - 20




I RLPR programs have a tendency to determinism

I Using EC lowers the problem but does not remove itcompletely



Experiment 2: relative difference at MC level


I MC(RLPR) playes agains itself again

I Update rule to avoid determinismfor a pair of patterns a, b:

exp(C (Va − Vb)) = ua/ub

δ = Qa − Qb − C (Va − Vb)

Qa = Qa − αδ

Qb = Qb − αδ




9×9

I MC(RLPR) = MC(Manual) + 3

19×19 (learning on 9×9)

I MC(RLPR) = MC(Manual) - 30



Conclusion

I Determinism was identified as obstacle

I Evolutionary computing was used to avoid determinism(strong dependence on intial conditions)

I Relative differences were used to avoid determinism


Documents

Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at