19
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion Monte-Carlo Go Reinforcement Learning Experiments Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Presented by Markus Enzenberger. Go Seminar, University of Alberta. November 16, 2006 Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go Reinforcement LearningExperiments

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006

Presented by Markus Enzenberger.Go Seminar, University of Alberta.

November 16, 2006

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 2: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

BackgroundIndigo project

Monte-Carlo GoBasic Monte-Carlo GoMonte-Carlo Go with specific knowledge

Reinforcement Learning and Monte-Carlo GoThe randomness dimension

ExperimentsExperiment 1: On program vs. populationExperiment 2: relative difference at MC level

Conclusion

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 3: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Indigo project

Indigo project

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 4: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Basic Monte-Carlo Go

Basic Monte-Carlo Go

I General MC model (Abramson)

I all-moves-as-first heuristic (Brugmann)

I Standard deviation, random games, 9×9 ≈ 35

I Precision: 1 point

I 1000 games ≡ 68 % confidence

I 4000 games ≡ 95 % confidence

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 5: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Basic Monte-Carlo Go

Progressive pruning

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 6: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Basic Monte-Carlo Go

Advantages of MC

I Bouzy, Helmstetter: MC programs on par with Indigo2002

I MC takes advantage of faster computers

I Heavily knowledge based programs less robust

I Global view (avoids errors due to decomposition)

I Easy to develop

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 7: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go with specific knowledge

Indigo2003

Two ways of associating Go knowledge with MC

I Pre-select moves for MC

I Insert knowledge in random games(“pseudo-random” := non-uniform probability)

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 8: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go with specific knowledge

Two Modules of Indigo2003

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 9: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go with specific knowledge

Pseudo-Random (PR) player

I Capture-escape urgency: fill last liberty of one-liberty stringurgency linear in string size

I Cut-connect urgency: 3×3 patterns(smaller: insufficient; larger: lower urgency)

I Total urgency is sum of both urgencies

I Probability of move proportional to total urgency

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 10: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Monte-Carlo Go with specific knowledge

MC (Manual) vs MC(Zero)

I Manual: urgencies assigned to 3×3 patterns assigned byhuman expert

I correspond to Go concepts such as cut and connect

I contains non-zero and high values only very sparsely

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 11: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Reinforcement Learning and Monte-Carlo Go

I Q-learning for learning action values

I p1 better than p2 (at random level)does not necessarily mean thatMC (p1) better than MC (p2)(atMClevel)

I Randomization vs. determinism

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 12: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

The randomness dimension

The randomness dimension

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 13: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 1: On program vs. population

Experiment 1: One program vs. population

I Patterns have action values Q ∈]− 1,+1[I Urgency:

U = (1 + Q

1− Q)k

I Value updates after playing a block of b games:Qplay = Qplay + λbQlearn

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 14: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 1: On program vs. population

Experiment 1a: one unique learning program

I RLPR � Zero

I RLPR < Manual

I MC(RLPR) � MC(Manual)

I Highly dependent on initial conditions

I Q-values learnt in first block,later only these Q-values are increased

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 15: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 1: On program vs. population

Experiment 1b: a population of learning programs

I Evolutionary computing

I Test against Zero and Manual

I Select and duplicate best programs according to some scheme

Starting population = Zero

I RLPR = Zero + 180

I RLPR = Manual - 50

I MC(RLPR) � MC(Manual)

Starting population = Manual

I RLPR = Zero + 180

I RLPR = Manual + 50

I MC(RLPR) = MC(Manual) - 20

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 16: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 1: On program vs. population

I RLPR programs have a tendency to determinism

I Using EC lowers the problem but does not remove itcompletely

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 17: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 2: relative difference at MC level

Experiment 2: relative difference at MC level

I MC(RLPR) playes agains itself again

I Update rule to avoid determinismfor a pair of patterns a, b:

exp(C (Va − Vb)) = ua/ub

δ = Qa − Qb − C (Va − Vb)

Qa = Qa − αδ

Qb = Qb − αδ

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 18: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Experiment 2: relative difference at MC level

9×9

I MC(RLPR) = MC(Manual) + 3

19×19 (learning on 9×9)

I MC(RLPR) = MC(Manual) - 30

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments

Page 19: Monte-Carlo Go Reinforcement Learning Experimentsgames/go/seminar/notes/...Reinforcement Learning and Monte-Carlo Go I Q-learning for learning action values I p 1 better than p 2 (at

Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion

Conclusion

I Determinism was identified as obstacle

I Evolutionary computing was used to avoid determinism(strong dependence on intial conditions)

I Relative differences were used to avoid determinism

Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments