By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

ByDavidAndersonSZTAKI(Budapest,Hungary)WPID2009

  1997,DeepBluewonagainstKasparov  AverageworkstationcandefeatbestChessplayers

  ComputerChessnolonger“interesting”  Goismuchharderforcomputerstoplay

  Branchingfactoris~50‐200versus~35inChess  Positionalevaluationinaccurate,expensive  Gamecannotbescoreduntiltheend

  BeginnerscandefeatbestGoprograms

  Two‐player,totalinformation  Playerstaketurnsplacingblackandwhitestonesongrid

  Boardis19x19(13x13or9x9forbeginners)  Objectistosurroundemptyspaceasterritory  Piecescanbecaptured,butnotmoved Winnerdeterminedbymostpoints(territorypluscapturedpieces)

Imagefromhttp://ict.ewi.tudelft.nl/~gineke/

 Minimax/α‐βalgorithmsrequirehugetrees  Treedepthcannotbecuteasily

 Monte‐Carlonowmorepopular  Simulaterandomgamesfromthegametree  Useresultstopickbestmove

  Twoareasofoptimization  Discoveryofgoodpathsinthegametree  Intelligenceofrandomsimulations▪  Randomgamesareusuallybogus

  Needtobalancebetweenexploration…  Discoveringandsimulatingnewpaths

  Andexploitation…  Simulatingthemostoptimalpath

  BestmethodiscurrentlyUCTgivenbyLeventeKocsisandCsabaSzepesvári.

Sayyouhaveaslotmachinewithaprobabilityofgivingyoumoney.Youcaninferthisprobabilitythroughexperimentation.

Whatiftherearethreeslotmachines,andeachhasadifferentprobability?

Youneedtochoosebetweenexperimenting(exploration)andgettingthebestreward(exploitation).

UCBalgorithmbalancestheseproblemstominimizelossofreward.

UCTappliesUCBtogameslikeGo,decidingwhichmovetoexplorenextbytreatingitlikethebanditproblem.

  Startswithone‐leveltreeoflegalboardmoves

  PicksbestmoveaccordingtoUCBalgorithm

  RunsMonte‐Carlosimulation,updatenode’swin/loss.

  ThisisoneiterationoftheUCTprocess.

  Ifnodegetsvisitedenoughtimes,startlookingatitschildmoves

  UCTdivesdeeper,eachtimepickingthemost“interesting”move.

  Eventually,UCThasbuiltalargetreeofsimulationinformation

  UCTisnowinmostmajorcompetitiveprograms

  “MoGo”usedUCTtodefeataprofessional  Used800‐nodegridanda9stonehandicap

 Muchresearchnowfocusedonimprovingsimulationintelligence

  Policydecideswhichmovetoplaynextinarandomgamesimulation

  HighstochasticitymakesUCTlessaccurate  Takeslongertoconvergetocorrectmove

  ToomuchdeterminismmakesUCTlesseffective  DefeatspurposeofMonte‐Carlosearch Mightintroduceharmfulselectionbias

  CertainshapesinGoaregood  “Hane”hereisastrongattackonB

  Othersarequitebad!  B’s“emptytriangle”istoodenseandwasteful

 MoGousespatternknowledgewithUCT  Hand‐crafteddatabaseof3x3interestingpatterns  Doubledsimulationwin‐rateaccordingtoauthors

  Canpatternknowledgebetrainedautomaticallyviamachinelearning?

  Paper“Monte‐CarloSimulationBalancing”  (byDavidSilverandGeraldTesauro)  Policiesaccumulateerrorwitheachmove  Strongpoliciesminimizethiserror,butnotthewhole‐gameerror

  Proposesalgorithmsforminimizingwhole‐gameerrorwitheachmove

  Authorstestedon5x5Gousing2x2patterns  Foundthatbalancingwasmoreeffectiveoverrawstrength

  Implementedpattern‐learningalgorithmsin“Monte‐CarloSimulationBalancing”  Strength:Apprenticeship  Strength:PolicyGradientReinforcement  Balance:PolicyGradientSimulationBalancing  Balance:Two‐StepSimulationBalancing

  Used9x9Gowith3x3patterns

  Usedamateurdatabaseof9x9gamesfortraining

 Mention‐worthymetrics:  Simulationwinrateagainstpurelyrandom  UCTwinrateagainstUCTpurelyrandom  UCTwinrateagainstGNUGo

  Simplestalgorithm  Looksateverymoveofeverygameinthetrainingset  Highpreferenceforchosenmoves  Lowpreferenceforunchosenmoves

  Stronglyfavoredgoodpatterns  Over‐training;poorerrorcompensation

  Valuesconvergetoinfinity

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

GameType

ApprenticeshipvsPureRandom

PureRandom

Apprenticeship

  Playsrandomgamesfromthetrainingset  Ifthesimulationmatchestheoriginalgameresult,patternsgethigherpreference

  Otherwise,lowerpreference  Resultswerepromising

Winrate(%

GameType

ReinforcementvsPureRandom

PureRandom

Reinforcement

  Foreachtraininggame…  Playsrandomgamestoestimatewinrate  Playsmorerandomgamestodeterminewhichpatternswinandlose

  Givespreferencestopatternsbasedonerrorbetweenactualgameresultandobservedwinrate

  Usually,stronglocalmoves  Seemedtolearngoodpatterndistribution  Aggressivelyplayeduselessmoveshopingforanopponentmistake

  Poorconsiderationofthewholeboard

Winrate(%

GameType

SimulationBalancingversusPureRandom

PureRandom

SimulationBalancing

  Picksrandomgamestates  Computesscoreestimateofeverymoveat2‐plydepth

  Updatespatternpreferencesbasedontheseresults,usingactualgameresulttocompensateforerror

  Gamescoreishardtoestimate,usuallyinaccurate

  Extremelyexpensive;10‐30sectoestimatescore

  Gamescoredoesn’tchangemeaningfullyformanymoves

  Probablydoesnotscaleasboardsizegrows

Winrate(%

GameType

TwoStepBalancingvsPureRandom

PureRandom

TwoStepBalancing

Winrate(%

GameType

AlgorithmResults

PureRandom

Apprenticeship

Reinforcement

SimulationBalancing

TwoStepBalancing

  Reinforcementstrongest  Allalgorithmscapableofverydeterministicpolicies

  HigherplayoutwinratesweretoodeterministicandthususuallybadwithUCT

  Gomaybetoocomplexforthesealgorithms  Optimizingself‐playdoesn’tguaranteegoodmoves

  LeventeKocsis

  SZTAKI

  ProfessorsSárközyandSelkow

  Algorithmgenerateslistofpatterns  Eachpatternhasaweight/value  Policylooksatopenpositionsontheboard  Getsthepatternateachopenposition  Usesweightsasaprobabilitydistribution

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Documents

Introduction to the Grid Peter Kacsuk MTA SZTAKI

Algorithms for Reinforcement Learning - SZTAKI: SZTAKIWeb

Solar Salvation at WPI - Worcester Polytechnic Institute (WPI)

CPI vs WPI

WPI Community Standard

SZTAKI Desktop Grid

11 SZTAKI & International Desktop Grid Federation: status report József Kovács (MTA SZTAKI) jozsef.kovacs@sztaki.mta.hu jozsef.kovacs@sztaki.mta.hu

WPI Research · 2017-02-10 · WPI Research Discovery and Innovation with Purpose > wpi.edu/+research 2017 KEEP UP WITH RESEARCH AT WPI > Read the latest WPI research news, watch

WPI London Centre Paul Davis Director, WPI London Centre

Kerberos - WPI

SZTAKI & International Desktop Grid Federation : status report

STEM Education Center wpi/+stem stemcenter@wpi 508-831-5512

WPI 02-20-07 Crossley - Worcester Polytechnic Institute (WPI)

ENERGY PRODUCTION AND DISTRIBUTION - SZTAKI

WPI Aula 05

FP6 IST Infoday at SZTAKI

WPI Aula 03

Introduction to Cloud Computing Zsolt Németh MTA SZTAKI

“Cloud bursting” on SZTAKI Cloud Attila Csaba Marosi Cloud Computing Research Group MTA SZTAKI LPDS marosi.attila@sztaki.mta.hu 1 Summer School on Grid

WPI Baccalaureate