Learning to Play Games. Simon M. Lucas Game Intelligence Group University of Essex, UK http://csee.essex.ac.uk/gig/. Overview. Motivation Some sample applications and challenges Non-Player Character AI Content Creation for Games Generation of new games Main learning algorithms: - PowerPoint PPT Presentation
Learning to Play Games WCCI 2008 Tutorial
Learning to Play Games
Simon M. LucasGame Intelligence GroupUniversity of Essex, UKhttp://csee.essex.ac.uk/gig/1OverviewMotivationSome sample applications and challengesNon-Player Character AIContent Creation for GamesGeneration of new gamesMain learning algorithms: Evolution and TDLInformation acquisition ratesEffects of function approximation architectureFuture Directions2Games Industry is HugeAnd in need of better AICurrent games are impressive, but high development costs
3My MotivationTo better understand the process of learningHow can we learn reliably and efficiently in complex environmentsVideo game AI may need to more natural intelligence than classic game AIWhich techniques work best when and why?How can we use this knowledge to build better games?E.g. with plug-and-play character AI4Plug and Play NPCsand Game Mashups
5Unreal Tournament: Pogamut Interface
6Screen Capture Pac-Man
7Galactic Arms Race(Hastings, Guha, Stanley, UCF)
8Evolving Board Games(Cameron Browne)
9ArchitectureWhere does the learning fit in to a game playing agent?Two main choicesValue functionE.g. [TD-Gammon], [Neuro-Gammon], [Blondie]Action selector[NERO], [MENACE]Well see how this works in a simple grid world
Well begin by looking at the main classes of architecture. An agent perceives the world through some real or virtual sensors. These are fed to the inputs of a function approximator. There are two main ways this can be configured: either as a value function or as an action selector. In the value function approach, the agent projects the current state of the game to a set of possible future states, by applying a set of actions to a model of the game or system. The value function takes as input a game state, and produces a single value as output for each state. The agent selects the action that leads to the highest value state (as judged by the value function).
In the case of an action selector, it takes only the current state of the game, and directly outputs the action to take. The interpretation of the outputs will depend on the nature of the game. For example, when using this approach for a board game, the outputs of the function approximator will be continuous values, but the move selected could be the legal move with the highest associated output. On the other hand, for a car racing game there might two outputs that are directly interpreted as the accelerator and the steering angle.
10Example(images from COBOSTAR paper, Butz and Loenneker, IEEE CIG 2009)
?Action (steer, accel.)11State Value FunctionHypothetically apply possible actions to current state to generate set of possible future statesEvaluate these using value functionPick the action that leads to the most favourable stateForEasy to apply, learns relatively quicklyAgainstNeed a model of the systemTo reiterate, the state value function works by applying a set of possible actions to the current game state to generate a set of possible future states. These states are then evaluated with the value function. The action is chosen that leads to the state with the highest value. Since the system has only a single output, the system typically has fewer parameters than an action selector, and learns relatively quickly. On the negative side, a model of the system is required in order to project forward in time. The approach is very flexible, and stronger players can often be generated by simply projecting further forward in time, though at the expense of greater CPU time during training and operation.
12State Value DiagramGame state projected to possible future statesThese are then evaluatedChoose action that leads to best valueSingle output F.A.Game State (t)Game State(t+1, an)Func.Approx.FeatureExtractionGame State(t+1, a2)Func.Approx.FeatureExtractionGame State(t+1, a1)Func.Approx.FeatureExtractionThis diagram shows a block diagram of a value function being used to play a game.
The game state at time t is projected to future states at time t+1, in the diagram these are shown as the result of taking actions a1, a2, and a_n.
Each game state then goes in turn through the exact same feature extractor and function approximator being operated as a value function.
The function approximator has a single output indicating the value of that state. The chosen action is the one that leads to the highest output.
13Grid Worldn x n grid (toroidal i.e. wrap-around)Green disc: goal stateRed disc: current stateActions: up, down, left, rightRed circles: possible next statesExample uses 15 x 15 gridState value approach: choose next state with highest valueQ: What if we dont know the value function?A: learn it!
To clarify these approaches, lets consider a simple grid world problem.The aim is for the agent (the red disc) to find the goal (the green disc in the centre).The grid is toroidal, and in each state there are four possible actions: up, down, left and right.
Lets see now how to address this problem with a state value function, and with an action selector.
14Learning AlgorithmsTemporal difference learning (TDL)Evolution (or Co-Evolution)Both can learn to play games given no expert knowledgeNext well study the main algorithms for learning the parameters of a game-playing system. These are temporal difference learning (TDL) and Evolution (or Co-Evolution). Although there are many possible algorithms to use, many of them can be placed into one of these categories.
15(Co) EvolutionEvolution / Co-evolution (vanilla form)Use information from game results as the basis of a fitness functionThese results can be against some existing computer playersThis is standard evolutionOr against a population of simultaneously evolving playersThis is called co-evolutionEasy to applyBut wasteful: discards so much informationEvolution or Co-evolution uses information from game results as the basis of a fitness function.
These results can be against some existing computer players which leads tostandard evolutionary algorithms, or against a population of simultaneously evolving players,which gives us co-evolution.
The approach is very easy to apply and works with any architecture thereis no requirement for the function approximator to be differentiable.
The focus on end-results should make it robust as well. On the minus sidethese algorithms are rather wasteful as the information that couldbe observed during game play is ignored.16Co-evolution (single population)Evolutionary algorithm: rank them using a league
This slide illustrates how easy co-evolution is to apply.Imagine using an evolutionary algorithm with rank-based selection.
Co-evolution is exactly this except that the ranking is doneby playing the agents off against each other.
This could be done tournament-style like Wimbledon, or in a round-robin leaguesuch as the English football premiership shown in the example.
The advantage of using tournament selection is that fewer games are required to establish the winner.The disadvantage is that its more susceptible to noise, in which case the best player wont necessarily win.
17Temporal Difference Learning (TDL)Can learn by self-playEssential to have some reward structureThis may follow directly from the gameLearns during game-playUses information readily available (i.e. current observable game-state)Often learns faster than evolution, but may be less robust.Function approximator must be trainableIn standard temporal difference learning an agent learns by playing the game against itself. In the case of a single player game it takes actions in the game environment.It is essential to have some kind of reward structure, but this will often follow directly from the game. For example, when learning to play a board game, a reward can be given at the end of each game depending on the result (win, lose, or draw). TDL also learns during game play. It learns that consecutive states should have similar values.
As a result of using more information, TDL often learns faster than evolution, but it may be less robust. One limitation of TDL is that the value function must be trainable this usually means that it must be differentiable.18TDL in Picturesv(t-1) moves towards v(t)TD(lambda)t
19Demo: TDL Learns a State Value Function for the Grid Problem
20Evolving State TablesInteresting to make a direct comparison for the grid problem between TDL and evolutionFor the fitness function we measure some aspect of the performanceE.g. average number of steps taken per episode given a set of start pointsOr number of times goal was found given a fixed number of stepsIt is interesting to make a direct comparison on the grid problem between TDL and evolution.For evolutions fitness function we measure some aspect of the performance.To give two examples, this could be the average number of steps taken per episode given a set of start points,or alternatively, the number of times goal was found given a fixed number of steps.
21Evolving a State Table for the Grid ProblemThis experiment ran for 1000 fitness evaluationsfitness function: average #steps taken over 25 episodesused [CMA-ES] a powerful evolutionary strategyvery poor results (final fitness approx 250)sample evolved table
22TDL versus EvolutionGrid Problem, State TableTDL greatly outperforms evolution on this experiment
[tidy these words up a bit]TDL greatly outperforms evolution on this experiment.Also note that each fitness evaluation for evolution is based on the average of 25 episodes.The graph therefore shows the result of 25,000 episodes for evolution versus only 1000 episodes for TDL.
Evolution makes very disappointing progress on this problem when applied in this way. Each fitness evaluation was the ave