RL with LCS

8/3/2019 RL with LCS

1/29

Towards Reinforcement Learning with LCS


2/29

Having until now concentrated on how LCS canhandle regression and classification tasks, thischapter returns to the prime motivator for LCS,which are sequential decision tasks.

Towards Reinforcement Learningwith LCS


3/29



4/29

Problem Definition

The sequential decision tasks that will beconsidered are the ones describable by aMarkov Decision Process (MDP)

Some of the previously used symbols will beassigned a new meaning.



5/29

Problem Definition

Let X be the set of states x X of the problem domain, that isassumed to be of finite size1 N , and hence is mapped into thenatural numbers N.

In every state xi X , an action a out of a finite set A is

performed and causes a state transition to xj . The probability of getting to state xj after performing action a in

state xi is given by the transition function p(xj |xi , a), which is aprobability distribution over X, conditional on X A.

The positive discount factor R with 0 < 1 determines the

preference of immediate reward over future reward.


6/29

Problem Definition

The aim is for every state to choose the actionthat maximises the reward in the long run,where future rewards are possibly valued less

that immediate rewards.The Value Function, the Action-Value

Function and Bellmans Equation The approach taken by dynamic programming

(DP) and reinforcement learning (RL) is to definea value function V : X R that expresses foreach state how much reward we can expect toreceive in the long run.


7/29

Problem Types

The three basic classes of infinite horizonproblems are stochastic shortest path problems,discounted problems, and average reward perstep problems, all of which are well described byBertsekas and Tsitsiklis [17].

Here, only discounted problems and stochasticshortest path problems are considered, where for

the problems and stochastic shortest pathproblems are considered, where for the latter onlyproper policies that are guaranteed to reach thedesired terminal state are assumed.


8/29

Dynamic Programming andReinforcement Learning

In this section, some common RL methods areintroduced, that learn these functions whiletraversing the state space without building a

model of the transition and reward function.

These methods are simulation-basedapproximations to DP methods, and theirstability is determined by the stability of thecorresponding DP method.


9/29

Dynamic Programming Operators

Bellmans Equation is a set of equations thatcannot be solved analytically.

Fortunately, several methods have been

developed that make finding its solution easier,all of which are based on the DP operators Tand T.


10/29

Value Iteration and Policy Iteration

The method of value iteration is astraightforward application of the contractionproperty of T and is based on applying T

repeatedly to an initially arbitrary value vector Vuntil it converges to the optimal value vectorV* . Convergence can only be guaranteed afteran infinite number of steps, but the value vector

V is usually already close to V* after fewiterations.


11/29

Value Iteration and Policy Iteration

Various variants to these methods exist, such as asynchronous valueiteration, that at each application of T only updates a single state of V.Modified policy iteration performs the policy evaluation step byapproximating V by Tn V for some small n.

Asynchronous policy iteration mixes asynchronous value iteration with

policy iteration by at each step either i) updating some states of V by asynchronous value iteration.

ii) improving the policy of some set of states by policy improvement.Convergence criteria for these variants are given by Bertsekas andTsitsiklis [17].


12/29

Approximate Dynamic Programming

If N is large, we prefer to approximate the valuefunction rather than representing the value foreach state explicitly

Approximate value iteration is performed byapproximating the value iteration update Vt+1 =TVt by


13/29

Approximate Dynamic Programming

where is the approximation operator that, for the usedfunction approximation technique, returns the value function

estimate approximation Vt+1 that is closest to by

The only approximation that will be considered is the onemost similar to approximation value iteration and is the

temporal-di erence solution which aims at finding the fixedffpoint by the update


14/29

SARSA()

Coming to the first reinforcement learning algorithm,SARSA stands for State-Action-Reward-State-Action,as SARSA(0) requires only information on the currentand next state/action pair and the reward that was

received for the transition. It conceptually performs policy iteration and uses

TD() to update its action-value function Q. Morespecifically it performs optimistic policy iteration, where

in contrast to standard policy iteration the policyimprovement step is based on an incompletelyevaluated policy.


15/29

Q-Learning


16/29


17/29

9.5 Further Issues

Besides the stability concerns when using LCSto perform RL, there are still some furtherissues to consider, two of which will bediscussed in this section:

The learning of long paths, and

How to best handle the explore/exploitdilemma.


18/29

9.5.1 Long Path Learning

The problem of long path learning is to find the optimalpolicy in sequential decision tasks when the solutionrequires learning of action sequences of substantiallength.

While a solution was proposed to handle this problem[12], it was only designed to work for a particular problemclass, as will be shown after discussing how XCS fails atlong path learning. The classifier set optimality criterionfrom Chap. 7 might provide better results, but in general,long path learning remains an open problem.

Long path learning is not only an issue for LCS, but forapproximate DP d RL in general.


19/29

XCS and Long Path Learning

Consider the problem that is shown in Fig. 9.2.The aim is to find the policy that reaches theterminal state x6 from the initial state x1a in theshortest number of steps.

In RL terms, this aim is described by giving areward of 1 upon reaching the terminal state,and a reward of 0 for all other transitions4 . The

optimal policy is to alternately choose actions 0and 1, starting with action 1 in state x1a .


20/29

XCS and Long Path Learning

The optimal value function V over the number ofsteps to the terminal state is for a 15-step corridor

finite state world shown in Fig. 9.3(a). As can beseen, the di erence of the values of V betweenff two adjacent states decreases with the distancefrom the terminal state.


21/29

Using the Relative Error

Barry proposed two preliminary approaches tohandle the problem in long path learning in XCS,both based on making the error calculation of aclassifier relative to its prediction of the value

function [12]. The first approach is to estimate the distance of the

matched states to the terminal state and scale theerror accordingly, but this approach su ers from theff

inaccuracy of predicting this distance.


22/29

Using the Relative Error

A second, more promising alternative proposed

in his study is to scale the measured predictionerror by the inverse absolute magnitude of theprediction. The underlying assumption is that thedi erence in optimal values between twoff

successive states is proportional to the absolutemagnitude of these values


23/29

A Possible Alternative?

It was shown in Sect. 8.3.4 that the optimalitycriterion that was introduced in Chap. 7 is able

to handle problem where the noise di ers inffdi erent areas of the input space. Given that itffis possible to use this criterion in anincremental implementation, will such an

implementation be able to perform long pathlearning?


24/29

A Possible Alternative?

Let us assume that the optimality criterion causes the size of the area of the input space that is matched by aclassifier to be proportional to the level of noise in the data, such that the model is refined in areas where theobservations are known to accurately represent the data-generating process. Considering only measurement noise,when applied to value function approximation this would lead to having more specific classifiers in states where thedi erence in magnitude of the value function for successive states is low, as in such areas this noise is deemed toffbe low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimalvalue function, even in cases where long action sequences need to be represented.


25/29

9.5.2 Exploration and Exploitation

Maintaining the balance between exploiting current knowledge to guide actionselection and exploring the state space to gain new knowledge is an essentialproblem for reinforcement learning.

Too much exploration implies the frequent selection of sub-optimal actions andcauses the accumulated reward to decrease.

Too much emphasis on exploitation of current knowledge, on the other hand,might cause the agent to settle on a sub-optimal policy due to insu cientffiknowledge of the reward distribution [228, 209]. Keeping a good balance isimportant as it has a significant impact on the performance of RL methods.


26/29

9.5.2 Exploration and Exploitation

There are several approaches to handlingexploration and exploitation: one can choose asub-optimal action every now and then,independent of the certainty of the available

knowledge,

Or one can take this certainty into account tochoose actions that increase it. A variant of the

latter is to use Bayesian statistics to model thisuncertainty, which seems the most elegantsolution but is unfortunately also the leasttractable.


27/29

9.6 Summary

Despite sequential decision tasks being theprime motivator for LCS, they are still the oneswhich LCS handle least successfully. Thischapter provides a primer on how to usedynamic programming and reinforcementlearning to handle such tasks, and on how LCScan be combined with either approach from firstprinciples.


28/29

9.6 Summary

An essential part of the LCS type discussed in this book isthat classifiers are trained independently. This is notcompletely true when using LCS with reinforcement learning,as the target values that the classifiers are trained on arebased on the global prediction, which is formed by all

matching classifiers in combination. In that sense, classifiersinteract when forming their action-value function estimates.Still, besides combining classifier predictions to form thetarget values, independent classifier training still forms thebasis of this model type, even when used in combination with

RL.


29/29

9.6 Summary

Overall, using LCS to approximate the value or action-value function in RL is appealing as LCS dynamicallyadjust to the form of this function and thus might provide a better approximation than standard functionapproximation techniques. It should be noted, however, that the field of RL is moving quickly, and that Q-Learning is by far not the best method that is currently available. Hence, in order for LCS to be a competitiveapproach to sequential decision tasks, they also need to keep track with new developments in RL, some ofwhich were discussed when detailing the exploration/exploitation dilemma that is an essential component of RL.

In summary, it is obvious that there is still plenty of work to be done until LCS can provide the same formaldevelopment as RL currently does. Nonetheless, the initial formal basis is provided in this chapter, upon whichother research can build further analysis and improvements to how LCS handles sequential decision taskse ectively, competitively, and with high reliability.ff

Documents

RL with LCS