prev

next

out of 29

View

221Download

0

Embed Size (px)

8/3/2019 RL with LCS

1/29

Towards Reinforcement Learning with LCS

8/3/2019 RL with LCS

2/29

Having until now concentrated on how LCS canhandle regression and classification tasks, thischapter returns to the prime motivator for LCS,which are sequential decision tasks.

Towards Reinforcement Learningwith LCS

8/3/2019 RL with LCS

3/29

Towards Reinforcement Learningwith LCS

8/3/2019 RL with LCS

4/29

Problem Definition

The sequential decision tasks that will beconsidered are the ones describable by aMarkov Decision Process (MDP)

Some of the previously used symbols will beassigned a new meaning.

Towards Reinforcement Learningwith LCS

8/3/2019 RL with LCS

5/29

Problem Definition

Let X be the set of states x X of the problem domain, that isassumed to be of finite size1 N , and hence is mapped into thenatural numbers N.

In every state xi X , an action a out of a finite set A is

performed and causes a state transition to xj . The probability of getting to state xj after performing action a in

state xi is given by the transition function p(xj |xi , a), which is aprobability distribution over X, conditional on X A.

The positive discount factor R with 0 < 1 determines the

preference of immediate reward over future reward.

8/3/2019 RL with LCS

6/29

Problem Definition

The aim is for every state to choose the actionthat maximises the reward in the long run,where future rewards are possibly valued less

that immediate rewards.The Value Function, the Action-Value

Function and Bellmans Equation The approach taken by dynamic programming

(DP) and reinforcement learning (RL) is to definea value function V : X R that expresses foreach state how much reward we can expect toreceive in the long run.

8/3/2019 RL with LCS

7/29

Problem Types

The three basic classes of infinite horizonproblems are stochastic shortest path problems,discounted problems, and average reward perstep problems, all of which are well described byBertsekas and Tsitsiklis [17].

Here, only discounted problems and stochasticshortest path problems are considered, where for

the problems and stochastic shortest pathproblems are considered, where for the latter onlyproper policies that are guaranteed to reach thedesired terminal state are assumed.

8/3/2019 RL with LCS

8/29

Dynamic Programming andReinforcement Learning

In this section, some common RL methods areintroduced, that learn these functions whiletraversing the state space without building a

model of the transition and reward function.

These methods are simulation-basedapproximations to DP methods, and theirstability is determined by the stability of thecorresponding DP method.

8/3/2019 RL with LCS

9/29

Dynamic Programming Operators

Bellmans Equation is a set of equations thatcannot be solved analytically.

Fortunately, several methods have been

developed that make finding its solution easier,all of which are based on the DP operators Tand T.

8/3/2019 RL with LCS

10/29

Value Iteration and Policy Iteration

The method of value iteration is astraightforward application of the contractionproperty of T and is based on applying T

repeatedly to an initially arbitrary value vector Vuntil it converges to the optimal value vectorV* . Convergence can only be guaranteed afteran infinite number of steps, but the value vector

V is usually already close to V* after fewiterations.

8/3/2019 RL with LCS

11/29

Value Iteration and Policy Iteration

Various variants to these methods exist, such as asynchronous valueiteration, that at each application of T only updates a single state of V.Modified policy iteration performs the policy evaluation step byapproximating V by Tn V for some small n.

Asynchronous policy iteration mixes asynchronous value iteration with

policy iteration by at each step either i) updating some states of V by asynchronous value iteration.

ii) improving the policy of some set of states by policy improvement.Convergence criteria for these variants are given by Bertsekas andTsitsiklis [17].

8/3/2019 RL with LCS

12/29

Approximate Dynamic Programming

If N is large, we prefer to approximate the valuefunction rather than representing the value foreach state explicitly

Approximate value iteration is performed byapproximating the value iteration update Vt+1 =TVt by

8/3/2019 RL with LCS

13/29

Approximate Dynamic Programming

where is the approximation operator that, for the usedfunction approximation technique, returns the value function

estimate approximation Vt+1 that is closest to by

The only approximation that will be considered is the onemost similar to approximation value iteration and is the

temporal-di erence solution which aims at finding the fixedffpoint by the update

8/3/2019 RL with LCS

14/29

SARSA()

Coming to the first reinforcement learning algorithm,SARSA stands for State-Action-Reward-State-Action,as SARSA(0) requires only information on the currentand next state/action pair and the reward that was

received for the transition. It conceptually performs policy iteration and uses

TD() to update its action-value function Q. Morespecifically it performs optimistic policy iteration, where

in contrast to standard policy iteration the policyimprovement step is based on an incompletelyevaluated policy.

8/3/2019 RL with LCS

15/29

Q-Learning

8/3/2019 RL with LCS

16/29

8/3/2019 RL with LCS

17/29

9.5 Further Issues

Besides the stability concerns when using LCSto perform RL, there are still some furtherissues to consider, two of which will bediscussed in this section:

The learning of long paths, and

How to best handle the explore/exploitdilemma.

8/3/2019 RL with LCS

18/29

9.5.1 Long Path Learning

The problem of long path learning is to find the optimalpolicy in sequential decision tasks when the solutionrequires learning of action sequences of substantiallength.

While a solution was proposed to handle this problem[12], it was only designed to work for a particular problemclass, as will be shown after discussing how XCS fails atlong path learning. The classifier set optimality criterionfrom Chap. 7 might provide better results, but in general,long path learning remains an open problem.

Long path learning is not only an issue for LCS, but forapproximate DP d RL in general.

8/3/2019 RL with LCS

19/29

XCS and Long Path Learning

Consider the problem that is shown in Fig. 9.2.The aim is to find the policy that reaches theterminal state x6 from the initial state x1a in theshortest number of steps.

In RL terms, this aim is described by giving areward of 1 upon reaching the terminal state,and a reward of 0 for all other transitions4 . The

optimal policy is to alternately choose actions 0and 1, starting with action 1 in state x1a .

8/3/2019 RL with LCS

20/29

XCS and Long Path Learning

The optimal value function V over the number ofsteps to the terminal state is for a 15-step corridor

finite state world shown in Fig. 9.3(a). As can beseen, the di erence of the values of V betweenff two adjacent states decreases with the distancefrom the terminal state.

8/3/2019 RL with LCS

21/29

Using the Relative Error

Barry proposed two preliminary approaches tohandle the problem in long path learning in XCS,both based on making the error calculation of aclassifier relative to its prediction of the value

function [12]. The first approach is to estimate the distance of the

matched states to the terminal state and scale theerror accordingly, but this approach su ers from theff

inaccuracy of predicting this distance.

8/3/2019 RL with LCS

22/29

Using the Relative Error

A second, more promising alternative proposed

in his study is to scale the measured predictionerror by the inverse absolute magnitude of theprediction. The underlying assumption is that thedi erence in optimal values between twoff

successive states is proportional to the absolutemagnitude of these values

8/3/2019 RL with LCS

23/29

A Possible Alternative?

It was shown in Sect. 8.3.4 that the optimalitycriterion that was introduced in Chap. 7 is able

to handle problem where the noise di ers inffdi erent areas of the input space. Given that itffis possible to use this criterion in anincremental implementation, will such an

implementation be able to perform long pathlearning?

8/3/2019 RL with LCS

24/29

A Possible Alternative?

Let us assume that the optimality criterion causes the size of the area of the input space that is matched by aclassifier to be proportional to the level of noise in the data, such that the model is refined in areas where theobservations are known to accurately represent the data-generating process. Considering only measurement noise,when applied to value function approximation this would lead to having more specific classifiers in states where thedi erence in magnitude of the value function for successive states is low, as in such areas this noise is deemed toffbe low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimalvalue function, even in cases where long