RL with LCS

Embed Size (px)

Citation preview

  • 8/3/2019 RL with LCS

    1/29

    Towards Reinforcement Learning with LCS

  • 8/3/2019 RL with LCS

    2/29

    Having until now concentrated on how LCS canhandle regression and classification tasks, thischapter returns to the prime motivator for LCS,which are sequential decision tasks.

    Towards Reinforcement Learningwith LCS

  • 8/3/2019 RL with LCS

    3/29

    Towards Reinforcement Learningwith LCS

  • 8/3/2019 RL with LCS

    4/29

    Problem Definition

    The sequential decision tasks that will beconsidered are the ones describable by aMarkov Decision Process (MDP)

    Some of the previously used symbols will beassigned a new meaning.

    Towards Reinforcement Learningwith LCS

  • 8/3/2019 RL with LCS

    5/29

    Problem Definition

    Let X be the set of states x X of the problem domain, that isassumed to be of finite size1 N , and hence is mapped into thenatural numbers N.

    In every state xi X , an action a out of a finite set A is

    performed and causes a state transition to xj . The probability of getting to state xj after performing action a in

    state xi is given by the transition function p(xj |xi , a), which is aprobability distribution over X, conditional on X A.

    The positive discount factor R with 0 < 1 determines the

    preference of immediate reward over future reward.

  • 8/3/2019 RL with LCS

    6/29

    Problem Definition

    The aim is for every state to choose the actionthat maximises the reward in the long run,where future rewards are possibly valued less

    that immediate rewards.The Value Function, the Action-Value

    Function and Bellmans Equation The approach taken by dynamic programming

    (DP) and reinforcement learning (RL) is to definea value function V : X R that expresses foreach state how much reward we can expect toreceive in the long run.

  • 8/3/2019 RL with LCS

    7/29

    Problem Types

    The three basic classes of infinite horizonproblems are stochastic shortest path problems,discounted problems, and average reward perstep problems, all of which are well described byBertsekas and Tsitsiklis [17].

    Here, only discounted problems and stochasticshortest path problems are considered, where for

    the problems and stochastic shortest pathproblems are considered, where for the latter onlyproper policies that are guaranteed to reach thedesired terminal state are assumed.

  • 8/3/2019 RL with LCS

    8/29

    Dynamic Programming andReinforcement Learning

    In this section, some common RL methods areintroduced, that learn these functions whiletraversing the state space without building a

    model of the transition and reward function.

    These methods are simulation-basedapproximations to DP methods, and theirstability is determined by the stability of thecorresponding DP method.

  • 8/3/2019 RL with LCS

    9/29

    Dynamic Programming Operators

    Bellmans Equation is a set of equations thatcannot be solved analytically.

    Fortunately, several methods have been

    developed that make finding its solution easier,all of which are based on the DP operators Tand T.

  • 8/3/2019 RL with LCS

    10/29

    Value Iteration and Policy Iteration

    The method of value iteration is astraightforward application of the contractionproperty of T and is based on applying T

    repeatedly to an initially arbitrary value vector Vuntil it converges to the optimal value vectorV* . Convergence can only be guaranteed afteran infinite number of steps, but the value vector

    V is usually already close to V* after fewiterations.

  • 8/3/2019 RL with LCS

    11/29

    Value Iteration and Policy Iteration

    Various variants to these methods exist, such as asynchronous valueiteration, that at each application of T only updates a single state of V.Modified policy iteration performs the policy evaluation step byapproximating V by Tn V for some small n.

    Asynchronous policy iteration mixes asynchronous value iteration with

    policy iteration by at each step either i) updating some states of V by asynchronous value iteration.

    ii) improving the policy of some set of states by policy improvement.Convergence criteria for these variants are given by Bertsekas andTsitsiklis [17].

  • 8/3/2019 RL with LCS

    12/29

    Approximate Dynamic Programming

    If N is large, we prefer to approximate the valuefunction rather than representing the value foreach state explicitly

    Approximate value iteration is performed byapproximating the value iteration update Vt+1 =TVt by

  • 8/3/2019 RL with LCS

    13/29

    Approximate Dynamic Programming

    where is the approximation operator that, for the usedfunction approximation technique, returns the value function

    estimate approximation Vt+1 that is closest to by

    The only approximation that will be considered is the onemost similar to approximation value iteration and is the

    temporal-di erence solution which aims at finding the fixedffpoint by the update

  • 8/3/2019 RL with LCS

    14/29

    SARSA()

    Coming to the first reinforcement learning algorithm,SARSA stands for State-Action-Reward-State-Action,as SARSA(0) requires only information on the currentand next state/action pair and the reward that was

    received for the transition. It conceptually performs policy iteration and uses

    TD() to update its action-value function Q. Morespecifically it performs optimistic policy iteration, where

    in contrast to standard policy iteration the policyimprovement step is based on an incompletelyevaluated policy.

  • 8/3/2019 RL with LCS

    15/29

    Q-Learning

  • 8/3/2019 RL with LCS

    16/29

  • 8/3/2019 RL with LCS

    17/29

    9.5 Further Issues

    Besides the stability concerns when using LCSto perform RL, there are still some furtherissues to consider, two of which will bediscussed in this section:

    The learning of long paths, and

    How to best handle the explore/exploitdilemma.

  • 8/3/2019 RL with LCS

    18/29

    9.5.1 Long Path Learning

    The problem of long path learning is to find the optimalpolicy in sequential decision tasks when the solutionrequires learning of action sequences of substantiallength.

    While a solution was proposed to handle this problem[12], it was only designed to work for a particular problemclass, as will be shown after discussing how XCS fails atlong path learning. The classifier set optimality criterionfrom Chap. 7 might provide better results, but in general,long path learning remains an open problem.

    Long path learning is not only an issue for LCS, but forapproximate DP d RL in general.

  • 8/3/2019 RL with LCS

    19/29

    XCS and Long Path Learning

    Consider the problem that is shown in Fig. 9.2.The aim is to find the policy that reaches theterminal state x6 from the initial state x1a in theshortest number of steps.

    In RL terms, this aim is described by giving areward of 1 upon reaching the terminal state,and a reward of 0 for all other transitions4 . The

    optimal policy is to alternately choose actions 0and 1, starting with action 1 in state x1a .

  • 8/3/2019 RL with LCS

    20/29

    XCS and Long Path Learning

    The optimal value function V over the number ofsteps to the terminal state is for a 15-step corridor

    finite state world shown in Fig. 9.3(a). As can beseen, the di erence of the values of V betweenff two adjacent states decreases with the distancefrom the terminal state.

  • 8/3/2019 RL with LCS

    21/29

    Using the Relative Error

    Barry proposed two preliminary approaches tohandle the problem in long path learning in XCS,both based on making the error calculation of aclassifier relative to its prediction of the value

    function [12]. The first approach is to estimate the distance of the

    matched states to the terminal state and scale theerror accordingly, but this approach su ers from theff

    inaccuracy of predicting this distance.

  • 8/3/2019 RL with LCS

    22/29

    Using the Relative Error

    A second, more promising alternative proposed

    in his study is to scale the measured predictionerror by the inverse absolute magnitude of theprediction. The underlying assumption is that thedi erence in optimal values between twoff

    successive states is proportional to the absolutemagnitude of these values

  • 8/3/2019 RL with LCS

    23/29

    A Possible Alternative?

    It was shown in Sect. 8.3.4 that the optimalitycriterion that was introduced in Chap. 7 is able

    to handle problem where the noise di ers inffdi erent areas of the input space. Given that itffis possible to use this criterion in anincremental implementation, will such an

    implementation be able to perform long pathlearning?

  • 8/3/2019 RL with LCS

    24/29

    A Possible Alternative?

    Let us assume that the optimality criterion causes the size of the area of the input space that is matched by aclassifier to be proportional to the level of noise in the data, such that the model is refined in areas where theobservations are known to accurately represent the data-generating process. Considering only measurement noise,when applied to value function approximation this would lead to having more specific classifiers in states where thedi erence in magnitude of the value function for successive states is low, as in such areas this noise is deemed toffbe low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimalvalue function, even in cases where long action sequences need to be represented.

  • 8/3/2019 RL with LCS

    25/29

    9.5.2 Exploration and Exploitation

    Maintaining the balance between exploiting current knowledge to guide actionselection and exploring the state space to gain new knowledge is an essentialproblem for reinforcement learning.

    Too much exploration implies the frequent selection of sub-optimal actions andcauses the accumulated reward to decrease.

    Too much emphasis on exploitation of current knowledge, on the other hand,might cause the agent to settle on a sub-optimal policy due to insu cientffiknowledge of the reward distribution [228, 209]. Keeping a good balance isimportant as it has a significant impact on the performance of RL methods.

  • 8/3/2019 RL with LCS

    26/29

    9.5.2 Exploration and Exploitation

    There are several approaches to handlingexploration and exploitation: one can choose asub-optimal action every now and then,independent of the certainty of the available

    knowledge,

    Or one can take this certainty into account tochoose actions that increase it. A variant of the

    latter is to use Bayesian statistics to model thisuncertainty, which seems the most elegantsolution but is unfortunately also the leasttractable.

  • 8/3/2019 RL with LCS

    27/29

    9.6 Summary

    Despite sequential decision tasks being theprime motivator for LCS, they are still the oneswhich LCS handle least successfully. Thischapter provides a primer on how to usedynamic programming and reinforcementlearning to handle such tasks, and on how LCScan be combined with either approach from firstprinciples.

  • 8/3/2019 RL with LCS

    28/29

    9.6 Summary

    An essential part of the LCS type discussed in this book isthat classifiers are trained independently. This is notcompletely true when using LCS with reinforcement learning,as the target values that the classifiers are trained on arebased on the global prediction, which is formed by all

    matching classifiers in combination. In that sense, classifiersinteract when forming their action-value function estimates.Still, besides combining classifier predictions to form thetarget values, independent classifier training still forms thebasis of this model type, even when used in combination with

    RL.

  • 8/3/2019 RL with LCS

    29/29

    9.6 Summary

    Overall, using LCS to approximate the value or action-value function in RL is appealing as LCS dynamicallyadjust to the form of this function and thus might provide a better approximation than standard functionapproximation techniques. It should be noted, however, that the field of RL is moving quickly, and that Q-Learning is by far not the best method that is currently available. Hence, in order for LCS to be a competitiveapproach to sequential decision tasks, they also need to keep track with new developments in RL, some ofwhich were discussed when detailing the exploration/exploitation dilemma that is an essential component of RL.

    In summary, it is obvious that there is still plenty of work to be done until LCS can provide the same formaldevelopment as RL currently does. Nonetheless, the initial formal basis is provided in this chapter, upon whichother research can build further analysis and improvements to how LCS handles sequential decision taskse ectively, competitively, and with high reliability.ff