Collaborative Reinforcement Learning

Embed Size (px)


Collaborative Reinforcement Learning. Presented by Dr. Ying Lu. Credits. Reinforcement Learning : A User ’ s Guide . Bill Smart at ICAC 2005 - PowerPoint PPT Presentation

Text of Collaborative Reinforcement Learning

  • Collaborative Reinforcement LearningPresented by Dr. Ying Lu

  • CreditsReinforcement Learning: A Users Guide. Bill Smart at ICAC 2005

    Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages 700-704, 2004. [Winner Best Paper Award].

  • What is RL?

    a way of programming agents by reward and punishment without needing to specify how the task is to be achieved

    [Kaelbling, Littman, & Moore, 96]

  • Basic RL Model

    Observe state, stDecide on an action, atPerform actionObserve new state, st+1Observe reward, rt+1Learn from experienceRepeat

    Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agentASR

  • An Example: GridworldCanonical RL domainStates are grid cells4 actions: N, S, E, WReward for entering top right cell-0.01 for every other move

    Maximizing sum of rewards Shortest pathIn this instance+1

  • The Promise of RLSpecify what to do, but not how to do itThrough the reward functionLearning fills in the details

    Better final solutionsBased on actual experiences, not programmer assumptions

    Less (human) time needed for a good solution

  • Mathematics of RLBefore we talk about RL, we need to cover some background materialSome simple decision theoryMarkov Decision ProcessesValue functions

  • Making Single DecisionsSingle decision to be madeMultiple discrete actionsEach action has a reward associated with itGoal is to maximize rewardNot hard: just pick the action with the largest rewardState 0 has a value of 2Sum of rewards from taking the best action from the state012AB21

  • Markov Decision ProcessesWe can generalize the previous example to multiple sequential decisionsEach decision affects subsequent decisions

    This is formally modeled by a Markov Decision Process (MDP)012AB21534AA-10001AA101B1

  • Markov Decision ProcessesFormally, an MDP isA set of states, S = {s1, s2, ... , sn}A set of actions, A = {a1, a2, ... , am}A reward function, R: SASA transition function,

    We want to learn a policy, p: S AMaximize sum of rewards we see over our lifetime

  • PoliciesThere are 3 policies for this MDP0 1 3 50 1 4 50 2 4 5Which is the best one?012AB21534AA-10001AA101B1

  • Comparing PoliciesOrder policies by how much reward they see0 1 3 5 = 1 + 1 + 1 = 30 1 4 5 = 1 + 1 + 10 = 120 2 4 5 = 2 1000 + 10 = -988012AB21534AA-10001AA101B1

  • Value FunctionsWe can define value without specifying the policySpecify the value of taking action a from state s and then performing optimallyThis is the state-action value function, Q012AB21534A-10001A101B1AAHow do you tell whichaction to take fromeach state?

  • Value FunctionsSo, we have value functionQ(s, a) = R(s, a, s) + maxa Q(s, a)

    In the form ofNext reward plus the best I can do from the next state

    These extend to probabilistic actions s is thenext state

  • Getting the PolicyIf we have the value function, then finding the best policy is easy p(s) = arg maxa Q(s, a)Were looking for the optimal policy, p*(s)No policy generates more reward than p*Optimal policy defines optimal value functions The easiest way to learn the optimal policy is to learn the optimal value function first

  • Collaborative Reinforcement Learningto Adaptively Optimize MANET RoutingJim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill

  • OverviewBuilding autonomic distributed systems with self* propertiesSelf-OrganizingSelf-HealingSelf-OptimizingAdd collaborative learning mechanism to self-adaptive component modelImproved ad-hoc routing protocol

  • IntroductionAutonomous distributed systems will consist of interacting components free from human interferenceExisting top-down management and programming solutions require too much global stateBottom up, decentralized collection of components who make their own decisions based on local informationSystem wide self* behavior emerges from interactions

  • Self-* BehaviorSelf-adaptive components that change structure and/or behavior at run-time, adapt to discovered faultsreduced performance

    Requires active monitoring of component states and external dependencies

  • Self-* Distributed Systems using Distributed (collaborative) Reinforcement LearningFor complex systems, programmers cannot be expected to describe all conditionsSelf-adaptive behavior learnt by componentsDecentralized co-ordination of components to support system-wide propertiesDistributed Reinforcement Learning (DRL) is extension to RL and uses neighbor interactions only

  • Model-Based Reinforcement Learning1.Action Reward2. State Transition Model3. Next State RewardMarkov Decision Process = {States }, {Actions}, R(States,Actions), P(States, Actions, States)


    Adaptation Contract


    action (at)



    reward (rt)

    state (st)


  • Decentralised System OptimisationCoordinating the solution to a set of Discrete Optimisation Problems (DOPs)Components have a Partial System ViewCoordination ActionsActions ={delegation} U {DOP actions} U {discovery}Connection Costs




    Causally-Connected States


  • Collaborative Reinforcement LearningAdvertisementUpdate Partial Views of Neighbours

    DecayNegative Feedback on State Values in the Absence of Advertisements

    Action RewardState Transition ModelCachedNeighbours V-valueConnectionCost

  • Adaptation in CRL SystemA feedback process to

    Changes in the optimal policy of any RL agentChanges in the system environmentThe passing time

  • SAMPLE: Ad-hoc Routing using DRLProbabilistic ad-hoc routing protocol based on DRLAdaptation of network traffic around areas of congestionExploitation of stable routes

    Routing decisions based on local information and information obtained from neighbors

    Outperforms Ad-hoc On Demand Distance Vector Routing (AODV) and Dynamic Source Routing (DSR)

  • SAMPLE: A CRL System (I)

  • SAMPLE: A CRL System (II)Instead of always choosing the neighbor with the best Q value, i.e., taking the delegation action

    a= arg maxaQi(B, a),

    a neighbor is chosen probabilistically

  • SAMPLE: A CRL System (III)Pi(s|s, aj) = E(CS/CA)

  • SAMPLE: A CRL System (IV)

  • PerformanceMetric:Maximizethroughputratio of delivered packets to undelivered packets

    Minimizenumber of transmission required per packet sent

    Figures 5-10

  • Questions/Discussions

    *********************architecture meta model (AMM)

    Build a model of the system as a MDP.Used to solve Discrete Optimisation Problems - Given my current state, what action should I execute to maximise my reward?Assumptions: Static Environment, Single Agent,

    1. Evaluative Feedback from Actions2.Environmental Feedback from State Transition Model3.Temporal Feedback*Causally-connected states as Partial View

    Compare load balancing solve at potentially any nodeRouting - solved at only a single node in the networkService-oriented routing solved at potentially more than one node in the networkNo assumption of homogeneous agentsAgents delegate the solution to a DOP to a neighbour when the estimated cost of solving it locally is higher than a neighbour solving the DOP.*Determine Optimal Action to Make Using Local Partial View

    Requirement for continuous flow of data



View more >