prev

next

of 31

View

20Category

## Documents

Embed Size (px)

DESCRIPTION

Collaborative Reinforcement Learning. Presented by Dr. Ying Lu. Credits. Reinforcement Learning : A User ’ s Guide . Bill Smart at ICAC 2005 - PowerPoint PPT Presentation

Collaborative Reinforcement LearningPresented by Dr. Ying Lu

CreditsReinforcement Learning: A Users Guide. Bill Smart at ICAC 2005

Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages 700-704, 2004. [Winner Best Paper Award].

What is RL?

a way of programming agents by reward and punishment without needing to specify how the task is to be achieved

[Kaelbling, Littman, & Moore, 96]

Basic RL Model

Observe state, stDecide on an action, atPerform actionObserve new state, st+1Observe reward, rt+1Learn from experienceRepeat

Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agentASR

An Example: GridworldCanonical RL domainStates are grid cells4 actions: N, S, E, WReward for entering top right cell-0.01 for every other move

Maximizing sum of rewards Shortest pathIn this instance+1

The Promise of RLSpecify what to do, but not how to do itThrough the reward functionLearning fills in the details

Better final solutionsBased on actual experiences, not programmer assumptions

Less (human) time needed for a good solution

Mathematics of RLBefore we talk about RL, we need to cover some background materialSome simple decision theoryMarkov Decision ProcessesValue functions

Making Single DecisionsSingle decision to be madeMultiple discrete actionsEach action has a reward associated with itGoal is to maximize rewardNot hard: just pick the action with the largest rewardState 0 has a value of 2Sum of rewards from taking the best action from the state012AB21

Markov Decision ProcessesWe can generalize the previous example to multiple sequential decisionsEach decision affects subsequent decisions

This is formally modeled by a Markov Decision Process (MDP)012AB21534AA-10001AA101B1

Markov Decision ProcessesFormally, an MDP isA set of states, S = {s1, s2, ... , sn}A set of actions, A = {a1, a2, ... , am}A reward function, R: SASA transition function,

We want to learn a policy, p: S AMaximize sum of rewards we see over our lifetime

PoliciesThere are 3 policies for this MDP0 1 3 50 1 4 50 2 4 5Which is the best one?012AB21534AA-10001AA101B1

Comparing PoliciesOrder policies by how much reward they see0 1 3 5 = 1 + 1 + 1 = 30 1 4 5 = 1 + 1 + 10 = 120 2 4 5 = 2 1000 + 10 = -988012AB21534AA-10001AA101B1

Value FunctionsWe can define value without specifying the policySpecify the value of taking action a from state s and then performing optimallyThis is the state-action value function, Q012AB21534A-10001A101B1AAHow do you tell whichaction to take fromeach state?

Value FunctionsSo, we have value functionQ(s, a) = R(s, a, s) + maxa Q(s, a)

In the form ofNext reward plus the best I can do from the next state

These extend to probabilistic actions s is thenext state

Getting the PolicyIf we have the value function, then finding the best policy is easy p(s) = arg maxa Q(s, a)Were looking for the optimal policy, p*(s)No policy generates more reward than p*Optimal policy defines optimal value functions The easiest way to learn the optimal policy is to learn the optimal value function first

Collaborative Reinforcement Learningto Adaptively Optimize MANET RoutingJim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill

OverviewBuilding autonomic distributed systems with self* propertiesSelf-OrganizingSelf-HealingSelf-OptimizingAdd collaborative learning mechanism to self-adaptive component modelImproved ad-hoc routing protocol

IntroductionAutonomous distributed systems will consist of interacting components free from human interferenceExisting top-down management and programming solutions require too much global stateBottom up, decentralized collection of components who make their own decisions based on local informationSystem wide self* behavior emerges from interactions

Self-* BehaviorSelf-adaptive components that change structure and/or behavior at run-time, adapt to discovered faultsreduced performance

Requires active monitoring of component states and external dependencies

Self-* Distributed Systems using Distributed (collaborative) Reinforcement LearningFor complex systems, programmers cannot be expected to describe all conditionsSelf-adaptive behavior learnt by componentsDecentralized co-ordination of components to support system-wide propertiesDistributed Reinforcement Learning (DRL) is extension to RL and uses neighbor interactions only

Model-Based Reinforcement Learning1.Action Reward2. State Transition Model3. Next State RewardMarkov Decision Process = {States }, {Actions}, R(States,Actions), P(States, Actions, States)

MDP

Adaptation Contract

AMM

action (at)

rt+1

st+1

reward (rt)

state (st)

Component

Decentralised System OptimisationCoordinating the solution to a set of Discrete Optimisation Problems (DOPs)Components have a Partial System ViewCoordination ActionsActions ={delegation} U {DOP actions} U {discovery}Connection Costs

Delegation

A

B

Causally-Connected States

C

Collaborative Reinforcement LearningAdvertisementUpdate Partial Views of Neighbours

DecayNegative Feedback on State Values in the Absence of Advertisements

Action RewardState Transition ModelCachedNeighbours V-valueConnectionCost

Adaptation in CRL SystemA feedback process to

Changes in the optimal policy of any RL agentChanges in the system environmentThe passing time

SAMPLE: Ad-hoc Routing using DRLProbabilistic ad-hoc routing protocol based on DRLAdaptation of network traffic around areas of congestionExploitation of stable routes

Routing decisions based on local information and information obtained from neighbors

Outperforms Ad-hoc On Demand Distance Vector Routing (AODV) and Dynamic Source Routing (DSR)

SAMPLE: A CRL System (I)

SAMPLE: A CRL System (II)Instead of always choosing the neighbor with the best Q value, i.e., taking the delegation action

a= arg maxaQi(B, a),

a neighbor is chosen probabilistically

SAMPLE: A CRL System (III)Pi(s|s, aj) = E(CS/CA)

SAMPLE: A CRL System (IV)

PerformanceMetric:Maximizethroughputratio of delivered packets to undelivered packets

Minimizenumber of transmission required per packet sent

Figures 5-10

Questions/Discussions

*********************architecture meta model (AMM)

Build a model of the system as a MDP.Used to solve Discrete Optimisation Problems - Given my current state, what action should I execute to maximise my reward?Assumptions: Static Environment, Single Agent,

1. Evaluative Feedback from Actions2.Environmental Feedback from State Transition Model3.Temporal Feedback*Causally-connected states as Partial View

Compare load balancing solve at potentially any nodeRouting - solved at only a single node in the networkService-oriented routing solved at potentially more than one node in the networkNo assumption of homogeneous agentsAgents delegate the solution to a DOP to a neighbour when the estimated cost of solving it locally is higher than a neighbour solving the DOP.*Determine Optimal Action to Make Using Local Partial View

Requirement for continuous flow of data

*******