51
Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC Lab, University of Georgia [email protected] Prashant Doshi THINC Lab, University of Georgia [email protected] Bikramjit Banerjee University of Southern Mississippi [email protected]

Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

  • Upload
    others

  • View
    9

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Reinforcement Learning in Partially ObservableMultiagent Settings:

Monte Carlo Exploring Policies

Presenter:  Roi  CerenTHINC  Lab,  University  of  Georgia

[email protected]

Prashant  DoshiTHINC  Lab,  University  of  Georgia

[email protected]

Bikramjit BanerjeeUniversity  of  Southern  [email protected]

Page 2: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Introduction

Model-­‐free  reinforcement  learning  in  multiagent systems  is  a  nascent  fieldMonte  Carlo  Exploring  Starts  for  POMDPs  is  a  powerful  single-­‐agent  RL  technique

Policy  iteration  leveraging  Q-­‐learning  to  hill-­‐climb  through  the  local  policy  space  to  local  optimaAllows  PAC  bounds  to  select  sample  complexity  with  confidence

Page 3: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

IntroductionWe  extend  MCES-­‐P  to  the  non-­‐cooperative  multiagentsetting  and  introduce  MCES  for  Interactive  POMDPs

Explicitly  models the  opponentPredicates  action-­‐values  on  expected  opponent  behaviorWhen  instantiated  with  PAC,  trades  off  computational  expense  of  modeling  with  lower  sample  bound  complexity

We  additionally  provide  a  policy  space  pruning  mechanism  to  promote  scalability

Parametrically  bounds  regret from  avoiding  policiesPrioritizes  eliminating  low-­‐regret  policy  transformations

Page 4: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: Multiagent Decision Process

In  the  multiagent setting,  all  agents  affect  the  state  and  the  reward  for  each  agent

Agent  i

(Joint)Rewards

Action

Reward

Physical  State

R(s,ai,aj)

Agent  j

Action

Reward

Action Action

Page 5: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: I-POMDPThe  Interactive  POMDP  (I-­‐POMDP)  (Gmytrasiewicz and  Doshi 2005)<IS,A,T,Ω,O,R>

Non-­‐cooperative:  Agents  get  individual,  potentially  competitive  rewardsActions  A,  state  transitions  T,  observations  𝛀,  observation  probabilities  O,  and  rewards  RIS: Interactive  state,  combining  the  physical  state  and  a  model  of  the  other  agent

Significant  uncertaintyMust  reason  not  only  the  physical  state,  but  also  the  opponent’s  motivations  and  beliefs

Page 6: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P Template

Monte  Carlo  Exploring  Starts  for  POMDPs (MCES-­‐P)(Perkins  -­‐ AAAI  2002)

General  templateExplore  neighborhood of  𝜋 -­‐ all  policies  that  differ  by  a  single  action  𝑎 on  some  observation  sequence  𝑜Compute  expected  value by  simulating  policies  onlineHill  climb to  policies  with  better  valuesTerminate if  no  neighbor  is  better  than  the  current  policy

Page 7: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateTransformation

a2

a1 a3

a3 a1 a2 a2

o1

o1 o1

o2

o2 o2

Pick  random observation  sequence  and  replace  with  a  random action

a2

a1 a3

a3 a3 a2 a2

o1

o1 o1

o2

o2 o2

{o1,o2}:  a1à a3

Page 8: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateTransformation

Pick  random observation  sequence  and  replace  with  a  random action

𝜋 𝜋'{o1,o2}:  a1ßà a3

Page 9: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateTransformation

Pick  random observation  sequence  and  replace  with  a  random action

𝜋 𝜋'{o1,o2}:  a1ßà a3

𝜋'{o1,o2}:  a1ßà a2

𝜋'

𝜋'

o1:  a1ßà a3

o1:  a1ßà a2

𝜋' 𝜋'

∅:  a2ßà a1 ∅:  a2ßà a3

Page 10: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateTransformation

Local  Neighborhood

Page 11: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateSampling

Pick  random  action  and  simulate  

a3

𝑄*← ,,. ← 1 − 𝛼(𝑚,𝑐,,.) 𝑄*← ,,. + 𝛼 𝑚,𝑐,,. ⋅ 𝑅9,:;<,(𝜏)

Page 12: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateSampling

Sample  neighborhood  k times  for  each  policy

𝜋

𝜋′

𝑄*? > 𝑄* + 𝜖

Page 13: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateSampling

Sample  neighborhood  k times  for  each  policy

𝜋

𝜋′

𝑄*? > 𝑄* + 𝜖

Page 14: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateSampling

Sample  neighborhood  k times  for  each  policy

𝑄*? > 𝑄* + 𝜖

Page 15: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCES-P TemplateTermination

When  all  neighbors  sampled  k times  and  no  neighbor  is  better

Page 16: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCESP+PAC

Problem:  Choosing  a  good  sample  bound  kLow values  of  k increase  the  chance  we  make  errors when  transformingHigh  values,  though  requiring  more  samples,  guarantee  we  hill-­‐climb  correctly

Inaccurate  Q-­‐values Accurate  Q-­‐values

High  Error  Probability Low  Error  Probability

Page 17: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCESP+PAC

Solution:  Pick  a  k that  guarantees  some  confidence  on  the  accuracy  of  the  Q-­‐value

Probably  Approximately  Correct  (PAC)  Learning

The  probability  of  the  sample  average  deviating  from  the  true  mean  by  more  than  variance  𝝐 is  bound  by error  𝜹

Pr 𝑋G − 𝜇 > 𝜖 ≤ 2 ⋅ exp −2𝑘𝜖Λ

P= 𝛿

Page 18: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCESP+PAC

With  𝜖 and  𝛿,  we  calculate  required  samples  to  satisfy  the  error  bound

𝑚 is  the  number  of  current  transformations𝑁 is  number  of  neighbor  policies

𝛿a = bcad*d

𝑘a = 2Λ(𝜋)𝜖

Pln2𝑁𝛿a

Λ 𝜋',𝜋 ≜ maxg(𝑄*−𝑄*?) − min

g(𝑄*−𝑄*?) ≤ 2𝑇 𝑅a.i − 𝑅ajk

Λ 𝜋 = max*?∈kmjnop,q *

Λ(𝜋, 𝜋')

Page 19: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCESP+PAC

We  can  transform  early  by  modifying  𝜖

Terminate  when  𝑘a samples  of  each  neighbor  is  taken  or  for  all  neighbor  policies:

𝜖 𝑚, 𝑝, 𝑞 =Λ 𝜋,𝜋'

12𝑝 ln

2 𝑘a − 1 𝑁𝛿a

 if  𝑝 = 𝑞 < 𝑘a

𝜖2  if  𝑝 = 𝑞 = 𝑘a∞  otherwise

𝑄,,. < 𝑄,,*(,) + 𝜖 − 𝜖(𝑚, 𝑐,,. , 𝑐,,* , )

Page 20: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: MCESP+PAC

Then,  with  probability  1 − 𝛿1. MCESP+PAC  picks  transformations that  are  

always  better  than  the  current  policy2. MCESP+PAC  terminates with  a  policy  that  is  an  𝜖-­‐

local  optima• That  is,  there  is  no  neighbor  that  is  better  than  the  

last  policy  by  more  than  𝜖

Page 21: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-P for Multiagent Settings

MCES-­‐P  can  almost be  used  as  is  in  the  multiagentsetting

MCES-­‐P  has  high  computational  costsLarge  neighborhood  requiring  𝑘a samples  eachMCES  for  I-­‐POMDPs:  Explicitly  models  the  opponent  and  significantly  decreases  sample  requirements

Observations

Public Private

Noisily  indicates  physical  state Noisily  indicates  other  agents’  actions

Page 22: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateMCES-P vs. MCES-IP

MCES-­‐P  simulation  and  Q-­‐update

MCES-­‐IP  reasons  about  which  actions  the  opponent  took  in  the  simulation  prior  to  updating

Pick  random𝑜 and  𝑎

Simulate  𝜋 ← 𝑜, 𝑎generating  𝜏

Update  𝑄*← ,,.with  𝑅9,:;<,(𝜏)

Pick  random𝑜 and  𝑎

Simulate  𝜋 ← 𝑜, 𝑎generating  𝜏

Update  𝑄*← ,,..w

with  𝑅9,:;<,(𝜏)

Update  belief overopponent  models

Calculate  𝑎x frommost  likely  models

Page 23: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateModels

MCES-­‐IP  maintains  a  set  of  models of  the  opponent,  where  a  model  =  <history,  policy  tree>

a1

a1 a1

a1 a1 a1 a1

o1

o1 o1

o2

o2 o2

a2

a2 a2

a2 a2 a2 a2

o1

o1 o1

o2

o2 o2

a3

a1 a2

a2 a3 a3 a1

o1

o1 o1

o2

o2 o2

𝒎𝟏 𝒎𝟐 𝒎𝟑

Page 24: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateGenerating 𝑎x

Every  round,  MCES-­‐IP  updates  the  most  probable  model  and  selects  the  most  probable  action

0

0.2

0.4

m1 m2 m3

t=1 t=2 t=3

Page 25: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateGenerating 𝑎x

Every  round,  MCES-­‐IP  updates  the  most  probable  model  and  selects  the  most  probable  action

t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐

𝑜j = 2𝑜 = ∅

0

0.2

0.4

m1 m2 m3

Page 26: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateGenerating 𝑎x

Every  round,  MCES-­‐IP  updates  the  most  probable  model  and  selects  the  most  probable  action

0.00

0.20

0.40

m1 m2 m3

t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏

𝑜j = 2𝑜 = ∅

𝑜j = 1𝑜 = 1

0.00

0.50

1.00

m1 m2 m3

Page 27: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateGenerating 𝑎x

Every  round,  MCES-­‐IP  updates  the  most  probable  model  and  selects  the  most  probable  action

0

0.2

0.4

m1 m2 m3

t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏 𝒂𝒋𝟐 = 𝟑

𝒂𝒋 = {𝟐, 𝟏, 𝟑}

0.00

0.50

1.00

m1 m2 m3

𝑜j = 2𝑜 = ∅

𝑜j = 1𝑜 = 1

𝑜j = 1𝑜 = 3

Page 28: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCES-IP TemplateUpdating Q-values

Update  counts  and  Q-­‐values  using  𝑎x

So  far,  MCES-­‐IP  is  more  expensive than  MCES-­‐P

The  Q-­‐table  is  now  up  to   𝐴x�larger!

𝑄*← ,,..w ← 1− 𝛼 𝑚, 𝑐,,.

.w 𝑄*← ,,..w + 𝛼 𝑚, 𝑐,,.

.w ⋅ 𝑅9,:;<,(𝜏)

Page 29: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCESIP+PACPAC Bounds

MCESIP+PAC  has  similar  PAC  bounds  to  MCESP+PAC

𝑘a = 2Λ.w(𝜋j)

𝜖

P

ln2𝑁𝛿a

𝜖.w 𝑚, 𝑝, 𝑞 =Λ.w 𝜋j , 𝜋j'

12𝑝 ln

2 𝑘a− 1 𝑁𝛿a

 if  𝑝 = 𝑞 < 𝑘a

𝜖2  if  𝑝 = 𝑞 = 𝑘a∞  otherwise

Page 30: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCESIP+PACPAC Bounds

Λ.w modifies  the  range  of  possible  rewardsSince  the  opponent  action  is  known,  the  range  of  possible  rewards  may  often  be  narrower

resulting  in  the  following  proposition:

𝑎x1 𝑎x2

𝑎j1 0 3

𝑎j2 4 5

Λ.w 𝜋j, 𝜋j' ≤ Λ 𝜋j, 𝜋j'

Page 31: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

MCESIP+PACPAC Bounds

MCESIP+PAC  terminates  when  𝑘a samples  of  the  local  neighborhood  bears  no  better  policy  or for  all  neighbors  𝜋′

With  probability  1 − 𝛿1. MCESIP+PAC  picks  transformations  that  are  always  better  than  the  

current  policy2. MCESIP+PAC  terminates  with  a  policy  that  is  an  𝜖-­‐local  optima

𝑄*? < 𝑄* + 𝜖 − 𝜖(𝑚, 𝑐,,., 𝑐,,* , )

Page 32: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space Pruning

Page 33: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space PruningIntroduction

Not  all  observation  sequences  occur  with  the  same  probability

Low  likelihood  events  are  difficult  to  sample

Pruning:  Avoid  policy  transformations  that  involve  rare  observation  sequences  while  considering  the  impact  on  reward

Regret:  The  amount  of  expected  value  lost  by  avoiding  simulating  on  these  transformations

Page 34: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space PruningRegret

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR

L L

GL GR

L L

GL GR

LPr                 ≈ 6%LPr                 ≈ 30%

𝑟𝑒𝑔𝑟𝑒𝑡                 ≈ 6.6

𝑟𝑒𝑔𝑟𝑒𝑡                 ≈ 33

L

L

Page 35: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space Pruning

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%

Allowedtransformations

Page 36: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space Pruning

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%

Allowedtransformations

Page 37: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space Pruning

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%

Allowedtransformations

Page 38: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Policy Search Space Pruning

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%

Allowedtransformations

Page 39: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsDomains

3  Domains

Multiagent  Tiger  Problem 3x2  UAV  Problem

Page 40: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsDomains

3  Domains

Money  Laundering   (ML)  Problem

bank

insurance

offshore

shell  companies

casinos

real  estate

Placement Layering Integration

Page 41: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsDomains

3  Domains

Money  Laundering   (ML)  Problem

bank

insurance

offshore

shell  companies

casinos

real  estate

Placement Layering Integration

Page 42: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsDomain Parameters

Opponent  follows  a  fixed  strategySingle:  Only  one  policy  is  ever  usedMixed (Non-­‐stationary  environment):  Randomly  selects  from  2  to  3  policies  every  new  trajectory

𝝐 𝜹 %  𝒓𝒆𝒈𝒓𝒆𝒕 𝒉𝒐𝒓𝒊𝒛𝒐𝒏Multiagent  Tiger 0.05 0.1 15% 3

3x2  UAV 0.1 0.1 20% 3Money Laundering 0.1 0.15 20% 3

Page 43: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsComparative Results

Right: 2  runs  comparing  MCESP+PAC  and  MCESIP+PACRight-­‐top:  Mixed  strategy  opponentRight-­‐middle:  Single  strategy  opponent

Page 44: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

ExperimentsPruning

Pruning  is  crucial to  tractability

×7.59 ×5.94 ×8.37

Page 45: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Concluding Remarks

Model-­‐free  RL  in  multiagent settingsGeneralized  from  MCES-­‐P

MCES-­‐IP  models  the  opponent,  more  sample  efficient  when  paired  with  PAC  bounds

Partiallymodel-­‐free

Instantiated  with  PAC to  provide  𝜖-­‐local  optimality  and  search  space  pruning  for  improved  scalability

Page 46: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Thank you!Q & A

Page 47: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Related WorksBayes-­‐Adaptive  POMDPs  (Ross  et  al.  2007)

Extended  to  MPOMDPs  (Amato  and  Oliehoek 2013)

Model-­‐based  RLIMCQ-­‐Alt  for  Dec-­‐POMDPs  (Banerjee  et  al.  2013)  

Quasi-­‐model  based  – intermediate  calculation  of  model  parametersAlternating  – each  agent  must  take  turns

Bayes-­‐Adaptive  I-­‐POMDPs  (Ng  et  al.  2012)Model-­‐based  RLPhysical  state  perfectly  observable

Page 48: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: Decision Processes

Decision  problem:  how  to  optimize  behavior  to  maximize  reward?

Choose  the  action  that  has  the  best  expected  outcome

Agent PreferencesAction

RewardR(a)

Page 49: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background : Decision Processes

Agent PreferencesAction

Reward

Physical  State

R(s,a)

Action

Page 50: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background : Decision Processes

Agent PreferencesAction

Reward

Physical  State

R(s,a)

Action

Page 51: Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Background: RL

A  popular  class  of  model-­‐free  RL  methods  are  the  temporal  difference  learning models

Example:  Q-­‐learning

𝜶:  Learning  rate𝜸:  Discount  factor

Computes  action-­‐values from  a  state  by  exploring new  values  and  exploiting previous  knowledge

𝑄 𝑠, 𝑎; 𝛼 = 1− 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 𝑠,𝑎 + 𝛾 ⋅ max.'

𝑄(𝑠′, 𝑎')