Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Reinforcement Learning in Partially ObservableMultiagent Settings:

Monte Carlo Exploring Policies

Presenter: Roi CerenTHINC Lab, University of Georgia

[email protected]

Prashant DoshiTHINC Lab, University of Georgia

[email protected]

Bikramjit BanerjeeUniversity of Southern [email protected]

Introduction

Model-‐free reinforcement learning in multiagent systems is a nascent fieldMonte Carlo Exploring Starts for POMDPs is a powerful single-‐agent RL technique

Policy iteration leveraging Q-‐learning to hill-‐climb through the local policy space to local optimaAllows PAC bounds to select sample complexity with confidence

IntroductionWe extend MCES-‐P to the non-‐cooperative multiagentsetting and introduce MCES for Interactive POMDPs

Explicitly models the opponentPredicates action-‐values on expected opponent behaviorWhen instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity

We additionally provide a policy space pruning mechanism to promote scalability

Parametrically bounds regret from avoiding policiesPrioritizes eliminating low-‐regret policy transformations

Background: Multiagent Decision Process

In the multiagent setting, all agents affect the state and the reward for each agent

Agent i

(Joint)Rewards

Action

Reward

Physical State

R(s,ai,aj)

Agent j

Action

Reward

Action Action

Background: I-POMDPThe Interactive POMDP (I-‐POMDP) (Gmytrasiewicz and Doshi 2005)<IS,A,T,Ω,O,R>

Non-‐cooperative: Agents get individual, potentially competitive rewardsActions A, state transitions T, observations 𝛀, observation probabilities O, and rewards RIS: Interactive state, combining the physical state and a model of the other agent

Significant uncertaintyMust reason not only the physical state, but also the opponent’s motivations and beliefs

Background: MCES-P Template

Monte Carlo Exploring Starts for POMDPs (MCES-‐P)(Perkins -‐ AAAI 2002)

General templateExplore neighborhood of 𝜋 -‐ all policies that differ by a single action 𝑎 on some observation sequence 𝑜Compute expected value by simulating policies onlineHill climb to policies with better valuesTerminate if no neighbor is better than the current policy

Background: MCES-P TemplateTransformation

a2

a1 a3

a3 a1 a2 a2

o1

o1 o1

o2

o2 o2

Pick random observation sequence and replace with a random action

a2

a1 a3

a3 a3 a2 a2

o1

o1 o1

o2

o2 o2

{o1,o2}: a1à a3



𝜋 𝜋'{o1,o2}: a1ßà a3



𝜋 𝜋'{o1,o2}: a1ßà a3

𝜋'{o1,o2}: a1ßà a2

𝜋'

𝜋'

o1: a1ßà a3

o1: a1ßà a2

𝜋' 𝜋'

∅: a2ßà a1 ∅: a2ßà a3


Local Neighborhood

Background: MCES-P TemplateSampling

Pick random action and simulate

a3

𝑄*← ,,. ← 1 − 𝛼(𝑚,𝑐,,.) 𝑄*← ,,. + 𝛼 𝑚,𝑐,,. ⋅ 𝑅9,:;<,(𝜏)


Sample neighborhood k times for each policy

𝜋

𝜋′

𝑄*? > 𝑄* + 𝜖



𝜋

𝜋′

𝑄*? > 𝑄* + 𝜖



𝑄*? > 𝑄* + 𝜖

Background: MCES-P TemplateTermination

When all neighbors sampled k times and no neighbor is better

Background: MCESP+PAC

Problem: Choosing a good sample bound kLow values of k increase the chance we make errors when transformingHigh values, though requiring more samples, guarantee we hill-‐climb correctly

Inaccurate Q-‐values Accurate Q-‐values

High Error Probability Low Error Probability


Solution: Pick a k that guarantees some confidence on the accuracy of the Q-‐value

Probably Approximately Correct (PAC) Learning

The probability of the sample average deviating from the true mean by more than variance 𝝐 is bound by error 𝜹

Pr 𝑋G − 𝜇 > 𝜖 ≤ 2 ⋅ exp −2𝑘𝜖Λ

P= 𝛿


With 𝜖 and 𝛿, we calculate required samples to satisfy the error bound

𝑚 is the number of current transformations𝑁 is number of neighbor policies

𝛿a = bcad*d

𝑘a = 2Λ(𝜋)𝜖

Pln2𝑁𝛿a

Λ 𝜋',𝜋 ≜ maxg(𝑄*−𝑄*?) − min

g(𝑄*−𝑄*?) ≤ 2𝑇 𝑅a.i − 𝑅ajk

Λ 𝜋 = max*?∈kmjnop,q *

Λ(𝜋, 𝜋')


We can transform early by modifying 𝜖

Terminate when 𝑘a samples of each neighbor is taken or for all neighbor policies:

𝜖 𝑚, 𝑝, 𝑞 =Λ 𝜋,𝜋'

12𝑝 ln

2 𝑘a − 1 𝑁𝛿a

if 𝑝 = 𝑞 < 𝑘a

𝜖2 if 𝑝 = 𝑞 = 𝑘a∞ otherwise

𝑄,,. < 𝑄,,*(,) + 𝜖 − 𝜖(𝑚, 𝑐,,. , 𝑐,,* , )


Then, with probability 1 − 𝛿1. MCESP+PAC picks transformations that are

always better than the current policy2. MCESP+PAC terminates with a policy that is an 𝜖-‐

local optima• That is, there is no neighbor that is better than the

last policy by more than 𝜖

MCES-P for Multiagent Settings

MCES-‐P can almost be used as is in the multiagentsetting

MCES-‐P has high computational costsLarge neighborhood requiring 𝑘a samples eachMCES for I-‐POMDPs: Explicitly models the opponent and significantly decreases sample requirements

Observations

Public Private

Noisily indicates physical state Noisily indicates other agents’ actions

MCES-IP TemplateMCES-P vs. MCES-IP

MCES-‐P simulation and Q-‐update

MCES-‐IP reasons about which actions the opponent took in the simulation prior to updating

Pick random𝑜 and 𝑎

Simulate 𝜋 ← 𝑜, 𝑎generating 𝜏

Update 𝑄*← ,,.with 𝑅9,:;<,(𝜏)

Pick random𝑜 and 𝑎

Simulate 𝜋 ← 𝑜, 𝑎generating 𝜏

Update 𝑄*← ,,..w

with 𝑅9,:;<,(𝜏)

Update belief overopponent models

Calculate 𝑎x frommost likely models

MCES-IP TemplateModels

MCES-‐IP maintains a set of models of the opponent, where a model = <history, policy tree>

a1

a1 a1

a1 a1 a1 a1

o1

o1 o1

o2

o2 o2

a2

a2 a2

a2 a2 a2 a2

o1

o1 o1

o2

o2 o2

a3

a1 a2

a2 a3 a3 a1

o1

o1 o1

o2

o2 o2

𝒎𝟏 𝒎𝟐 𝒎𝟑

MCES-IP TemplateGenerating 𝑎x

Every round, MCES-‐IP updates the most probable model and selects the most probable action

0

0.2

0.4

m1 m2 m3

t=1 t=2 t=3



t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐

𝑜j = 2𝑜 = ∅

0

0.2

0.4

m1 m2 m3



0.00

0.20

0.40

m1 m2 m3

t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏

𝑜j = 2𝑜 = ∅

𝑜j = 1𝑜 = 1

0.00

0.50

1.00

m1 m2 m3



0

0.2

0.4

m1 m2 m3

t=1 t=2 t=3

0.00

0.50

1.00

m1 m2 m3

0.00

0.50

1.00

m1 m2 m3

𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏 𝒂𝒋𝟐 = 𝟑

𝒂𝒋 = {𝟐, 𝟏, 𝟑}

0.00

0.50

1.00

m1 m2 m3

𝑜j = 2𝑜 = ∅

𝑜j = 1𝑜 = 1

𝑜j = 1𝑜 = 3

MCES-IP TemplateUpdating Q-values

Update counts and Q-‐values using 𝑎x

So far, MCES-‐IP is more expensive than MCES-‐P

The Q-‐table is now up to 𝐴x�larger!

𝑄*← ,,..w ← 1− 𝛼 𝑚, 𝑐,,.

.w 𝑄*← ,,..w + 𝛼 𝑚, 𝑐,,.

.w ⋅ 𝑅9,:;<,(𝜏)

MCESIP+PACPAC Bounds

MCESIP+PAC has similar PAC bounds to MCESP+PAC

𝑘a = 2Λ.w(𝜋j)

𝜖

P

ln2𝑁𝛿a

𝜖.w 𝑚, 𝑝, 𝑞 =Λ.w 𝜋j , 𝜋j'

12𝑝 ln

2 𝑘a− 1 𝑁𝛿a

if 𝑝 = 𝑞 < 𝑘a

𝜖2 if 𝑝 = 𝑞 = 𝑘a∞ otherwise


Λ.w modifies the range of possible rewardsSince the opponent action is known, the range of possible rewards may often be narrower

resulting in the following proposition:

𝑎x1 𝑎x2

𝑎j1 0 3

𝑎j2 4 5

Λ.w 𝜋j, 𝜋j' ≤ Λ 𝜋j, 𝜋j'


MCESIP+PAC terminates when 𝑘a samples of the local neighborhood bears no better policy or for all neighbors 𝜋′

With probability 1 − 𝛿1. MCESIP+PAC picks transformations that are always better than the

current policy2. MCESIP+PAC terminates with a policy that is an 𝜖-‐local optima

𝑄*? < 𝑄* + 𝜖 − 𝜖(𝑚, 𝑐,,., 𝑐,,* , )

Policy Search Space Pruning

Policy Search Space PruningIntroduction

Not all observation sequences occur with the same probability

Low likelihood events are difficult to sample

Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward

Regret: The amount of expected value lost by avoiding simulating on these transformations

Policy Search Space PruningRegret

L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR

L L

GL GR

L L

GL GR

LPr ≈ 6%LPr ≈ 30%

𝑟𝑒𝑔𝑟𝑒𝑡 ≈ 6.6

𝑟𝑒𝑔𝑟𝑒𝑡 ≈ 33

L

L


L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%

Allowedtransformations


L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%



L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%



L

L L

L L L

GL

GL GL

GR

GR GR

L

L L

GL GR

L L

GL GR GL GR

L L

GL GR

L L

Allowableregret

0%

100%


ExperimentsDomains

3 Domains

Multiagent Tiger Problem 3x2 UAV Problem

ExperimentsDomains

3 Domains

Money Laundering (ML) Problem

bank

insurance

offshore

shell companies

casinos

real estate

Placement Layering Integration

ExperimentsDomains

3 Domains

Money Laundering (ML) Problem

bank

insurance

offshore

shell companies

casinos

real estate

Placement Layering Integration

ExperimentsDomain Parameters

Opponent follows a fixed strategySingle: Only one policy is ever usedMixed (Non-‐stationary environment): Randomly selects from 2 to 3 policies every new trajectory

𝝐 𝜹 % 𝒓𝒆𝒈𝒓𝒆𝒕 𝒉𝒐𝒓𝒊𝒛𝒐𝒏Multiagent Tiger 0.05 0.1 15% 3

3x2 UAV 0.1 0.1 20% 3Money Laundering 0.1 0.15 20% 3

ExperimentsComparative Results

Right: 2 runs comparing MCESP+PAC and MCESIP+PACRight-‐top: Mixed strategy opponentRight-‐middle: Single strategy opponent

ExperimentsPruning

Pruning is crucial to tractability

×7.59 ×5.94 ×8.37

Concluding Remarks

Model-‐free RL in multiagent settingsGeneralized from MCES-‐P

MCES-‐IP models the opponent, more sample efficient when paired with PAC bounds

Partiallymodel-‐free

Instantiated with PAC to provide 𝜖-‐local optimality and search space pruning for improved scalability

Thank you!Q & A

Related WorksBayes-‐Adaptive POMDPs (Ross et al. 2007)

Extended to MPOMDPs (Amato and Oliehoek 2013)

Model-‐based RLIMCQ-‐Alt for Dec-‐POMDPs (Banerjee et al. 2013)

Quasi-‐model based – intermediate calculation of model parametersAlternating – each agent must take turns

Bayes-‐Adaptive I-‐POMDPs (Ng et al. 2012)Model-‐based RLPhysical state perfectly observable

Background: Decision Processes

Decision problem: how to optimize behavior to maximize reward?

Choose the action that has the best expected outcome

Agent PreferencesAction

RewardR(a)

Background : Decision Processes


Reward

Physical State

R(s,a)

Action

Background : Decision Processes


Reward

Physical State

R(s,a)

Action

Background: RL

A popular class of model-‐free RL methods are the temporal difference learning models

Example: Q-‐learning

𝜶: Learning rate𝜸: Discount factor

Computes action-‐values from a state by exploring new values and exploiting previous knowledge

𝑄 𝑠, 𝑎; 𝛼 = 1− 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 𝑠,𝑎 + 𝛾 ⋅ max.'

𝑄(𝑠′, 𝑎')

Documents

Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings: