31
Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science University of Oxford joint work with Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, and Nantas Nardelli July 6, 2017 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 1 / 31

Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Counterfactual Multi-Agent Policy Gradients

Shimon WhitesonDept. of Computer Science

University of Oxford

joint work with Jakob Foerster, Gregory Farquhar,Triantafyllos Afouras, and Nantas Nardelli

July 6, 2017

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 1 / 31

Page 2: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Single-Agent Paradigm

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 2 / 31

Page 3: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Multi-Agent Paradigm

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 3 / 31

Page 4: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Multi-Agent Systems are Everywhere

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 4 / 31

Page 5: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Types of Multi-Agent Systems

Cooperative:I Shared team rewardI Coordination problem

Competitive:I Zero-sum gamesI Individual opposing rewardsI Minimax equilibria

Mixed:I General-sum gamesI Nash equilibriaI What is the question?

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 5 / 31

Page 6: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Coordination Problems are Everywhere

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 6 / 31

Page 7: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Multi-Agent MDP

All agents see the global state s

Individual actions: ua ∈ U

State transitions: P(s ′|s,u) : S ×U× S → [0, 1]

Shared team reward: r(s,u) : S ×U→ R

Equivalent to an MDP with a factored action space

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 7 / 31

Page 8: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Dec-POMDP

Observation function: O(s, a) : S × A→ Z

Action-observation history: τ a ∈ T ≡ (Z × U)∗

Decentralised policies: πa(ua|τ a) : T × U → [0, 1]

Natural decentralisation: communication and sensory constraints

Artificial decentralisation: coping with joint action space

Centralised learning of decentralised policies

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 8 / 31

Page 9: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Key Challenges

Curse of dimensionality in actions

Multi-agent credit assignment

Modelling other agents’ information state

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 9 / 31

Page 10: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Single-Agent Policy Gradient Methods

Optimise πθ with gradient ascent on expected return:

Jθ = Es∼ρπ(s),u∼πθ(s,·) [r(s, u)]

Good when:I Greedification is hard, e.g., continuous actionsI Policy is simpler than value function

Policy gradient theorem [Sutton et al. 2000]:

∇θJθ = Es∼ρπ(s),u∼πθ(s,·) [∇θ log πθ(u|s)Qπ(s, u)]

REINFORCE [Williams 1992]:

∇θJθ ≈ g(τ) =T∑t=0

∇θ log πθ(ut |st)Rt

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 10 / 31

Page 11: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Single-Agent Actor-Critic Methods [Sutton et al. 00]Reduce variance in g(τ) by learning a critic Q(s, u):

g(τ) =T∑t=0

∇θ log πθ(ut |st)Q(st , ut)

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 11 / 31

Page 12: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Single-Agent Baselines

Further reduce variance with a baseline b(s):

g(τ) =T∑t=0

∇θ log πθ(ut |st)(Q(st , ut)− b(st))

b(s) = V (s) =⇒ Q(s, u)− b(s) = A(s, a), the advantage function:

g(τ) =T∑t=0

∇θ log πθ(ut |st)A(st , ut)

TD-error rt + γV (st+1)− V (s) is an unbiased estimate of A(st , ut):

g(τ) =T∑t=0

∇θ log πθ(ut |st)(rt + γV (st+1)− V (st))

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 12 / 31

Page 13: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Single-Agent Deep Actor-Critic Methods

Actor and critic are both deep neural networks

I Convolutional and recurrent layers

I Actor and critic share layers

Both trained with stochastic gradient descent

I Actor trained on policy gradient

I Critic trained on TD(λ) or Sarsa(λ):

Lt(ψ) = (y (λ) − C (·t , ψ))2

y (λ) = (1− λ)∞∑n=1

λn−1G(n)t

G(n)t =

n∑k=1

γk−1rt+k + γnC (·t+n, ψ)

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 13 / 31

Page 14: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Independent Actor-Critic

Inspired by independent Q-learning [Tan 1993]I Each agent learns independently with its own actor and criticI Treats other agents as part of the environment

Speed learning with parameter sharingI Different inputs, including a, induce different behaviourI Still independent: critics condition only on τ a and ua

Variants:I IAC-V: TD-error gradient using V (τ a)I IAC-Q: Advantage-based gradient using A(τ a, ua) = Q(τ a, ua)− V (τ a)

Limitations:I Nonstationary learningI Hard to learn to coordinateI Multi-agent credit assignment

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 14 / 31

Page 15: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Counterfactual Multi-Agent Policy Gradients

Centralised critic: stabilise learning to coordinate

Counterfactual baseline: tackle multi-agent credit assignment

Efficient critic representation: scale to large NNs

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 15 / 31

Page 16: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Centralised Critic

Centralisation → Hard greedification → actor-critic

ga(τ) =T∑t=0

∇θ log πθ(uat |τ at )(rt + γV (st+1)− V (st))

critic A2t

st rt o2

t

h2

o1t

h1

(h1, ) A1

t

u1t

u1t

u2t

u2t

(h2, )

actor 2actor 1

environment

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 16 / 31

Page 17: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Wonderful Life Utility [Wolpert & Tumer 2000]

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 17 / 31

Page 18: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Difference Rewards [Tumer & Agogino 2007]

Per-agent shaped reward:

Da(s,u) = r(s,u)− r(s, (u−a, ca))

where ca is a default action

Important property:

Da(s, (u−a, u̇a)) > Da(s,u) =⇒ r(s, (u−a, u̇a)) > r(s, (u−a, a))

Limitations:

I Need (extra) simulation to estimate counterfactual r(s, (u−a, ca))

I Need expertise to choose ca

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 18 / 31

Page 19: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Counterfactual Baseline

Use Q(s,u) to estimate difference rewards:

ga(τ) =T∑t=0

∇θ log πθ(uat |τ at )Aa(st ,ut)

Aa(s,u) = Q(s,u)−∑ua

πa(ua|τ a)Q(s, (u−a, ua))

Baseline marginalises out ua

Critic obviates need for extra simulations

Marginalised action obviates need for default

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 19 / 31

Page 20: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Efficient Critic Representation

Critic A2t

st

rt

o2t

h2

o1t

h1

ᶢ(h1, ᶗ)

A1t

u1t

u1t

u2t

u2t

ᶢ(h2, ᶗ)

Actor 2Actor 1

(oat, a, ua

t-1) (u-a

t, s

t, oa

t, a, u

t-1)

hat

(hat)(ha

t-1)

ᶢat =ᶢ(ha

t, ᶗ)

(ᶗ)

{Q(ua=1, u-at,..),. .,Q(ua=|U|, u-a

t,..)}

(uat, ᶢa

t)

Aat

COMA

GRU

(b) (c)

Environment

(a)

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 20 / 31

Page 21: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Starcraft

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 21 / 31

Page 22: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Starcraft Micromanagement [Synnaeve et al. 2016]

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 22 / 31

Page 23: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Centralised Performance

Local Field of View (FoV) Full FoV, Central Control

map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA

heur. DQN GMEZOmean best

3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 23 / 31

Page 24: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Decentralised Starcraft Micromanagement

x

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 24 / 31

Page 25: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Heuristic Performance

Local Field of View (FoV) Full FoV, Central Control

map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA

heur. DQN GMEZOmean best

3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 25 / 31

Page 26: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Baseline Algorithms

IAC-V: independent actor-critic with V (τ a)

IAC-Q: independent actor-critic with A(τ a, ua) = Q(τ a, ua)− V (τ a)

Central-V: centralised critic V (s) with TD-error-based gradient

Central-QV:

I Centralised critics Q(s,u) and V (s)

I Advantage gradient A(s,u) = Q(s,u)− V (s)

I COMA but with b(s) = V (s)

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 26 / 31

Page 27: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Results (3m, 5m, 5w, 2d-3z)

20k 40k 60k 80k 100k120k140k# Episodes

0102030405060708090

Ave

rag

e W

in %

COMAcentral-V

central-QVheuristic

10k 20k 30k 40k 50k 60k 70k# Episodes

0102030405060708090

Ave

rag

e W

in %

IAC-V IAC-Q

5k 10k 15k 20k 25k 30k 35k# Episodes

0102030405060708090

Ave

rag

e W

in %

5k 10k 15k 20k 25k 30k 35k 40k# Episodes

0

10

20

30

40

50

60

70

Ave

rag

e W

in %

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 27 / 31

Page 28: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Compared to Centralised Controllers

Local Field of View (FoV) Full FoV, Central Control

map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA

heur. DQN GMEZOmean best

3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 28 / 31

Page 29: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Future Work

Factored centralised critics for many agents

Multi-agent exploration

Starcraft macromanagement

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 29 / 31

Page 30: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Paper

Counterfactual Multi-Agent Policy Gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras,Nantas Nardelli, and Shimon Whiteson

https://arxiv.org/abs/1705.08926

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 30 / 31

Page 31: Counterfactual Multi-Agent Policy Gradients€¦ · Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science ... Centralised learning of decentralised

Thank You Microsoft!

This work was made possible thanks to a generousdonation of Azure cloud credits from Microsoft.

Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 31 / 31