48
Solving Stackelberg Equilibrium in Stochastic Games. ıctor Bucarey L´ opez [email protected] FCEIA - UNR - Rosario November 2nd, 2017 1

Solving Stackelberg Equilibrium in Stochastic Games

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solving Stackelberg Equilibrium in Stochastic Games

Solving Stackelberg Equilibrium in Stochastic Games.Vıctor Bucarey Lopez

[email protected]

FCEIA - UNR - Rosario

November 2nd, 2017

1

Page 2: Solving Stackelberg Equilibrium in Stochastic Games

2

Stackelberg Game

Leaderπ

−→ Followerγ(π)

Strong Stackelberg Equilibrium

Leader commits to a payoff maximizing strategy.

Follower best responds.

Follower breaks ties in favor of the leader.

Page 3: Solving Stackelberg Equilibrium in Stochastic Games

2

Stackelberg Game

Leaderπ

−→ Followerγ(π)

Strong Stackelberg Equilibrium

Leader commits to a payoff maximizing strategy.

Follower best responds.

Follower breaks ties in favor of the leader.

Page 4: Solving Stackelberg Equilibrium in Stochastic Games

2

Stackelberg Game

Leaderπ

−→ Followerγ(π)

Strong Stackelberg Equilibrium

Leader commits to a payoff maximizing strategy.

Follower best responds.

Follower breaks ties in favor of the leader.

Page 5: Solving Stackelberg Equilibrium in Stochastic Games

3

Example

b1 b2a1 (10,-10) (−5, 6)

a2 (-8,4) (6,−4)

MIP formulation

max vA

vA ≤ 10x1 +−8x2 + M(1− y1)

vA ≤ −5x1 + 6x2 + M(1− y2)

0 ≤ vB − (−10x1 + 4x2) ≤ M(1− y1)

0 ≤ vB − (6x1 +−4x2) ≤ M(1− y2)

x1 + x2 = 1 y1 + y2 = 1

x ≥ 0, y ∈ {0, 1}

Page 6: Solving Stackelberg Equilibrium in Stochastic Games

4

b1 b2a1 (10,-10) (−5, 6)

a2 (-8,4) (6,−4)

0.2 0.4 0.6 0.8 1

−10

−5

5

10b1

b2

x∗1 = 1

3

Leader

0.2 0.4 0.6 0.8 1

−10

−5

5

10

b2

b1

x∗1 = 1

3

Follower

Page 7: Solving Stackelberg Equilibrium in Stochastic Games

5

Multiple States

b1 b2

a1����������( 1

2 ,12 )

(10,-10) ����������(0, 1)(−5, 6)

a2����������( 1

4 ,34 )

(-8,4) ����������(1, 0)(6,−4)

State s1

b1 b2

a1����������( 1

2 ,12 )

(7,-5) ����������(0, 1)(−1, 6)

a2����������( 1

4 ,34 )

(-3,10) ����������(1, 0)(2,−10)

State s2

Page 8: Solving Stackelberg Equilibrium in Stochastic Games

6

Stochastic Games - Definition

G = (S,A,B,Q, rA, rB , βA, βB , τ)

s0 Player A

chooses f0

Player Bobserves f0

and chooses g0

︸︷︷︸Q f0g0 (s1|s0)

s1 Player A

chooses f1

Player Bobserves f1

and chooses g1

s2 · · ·

Feedback Policies:

π = π(s, t)

= {f1, . . . , fτ}

Stationary Policies:

π = π(s)

= {f , . . . , f }

Page 9: Solving Stackelberg Equilibrium in Stochastic Games

6

Stochastic Games - Definition

G = (S,A,B,Q, rA, rB , βA, βB , τ)

s0 Player A

chooses f0

Player Bobserves f0

and chooses g0

︸︷︷︸Q f0g0 (s1|s0)

s1 Player A

chooses f1

Player Bobserves f1

and chooses g1

s2 · · ·

Feedback Policies:

π = π(s, t)

= {f1, . . . , fτ}

Stationary Policies:

π = π(s)

= {f , . . . , f }

Page 10: Solving Stackelberg Equilibrium in Stochastic Games

7

Framework

General Objectives

Existence and characterization of value functions.

Existence of equilibrium strategies.

Algorithms to compute them.

State of the Art

For finite horizon, Stackelberg equilibrium in stochastic gamesvia Dynamic programming.

Mathematical programming approach to compute stationaryvalues.

Page 11: Solving Stackelberg Equilibrium in Stochastic Games

8

Framework

Contributions in Infinite horizon

We define suitable Dynamic Programming operators.

We used it to characterize value functions and to proveexistence and unicity of stationary policies forming a StrongStackelberg Equilibrium for a family of problems.

We define Value Iteration and Policy Iteration for this familyand prove its convergence.

We prove via counterexample that this methodology is notalways applicable for the general case.

Page 12: Solving Stackelberg Equilibrium in Stochastic Games

9

Stackelberg equilibrium

(π, γ) −→

Value Functions

vπ,γA (s) = Eπ,γs

[τ∑

t=0

βtAr

At ,BtA (St)

]

vπ,γB (s) = Eπ,γs

[τ∑

t=0

βtB r

At ,BtB (St)

]

Stackelberg Equilibrium

(π∗, γ∗)

vπ∗,γ∗

A (s) = maxπ,γ∗

vπ,γ∗

A (s)

γ∗ ∈ argmax vπ,γB (s)

Page 13: Solving Stackelberg Equilibrium in Stochastic Games

9

Stackelberg equilibrium

(π, γ) −→

Value Functions

vπ,γA (s) = Eπ,γs

[τ∑

t=0

βtAr

At ,BtA (St)

]

vπ,γB (s) = Eπ,γs

[τ∑

t=0

βtB r

At ,BtB (St)

]

Stackelberg Equilibrium

(π∗, γ∗)

vπ∗,γ∗

A (s) = maxπ,γ∗

vπ,γ∗

A (s)

γ∗ ∈ argmax vπ,γB (s)

Page 14: Solving Stackelberg Equilibrium in Stochastic Games

10

Myopic Follower Strategies

Best response functional:

g(f , vB) = arg maxb∈Bs

∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]

Myopic follower strategies (MFS):

g(f , vB) = g(f )

2 important cases:

Myopic follower: βB = 0

Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)

Page 15: Solving Stackelberg Equilibrium in Stochastic Games

10

Myopic Follower Strategies

Best response functional:

g(f , vB) = arg maxb∈Bs

∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]

Myopic follower strategies (MFS):

g(f , vB) = g(f )

2 important cases:

Myopic follower: βB = 0

Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)

Page 16: Solving Stackelberg Equilibrium in Stochastic Games

10

Myopic Follower Strategies

Best response functional:

g(f , vB) = arg maxb∈Bs

∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]

Myopic follower strategies (MFS):

g(f , vB) = g(f )

2 important cases:

Myopic follower: βB = 0

Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)

Page 17: Solving Stackelberg Equilibrium in Stochastic Games

11

Myopic Follower Strategies

f a stationary policy.

T fA : R|S| → R|S|:

T fA(vA)(s) =

∑a∈As

f (a)

[rag(f )A (s) + βA

∑z∈S

Qag(f )(z |s)vA(z)

]

Operator for the MFS case

TA(vA)(s) = maxf ∈P(As)

T fA(vA)(s) (1)

Page 18: Solving Stackelberg Equilibrium in Stochastic Games

11

Myopic Follower Strategies

f a stationary policy.

T fA : R|S| → R|S|:

T fA(vA)(s) =

∑a∈As

f (a)

[rag(f )A (s) + βA

∑z∈S

Qag(f )(z |s)vA(z)

]

Operator for the MFS case

TA(vA)(s) = maxf ∈P(As)

T fA(vA)(s) (1)

Page 19: Solving Stackelberg Equilibrium in Stochastic Games

12

Myopic Follower Strategies

Theorem 1.

a) T fA, TA are monotone.

b) For any stationary strategy f , the operator T fA, is a contraction on

(R|S|, || · ||∞) of modulus βA.

c) The operator TA is a contraction on (R|S|, || · ||∞), of modulus βA.

Theorem 2.

There exists a equilibrium value function v∗A and it is the unique solutionof v∗A = TA(v∗A). Moreover, the pair f ∗ and g(f ∗) which maximizes theRHS of (1) are the equilibrium strategies.

Page 20: Solving Stackelberg Equilibrium in Stochastic Games

13

Myopic Follower Strategies

Algorithm 1 Value function iteration: Infinite horizon

Require: ε > 01: Initialize with n = 1, v0A(s) = 0 for every s ∈ S and v1A = TA(v0A)2: while ||vnA − vn−1A ||∞ > ε do3: Compute vn+1

A byvn+1A (s) = TA(vnA)(s) .

Finding f ∗ and g∗(f ) at stage n.4: n := n + 15: end while6: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ ={g∗, . . .}

Page 21: Solving Stackelberg Equilibrium in Stochastic Games

14

Myopic Follower Strategies

Theorem 3.

The sequence of value functions vnA converges to v∗A. Furthermore, v∗A isthe fixed point of TA with the following bound

‖v∗A − vnA‖∞ ≤‖rA‖∞ βnA

1− βA.

Page 22: Solving Stackelberg Equilibrium in Stochastic Games

15

Policy Iteration - MFS

Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).

Compute: uA,0 = T f0A (uA,0)

Find f1:T f1A (uA,0) = TA(uA,0)

Compute: uA,1 = T f1A (uA,1)

. . .

Repeat until convergence.

Theorem 4.

The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .

Page 23: Solving Stackelberg Equilibrium in Stochastic Games

15

Policy Iteration - MFS

Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).

Compute: uA,0 = T f0A (uA,0)

Find f1:T f1A (uA,0) = TA(uA,0)

Compute: uA,1 = T f1A (uA,1)

. . .

Repeat until convergence.

Theorem 4.

The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .

Page 24: Solving Stackelberg Equilibrium in Stochastic Games

16

Policy Iteration - MFS

Algorithm 2 Policy Iteration (PI)

1: Choose a stationary Stackelberg pair (f0, g(f0)).2: while ||uA,n − uA,n+1|| > ε do

3: Evaluation Phase: Find uA,n fixed point of the operator T fnA .

4: Improvement Phase: Find a strategy fn+1 such that

T fn+1

A (uA,n) = TA(uA,n) .

5: n:= n+16: end while7: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ = {g(f ∗), . . .}

Page 25: Solving Stackelberg Equilibrium in Stochastic Games

17

Computational Results - MFS

2 10 25 50 100 200|S|

0

50

100

150

200

250

300

350

400

450

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Page 26: Solving Stackelberg Equilibrium in Stochastic Games

18

Computational Results - MFS

2 10 25 50 100 200

|A|

0

10

20

30

40

50

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Page 27: Solving Stackelberg Equilibrium in Stochastic Games

19

Computational Results - MFS

2 10 25 50 100 200

|B|

0

50

100

150

200

250

300

350

400

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Page 28: Solving Stackelberg Equilibrium in Stochastic Games

20

General Case

f and g fixed stationary policies

T f ,gi : R|S| → R|S|, i ∈ {A, B}

T f ,gi (vi )(s) =

∑a∈As

f (a)∑b∈Bs

g(b)

[r abi (s) + βi

∑z∈S

Qab(z |s)vi (z)

]

Operator for the General case

T : R|S| × R|S| → R|S| × R|S|

(T (vA, vB))(s) =

(max

f∈P(As ),T

f ,g(f ,vB )A (vA)(s), T

f ∗,g(f ∗,vB )B (vB)(s)

)

Page 29: Solving Stackelberg Equilibrium in Stochastic Games

21

Algorithm

Algorithm 3 Value Iteration (VI): Finite horizon for the general case

1: Initialize with vτ+1A (s) = vτ+1

B (s) = 0 for every s ∈ S2: for t = τ, . . . , 0, and for every s ∈ S do3: Solve (

v tA(s), v t

B(s))

= T (v t+1A , v t+1

B )(s) ∀s ∈ SFinding f ∗t and g∗t SSE strategies at stage t.

4: end for5: return Stackelberg policies π∗ = {f ∗0 , . . . , f ∗τ } and γ∗ = {g∗0 , . . . , g∗τ }

Page 30: Solving Stackelberg Equilibrium in Stochastic Games

22

Example

b1 b2

a1����������( 1

2 ,12 )

(10,-10) ����������(0, 1)(−5, 6)

a2����������( 1

4 ,34 )

(-8,4) ����������(1, 0)(6,−4)

State s1

b1 b2

a1����������( 1

2 ,12 )

(7,-5) ����������(0, 1)(−1, 6)

a2����������( 1

4 ,34 )

(-3,10) ����������(1, 0)(2,−10)

State s2

βA = βB = 0.9

Page 31: Solving Stackelberg Equilibrium in Stochastic Games

23

Example

0 10 20 30 40 50 60Stages

100

50

0

50

100

Valu

e F

unct

ion

(0, 0)(100.0, 100.0)(-100.0, -100.0)

(-50.0, -50.0)(50.0, 50.0)

Page 32: Solving Stackelberg Equilibrium in Stochastic Games

24

Counterexample

0 10 20 30 40 50Stages

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Valu

e F

unct

ion

State s0Leader Follower

State s1

0 10 20 30 40 50Stages

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Valu

e F

unct

ion

State s1Leader Follower

State s2

Page 33: Solving Stackelberg Equilibrium in Stochastic Games

25

Counterexample

Iteration 14State s1 State s2

0.2 0.4 0.6 0.8 1

−1

1

2

b1

b2x∗1 = 0.552

0.289

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.552

0.4364

0.2 0.4 0.6 0.8 1

−1

1

2

b1b2

x∗1 = 0.6330.543

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.633

0.8374

Leader Follower Leader FollowerIteration 15

State s1 State s2

0.2 0.4 0.6 0.8 1

−1

1

2

b1

b2

x∗1 = 1

0.2714

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 1

1.4187

0.2 0.4 0.6 0.8 1

−1

1

2

b1b2

x∗1 = 0.6920.491

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.692

0.665

Leader Follower Leader Follower

Page 34: Solving Stackelberg Equilibrium in Stochastic Games

26

Computational Results - General Instances

Algorithm 4 VI modified: Infinite horizon for the general case

1: Initialize with n = 0, v 0A(s) = v 0

B(s) = 0 for every s ∈ S.2: for n = 1, · · · ,MAX IT do3: Find the pair (vn

A, vnB) by

(vnA, v

nB)(s) = T (vn−1

A , vn−1B )(s) .

Finding f ∗ and g∗ SSE strategies at stage n − 1.4: if (vn

A, vnB) = (vn−1

A , vn−1B ) then

5: return (vnA, v

nB) fixed point of T .

6: end if7: if ||(vn

A, vnB)− (vn−1

A , vn−1B )|| > 2β

n−1

1−β ||(rA, rB)|| then8: return UNDEFINED 1.9: end if

10: end for11: return UNDEFINED 2.

Page 35: Solving Stackelberg Equilibrium in Stochastic Games

27

Security Games

r abA (s) =

{RA(b) > 0 if b = a

PA(b) < 0 otherwiser abB (s) =

{PB(b) < 0 if b = a

RB(b) > 0 otherwise

Non pure strategies seems to be optimal for the leader.

Computationally all instances in Security games VI converges withthe geometric bound.

Conjecture

For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.

Page 36: Solving Stackelberg Equilibrium in Stochastic Games

27

Security Games

r abA (s) =

{RA(b) > 0 if b = a

PA(b) < 0 otherwiser abB (s) =

{PB(b) < 0 if b = a

RB(b) > 0 otherwise

Non pure strategies seems to be optimal for the leader.

Computationally all instances in Security games VI converges withthe geometric bound.

Conjecture

For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.

Page 37: Solving Stackelberg Equilibrium in Stochastic Games

28

Computational Results - General Instances

2 10 25 50 100 200|S|

0

100

200

300

400

500

600

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Figure: Performance of VI and PI in general random instances generated.

Page 38: Solving Stackelberg Equilibrium in Stochastic Games

29

Computational Results - General Instances

2 10 25 50 100 200

|A|

0

10

20

30

40

50

60

70

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Figure: Performance of VI and PI in general random instances generated.

Page 39: Solving Stackelberg Equilibrium in Stochastic Games

30

Computational Results - General Instances

2 10 25 50 100 200

|B|

0

200

400

600

800

1000

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

Figure: Performance of VI and PI in general random instances generated.

Page 40: Solving Stackelberg Equilibrium in Stochastic Games

31

Computational Results - % UNDEFINED.

2 10 25 50 100 200|S|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%

Figure: Percentage of instances where VI returns UNDEFINED.

Page 41: Solving Stackelberg Equilibrium in Stochastic Games

32

Computational Results - % UNDEFINED.

2 10 25 50 100 200

|A|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%

Figure: Percentage of instances where VI returns UNDEFINED.

Page 42: Solving Stackelberg Equilibrium in Stochastic Games

33

Computational Results - % UNDEFINED.

2 10 25 50 100 200

|B|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%

Figure: Percentage of instances where VI returns UNDEFINED.

Page 43: Solving Stackelberg Equilibrium in Stochastic Games

34

Conclusions

We define suitable Dynamic Programming operators.

We used it to characterize value functions and to prove existence andunicity of stationary policies forming a Strong Stackelberg Equilibriumfor a family of problems.

We define Value Iteration and Policy Iteration for this family andprove its convergence.

We prove via counterexample that this methodology is not alwaysapplicable for the general case.

We study security games and we conjecture that operators this typeof games are contractive.

Page 44: Solving Stackelberg Equilibrium in Stochastic Games

35

Future Work

We aim to prove the convergence of VI procedure for security games.

Rolling horizon techniques.

Applicability Approximate Dynamic Programming techniques.

To formalize and understand the behavior of Cyclic policies formingstrong Stackelberg equilibrium.

Page 45: Solving Stackelberg Equilibrium in Stochastic Games

Thank you!

Vıctor Bucarey Lopez

[email protected]

FCEIA - UNR - RosarioNovember 2nd, 2017

36

Page 46: Solving Stackelberg Equilibrium in Stochastic Games

37

References

1 Tansu - Alpcan and Tamer Basar. Stochastic security games, page 74-97. Cambridge. UniversityPress, 2010.

2 Tamer Basar, Geert Jan Olsder. Dynamic noncooperative game theory, volume 200. SIAM, 1995.

3 Francesco Maria Delle Fave, Albert Xin Jiang, Zhengyu Yin, Chao Zhang, Milind Tambe, SaritKraus, and John P Sullivan. Game-theoretic patrolling with dynamic execution uncertainty and acase study on a real transit system. Journal of Artificial Intelligence Research, 2014.

4 Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science &Business Media, 2012.

5 Yevgeniy Vorobeychik, Bo An, Milind Tambe, and Satinder Singh. Computing solutions ininfinite-horizon discounted adversarial patrolling games. In Proc. 24th International Conferenceon Automated Planning and Scheduling (ICAPS 2014)(June 2014), 2014.

6 Yevgeniy Vorobeychik and Satinder Singh. Computing Stackelberg equilibria in discountedstochastic games (corrected version). 2012

Page 47: Solving Stackelberg Equilibrium in Stochastic Games

38

Counterexample

b1 b2

a1���������

(1, 0)(1,-1) ���������

(0, 1)(0, 1)

a2���������

(0, 1)(-1,1) ���������

(0, 1)(−1,−1)

State s1b1 b2

a1���������

(0, 1)(-1,0) ���������

(1, 0)(0, 1)

a2���������

(1, 0)(0,1) ���������

(0, 1)(1,−1)

State s2

Table: Transition matrix and payoffs for each player in the numerical example 2.

Page 48: Solving Stackelberg Equilibrium in Stochastic Games

Back-up slides: Stochastic games

39