Solving Stackelberg Equilibrium in Stochastic Games

Solving Stackelberg Equilibrium in Stochastic Games.Vıctor Bucarey Lopez

[email protected]

FCEIA - UNR - Rosario

November 2nd, 2017

1

2

Stackelberg Game

Leaderπ

−→ Followerγ(π)

Strong Stackelberg Equilibrium

Leader commits to a payoff maximizing strategy.

Follower best responds.

Follower breaks ties in favor of the leader.

2

Stackelberg Game

Leaderπ






2

Stackelberg Game

Leaderπ






3

Example

b1 b2a1 (10,-10) (−5, 6)

a2 (-8,4) (6,−4)

MIP formulation

max vA

vA ≤ 10x1 +−8x2 + M(1− y1)

vA ≤ −5x1 + 6x2 + M(1− y2)

0 ≤ vB − (−10x1 + 4x2) ≤ M(1− y1)

0 ≤ vB − (6x1 +−4x2) ≤ M(1− y2)

x1 + x2 = 1 y1 + y2 = 1

x ≥ 0, y ∈ {0, 1}

4

b1 b2a1 (10,-10) (−5, 6)

a2 (-8,4) (6,−4)

0.2 0.4 0.6 0.8 1

−10

−5

5

10b1

b2

x∗1 = 1

3

Leader

0.2 0.4 0.6 0.8 1

−10

−5

5

10

b2

b1

x∗1 = 1

3

Follower

5

Multiple States

b1 b2

a1��( 1

2 ,12 )

(10,-10) ��(0, 1)(−5, 6)

a2��( 1

4 ,34 )

(-8,4) ��(1, 0)(6,−4)

State s1

b1 b2

a1��( 1

2 ,12 )

(7,-5) ��(0, 1)(−1, 6)

a2��( 1

4 ,34 )

(-3,10) ��(1, 0)(2,−10)

State s2

6

Stochastic Games - Definition

G = (S,A,B,Q, rA, rB , βA, βB , τ)

s0 Player A

chooses f0

Player Bobserves f0

and chooses g0

︸︷︷︸Q f0g0 (s1|s0)

s1 Player A

chooses f1

Player Bobserves f1

and chooses g1

s2 · · ·

Feedback Policies:

π = π(s, t)

= {f1, . . . , fτ}

Stationary Policies:

π = π(s)

= {f , . . . , f }

6

Stochastic Games - Definition

G = (S,A,B,Q, rA, rB , βA, βB , τ)

s0 Player A

chooses f0

Player Bobserves f0

and chooses g0

︸︷︷︸Q f0g0 (s1|s0)

s1 Player A

chooses f1

Player Bobserves f1

and chooses g1

s2 · · ·

Feedback Policies:

π = π(s, t)

= {f1, . . . , fτ}

Stationary Policies:

π = π(s)

= {f , . . . , f }

7

Framework

General Objectives

Existence and characterization of value functions.

Existence of equilibrium strategies.

Algorithms to compute them.

State of the Art

For finite horizon, Stackelberg equilibrium in stochastic gamesvia Dynamic programming.

Mathematical programming approach to compute stationaryvalues.

8

Framework

Contributions in Infinite horizon

We define suitable Dynamic Programming operators.

We used it to characterize value functions and to proveexistence and unicity of stationary policies forming a StrongStackelberg Equilibrium for a family of problems.

We define Value Iteration and Policy Iteration for this familyand prove its convergence.

We prove via counterexample that this methodology is notalways applicable for the general case.

9

Stackelberg equilibrium

(π, γ) −→

Value Functions

vπ,γA (s) = Eπ,γs

[τ∑

t=0

βtAr

At ,BtA (St)

]

vπ,γB (s) = Eπ,γs

[τ∑

t=0

βtB r

At ,BtB (St)

]

Stackelberg Equilibrium

(π∗, γ∗)

vπ∗,γ∗

A (s) = maxπ,γ∗

vπ,γ∗

A (s)

γ∗ ∈ argmax vπ,γB (s)

9

Stackelberg equilibrium

(π, γ) −→

Value Functions

vπ,γA (s) = Eπ,γs

[τ∑

t=0

βtAr

At ,BtA (St)

]

vπ,γB (s) = Eπ,γs

[τ∑

t=0

βtB r

At ,BtB (St)

]

Stackelberg Equilibrium

(π∗, γ∗)

vπ∗,γ∗

A (s) = maxπ,γ∗

vπ,γ∗

A (s)

γ∗ ∈ argmax vπ,γB (s)

10

Myopic Follower Strategies

Best response functional:

g(f , vB) = arg maxb∈Bs

∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]

Myopic follower strategies (MFS):

g(f , vB) = g(f )

2 important cases:

Myopic follower: βB = 0

Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)

10




∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]


g(f , vB) = g(f )

2 important cases:



10




∑a∈As

f (a)

[rabB (s) + βB

∑z∈S

Qab(z |s)vB(z)

]


g(f , vB) = g(f )

2 important cases:



11


f a stationary policy.

T fA : R|S| → R|S|:

T fA(vA)(s) =

∑a∈As

f (a)

[rag(f )A (s) + βA

∑z∈S

Qag(f )(z |s)vA(z)

]

Operator for the MFS case

TA(vA)(s) = maxf ∈P(As)

T fA(vA)(s) (1)

11


f a stationary policy.

T fA : R|S| → R|S|:

T fA(vA)(s) =

∑a∈As

f (a)

[rag(f )A (s) + βA

∑z∈S

Qag(f )(z |s)vA(z)

]

Operator for the MFS case

TA(vA)(s) = maxf ∈P(As)

T fA(vA)(s) (1)

12


Theorem 1.

a) T fA, TA are monotone.

b) For any stationary strategy f , the operator T fA, is a contraction on

(R|S|, || · ||∞) of modulus βA.

c) The operator TA is a contraction on (R|S|, || · ||∞), of modulus βA.

Theorem 2.

There exists a equilibrium value function v∗A and it is the unique solutionof v∗A = TA(v∗A). Moreover, the pair f ∗ and g(f ∗) which maximizes theRHS of (1) are the equilibrium strategies.

13


Algorithm 1 Value function iteration: Infinite horizon

Require: ε > 01: Initialize with n = 1, v0A(s) = 0 for every s ∈ S and v1A = TA(v0A)2: while ||vnA − vn−1A ||∞ > ε do3: Compute vn+1

A byvn+1A (s) = TA(vnA)(s) .

Finding f ∗ and g∗(f ) at stage n.4: n := n + 15: end while6: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ ={g∗, . . .}

14


Theorem 3.

The sequence of value functions vnA converges to v∗A. Furthermore, v∗A isthe fixed point of TA with the following bound

‖v∗A − vnA‖∞ ≤‖rA‖∞ βnA

1− βA.

15

Policy Iteration - MFS

Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).

Compute: uA,0 = T f0A (uA,0)

Find f1:T f1A (uA,0) = TA(uA,0)


. . .

Repeat until convergence.

Theorem 4.

The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .

15


Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).


Find f1:T f1A (uA,0) = TA(uA,0)


. . .

Repeat until convergence.

Theorem 4.

The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .

16


Algorithm 2 Policy Iteration (PI)

1: Choose a stationary Stackelberg pair (f0, g(f0)).2: while ||uA,n − uA,n+1|| > ε do

3: Evaluation Phase: Find uA,n fixed point of the operator T fnA .

4: Improvement Phase: Find a strategy fn+1 such that

T fn+1

A (uA,n) = TA(uA,n) .

5: n:= n+16: end while7: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ = {g(f ∗), . . .}

17

Computational Results - MFS

2 10 25 50 100 200|S|

0

50

100

150

200

250

300

350

400

450

Sol

utio

ntim

e[s

econ

ds]

Value Iteration Policiy Iteration

18


2 10 25 50 100 200

|A|

0

10

20

30

40

50

Sol

utio

ntim

e[s

econ

ds]


19


2 10 25 50 100 200

|B|

0

50

100

150

200

250

300

350

400

Sol

utio

ntim

e[s

econ

ds]


20

General Case

f and g fixed stationary policies

T f ,gi : R|S| → R|S|, i ∈ {A, B}

T f ,gi (vi )(s) =

∑a∈As

f (a)∑b∈Bs

g(b)

[r abi (s) + βi

∑z∈S

Qab(z |s)vi (z)

]

Operator for the General case

T : R|S| × R|S| → R|S| × R|S|

(T (vA, vB))(s) =

(max

f∈P(As ),T

f ,g(f ,vB )A (vA)(s), T

f ∗,g(f ∗,vB )B (vB)(s)

)

21

Algorithm

Algorithm 3 Value Iteration (VI): Finite horizon for the general case

1: Initialize with vτ+1A (s) = vτ+1

B (s) = 0 for every s ∈ S2: for t = τ, . . . , 0, and for every s ∈ S do3: Solve (

v tA(s), v t

B(s))

= T (v t+1A , v t+1

B )(s) ∀s ∈ SFinding f ∗t and g∗t SSE strategies at stage t.

4: end for5: return Stackelberg policies π∗ = {f ∗0 , . . . , f ∗τ } and γ∗ = {g∗0 , . . . , g∗τ }

22

Example

b1 b2

a1��( 1

2 ,12 )

(10,-10) ��(0, 1)(−5, 6)

a2��( 1

4 ,34 )

(-8,4) ��(1, 0)(6,−4)

State s1

b1 b2

a1��( 1

2 ,12 )

(7,-5) ��(0, 1)(−1, 6)

a2��( 1

4 ,34 )

(-3,10) ��(1, 0)(2,−10)

State s2

βA = βB = 0.9

23

Example

0 10 20 30 40 50 60Stages

100

50

0

50

100

Valu

e F

unct

ion

(0, 0)(100.0, 100.0)(-100.0, -100.0)

(-50.0, -50.0)(50.0, 50.0)

24

Counterexample

0 10 20 30 40 50Stages

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Valu

e F

unct

ion

State s0Leader Follower

State s1

0 10 20 30 40 50Stages

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Valu

e F

unct

ion

State s1Leader Follower

State s2

25

Counterexample

Iteration 14State s1 State s2

0.2 0.4 0.6 0.8 1

−1

1

2

b1

b2x∗1 = 0.552

0.289

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.552

0.4364

0.2 0.4 0.6 0.8 1

−1

1

2

b1b2

x∗1 = 0.6330.543

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.633

0.8374

Leader Follower Leader FollowerIteration 15

State s1 State s2

0.2 0.4 0.6 0.8 1

−1

1

2

b1

b2

x∗1 = 1

0.2714

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 1

1.4187

0.2 0.4 0.6 0.8 1

−1

1

2

b1b2

x∗1 = 0.6920.491

0.2 0.4 0.6 0.8 1

−1

1

2

b1 b2

x∗1 = 0.692

0.665

Leader Follower Leader Follower

26

Computational Results - General Instances

Algorithm 4 VI modified: Infinite horizon for the general case

1: Initialize with n = 0, v 0A(s) = v 0

B(s) = 0 for every s ∈ S.2: for n = 1, · · · ,MAX IT do3: Find the pair (vn

A, vnB) by

(vnA, v

nB)(s) = T (vn−1

A , vn−1B )(s) .

Finding f ∗ and g∗ SSE strategies at stage n − 1.4: if (vn

A, vnB) = (vn−1

A , vn−1B ) then

5: return (vnA, v

nB) fixed point of T .

6: end if7: if ||(vn

A, vnB)− (vn−1

A , vn−1B )|| > 2β

n−1

1−β ||(rA, rB)|| then8: return UNDEFINED 1.9: end if

10: end for11: return UNDEFINED 2.

27

Security Games

r abA (s) =

{RA(b) > 0 if b = a

PA(b) < 0 otherwiser abB (s) =

{PB(b) < 0 if b = a

RB(b) > 0 otherwise

Non pure strategies seems to be optimal for the leader.

Computationally all instances in Security games VI converges withthe geometric bound.

Conjecture

For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.

27

Security Games

r abA (s) =

{RA(b) > 0 if b = a

PA(b) < 0 otherwiser abB (s) =

{PB(b) < 0 if b = a

RB(b) > 0 otherwise

Non pure strategies seems to be optimal for the leader.

Computationally all instances in Security games VI converges withthe geometric bound.

Conjecture

For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.

28


2 10 25 50 100 200|S|

0

100

200

300

400

500

600

Sol

utio

ntim

e[s

econ

ds]


Figure: Performance of VI and PI in general random instances generated.

29


2 10 25 50 100 200

|A|

0

10

20

30

40

50

60

70

Sol

utio

ntim

e[s

econ

ds]



30


2 10 25 50 100 200

|B|

0

200

400

600

800

1000

Sol

utio

ntim

e[s

econ

ds]



31

Computational Results - % UNDEFINED.

2 10 25 50 100 200|S|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%

Figure: Percentage of instances where VI returns UNDEFINED.

32


2 10 25 50 100 200

|A|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%


33


2 10 25 50 100 200

|B|

0

20

40

60

80

100

Inst

ance

sno

tSol

ved

%


34

Conclusions

We define suitable Dynamic Programming operators.

We used it to characterize value functions and to prove existence andunicity of stationary policies forming a Strong Stackelberg Equilibriumfor a family of problems.

We define Value Iteration and Policy Iteration for this family andprove its convergence.

We prove via counterexample that this methodology is not alwaysapplicable for the general case.

We study security games and we conjecture that operators this typeof games are contractive.

35

Future Work

We aim to prove the convergence of VI procedure for security games.

Rolling horizon techniques.

Applicability Approximate Dynamic Programming techniques.

To formalize and understand the behavior of Cyclic policies formingstrong Stackelberg equilibrium.

Thank you!

Vıctor Bucarey Lopez

[email protected]

FCEIA - UNR - RosarioNovember 2nd, 2017

36

37

References

1 Tansu - Alpcan and Tamer Basar. Stochastic security games, page 74-97. Cambridge. UniversityPress, 2010.

2 Tamer Basar, Geert Jan Olsder. Dynamic noncooperative game theory, volume 200. SIAM, 1995.

3 Francesco Maria Delle Fave, Albert Xin Jiang, Zhengyu Yin, Chao Zhang, Milind Tambe, SaritKraus, and John P Sullivan. Game-theoretic patrolling with dynamic execution uncertainty and acase study on a real transit system. Journal of Artificial Intelligence Research, 2014.

4 Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science &Business Media, 2012.

5 Yevgeniy Vorobeychik, Bo An, Milind Tambe, and Satinder Singh. Computing solutions ininfinite-horizon discounted adversarial patrolling games. In Proc. 24th International Conferenceon Automated Planning and Scheduling (ICAPS 2014)(June 2014), 2014.

6 Yevgeniy Vorobeychik and Satinder Singh. Computing Stackelberg equilibria in discountedstochastic games (corrected version). 2012

38

Counterexample

b1 b2

a1��

(1, 0)(1,-1) ��

(0, 1)(0, 1)

a2��

(0, 1)(-1,1) ��

(0, 1)(−1,−1)

State s1b1 b2

a1��

(0, 1)(-1,0) ��

(1, 0)(0, 1)

a2��

(1, 0)(0,1) ��

(0, 1)(1,−1)

State s2

Table: Transition matrix and payoffs for each player in the numerical example 2.

Back-up slides: Stochastic games

39

Documents

Solving Stackelberg Equilibrium in Stochastic Games