Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Solving Stackelberg Equilibrium in Stochastic Games.Vıctor Bucarey Lopez
FCEIA - UNR - Rosario
November 2nd, 2017
1
2
Stackelberg Game
Leaderπ
−→ Followerγ(π)
Strong Stackelberg Equilibrium
Leader commits to a payoff maximizing strategy.
Follower best responds.
Follower breaks ties in favor of the leader.
2
Stackelberg Game
Leaderπ
−→ Followerγ(π)
Strong Stackelberg Equilibrium
Leader commits to a payoff maximizing strategy.
Follower best responds.
Follower breaks ties in favor of the leader.
2
Stackelberg Game
Leaderπ
−→ Followerγ(π)
Strong Stackelberg Equilibrium
Leader commits to a payoff maximizing strategy.
Follower best responds.
Follower breaks ties in favor of the leader.
3
Example
b1 b2a1 (10,-10) (−5, 6)
a2 (-8,4) (6,−4)
MIP formulation
max vA
vA ≤ 10x1 +−8x2 + M(1− y1)
vA ≤ −5x1 + 6x2 + M(1− y2)
0 ≤ vB − (−10x1 + 4x2) ≤ M(1− y1)
0 ≤ vB − (6x1 +−4x2) ≤ M(1− y2)
x1 + x2 = 1 y1 + y2 = 1
x ≥ 0, y ∈ {0, 1}
4
b1 b2a1 (10,-10) (−5, 6)
a2 (-8,4) (6,−4)
0.2 0.4 0.6 0.8 1
−10
−5
5
10b1
b2
x∗1 = 1
3
Leader
0.2 0.4 0.6 0.8 1
−10
−5
5
10
b2
b1
x∗1 = 1
3
Follower
5
Multiple States
b1 b2
a1����������( 1
2 ,12 )
(10,-10) ����������(0, 1)(−5, 6)
a2����������( 1
4 ,34 )
(-8,4) ����������(1, 0)(6,−4)
State s1
b1 b2
a1����������( 1
2 ,12 )
(7,-5) ����������(0, 1)(−1, 6)
a2����������( 1
4 ,34 )
(-3,10) ����������(1, 0)(2,−10)
State s2
6
Stochastic Games - Definition
G = (S,A,B,Q, rA, rB , βA, βB , τ)
s0 Player A
chooses f0
Player Bobserves f0
and chooses g0
︸︷︷︸Q f0g0 (s1|s0)
s1 Player A
chooses f1
Player Bobserves f1
and chooses g1
s2 · · ·
Feedback Policies:
π = π(s, t)
= {f1, . . . , fτ}
Stationary Policies:
π = π(s)
= {f , . . . , f }
6
Stochastic Games - Definition
G = (S,A,B,Q, rA, rB , βA, βB , τ)
s0 Player A
chooses f0
Player Bobserves f0
and chooses g0
︸︷︷︸Q f0g0 (s1|s0)
s1 Player A
chooses f1
Player Bobserves f1
and chooses g1
s2 · · ·
Feedback Policies:
π = π(s, t)
= {f1, . . . , fτ}
Stationary Policies:
π = π(s)
= {f , . . . , f }
7
Framework
General Objectives
Existence and characterization of value functions.
Existence of equilibrium strategies.
Algorithms to compute them.
State of the Art
For finite horizon, Stackelberg equilibrium in stochastic gamesvia Dynamic programming.
Mathematical programming approach to compute stationaryvalues.
8
Framework
Contributions in Infinite horizon
We define suitable Dynamic Programming operators.
We used it to characterize value functions and to proveexistence and unicity of stationary policies forming a StrongStackelberg Equilibrium for a family of problems.
We define Value Iteration and Policy Iteration for this familyand prove its convergence.
We prove via counterexample that this methodology is notalways applicable for the general case.
9
Stackelberg equilibrium
(π, γ) −→
Value Functions
vπ,γA (s) = Eπ,γs
[τ∑
t=0
βtAr
At ,BtA (St)
]
vπ,γB (s) = Eπ,γs
[τ∑
t=0
βtB r
At ,BtB (St)
]
Stackelberg Equilibrium
(π∗, γ∗)
vπ∗,γ∗
A (s) = maxπ,γ∗
vπ,γ∗
A (s)
γ∗ ∈ argmax vπ,γB (s)
9
Stackelberg equilibrium
(π, γ) −→
Value Functions
vπ,γA (s) = Eπ,γs
[τ∑
t=0
βtAr
At ,BtA (St)
]
vπ,γB (s) = Eπ,γs
[τ∑
t=0
βtB r
At ,BtB (St)
]
Stackelberg Equilibrium
(π∗, γ∗)
vπ∗,γ∗
A (s) = maxπ,γ∗
vπ,γ∗
A (s)
γ∗ ∈ argmax vπ,γB (s)
10
Myopic Follower Strategies
Best response functional:
g(f , vB) = arg maxb∈Bs
∑a∈As
f (a)
[rabB (s) + βB
∑z∈S
Qab(z |s)vB(z)
]
Myopic follower strategies (MFS):
g(f , vB) = g(f )
2 important cases:
Myopic follower: βB = 0
Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)
10
Myopic Follower Strategies
Best response functional:
g(f , vB) = arg maxb∈Bs
∑a∈As
f (a)
[rabB (s) + βB
∑z∈S
Qab(z |s)vB(z)
]
Myopic follower strategies (MFS):
g(f , vB) = g(f )
2 important cases:
Myopic follower: βB = 0
Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)
10
Myopic Follower Strategies
Best response functional:
g(f , vB) = arg maxb∈Bs
∑a∈As
f (a)
[rabB (s) + βB
∑z∈S
Qab(z |s)vB(z)
]
Myopic follower strategies (MFS):
g(f , vB) = g(f )
2 important cases:
Myopic follower: βB = 0
Leader-Controller Discounted Games: Qab(z |s) = Qa(z |s)
11
Myopic Follower Strategies
f a stationary policy.
T fA : R|S| → R|S|:
T fA(vA)(s) =
∑a∈As
f (a)
[rag(f )A (s) + βA
∑z∈S
Qag(f )(z |s)vA(z)
]
Operator for the MFS case
TA(vA)(s) = maxf ∈P(As)
T fA(vA)(s) (1)
11
Myopic Follower Strategies
f a stationary policy.
T fA : R|S| → R|S|:
T fA(vA)(s) =
∑a∈As
f (a)
[rag(f )A (s) + βA
∑z∈S
Qag(f )(z |s)vA(z)
]
Operator for the MFS case
TA(vA)(s) = maxf ∈P(As)
T fA(vA)(s) (1)
12
Myopic Follower Strategies
Theorem 1.
a) T fA, TA are monotone.
b) For any stationary strategy f , the operator T fA, is a contraction on
(R|S|, || · ||∞) of modulus βA.
c) The operator TA is a contraction on (R|S|, || · ||∞), of modulus βA.
Theorem 2.
There exists a equilibrium value function v∗A and it is the unique solutionof v∗A = TA(v∗A). Moreover, the pair f ∗ and g(f ∗) which maximizes theRHS of (1) are the equilibrium strategies.
13
Myopic Follower Strategies
Algorithm 1 Value function iteration: Infinite horizon
Require: ε > 01: Initialize with n = 1, v0A(s) = 0 for every s ∈ S and v1A = TA(v0A)2: while ||vnA − vn−1A ||∞ > ε do3: Compute vn+1
A byvn+1A (s) = TA(vnA)(s) .
Finding f ∗ and g∗(f ) at stage n.4: n := n + 15: end while6: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ ={g∗, . . .}
14
Myopic Follower Strategies
Theorem 3.
The sequence of value functions vnA converges to v∗A. Furthermore, v∗A isthe fixed point of TA with the following bound
‖v∗A − vnA‖∞ ≤‖rA‖∞ βnA
1− βA.
15
Policy Iteration - MFS
Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).
Compute: uA,0 = T f0A (uA,0)
Find f1:T f1A (uA,0) = TA(uA,0)
Compute: uA,1 = T f1A (uA,1)
. . .
Repeat until convergence.
Theorem 4.
The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .
15
Policy Iteration - MFS
Begin with f 0 and g(f 0) (e.g. f 0 = 1|A|).
Compute: uA,0 = T f0A (uA,0)
Find f1:T f1A (uA,0) = TA(uA,0)
Compute: uA,1 = T f1A (uA,1)
. . .
Repeat until convergence.
Theorem 4.
The sequence of functions uA,n verifies uA,n ↑ v∗A . Even more, if for anyn ∈ N, uA,n = uA,n+1, then it is true that uA,n = v∗A .
16
Policy Iteration - MFS
Algorithm 2 Policy Iteration (PI)
1: Choose a stationary Stackelberg pair (f0, g(f0)).2: while ||uA,n − uA,n+1|| > ε do
3: Evaluation Phase: Find uA,n fixed point of the operator T fnA .
4: Improvement Phase: Find a strategy fn+1 such that
T fn+1
A (uA,n) = TA(uA,n) .
5: n:= n+16: end while7: return Stationary Stackelberg policies π∗ = {f ∗, . . .} and γ∗ = {g(f ∗), . . .}
17
Computational Results - MFS
2 10 25 50 100 200|S|
0
50
100
150
200
250
300
350
400
450
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
18
Computational Results - MFS
2 10 25 50 100 200
|A|
0
10
20
30
40
50
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
19
Computational Results - MFS
2 10 25 50 100 200
|B|
0
50
100
150
200
250
300
350
400
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
20
General Case
f and g fixed stationary policies
T f ,gi : R|S| → R|S|, i ∈ {A, B}
T f ,gi (vi )(s) =
∑a∈As
f (a)∑b∈Bs
g(b)
[r abi (s) + βi
∑z∈S
Qab(z |s)vi (z)
]
Operator for the General case
T : R|S| × R|S| → R|S| × R|S|
(T (vA, vB))(s) =
(max
f∈P(As ),T
f ,g(f ,vB )A (vA)(s), T
f ∗,g(f ∗,vB )B (vB)(s)
)
21
Algorithm
Algorithm 3 Value Iteration (VI): Finite horizon for the general case
1: Initialize with vτ+1A (s) = vτ+1
B (s) = 0 for every s ∈ S2: for t = τ, . . . , 0, and for every s ∈ S do3: Solve (
v tA(s), v t
B(s))
= T (v t+1A , v t+1
B )(s) ∀s ∈ SFinding f ∗t and g∗t SSE strategies at stage t.
4: end for5: return Stackelberg policies π∗ = {f ∗0 , . . . , f ∗τ } and γ∗ = {g∗0 , . . . , g∗τ }
22
Example
b1 b2
a1����������( 1
2 ,12 )
(10,-10) ����������(0, 1)(−5, 6)
a2����������( 1
4 ,34 )
(-8,4) ����������(1, 0)(6,−4)
State s1
b1 b2
a1����������( 1
2 ,12 )
(7,-5) ����������(0, 1)(−1, 6)
a2����������( 1
4 ,34 )
(-3,10) ����������(1, 0)(2,−10)
State s2
βA = βB = 0.9
23
Example
0 10 20 30 40 50 60Stages
100
50
0
50
100
Valu
e F
unct
ion
(0, 0)(100.0, 100.0)(-100.0, -100.0)
(-50.0, -50.0)(50.0, 50.0)
24
Counterexample
0 10 20 30 40 50Stages
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Valu
e F
unct
ion
State s0Leader Follower
State s1
0 10 20 30 40 50Stages
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Valu
e F
unct
ion
State s1Leader Follower
State s2
25
Counterexample
Iteration 14State s1 State s2
0.2 0.4 0.6 0.8 1
−1
1
2
b1
b2x∗1 = 0.552
0.289
0.2 0.4 0.6 0.8 1
−1
1
2
b1 b2
x∗1 = 0.552
0.4364
0.2 0.4 0.6 0.8 1
−1
1
2
b1b2
x∗1 = 0.6330.543
0.2 0.4 0.6 0.8 1
−1
1
2
b1 b2
x∗1 = 0.633
0.8374
Leader Follower Leader FollowerIteration 15
State s1 State s2
0.2 0.4 0.6 0.8 1
−1
1
2
b1
b2
x∗1 = 1
0.2714
0.2 0.4 0.6 0.8 1
−1
1
2
b1 b2
x∗1 = 1
1.4187
0.2 0.4 0.6 0.8 1
−1
1
2
b1b2
x∗1 = 0.6920.491
0.2 0.4 0.6 0.8 1
−1
1
2
b1 b2
x∗1 = 0.692
0.665
Leader Follower Leader Follower
26
Computational Results - General Instances
Algorithm 4 VI modified: Infinite horizon for the general case
1: Initialize with n = 0, v 0A(s) = v 0
B(s) = 0 for every s ∈ S.2: for n = 1, · · · ,MAX IT do3: Find the pair (vn
A, vnB) by
(vnA, v
nB)(s) = T (vn−1
A , vn−1B )(s) .
Finding f ∗ and g∗ SSE strategies at stage n − 1.4: if (vn
A, vnB) = (vn−1
A , vn−1B ) then
5: return (vnA, v
nB) fixed point of T .
6: end if7: if ||(vn
A, vnB)− (vn−1
A , vn−1B )|| > 2β
n−1
1−β ||(rA, rB)|| then8: return UNDEFINED 1.9: end if
10: end for11: return UNDEFINED 2.
27
Security Games
r abA (s) =
{RA(b) > 0 if b = a
PA(b) < 0 otherwiser abB (s) =
{PB(b) < 0 if b = a
RB(b) > 0 otherwise
Non pure strategies seems to be optimal for the leader.
Computationally all instances in Security games VI converges withthe geometric bound.
Conjecture
For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.
27
Security Games
r abA (s) =
{RA(b) > 0 if b = a
PA(b) < 0 otherwiser abB (s) =
{PB(b) < 0 if b = a
RB(b) > 0 otherwise
Non pure strategies seems to be optimal for the leader.
Computationally all instances in Security games VI converges withthe geometric bound.
Conjecture
For every Security game with this payoff structure, the operator Tis β contractive, with β = max{βA, βB}.
28
Computational Results - General Instances
2 10 25 50 100 200|S|
0
100
200
300
400
500
600
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
Figure: Performance of VI and PI in general random instances generated.
29
Computational Results - General Instances
2 10 25 50 100 200
|A|
0
10
20
30
40
50
60
70
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
Figure: Performance of VI and PI in general random instances generated.
30
Computational Results - General Instances
2 10 25 50 100 200
|B|
0
200
400
600
800
1000
Sol
utio
ntim
e[s
econ
ds]
Value Iteration Policiy Iteration
Figure: Performance of VI and PI in general random instances generated.
31
Computational Results - % UNDEFINED.
2 10 25 50 100 200|S|
0
20
40
60
80
100
Inst
ance
sno
tSol
ved
%
Figure: Percentage of instances where VI returns UNDEFINED.
32
Computational Results - % UNDEFINED.
2 10 25 50 100 200
|A|
0
20
40
60
80
100
Inst
ance
sno
tSol
ved
%
Figure: Percentage of instances where VI returns UNDEFINED.
33
Computational Results - % UNDEFINED.
2 10 25 50 100 200
|B|
0
20
40
60
80
100
Inst
ance
sno
tSol
ved
%
Figure: Percentage of instances where VI returns UNDEFINED.
34
Conclusions
We define suitable Dynamic Programming operators.
We used it to characterize value functions and to prove existence andunicity of stationary policies forming a Strong Stackelberg Equilibriumfor a family of problems.
We define Value Iteration and Policy Iteration for this family andprove its convergence.
We prove via counterexample that this methodology is not alwaysapplicable for the general case.
We study security games and we conjecture that operators this typeof games are contractive.
35
Future Work
We aim to prove the convergence of VI procedure for security games.
Rolling horizon techniques.
Applicability Approximate Dynamic Programming techniques.
To formalize and understand the behavior of Cyclic policies formingstrong Stackelberg equilibrium.
37
References
1 Tansu - Alpcan and Tamer Basar. Stochastic security games, page 74-97. Cambridge. UniversityPress, 2010.
2 Tamer Basar, Geert Jan Olsder. Dynamic noncooperative game theory, volume 200. SIAM, 1995.
3 Francesco Maria Delle Fave, Albert Xin Jiang, Zhengyu Yin, Chao Zhang, Milind Tambe, SaritKraus, and John P Sullivan. Game-theoretic patrolling with dynamic execution uncertainty and acase study on a real transit system. Journal of Artificial Intelligence Research, 2014.
4 Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science &Business Media, 2012.
5 Yevgeniy Vorobeychik, Bo An, Milind Tambe, and Satinder Singh. Computing solutions ininfinite-horizon discounted adversarial patrolling games. In Proc. 24th International Conferenceon Automated Planning and Scheduling (ICAPS 2014)(June 2014), 2014.
6 Yevgeniy Vorobeychik and Satinder Singh. Computing Stackelberg equilibria in discountedstochastic games (corrected version). 2012
38
Counterexample
b1 b2
a1���������
(1, 0)(1,-1) ���������
(0, 1)(0, 1)
a2���������
(0, 1)(-1,1) ���������
(0, 1)(−1,−1)
State s1b1 b2
a1���������
(0, 1)(-1,0) ���������
(1, 0)(0, 1)
a2���������
(1, 0)(0,1) ���������
(0, 1)(1,−1)
State s2
Table: Transition matrix and payoffs for each player in the numerical example 2.
Back-up slides: Stochastic games
39