58
Reinforcement learning in signaling game Yilei Hu 1 , Brian Skyrms 2 and Pierre Tarr` es 3 May 28, 2018 Abstract We consider a signaling game originally introduced by Skyrms, which models how two interacting players learn to signal each other and thus create a common language. The first rigorous analysis was done by Argiento, Pemantle, Skyrms and Volkov (2009) with 2 states, 2 signals and 2 acts. We study the case of M 1 states, M 2 signals and M 1 acts for general M 1 , M 2 N. We prove that the expected payoff increases in average and thus converges a.s., and that a limit bipartite graph emerges, such that no signal-state correspondence is associated to both a synonym and an informational bottleneck. Finally, we show that any graph correspondence with the above property is a limit configuration with positive probability. 1 University of Oxford, Mathematical Institute, 24-29 St Giles, Oxford OX1 3LB, United Kingdom. E-mail: [email protected]. 2 School of Social Sciences, University of California at Irvine, CA 92607. E-mail: [email protected] 3 CNRS, Universit´ e de Toulouse, Institut de Math´ ematiques, 118 route de Narbonne, 31062 Toulouse Cedex 9, France. On leave from the Mathematical Institute, University of Oxford. E-mail: [email protected] 1 arXiv:1103.5818v1 [math.PR] 30 Mar 2011

Reinforcement learning in signaling game

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement learning in signaling game

Reinforcement learning in signaling game

Yilei Hu1, Brian Skyrms2 and Pierre Tarres3

May 28, 2018

Abstract

We consider a signaling game originally introduced by Skyrms, which models how two

interacting players learn to signal each other and thus create a common language. The first

rigorous analysis was done by Argiento, Pemantle, Skyrms and Volkov (2009) with 2 states, 2

signals and 2 acts. We study the case of M1 states, M2 signals and M1 acts for general M1, M2

∈ N. We prove that the expected payoff increases in average and thus converges a.s., and that

a limit bipartite graph emerges, such that no signal-state correspondence is associated to both

a synonym and an informational bottleneck. Finally, we show that any graph correspondence

with the above property is a limit configuration with positive probability.

1University of Oxford, Mathematical Institute, 24-29 St Giles, Oxford OX1 3LB, United Kingdom.E-mail: [email protected].

2School of Social Sciences, University of California at Irvine, CA 92607. E-mail: [email protected], Universite de Toulouse, Institut de Mathematiques, 118 route de Narbonne, 31062

Toulouse Cedex 9, France. On leave from the Mathematical Institute, University of Oxford. E-mail:[email protected]

1

arX

iv:1

103.

5818

v1 [

mat

h.PR

] 3

0 M

ar 2

011

Page 2: Reinforcement learning in signaling game

1 Introduction

1.1 Signaling game

Signaling games aim to provide a theoretical framework for the following basic

question: how do individuals create a common language? The setting was introduced

as follows by Philosopher D. Lewis (1969).

Consider two players, one called Sender, who is regularly given some information that

the other does not have and seeks to transmit it, and the other called Receiver.

Sender has a fixed set of signals at his disposal throughout the game, which do

not have any intrinsic meaning at the very beginning, in the sense that no signal

is a priori associated to any state; similarly Receiver has a fixed set of possible

acts. The game is thereafter repeatedly played according to following steps: (1)

Sender observes a certain state of nature, of which Receiver is not aware. (2)

Sender chooses a signal and then sends it to Receiver. (3) Receiver observes the

signal but not the state, and then chooses an act. (4) Both players receive payoffs

at the end of each round, which are functions of state and act.

The process involving the above four steps is called one communication. Sender, Re-

ceiver, states, signals and acts constitute one basic communication system. The game

lies in the choices of signals and acts by the agents. Note that we do not fix any strategy

at this point; this is the purpose of Section 1.4.

2

Page 3: Reinforcement learning in signaling game

1.2 Mathematical definition

As we will explain at length later on, we adopt a dynamic perspective to investigate

this game. The outcome of once-play game is not of much interest to us. What we

wonder is whether player can establish a signaling system, within which each signal is

uniquely related to a certain state, if they play this game repeatedly.

In this section, we provide solid mathematical definitions for the objects appearing

in Section 1.1 and we end with a mathematical definition of repeated signaling game.

Probability space. Let (Ω,F ,P) be a probability space, which is a sufficiently rich

source of randomness. More specifically, the probability space is at least rich enough

for all the random variables appearing in this section to live.

State spaces. Let S1 be the set of states, S2 be the set of signals and A be the set of

acts.

Players and strategies. We here introduce Nature as a player in this game who

assigns a state of nature to Sender at each round of the game. Three players, Nature,

Sender and Receiver respectively generate a random sequence, denoting their strategies

throughout this repeated game. More specifically,

(1) Nature generates a sequence (Sn)n∈N of random variables taking value in the set

of states S1, each denoting which state Nature assigns to Sender at each round of

the game.

(2) Sender generates a sequence (Yn)n∈N of random variables taking in value in the set

of signals S2, each denoting which signal Sender sends to Receiver at each round

of the game.

3

Page 4: Reinforcement learning in signaling game

(3) Receiver generates a sequence (Zn)n∈N of random variables taking value in the set

of states A, each denoting which state Receiver interpret the signal as at each

round of the game.

Payoffs. Let mappings u1n and u2

n from S1 × S2 × A to [0,∞) be payoff functions

for Sender and Receiver at the n-th round of the game. In other words, Sender and

Receiver gain payoffs u1n(Sn, Yn, Zn) and u2

n(Sn, Yn, Zn) at the end of n-th round of the

game respectively.

Information. Let filtration (F1n)n∈N (resp. (F2

n)n∈N) denote the information available

to Sender (resp. Receiver) before he makes his decision at each round of the game: for

each n ∈ N, we let

F1n := σ

(Si, 0 6 i 6 n, Yi, u

1n(Si, Yi, Zi), 0 6 i 6 n− 1

),

F2n := σ

(Yi, 0 6 i 6 n, Zi, u

2n(Si, Yi, Zi), 0 6 i 6 n− 1

).

Initial settings. The distributions of i.i.d random variables (Sn)n∈N and the distri-

butions of Y0 and Z0 are given at the beginning of the game, which are called priori

distribution.

Updating rule for strategies. Let F1n-measurable mapping ρ1

n

dist.(Yn+1) = ρ1n

(Si, 1 6 i 6 n, Yi, u

1i (Si, Yi, Zi), 1 6 i 6 n− 1

)(1)

be updating rule of strategies for Sender at n-th round of the game.

4

Page 5: Reinforcement learning in signaling game

Let F2n-measurable mapping ρ2

n

dist.(Zn+1) = ρ2n

(Si, 1 6 i 6 n, Zi, u

2i (Si, Yi, Zi), 1 6 i 6 n− 1

)(2)

be updating rule of strategies for Receiver at n-th round of the game. Different (ρ1n)n∈N

and (ρ2n)n∈N represent different learning rules.

Definition 1.1 (Signaling game). A repeated signaling game G is defined as:

G =(

(Ω,F ,P),S1,S2,A, (Sn, Yn, Zn, u1n, u

2n, ρ

1n, ρ

2n)n∈N

).

1.3 Questions

Throughout the paper, we limit our attention to a special, but most common,

circumstance under which

(A1) the set of acts matches the set of states of nature by a bijective map; and

(A2) both players only receive fixed payoffs when the act chosen by Receiver matches

the state of nature, which is considered as a successful communication; otherwise

they obtain nothing.

Under these assumptions, the act can be understood as an interpreted state. Note

however that the Receiver does not necessarily know the possible states of nature: the

mutual goal for the two players is to make the communication succeed, but they are not

always aware that they actually coordinate each other, or even that they are involved

in a communication game.

The analysis of choices of strategies by the players gives rise to the following

5

Page 6: Reinforcement learning in signaling game

Question 1.1. What game-theoretical equilibrium is most likely to arise in this repeated

game?

This issue can either be analyzed from a static perspective with classic equilibrium

analysis, see Trapa and Nowak (2000), Huttegger (2007) and Pawlowitsch (2008), or in

a dynamic perspective, through individual learning models or evolutionary strategies,

see for instance Huttegger and Zollman (2010). The latter perspective investigates the

evolutionary pathway out of equilibria:

Question 1.2. Does the communication system asymptotically reach a stable equilib-

rium state? If so, what are the good candidates, and how does the communication

system reach them?

However, modeling the game through individual learning process is really an issue

involving lots of factors, for instance the level of rationality of players. In particular,

we are interested in the following question which is first raised by Skyrms,

Question 1.3. What is the simplest mechanism to ensure the emergence of a signaling

system in this repeated game?

1.4 Reinforcement learning

In this paper we adopt a dynamic perspective, based on the following individual

reinforcement learning model.

(A3) Roth-Erev reinforcement learning rule (or Herrnstein’s matching law): the

probability of choosing an action is proportional to its accumulated rewards.

6

Page 7: Reinforcement learning in signaling game

Assumption (A3) actually decide the updating rule for distributions of strategies for

players. The corresponding behavior is analyzed by Argiento et al. (2009) in the 2-state,

2-signal, 2-act case, who show that an optimal signaling system emerges eventually, in

the sense of a one-to-one correspondence between states and signals. We study here

the case of M1 states, M2 signals and M1 acts for general M1, M2 ∈ N.

Note that Roth-Erev reinforcement rule is one of many possible strategies of the

players, who can have various levels of rationality, each of them leading to a different

learning process; for instance the myopic and best response models, see Fudenberg and

Levine (1998). Let us briefly motivate the reinforcement condition, corresponding to a

low level of rationality. It is natural to believe that individuals with high rationality,

devoting themselves to the task of establishing a common language, would rapidly

succeed. It is interesting to study whether on the contrary, under the only assumption

that these individuals have good memory of their own past experience and aspire to

a better payoff, a signaling system would also emerge, and how optimal the limiting

system is. This pertains either to individuals with lower cognitive ability, or who do not

devote themselves totally to the task of learning the game or take optimal decisions.

1.5 The model

1.5.1 Assumptions

Let G =(

(Ω,F ,P),S1,S2,A, (Sn, Yn, Zn, u1n, u

2n, ρ

1n, ρ

2n)n∈N

)be a signaling game

as defined in Section 1.2. Apart from assumptions (A1)-(A3), we make further more

assumptions for our model. One is about the priori distributions and the other is a

7

Page 8: Reinforcement learning in signaling game

more detailed version of assumption (A2).

(A4) States of nature are equiprobable. In other words, for each n ∈ N, Sn is an

independent uniformly distributed random variable. Furthermore, Y0 and Z0 are

independent uniformly distributed random variable.

(A5) Apart from assumption (A2) on payoffs, we here assume that payoffs are con-

stants only dependent on types of states and acts. More precisely, u1n(Sn, Yn, Zn) =

u2n(Sn, Yn, Zn) = aZn,Yn 1ISn=Zn, where ai,j is a positive constant, for i ∈ S1, j ∈

S2.

1.5.2 The model

Under assumptions (A1)− (A5), we now present the model of the signaling game

we are going to study in this paper.

Suppose there are M1 states, M2 signals and M1 acts. Let S = S1 ∪ S2 and, for

all d ∈ N, let Sd := S × . . .× S.

Let Spair := (i, j) : i ∈ S1, j ∈ S2 be the set of strategy pairs. Note that a

strategy pair (i, j) carries different meanings for the Sender and for the Receiver. For

the Sender, it means choosing signal j when he observes state i, while for the Receiver

it means choosing act i when he receives signal j. This strategy pair (i, j) accumulates

the same payoffs for both the Sender and the Receiver, since the corresponding rewards

are always received at the same time: let V (n, i, j) denote this accumulated payoff at

time n. For each n ∈ N, let

Vn := (V (n, i, j))i∈S1, j∈S2

8

Page 9: Reinforcement learning in signaling game

be the payoff vector at time n.

Let us describe the random process (Vn)n∈N, arising from our model for the game

and its strategies in Sections 1.1–1.4:

1 Initial setting. For any (i, j) ∈ Spair, we assume that V (0, i, j) > 0 is fixed.

2 Reinforcement learning. At each time step, Sender observes a certain state i

from set of states S1; we assume here that all states arise with equal probability

1/M1. Then Sender randomly chooses a signal, his probability of drawing j being

V (n, i, j)∑l∈S2 V (n, i, l)

.

Receiver observes the signal he receives (let us call it j) and then randomly chooses

an act k with probability

V (n, k, j)∑l∈S1 V (n, l, j)

.

3 Updating rule. Both Sender and Receiver receive payoffs when the act chosen

by Receiver matches the state observed by Sender. For any i ∈ S1, j ∈ S2,

V (n+ 1, i, j) :=

V (n, i, j) + ai,j if Sender observes state i and chooses

signal j, and Receiver chooses act i;

V (n, i, j) if else.

We suppose in this paper that ai,j := 1 for all i ∈ S1, j ∈ S2. We also assume for

simplicity that V (0, i, j) = 1, for all i ∈ S1, j ∈ S2, but the proofs carry on to general

initial conditions.

9

Page 10: Reinforcement learning in signaling game

1.5.3 Symmetrization and key processes

Let us symmetrize the notation, which will simplify some proofs (in particular that

of Proposition 4.1): for all i j ∈ S, let

V (n, i, j) :=

0 i, j ∈ S1 or i, j ∈ S2

V (n, i, j) i ∈ S1, j ∈ S2

V (n, j, i) j ∈ S1, i ∈ S2

Now, for all n ∈ N and i ∈ S state or signal, let

T in :=∑j∈S

V (n, i, j)

be its number of successes up to time n.

For all n ∈ N, let

Tn :=1

2

∑i∈S

T in =∑i∈S1

T in =∑i∈S2

T in.

Then Tn−T0 is the total number of successes of the communication system up to time

n.

Let

xnij := V (n, i, j)/Tn i, j ∈ S, n ∈ N;

xni := T in/Tn, i ∈ S, n ∈ N.

10

Page 11: Reinforcement learning in signaling game

For all n ∈ N, let

xn :=(xnij)i,j∈S , n ∈ N

be the occupation measure at time n, which takes values in the interior of the simplex

∆ :=

(xij)i,j∈S : xij > 0,∑

i∈S1, j∈S2

xij = 1, xji = xij for all i, j ∈ S.

Let us define

∂∆ := x ∈ ∆ : ∃ i ∈ S, s.t. xi = 0 ,

which we call boundary throughout the paper, although it is not the topological bound-

ary of ∆. One of the technical difficulties in this model is the understanding of the

behavior of (xn)n∈N near the boundary, as we shall explain later.

Given x ∈ ∆ \ ∂∆, i, j ∈ S, let

yij :=xijxixj

be the efficiency of the strategy pair (i, j) and let

Ni(x) :=∑k∈S

xikxi· yik

be the efficiency Ni(x) of i. We will justify this notation in Section 4.1.

Note that processes (xn)n∈N and (Tn)n∈N contain all the important information of

the communication system throughout the game. Therefore, to study how our model

evolves, we only need to focus on the evolutions of (xn)n∈N and (Tn)n∈N.

11

Page 12: Reinforcement learning in signaling game

1.6 Urn setting: Another way to interpret the model

Note that the reinforcement learning and updating rules 2 and 3 in Section 1.5 can

be interpreted in an urn setting. Assume Sender has M1 urns indexed by states, each

of them having M2 colours of balls, one per signal. Similarly assume that Receiver has

M2 urns indexed by signals, each of them having M1 colours of balls, one per act (or

state, since the two sets coincide).

The model corresponds to the following: Sender picks a ball at random in the

urn indexed by the state he observes, and sends the signal given by its colour. Then

receiver picks a ball at random in the urn indexed by this signal, and chooses the act

given by its colour. Both Sender and Receiver put back the balls they picked and, if

the act matches the state, add one more ball of the same colour.

2 Main results

Given x ∈ ∆ distribution of strategy pairs of the communication system, we

introduce in Section 2.1 the bipartite graph of state/signal connections associated to

x, and its communication potential or efficiency, which measures the corresponding

expected payoff up to a multiplicative constant. We present in Section 2.2 the main

results of the paper.

2.1 Bipartite graph and communication potential

Definition 2.1. Given x ∈ ∆, let Gx be the weighted bipartite graph with vertices

S := S1 ∪ S2, adjacency ∼ and weights as follows

12

Page 13: Reinforcement learning in signaling game

(1) ∀ i ∈ S1, j ∈ S2, i ∼ j if and only if xij > 0.

(2) The weight of edge i, j is its efficiency yij = xij/(xixj).

Note that the new adjacency relation is only justified when x is in the topological

boundary of ∆; otherwise, Gx is then the complete 2-partite graph with partitions S1

and S2.

Definition 2.2. Let H : ∆ −→ R+ be the function defined by, for all x ∈ ∆,

H(x) :=∑

i∈S1, j∈S2:xij>0

x2ij

xixj=

1

2

∑i,j∈S:xij>0

x2ij

xixj.

We call H(x) communication potential or efficiency of x.

Note the communication potential of xn at time n can be interpreted - up to a

multiplicative constant- as the expected payoff at that time step:

P(Tn+1 − Tn = 1 | Fn) =1

M1

H(xn), (3)

where F = (Fn)n∈N is the filtration of the past, i.e. Fn := σ(x1, . . . , xn).

Lemma 2.1. H has minimum 1 and maximum min(M1,M2) on ∆.

Proof. Using Cauchy-Schwartz inequality,

H(x) =∑

i∈S1,j∈S2:xij>0

x2ij

xixj

∑i∈S1,j∈S2

xixj >

∑i∈S1,j∈S2:xij>0

xij√xixj

√xixj

2

=

( ∑i∈S1,j∈S2

xij

)2

= 1

13

Page 14: Reinforcement learning in signaling game

provides the first inequality, whereas the second one comes from

H(x) =∑

i∈S1,j∈S2:xij>0

xijxi

xijxj

6 supi∈S1,j∈S2:xij>0

xijxi

∑i∈S1,j∈S2:xij>0

xijxj

6∑

i∈S1,j∈S2:xij>0

xijxj

= M2;

similarly H(x) 6M1.

Now

(a) H(x) reaches the minimum 1 if and only if Gx is a complete graph on which every

edge shares the same weight 1, as displayed on the figure below.

State 1 2 3 4

Signal A B C D

All edges with the same weight=1

(b) If M1 >M2 (resp. M1 6M2), then H(x) reaches the maximum if and only if every

vertex i ∈ S1 (resp. S2) only has one adjacent edge in Gx. In the case M1 6 M2,

this corresponds to a unique meaning for every signal, i.e. perfect efficiency, as

displayed on the figure below.

14

Page 15: Reinforcement learning in signaling game

State 1 2 3 4

Signal A B C D

2.2 Main results

Definition 2.3. Given a graph G on S1 ∪ S2, let (P )G be the following property:

• if we let C1, . . . Cd be its connected components then, for every i ∈ 1, . . . , d,

Ci ∩ S1 or Ci ∩ S2 is a singleton.

• each vertex has a corresponding edge.

We call synonym (resp. informational bottleneck or polysemy) a state (resp. signal)

associated to several signals (resp. states or acts), or the corresponding set of adjacent

signals (resp. states). ObviouslyM1 6= M2 ensures the existence of at least one synonym

or polysemy.

Note that, given x ∈ ∆, and even if M1 = M2, property (P )Gx allows for synonyms

or informational bottlenecks, and does not ensure that the system is optimal as a

communication system, i.e. that H(x) reaches the maximum of H. Most common

languages have such flaws. We show, on the figure below, a graph G corresponding to

15

Page 16: Reinforcement learning in signaling game

a sub-optimal communication system: it is easy to check that its normalized efficiency

is 80%, i.e. that any x ∈ ∆ such that Gx = G is such that H(x)/maxH = 0.8.

State 1 2 3 4 5

Signal A B C D E

Theorem 2.1. The communication potential process (H(xn))n∈N is a bounded sub-

martingale, and hence converges a.s.

Theorem 2.2. (xn)n∈N converges to the set of equilibria of mean-field ODE almost

surely.

Remark. We will define the mean-field ODE in Section 3.1. Roughly speaking, the

mean-field ODE is derived from the dynamics of expected movement of (xn)n∈N.

Theorem 2.3. For all G on S1 ∪ S2 s.t. (P )G holds, with positive probability

(a) xn → x s.t. Gx = G.

(b) ∀i, j ∈ S, V (∞, i, j) =∞ ⇐⇒ i, j is an edge of G.

2.3 Contents

The rest of this chapter is devoted to the proof of the main results, as follows. In

Section 3 we discuss the stochastic approximation of (xn)n∈N by an ordinary differential

16

Page 17: Reinforcement learning in signaling game

equation. In Section 4 we justify some notation from Section 1.5, and show Theorem

2.1 and its deterministic counterpart that H is a Lyapounov function of the associated

ODE. In Section 5 we deduce that (xn)n∈N almost surely converges to the set of equi-

libria of this differential equation, and describe the stable equilibria x in terms of graph

structure of Gx. In Section 6 we connect our analysis of stability of the reinforcement

learning model with the one from the static equilibrium analysis literature. Finally in

Section , we show Theorem 2.3 about convergence with positive probability towards

subgraphs G satisfying (P )G.

2.4 Notation

For all u, v ∈ R, we write u = 2(v) if |u| 6 v; we let u ∧ v = min(u, v) (resp.

u ∨ v = max(u, v)) be the minimum (resp. maximum) of u and v.

We let Cst(a1, a2, . . . , ap) denote a positive constant depending only on a1, a2, . . .

ap, and let Cst denote a universal positive constant.

3 Stochastic approximation

3.1 Mean-field ODE

Let | · | be the Euclidean norm on RM1×M2 . Let us calculate the increment of

(xn)n∈N at time n:

xn+1 − xn =

(Vn+1

1 + Tn− Vn

1 + Tn+

Vn1 + Tn

− VnTn

)1IVn+1−Vn 6=0

17

Page 18: Reinforcement learning in signaling game

=1IVn+1−Vn 6=0

1 + Tn(Vn+1 − Vn − xn). (4)

In expectation,

E[xn+1 − xn | Fn ] =1

(1 + Tn)M1

F (xn), (5)

where F , defined from ∆ to T∆ tangent space of ∆, maps x to

F (x) :=

(xij

( xijxixj

−H(x)))

i,j∈S,

with the convention that F (x)ij = 0 if xij = 0.

Let us consider the following ordinary differential equation, defined on ∆ \ ∂∆:

d x

dt= F (x). (6)

3.2 Results

Lemma 3.1. There exists an adapted martingale increment process (ηn)n∈N such that,

for all n ∈ N,

xn+1 − xn =1

(1 + Tn)M1

F (xn) + ηn+1, (7)

and |ηn+1| 6 2/(1 + Tn).

18

Page 19: Reinforcement learning in signaling game

Proof. The inequality |ηn+1| 6 2/(1 + Tn) comes from

|xn+1 − xn| =∣∣∣∣ 1IVn+1−Vn 6=0

1 + Tn(Vn+1 − Vn − xn)

∣∣∣∣ 6 1

1 + Tn. (8)

The following Lemma 3.2, which proves asymptotic linear growth of Tn, will in

particular imply that the martingale (∑n

k=1 ηk)n∈N converges a.s., by Doob convergence

theorem.

Lemma 3.2. With Probability 1,

1

M1

6 lim infn→∞

Tnn

6 lim supn→∞

Tnn

6min(M1,M2)

M1

.

Proof. The result is a direct consequence of (3) and Conditional Borel-Cantelli Lemma,

see [4, Theorem I.6]

Remark. We show later that, as n goes to infinity, H(xn) converges (Theorem 2.1),

and the proof above implies that its limit is also the one of Tn/n.

Formula (7) can be interpreted as a stochastically perturbed Cauchy-Euler ap-

proximation scheme for the ODE (6), with step size 1/M1(1 +Tn). The step size being

O(1/n), (xn)n∈N asymptotically shadows solutions of the asymptotic ODE, so that its

limit set belongs to a class of possible limit sets of pseudotrajectories of the ODE (see

for instance [3]).

Let Γ be the set of equilibria of the ODE, i.e.

Γ :=x ∈ ∆ : F (x) = 0

.

19

Page 20: Reinforcement learning in signaling game

4 Lyapunov function

4.1 Deterministic case: Mean-field ODE

Let us start with a heuristical justification of the fact that H is a Lyapounov

function, i.e. that it increases along the trajectories of the ODE.

The differential equation (6) can be understood in the language of (non-linear)

replicator dynamics, with the following biological perspective. Suppose a population

consists of species (i, j) ∈ S1×S2, each corresponding to a strategy pair. Let the fitness

of (i, j) be its efficiency yij = xij/(xixj). The wording is justified by the following

interpretation: the probability that population (i, j) increases is

1

M1

· xijxi· xijxj

=1

M1

× proportion of (i, j) × its efficiency.

Then the average fitness of the whole population is the communication potential H(x).

Therefore the mean-field ODE can be understood as follows,

Growth rate of xij

= xij × ( fitness of (i, j)− average fitness of whole population).

In particular, those species (i.e. strategy pairs) whose fitness (i.e. efficiency) are above

the average increase their proportions in the population (i.e. distribution of strategy

pairs). Note that the fitness of our species changes over time.

This interpretation makes it reasonable to conjecture that the average fitness of the

whole population indeed increases along solutions of ODE (6), as we show in Proposition

20

Page 21: Reinforcement learning in signaling game

4.1 and, in the (discrete) stochastic case, in Theorem 2.1, proved in Section 4.2. The

proof of the latter is technically quite long, owing to the non-continuity of H on the

boundary ∂∆, which prevents us from converting the deterministic Proposition 4.1 into

a stochastic one via a simple Taylor formula.

Let, for all x ∈ ∆,

p(x) :=1

2

∑i,j,k∈S:xij ,xik 6=0

xijxikxi

(yij − yik)2 .

Proposition 4.1. H is a Lyapunov function on ∆ \ ∂∆ for the mean-field ODE (6);

more precisely,

∇H · F (x) = p(x) > 0. (9)

Remark. We will see later that H is not a strict Lyapunov function; in other words,

∇H · F does not only vanish on the set of rest points of F .

Proof. We take advantage of the symmetrical notation introduced at the end of Section

1.5: in a mathematical perspective, there is no difference between state and signal in

the mean-field ODE. Let us now differentiate H along a path of the ODE: note that,

when differentiating with respect to space variables, we are obviously not restricted to

∆, so that xij and xji are considered as independent variables (in the calculation below,

we use the convention that xi =∑

j∈S xij, but any other convention would lead to the

same result):

∇H · F (x) =∑i,j∈S

xijxixj

((xij)2

xixj− xijH(x)

)− (xij)

2

2(xi)2xj

(∑k∈S

(xik)2

xixk− xikH(x)

)

21

Page 22: Reinforcement learning in signaling game

− (xij)2

2xi(xj)2

(∑k∈S

(xjk)2

xjxk− xjkH(x)

)=∑i,j∈S

(xij)3

(xi)2(xj)2−∑i,j,k∈S

(xij)2(xik)

2

(xi)3xjxk(10)

−(∑i,j∈S

(xij)2

xixj−∑i,j,k∈S

(xij)2xik

(xi)2xj

)H(x) (11)

Using that∑

k∈S xik/xi = 1,

(11) =∑i,j∈S

(xij)2

xixj

(1−

∑k∈S

xikxi

)H(x) = 0.

Using the symmetry between j and k, we obtain

(10) =∑i,j,k∈S

(xij)3xik

(xi)3(xj)2−∑i,j,k∈S

(xij)2(xik)

2

(xi)3xjxk

=1

2

∑i,j,k∈S

xijxikxi

(xijxixj

− xikxixk

)2

. (12)

Lemma 4.1. For any x ∈ ∆ \ ∂∆,

∇H · F (x) =∑i,j∈S

xij(yij −Ni(x))2, (13)

which can also be written as

∇H · F (x) =∑i,j∈S

xij(yij −H(x))2 −∑i∈S

xi(Ni(x)−H(x))2. (14)

Remark. In the context of communication systems, the above three formulas (9),

(13)–(14) mean that the growth rate of the communication potential is a function

22

Page 23: Reinforcement learning in signaling game

depending on the difference between efficiencies of different strategy pairs.

Proof. Fix i, j ∈ S, and define Y : S → R by Y (k) := yik, seen as a random variable

on (S,Pz,i) (with expectation Ez,i(.)), where Pz,i(k) := xik/xi. Then

Ez,i[(yij − Y )2] =∑k∈S

xikxi

(yij − yik)2 = (Ez,i(yij − Y ))2 + Ez,i[(Y − Ez,i(Y ))2]

= (yij −Ni(x))2 +∑k∈S

xikxi

(yik −Ni(x))2,

using that

Ez,i(Y ) =∑k∈S

xikxiyik = Ni(x).

Therefore

∇H · F (x) =1

2

(∑i,j∈S

xij(yij −Ni(x))2 +∑i,j,k∈S

xijxikxi

(yik −Ni(x))2

)

=∑i,j∈S

xij(yij −Ni(x))2,

which implies that

∇H · F (x) :=∑i,j∈S

xij(yij −H(x))2 −∑i∈S

xi(Ni(x)−H(x))2,

and completes the proof.

Lemma 4.2. For all x ∈ ∆ \ ∂∆ and i ∈ S, Ni(x) > 1.

Proof. Indeed, ∑j∈S

yijxj =∑j∈S

xijxi

= 1

23

Page 24: Reinforcement learning in signaling game

which subsequently implies, by Cauchy-Schwartz inequality, that

1 =

(∑j∈S

yijxj

)2

6

(∑j∈S

y2ijxj

)(∑j∈S

xj

)= Ni(x). (15)

Let us define

∆ε := x ∈ ∆ \ ∂∆ : p(x) > ε ,

and let

Λ := x ∈ ∆ : p(x) = 0,

where p is defined in the statement of Proposition 4.1. The following Lemma 4.3 is

straightforward.

Lemma 4.3. x ∈ Λ if and only if

xijxj

=xikxk, for all i, j, k s.t. xij 6= 0, xik 6= 0

or, equivalently,

yij = yik, for all i, j, k s.t. xij 6= 0, xik 6= 0.

Remark. Lemma 4.3 can be phrased as follows: if x ∈ Λ then, in the graph Gx,

edges within the same connected component have the same weight. Note that x ∈ Γ

is equivalent to all edges of Gx having the same weight H(x). So the two sets Γ and

Λ are different, i.e. H is not a strict Lyapounov function, which justifies the need to

prove separately the convergence to the set of equilibria in Section 5.1.

24

Page 25: Reinforcement learning in signaling game

4.2 Proof of Theorem 2.1 and convergence to Λ

4.2.1 Proof of Theorem 2.1

For simplicity, we let V := Vn in the following calculation. Let us compute the

expected increment of (H(xn))n∈N.

E[H(xn+1)−H(xn) | Fn ] (16)

=∑

i∈S1,j∈S2

V 2ij

ViVj

((Vij + 1)2

(Vi + 1)(Vj + 1)−

V 2ij

ViVj

+∑

k∈S,k 6=j

( V 2ik

(Vi + 1)Vk− V 2

ik

ViVk

)+

∑k∈S,k 6=i

( V 2kj

(Vj + 1)Vk−

V 2kj

VjVk

))

=1

2

∑(i,j)∈S2

V 2ij

ViVj

((Vij + 1)2

(Vi + 1)(Vj + 1)−

V 2ij

ViVj

+∑

k∈S,k 6=j

( V 2ik

(Vi + 1)Vk− V 2

ik

ViVk

)+

∑k∈S,k 6=i

( V 2jk

(Vj + 1)Vk−

V 2jk

VjVk

))

=1

2

∑(i,j)∈S2

V 2ij

ViVj

(−∑k∈S

V 2ik

Vi(Vi + 1)Vk−∑k∈S

V 2jk

Vj(Vj + 1)Vk

+(Vij + 1)2

(Vi + 1)(Vj + 1)+

V 2ij

ViVj−

V 2ij

(Vi + 1)Vj−

V 2ij

(Vj + 1)Vi

)= −

∑(i,j,k)∈S3

V 2ijV

2ik

V 2i (Vi + 1)VjVk

+1

2

∑(i,j)∈S2

V 2ij

ViVj

2ViVjVij + ViVj + V 2ij

ViVj(Vi + 1)(Vj + 1)

=∑

(i,j)∈S2

V 3ij

Vi(Vi + 1)V 2j

−∑

(i,j,k)∈S3

V 2ijV

2ik

V 2i (Vi + 1)VjVk

(17)

+1

2

∑(i,j)∈S2

V 2ij

ViVj

2ViVjVij + ViVj + V 2ij

ViVj(Vi + 1)(Vj + 1)−∑

(i,j)∈S2

V 3ij

Vi(Vi + 1)V 2j

(18)

25

Page 26: Reinforcement learning in signaling game

Now let us prove (17) is nonnegative. Indeed,

(17) =∑

(i,j,k)∈S3

VijVikVi + 1

(V 2ij

V 2i V

2j

− VijVikV 2i VjVk

)

=1

2

∑(i,j,k)∈S3

VijVikVi + 1

(VijViVj

− VikViVk

)2

> 0.

Next we show that (18) is nonnegative as well:

(18) = −2∑

(i,j)∈S2

V 3ij

Vi(Vi + 1)V 2j (Vj + 1)

+∑

(i,j)∈S2

V 2ij

ViVj(Vi + 1)(Vj + 1)

+∑

(i,j)∈S2

V 4ij

V 2i V

2j (Vi + 1)(Vj + 1)

=∑

(i,j)∈S2

V 2ij

ViVj(Vi + 1)(Vj + 1)

(−2VijVj

+ 1 +V 2ij

ViVj

)

=∑

i∈S1,j∈S2

2V 2ij

ViVj(Vi + 1)(Vj + 1)

(1− Vij

Vi− VijVj

+V 2ij

ViVj

)

=∑

i∈S1,j∈S2

2V 2ij

ViVj(Vi + 1)(Vj + 1)

(1− Vij

Vi

)(1− Vij

Vj

)> 0.

4.2.2 Convergence to Λ

Let us now prove the following Proposition 4.2.

Proposition 4.2. (xn)n∈N converges to Λ a.s. ; more precisely, (p(xn))n∈N converges

to 0 a.s.

Proof. Let us define process Yn := H(xn), n ∈ N. We decompose (Yn)n∈N into a martin-

gale (Mn)n∈N and a predictable process (An)n∈N where An+1−An = E[Yn+1−Yn | Fn ].

Since H is bounded, martingale (Mn)n∈N is upper bounded and hence converges.

26

Page 27: Reinforcement learning in signaling game

Let

Pn :=1

2

∑(i,j,k)∈S3

V nijV

nik

V ni + 1

(V nij

V ni V

nj

− V nik

V ni V

nk

)2

;

Qn :=∑

i∈S1,j∈S2

(V nij )

2

V ni V

nj (V n

i + 1)(V nj + 1)

(1−

V nij

V ni

)(1−

V nij

V nj

).

Hence

Pn >1

4

∑(i,j,k)∈S3

V nijV

nik

V ni

(V nij

V ni V

nj

− V nik

V ni V

nk

)2

=p(xn)

2Tn.

The rest of the argument is similar to the proof of convergence to the set of

equilibria in [1]. If (xn)n∈N were infinitely often away from Λ, then the drift would

cause (H(xn))n∈N to go to infinity, hence contradicting the boundedness of H. Indeed,

let δ be the distance between ∆ε and the complement of ∆ε/2. Suppose xn ∈ ∆ε,

xn+1, . . . , xn+k−1 ∈ ∆ ε2\∆ε and xn+k ∈ ∆c

ε2,

An+k − An >n+k−1∑r=n

(Pr +Qr) >n+k−1∑r=n

Pr =n+k−1∑r=n

p(xr)

2Tr>

ε

4

n+k−1∑r=n

1

Tr + 1.

Therefore,

δ 6n+k−1∑r=1

|xr+1 − xr| 6n+k−1∑r=1

1

1 + Tr6

4

ε(An+k − An).

Therefore xn ∈ ∆ε infinitely often would cause An to increase to infinity. By contra-

diction, (xn)n∈N must converge to Λ.

Remark. The proof of convergence of p(xn) to 0 would also hold on (deterministic)

solutions of the ODE.

27

Page 28: Reinforcement learning in signaling game

Corollary 4.1. Almost surely,

Tnn→ lim

n→∞

H(xn)

M1

∈[

1

M1

,min(M1,M2)

M1

]as n→∞.

Proof. Same as Lemma 3.2.

5 Equilibria

5.1 Convergence to the set of equilibria : Proof of Theorem

2.2

We already know that the occupation measure (xn)n∈N a.s. converges to Λ; the

goal of this section is to prove that, more precisely, (xn)n∈N a.s. converges to the set of

equilibria Γ of the ODE.

Lemma 5.1. Suppose ε small enough and x ∈ ∆cε4. Then for any i ∈ S1, j ∈ S2 s.t.

xij > ε,

| yij −Ni(x) | < ε

3, and | yij −Nj(x) | < ε

3.

Proof. Follows directly from (13).

Lemma 5.2.

yn+1ij − ynij =

1

M1Tnynij

(ynij −Ni(xn)−Nj(xn) +H(xn)

)+ rn+1 + ζn+1

ij , (19)

28

Page 29: Reinforcement learning in signaling game

where (rn)n>1 is predictable, E [ ζn+1ij | Fn ] = 0 and

|ζn+1ij | <

12

Tnxni xnj

, |rn+1| <6(xni + xnj )

(Tnxni xnj )2

612

(Tnxni xnj )2

. (20)

Proof.

yn+1ij − ynij =

Tn+1Vn+1ij

V n+1i V n+1

j

−TnV

nij

V ni V

nj

= (Tn+1Vn+1ij V n

i Vnj − TnV n

ijVn+1i V n+1

j ) (21)

× 1

(V ni V

nj )2

(1 +

( V ni V

nj

V n+1i V n+1

j

− 1))

(22)

and

(21) = TnVni V

nj 1I∆V n+1

ij >0 + V nijV

ni V

nj 1I∆Tn+1>0 + V n

i Vnj 1I∆V n+1

ij >0

− TnVnijV

ni 1y∆V n+1

i >0 − TnVnijV

nj 1y∆V n+1

j >0 − TnVnij 1I∆V n+1

ij >0.

Hence

|(21)| 6 6TnVni V

nj ,

and

E[

(21)

(V ni V

nj )2

∣∣∣Fn ] =1

M1Tnynij(ynij −Ni(xn)−Nj(xn) +H(xn)

).

By the following simple estimate

V ni V

nj

V n+1i V n+1

j

− 1 61

V n+1i

+1

V n+1j

,

29

Page 30: Reinforcement learning in signaling game

we deduce that

|rn+1| =

∣∣∣∣∣E[

(21)

(V ni V

nj )2

(V ni V

nj

V n+1i V n+1

j

− 1

) ∣∣∣Fn]∣∣∣∣∣

66Tn(V n

i + V nj )

(V ni V

nj )2

66(xni + xnj )

(Tnxni xnj )2

.

and

|yn+1ij − ynij| = (21)× (22) 6

6

Tnxni xnj

. (23)

Therefore

|ζn+1ij | = 2 |yn+1

ij − ynij − E[yn+1ij − ynij|Fn]| 6 12

Tnxni xnj

.

Let

Uij(ε) :=x ∈ ∆ : xij 6 ε or yij −H(X) > −ε

.

Lemma 5.3. Assume ε > 0 is small enough and m0 ∈ N is large enough. Let

Rn :=n∑

m=m0

(ymij − ym−1

ij − ε2

6m

)1Ixm−1 /∈Uij(ε)∪∆ε4 , Tm−1> m

2M1, ∀n ∈ N;

Sn :=n∑

m=m0

(xmij − xm−1

ij +ε

m

)1Ixm−1 /∈Uij(ε)∪∆ε4 , Tm−1> m

2M1, ∀n ∈ N.

Then

(a) (Rn)n∈N (resp. (Sn)n∈N) is a submartingale (resp. supermartingale).

30

Page 31: Reinforcement learning in signaling game

(b) lim supn>m,m→∞(Rn −Rm)− = lim supn>m,m→∞(Sn − Sm)+ = 0.

Proof. First we note that if xn /∈ Uij(ε),

E[xn+1ij − xnij | Fn ] =

xnij1 + Tn

(ynij −H(xn)) 6 − ε

Tn.

This implies that (Sn)n∈N is a supermartingale. Now we prove that (Rn)n∈N is a sub-

martingale.

Assume ε > 0 small enough and xn /∈ Uij(ε) ∪∆ε4 . Then Lemma 5.1 implies that

ynij −Ni(xn)−Nj(xn) +H(xn) >ε

3.

Hence,

E[ yn+1ij − ynij | Fn ] >

1

M1Tnxni xnj

(xnijε

3− 12

Tnxni xnj

)>

xnijε

6M1Tnxni xnj

>ε2

6n,

if n > 144M1/ε4 (which implies Tn > 72/ε4 and therefore Tnx

ni x

nj > 72/(xnijε) ).

Let us now prove (b). Let

Πn := Rn −n∑

m=m0

E[Rm −Rm−1 | Fm−1 ],

Ξn := Sn −n∑

m=m0

E[Sm − Sm−1 | Fm−1 ].

31

Page 32: Reinforcement learning in signaling game

By (20), we note that for all n > m0,

E[ (Πn+1 − Πn)2 | Fn ] 6 E[ (yn+1ij − ynij)2 | Fn ]

612

(Tnxni xnj )2

1Ixn /∈Uij(ε),Tn> n2M1

648M 2

1

ε4n2.

Therefore (Πn)n∈N is bounded in L2 and hence converges. We can obtain similar bounds

for (Ξn)n∈N, which converges as well. This completes the proof.

Lemma 5.4. Let ε > 0, and assume n ∈ N is sufficiently large (depending on ε). If

xn ∈ Uij(ε), |H(xn+1)−H(xn)| < ε/2 and Tn > n/(2M1), then xn+1 ∈ Uij(2ε).

Proof. Let xn ∈ Uij(ε). Then (8) implies

|xn+1ij − xnij| 6

2M1

n.

Assume that n is large enough. If xnij 6 ε, then xn+1ij 6 2ε. Otherwise xnij > ε and

ynij −H(xn) > −ε; assuming Tn > n/(2M1) and using (23), we have

|yn+1ij − ynij| 6

6

Tnxni xnj

612M1

ε2n<ε

2

and, by assumption |H(xn+1) − H(xn)| < ε/2, so that we conclude that yn+1ij −

H(xn+1) > −2ε.

Lemma 5.5.

lim supn→∞

xnij(ynij −H(xn))− = 0.

32

Page 33: Reinforcement learning in signaling game

Proof. We fix ε > 0 and m0 ∈ N, and let τm0 be the stopping time

τm0 := infn > m0 : xn /∈ ∆ε4 or Tn <

n

2M1

or |H(xn)−H(xm0)| >ε

4

.

We only need to show that almost surely, either τm0 < ∞ or xn ∈ Uij(3ε) for all n

large enough. This will complete the proof since we know from Theorem 2.1, Lemma

3.2 and Proposition 4.2 that there exists almost surely m0 ∈ N s.t. τm0 =∞.

Let σm0 be the stopping time

σm0 := infn > m0 : xn ∈ Uij(ε).

Lemma 5.3(b) implies that there exists a.s. a (random) m0 ∈ N such that, for all

n > m > m0,

(Rn −Rm)− 6ε

2, (Sn − Sm)+ 6

ε

2. (24)

Therefore τm0 < ∞ or σm0 < ∞, using∑

n>m01/n = ∞ and the observation that xnij

is bounded.

For all n ∈ [σm0 , τm0), let ρn be the largest k 6 n such that xk ∈ Uij(ε). By (24),

ynij − yρn+1ij > − ε

2.

Now xρn+1 ∈ Uij(2ε) (by Lemma 5.4): let us assume for instance that yρn+1ij −H(xρn+1) >

−2ε. Together with |H(xn)−H(xρn+1)| 6 ε/2 (by n < τm0), we deduce that

ynij −H(xn) > yρn+1ij −H(xρn+1)− ε > −3ε.

33

Page 34: Reinforcement learning in signaling game

With a similar argument, xρn+1ij 6 2ε implies xnij 6 3ε. So overall, xn ∈ Uij(3ε) if

σm0 6 n < τm0 , which enables us to conclude.

Proof of Theorem 2.2.

2H(xn) =∑i,j∈S

ynijxnij

= 2H(xn) +∑i,j∈S

(ynij −H(xn))+xnij −∑i,j∈S

(ynij −H(xn))−xnij

Hence, limn→∞ xnij(y

nij −H(xn))− = 0 implies

limn→∞

xnij(ynij −H(xn)) = 0.

Lemma 5.5 enables us to conclude.

5.2 Bipartite graph structure

Let us recall the bipartite graph defined in Section 3 (see Definition 2.1): any

x ∈ ∆ is associated with a weighted bipartite graph Gx with vertices S := S1 ∪ S2,

adjacency ∼ and weights as follows

(1) ∀ i ∈ S1, j ∈ S2, i ∼ j if and only if xij > 0.

(2) The weight of edge i, j is xij/(xixj).

Let C1, . . ., Cd be the connected components of Gx. Besides the bipartite graph

defined above, let us also discuss two other possible ways to assign weights:

34

Page 35: Reinforcement learning in signaling game

Graph G1x : Edge eij has weight xij/xj, i ∈ S1, j ∈ S2.

Graph G2x : Edge eij has weight xij/xi, i ∈ S1, j ∈ S2.

By Lemma 4.3, we observe some interesting properties of these three graphs when

x ∈ Λ, in particular x ∈ Γ:

On Gx : All edges in a component Ck, k = 1, . . . , d have the same weight λk. Hence,

H(x) =d∑

k=1

∑i∈S1∩Ck

xiλk =d∑

k=1

∑j∈S2∩Ck

xjλk.

Furthermore, if x is an equilibrium, all the edges in Gx have the same weight,

which equals H(x).

On G1x : Every edge linked with the same state i has the same weight, which we denote

by ki. Also for each signal j, the sum of weights of edges linked to j is equal to 1,

so that H(x) =∑M2

i=1 ki.

On G2x : Every edge linked with the same signal j has the same weight, which we

denote by k′j. Also for each state i, the sum of the weights of edges linked to i is

equal to 1. H(x) =∑M1

j=1 k′j.

5.3 Properties of Lyapounov function

We now show that H is constant on each connected component of Γ. Since it is not

continuous on the boundary, we first prove in Lemma 5.6 that it takes a constant value

on connected subsets of Γ with the same support (defined below) by a differentiability

argument, and then conclude in Proposition 5.1 by a continuity argument on the set

of equilibria.

35

Page 36: Reinforcement learning in signaling game

Let

Θ := θ : θ ⊆ Spair.

For any x ∈ ∆, we define its support

Sx := (i, j) : i ∈ S1, j ∈ S2, xij > 0.

Θ can be used as an index set to divide ∆ into several parts in the following sense: for

any θ ∈ Θ,

∆θ := x ∈ ∆ : Sx = θ,

Γθ := ∆θ ∩ Γ.

Lemma 5.6. For any θ ∈ Θ, H is constant on each component of Γθ.

Proof. Given q ∈ ΓΘ, let us differentiate H at q with respect to xij = xji, (i, j) ∈ Sq

without the constraint x ∈ ∆:

(∂

∂xijH(x)

)x=q

=∑

(k,l)∈Sq

∂xij

(x2kl

xkxl

)

=∂

∂xij

(x2ij

xixj

)+

∑k 6=j, (i,k)∈Sq

∂xij

(x2ik

xixk

)+

∑l 6=i, (l,j)∈Sq

∂xij

(x2lj

xlxj

)

=qijqiqj

(2− qij

qi− qijqj

)+

∑k 6=j, (i,k)∈Sq

qikqiqk

(−qikqi

)+

∑l 6=i, (l,j)∈Sq

qljqlqj

(−qljqj

)

= H(q)

(2− qij

qi− qijqj

)+H(q)

(−qi − qij

qi

)+H(q)

(−qj − qij

qj

)= 0.

36

Page 37: Reinforcement learning in signaling game

The penultimate equality comes from the fact that qij/(qiqj) = H(q) if (i, j) ∈ Sq, q ∈

Γ.

Proposition 5.1. H is constant on each connected component of Γ.

Proof. Let us show that H is continuous on Γ, which will enable us to conclude. Indeed,

suppose that q ∈ Γ, and that x ∈ Γ is in the neighbourhood of q ∈ Γ within ∆, then

Sx ⊇ Sq and, using x ∈ Γ,

H(x) =∑

(i,j)∈Sq

x2ij

xixj+

∑(i,j)∈Sx\Sq

xijH(x),

so that

H(x) =1

1−∑

(i,j)∈Sx\Sq xij

∑(i,j)∈Sq

x2ij

xixj,

and the conclusion follows.

5.4 Classification of equilibria and stability

5.4.1 Jacobian matrix

At any equilibrium x ∈ (∆ \ ∂∆)∩ Γ (F is not differentiable on ∂∆), we calculate

the Jacobian matrix

Jx =

(∂ Flk∂xij

)(i,j),(l,k)∈Spair

where, by a slight abuse of notation,

F (x) = (Fij(x))(i,j)∈Spair .

37

Page 38: Reinforcement learning in signaling game

For all (i, j), (l, k) ∈ Spair, a simple extension of the calculation in the proof of

Lemma 5.6 yields ∂ H∂xij

(x) = − 1Ixij=02H(x), so that

∂ Flk∂xij

= 1I(i,j)=(l,k),xij=0(ylk −H(x)) + xlk∂ ylk∂xij

− xlk∂ H

∂xij(x)

= − 1I(i,j)=(l,k),xij=0H(x) + xlk∂ ylk∂xij

+ 2xlk 1Ixij=0H(x)

= H(x) 1I(i,j)=(l,k)( 1Ixij 6=0 − 1Ixij=0)

+H(x) 1Ixlk 6=0

(−xikxi

1Ii=l −xljxj

1Ij=k

)+ 2xlk 1Ixij=0H(x)

Therefore, for any (i, j) ∈ Spair s.t. xij 6= 0,

∂Fij∂xij

= H(x)

(1− xij

xi− xijxj

); (25)

∂Fik∂xij

= H(x)

(−xikxi

), k ∈ S2 \ j; (26)

∂Flj∂xij

= H(x)

(−xljxj

), l ∈ S1 \ i; (27)

∂Flk∂xij

= 0, l ∈ S1 \ i, k ∈ S2 \ j; (28)

for any (i, j) ∈ Spair s.t. xij = 0,

∂Fij∂xij

= −H(x); (29)

∂Flk∂xij

= 0, l ∈ S1, k ∈ S2, (l, k) 6= (i, j), xlk = 0. (30)

Let C1, . . . , Cd be the connected components of the edges of Gx. Let

Jmx :=

(∂ Flk∂xij

)(i,j),(l,k)∈Cm

.

38

Page 39: Reinforcement learning in signaling game

Therefore, using (25)-(30), Jx can be written as follows, by putting first (i, j) and (l, k)

coordinates such that xij 6= 0 and xlk 6= 0 (in the same order, with increasing connected

components C1, . . . , Cd)

J1x (0)

. . .

Jdx

−H(x)

. . .

(∗) −H(x)

5.4.2 Classification of equilibria based on stability

Let us introduce a few definitions of stability for ordinary differential equations.

Definition 5.1. x is Lyapounov stable if for any neighborhood U1 of x, there exists a

neighborhood U2 ⊆ U1 of x such that any solution x(t) starting in U2 is such that x(t)

remains in U2 for all t > 0.

Definition 5.2. x is asymptotically stable if it is Lyapounov stable and there exists a

neighbourhood U1 such that any solution x(t) starting in U1 is such that x(t) converges

to x.

An equilibrium that is Lyapunov stable but not asymptotically stable is called

neutrally stable sometimes.

39

Page 40: Reinforcement learning in signaling game

Definition 5.3. x is linearly stable if all eigenvalues of the Jacobian matrix at x have

nonpositive real part; otherwise, x is called linearly unstable.

Remark that, with these definitions, linear stability allows for eigenvalues to have

zero real part, and therefore does not necessarily imply Lyapounov stability. However

the dynamics considered here makes these stable equilibria indeed Lyapounov stable

when they do not lie outside the boundary of ∆ are studying here, as it can be shown

by the help of an entropy function; Section 7 on convergence with positive probability

to stable configurations can be understood as a consequence of this propoerty in the

nondeterministic case.

Definition 5.4. Let

Γ0 := Γ ∩ ∆ \ ∂∆,

Γb := Γ ∩ ∂∆,

and let Γs (resp. Γu) be the set of linearly stable (resp. unstable) equilibria in Γ0 for

the mean-field ODE.

For any x ∈ Γu, let

Ex := θ ∈ RM1×M2 : |θ| = 1 and ∃ (i, j) ∈ Sx s.t. θ · eij > 0.

Proposition 5.2. We have

(a) Γs = x ∈ Γ0 : (P )Gx holds.

(b) If x ∈ Γu, then there exists an eigenvector in Ex whose eigenvalue has positive real

40

Page 41: Reinforcement learning in signaling game

part.

(c) For all x ∈ Γu and θ ∈ Ex \ 0, there exists a neighbourhood N (x) of x such that,

if xn ∈ N (x), then

E[ (ηn+1 · θ)2 | Fn ] >Cst(x)

n2,

where ηn+1 is the martingale increment defined in (7).

Note that (b)-(c) will be used to that (xn)n∈N stochastically ”perturbs enough”

the ODE (6), in order to prove nonconvergence to unstable equilibria.

To prove Proposition 5.2, we need following Lemma 5.7 on the structure of Gx

when x ∈ Γ0, and the elementary Lemma 5.8.

Lemma 5.7. For any x ∈ Γ0 such that PGx does not hold, Gx has at least one connected

component on which every vertex has at least two edges.

Proof. First, x ∈ Γu implies that there exists a connected component C with at least

two states and two signals. Assume, by contradiction, that signal j is in C and j is only

linked to one state, for instance state i. Then xij/xj = 1, and for all k ∈ S2, s.t. xik > 0,

xikxk

=xijxj

= 1,

i.e. k is only linked with one edge. It implies C ∩ S1 = i, which contradicts our

assumption.

Lemma 5.8. If a random variable ω satisfies that E[ω] < ∞, P(ω = a) > p and

P(ω = b) > p, then

Var(ω) >(b− a)2p

4.

41

Page 42: Reinforcement learning in signaling game

Proof.

Var(ω) > p((a− E[ω])2 ∨ (b− E[ω])2

)>

(b− a)2p

4.

Proof of Proposition 5.2. (a)-(b) Suppose that x ∈ Γ0 and that PGx does not hold,

and let us prove that x ∈ Γu. Lemma 5.7 implies that Gx has a connected component

on which every vertex has at least two edges, which we assume to be C1 w.l.o.g. Let

V (C1) (resp. E(C1)) be its set of vertices (resp. edges). Let us show that J1x has at

least one eigenvalue with positive real part.

Indeed, let us compute the trace of J1x :

Tr(J1x) = H(x)

∑i,j∈C1:i∼j

(1− xijxi− xijxj

)

= H(x) (|E(C1)| − |V (C1)|) > 0.

The last inequality comes from the fact that the number of edges is greater than or

equal to the number of vertices in C1 because every vertex has at least two edges in C1.

Now it is easy to check that (J1x)1 = −H(x) 1I where 1 = (1, . . . , 1)T , and therefore

that −H(x) is an eigenvalue of J1x , which enables us to conclude (b) and the first part

of (a).

Now suppose that x ∈ Γ0, and that PGx holds. Then each component of Gx has

only one state or only one signal. Let us assume for instance that C1 consists of states

42

Page 43: Reinforcement learning in signaling game

1, . . . , k and signal A. Then J1x equals

−H(x)

xA

x1 x2 . . . xk

x1 x2 . . . xk

......

. . ....

x1 x2 . . . xk

.

The rank of J1x is 1, and its eigenvalues are 0 and −H(x), which completes the proof.

(c) Let x ∈ Γu and θ ∈ Ex \ 0. Let eij be a orthogonal basis set in RM1×M2 .

Let

Wn+1 := 1I|Vn+1−Vn|=1

(Vn+1 − Vn − xn

)= (1 + Tn)(xn+1 − xn).

We note that

Wn+1 · θ = 0 , with probability 1− H(xn)

M1

,

∀i ∈ S1, j ∈ S2, Wn+1 · θ = (1− xnij)eij · θ , with probabilityxnijy

nij

M1

.

Note that x ∈ Γu implies that H(x) 6= M1 and that, for all (i, j) ∈ Sx, xij 6= 1.

Therefore, assuming that xn is in the neighbourhood of x (for which yij = H(x)),

Lemma 5.8 implies

Var(Wn+1 · θ | Fn

)> max

i,j∈C1:i∼j

min

(1− H(xn)

M1

,xnijy

nij

M1

)((1− xnij)eij · θ)2

4

(31)

> Cst(x) > 0, (32)

43

Page 44: Reinforcement learning in signaling game

where we use θ ∈ Ex in the penultimate inequality. This completes the proof. 2

If M1 = M2, let us define the set of equilibria Γsig as follows: x ∈ Γsig if and only

if there exists a bijective map φ from S1 to S2 such that

xij :=

1M1

if j = φ(i),

0 otherwise.

Equilibria in Γsig correspond to a perfectly efficient signaling system, in the sense

that they bear no synonyms or informational bottlenecks. It is easy to check from the

calculation in the proof of (b) of Proposition 5.2 that the Jacobian matrix at such an

x has the only eigenvalue −H(x), which implies asymptotic stability.

Corollary 5.1. Suppose the game has the same number of states and signals, i.e.

M1 = M2 and that x ∈ Γsig. Then x is asymptotically stable for the mean-field ODE.

6 Links with static equilibrium analysis

We present in this section some results on the characterization of evolutionarily

stable strategies (ESS) and neutrally stable strategies (NSS) for signaling games. Note

that Taylor and Jonker (1978), and Zeeman (1979) propose, in a general setting, con-

ditions under which an evolutionarily stable strategy is indeed a (dynamically) stable

equilibrium. However the condition is quite strong, and only applies to special cases of

the signaling game.

We show here an equivalence between this static context and the underlying rein-

44

Page 45: Reinforcement learning in signaling game

forcement learning dynamics, i.e. that

(1) the set of ESS matches the set of asymptotically stable equilibria of the mean-field

ODE;

(2) the set of neutrally stable strategies matches the set of linearly stable equilibria of

the mean-field ODE;

(3) the set of Nash equilibria matches the set Λ where ∇H · F vanishes.

In Sections 6.1, 6.2 and 6.3 we successively present usual notions in the static setting of

signaling games, results on equilibrium selection and the connection between stability

of the dynamics and the static setting of the game mentioned above.

6.1 Static setting

Let

P = P ∈ RM1×M2+ : ∀ i ∈ S1,

∑j∈S2

pij = 1 ,

Q = Q ∈ RM2×M1+ : ∀ j ∈ S2,

∑i∈S1

qji = 1 .

A Sender’s strategy can be represented by a M1 × M2 matrix P ∈ P in the sense

that if he sees state i, he chooses signal j with probability pij. Also we can represent

Receiver’s strategy by a M2 ×M1 matrix Q ∈ Q, i.e. if he sees signal j he chooses act

i with probability qji. The payoff function is

Payoff(P,Q) =∑

i∈S1,j∈S2

pijqji = tr(PQ).

45

Page 46: Reinforcement learning in signaling game

As in the real world, where somebody can be sender at one time and receiver at

another, the authors assume that each individual can be Sender or Receiver with equal

probability at each time, which symmetrizes the game. Thus the (mixed) strategy

of any of the two players of the symmetrized game is a pair of Sender and Receiver

matrices (P,Q), and the payoff function Υ can be written as

Υ[(P,Q), (P ′, Q′)] = Υ[(P ′, Q′), (P,Q)] =1

2tr(PQ′) +

1

2tr(P ′Q).

6.2 Equilibrium selection

Definition 6.1. A strategy (P,Q) ∈ P ×Q is called a Nash strategy if

Υ[(P,Q), (P,Q)] > Υ[(P ′, Q′), (P,Q)], ∀ (P ′, Q′) ∈ P ×Q.

There are uncountably many Nash strategies in signaling games. Let us recall the

following notions of evolutionary and neutrally stable equilibria, which enable one to

distinguish the relevant limiting strategies of the game; note that the notions are purely

static here.

Definition 6.2. A strategy (P,Q) ∈ P ×Q is evolutionarily stable if

(i) it is a Nash strategy, and

(ii) Υ[(P,Q), (P ′, Q′)] > Υ[(P ′, Q′), (P ′, Q′)] for all (P ′, Q′) 6= (P,Q).

Definition 6.3. A strategy (P,Q) ∈ P ×Q is neutrally stable if

(i) it is a Nash strategy, and

46

Page 47: Reinforcement learning in signaling game

(ii) if Υ[(P,Q), (P,Q)] = Υ[(P ′, Q′), (P,Q)] for some (P ′, Q′) ∈ P ×Q, then

Υ[(P,Q), (P ′, Q′)] > Υ[(P ′, Q′), (P ′, Q′)].

The following Propositions 6.1, 6.2 and 6.3 characterize ESSs and NSSs in the

signaling game.

Proposition 6.1 (Trapa and Nowak (2000)). Let (P,Q) ∈ P ×Q be such that neither

P nor Q contains a column that consists entirely of zeros. Then (P,Q) is a Nash

strategy if and only if there exist positive numbers p1, . . . , pn and q1, . . . , qm such that

(1) for each j, the j-th column of P has its entries drawn from 0, pj,

(2) for each i, the i-th column of Q has its entries drawn from 0, qi

(3) for all i, j, pij 6= 0 if and only if qji 6= 0.

Remark. The assumption that neither P nor Q contains a column consisting entirely

of zeros corresponds to the requirement that no signal or act falls out of use.

Proposition 6.2 (Pawlowitsch (2007)). Let (P,Q) ∈ P ×Q be a Nash strategy. Then

(P,Q) is a neutrally stable strategy if and only if

(1) at least one of the two matrices, P or Q, has no zero column, and

(2) if P or Q contains a column with more than one positive element, then

all the elements in this column take values in 0, 1.

Proposition 6.3 (Trapa and Nowak (2000)). (P,Q) ∈ P × Q is an evolutionarily

stable strategy if and only if M1 = M2, P is a permutation matrix and Q = P T .

47

Page 48: Reinforcement learning in signaling game

6.3 Connection

Let us now define a map between the static and dynamic learning models, in order

to emphasize the correspondence between stability in either settings.

Let s1 be a bijective map from S1 to 1, . . . ,M1 and s2 be a bijective map from

S2 to 1, . . . ,M2. Let us define map Ψ from ∆ \ ∂∆ to P ×Q : Ψ(x) = (P,Q) where

ps1(i)s2(j) =xijxi, qs2(j)s1(i) =

xijxj, ∀ i ∈ S1, j ∈ S2.

Proposition 6.4. Let

Γ :=

(P,Q) ∈ P ×Q : neither P nor Q contains

any column that consists entirely of zeros,

and let

LP×Q := (P,Q) ∈ P ×Q is a Nash strategy ∩ Γ,

LP×Q := (P,Q) ∈ P ×Q is a neutrally stable strategy ∩ Γ,

LP×Q := (P,Q) ∈ P ×Q is an evolutionarily stable strategy ∩ Γ.

Then

(a) Ψ((∆ \ ∂∆) ∩ Λ) = LP×Q and Ψ((∆ \ ∂∆) \ Λ)) ∩ LP×Q = ∅.

(b) Ψ(Γs) = LP×Q.

(c) Ψ(Γsig) = LP×Q.

48

Page 49: Reinforcement learning in signaling game

Proof. (a) is a direct consequence of Lemma 4.3 and Proposition 6.1.

Conversely, given (P,Q) ∈ LP×Q, let p1, . . . pn and q1, . . . qm be defined as in

Proposition 6.1. Note that

ps1(i)s2(j)qs2(j)s1(i) = ps2(j)qs1(i) 1Ips1(i)s2(j) 6=0 = ps1(i)s2(j)qs1(i) = ps2(j)qs2(j)s1(i).

Define

Z :=∑

(i,j)∈Spair

ps1(i)s2(j)qs2(j)s1(i) =∑i∈S1

qs1(i) =∑j∈S2

ps2(j),

and let

xij :=ps1(i)s2(j)qs2(j)s1(i)

Z, i ∈ S1, j ∈ S2;

xi :=qi∑

k∈S1 qk=∑j∈S2

xij, i ∈ S1;

xj :=pj∑k∈S2 pk

=∑i∈S1

xij, j ∈ S2.

Then (P,Q) = Ψ(x), and x ∈ ∆\∂∆)∩Λ is again a direct consequence of Lemma

4.3 and Proposition 6.1.

Let us now prove (b), and assume (P,Q) = Ψ(x): if one column of P (resp. Q)

has more than one element, this corresponds to an informational bottleneck (resp.

synonym). Hence Condition (2) in Proposition 6.2 means that a state(act)-signal cor-

respondence cannot be associated both to a synonym and an informational bottleneck,

which translates into (P )Gx . (c) follows from Corollary 5.1 and Proposition 6.3.

49

Page 50: Reinforcement learning in signaling game

7 Convergence with positive probability to stable

configurations

Let us prove the more general

Theorem 7.1. Let q ∈ Γs, and let N (q) neighbourhood of q in ∆. Then, with positive

probability,

(a) xn → x ∈ N (q) s.t. Gx = Gq.

(b) V (∞, i, j) =∞ ⇐⇒ i, j is an edge of G.

Theorem 7.1 is an obvious consequence of the following Proposition 7.1. Given

G = (S,∼), assume that (P )G holds: then each connected component of G contains

either only a single state or a single signal. Let π := πG : S −→ R1q be the function

mapping i ∈ S to the single state/signal in the same connected component as i, with

the convention that we choose the state if the component consists only of one state and

one signal.

For all i ∈ S, n ∈ N and ε > 0, let us define

αni := xni /xnπ(i)

H1n :=

⋂i∈S,i=π(i)

V ni > 2εn,

H2n :=

⋂i∈S

αni > ε,

H3n :=

⋂i,j∈S,π(i)6=π(j)

V nij 6

√n .

Proposition 7.1. Let G be such that (P )G holds, and let π := πG. For all ε ∈ (0, 1/M1),

50

Page 51: Reinforcement learning in signaling game

if H1n, H2

n and H3n hold, and n > Cst(ε,M1,M2), then, with lower bounded probability

(only depending on ε, M1 and M2), for all i, j ∈ S, k > n,

V ∞ij = V nij , when π(i) 6= π(j); (33)

αki /αni ∈ (1− ε, 1 + ε); (34)

V ki > εk, when π(i) = i. (35)

In the remainder of this section, we fix the graph (G,∼) (and thus π = πG) and

ε > 0. The proof consists of the following Lemmas 7.1–7.3.

Let, for all i, j ∈ S, n ∈ N,

τ 1,i,jn := infk > n : V k

ij 6= V nij ;

τ 2,in := infk > n : αki /α

ni /∈ (1− ε, 1 + ε);

τ 3,in := infk > n : V k

i < εk,

and let

τ 1n := inf

i,j∈S,π(i) 6=π(j)τ 1,i,jn , τ 2

n := infi∈S

τ 2,in , τ 3

n := infi∈S,π(i)=i

τ 3,in , τn := τ 1

n ∧ τ 2n ∧ τ 3

n.

Lemma 7.1. If n > Cst(ε,M1,M2), then

P(τ 1n > τ 2

n ∧ τ 3n | Fn, H1

n, H2n, H

3n

)> exp

(−2ε−4M1M2

).

51

Page 52: Reinforcement learning in signaling game

Proof. Assuming n > Cst(ε,M1,M2),

P(τ 1n > τ 2

n ∧ τ 3n | Fn, H1

n, H2n, H

3n

)>∏k>n

1−∑

π(i)6=π(j)

(V nij )

2

V ki V

kj

> exp

−3

2

∑π(i)6=π(j),k>n

n

ε4k2

> exp

(−2ε−4M1M2

).

Lemma 7.2. If n > Cst(ε,M1,M2) then, for all i ∈ S,

P(τ 2,in > τ 1

n ∧ τ 3n | Fn, H1

n, H2n, H

3n

)> 1− 2 exp(−Cst(ε)n).

Proof. Fix i ∈ S, n ∈ N, and assume w.l.o.g. that π(i) 6= i.

Let, for all j ∈ S and k > n,

V kj := V n

j +∑l∼j

(V kjl − V n

jl ),

which is equal to V kj as long as k < τ 1

n.

Let, for all k > n,

Wk := logV ki

V kπ(i)

,

and let us consider the Doob decomposition of (Wk)k>n:

Wk = Wn + ∆k + Ψk

∆k :=∑j>n+1

E(Wj −Wj−1|Fj−1).

52

Page 53: Reinforcement learning in signaling game

Assume that H1n, H2

n and H3n hold, and that k < τn: then

|∆k+1 −∆k| = |E [Wk+1 −Wk | Fk] |

=

∣∣∣∣∣∣ 1

M1

(V kiπ(i)

V ki

)21

V kπ(i)

(1 + 2((V ki )−1)− 1

M1

∑j∼π(i)

V kπ(i)j

V kj

V kπ(i)j

(Vπ(i))2(1 + 2((V k

π(i))−1)

∣∣∣∣∣∣=

1

M1V kπ(i)

∣∣∣∣∣∣1 + k−1/22(Cst(ε,M1,M2)−∑j∼π(i)

(1 + k−1/22(Cst(ε,M1,M2)

) V kπ(i)j

V kπ(i)

∣∣∣∣∣∣= k−3/22 (Cst(ε,M1,M2)) ,

where we use that, for all j ∼ π(i),

∣∣∣∣∣Vkπ(i)j

V kj

− 1

∣∣∣∣∣ ,∣∣∣∣∣∣∑l∼π(i)

V kπ(i)l

V kπ(i)

− 1

∣∣∣∣∣∣ 6 k−1/2Cst(ε,M1,M2). (36)

Therefore, for all k > n,

|∆k| 6 n−1/2Cst(ε,M1,M2).

Let us now estimate the martingale increment: |Ψk+1 − Ψk| 6 Cst(ε)k−1 (since

|Wk+1 −Wk| 6 Cst(ε)k−1), so that Lemma 7.4 implies

P(

supk>n|Ψk −Ψn| 1Ik6τn > ε/2

)> 1− 2 exp(−Cst(ε)n),

which completes the proof.

Lemma 7.3. If ε ∈ (0, 1/M1) and n > Cst(ε,M1,M2) then, for all i ∈ S such that

53

Page 54: Reinforcement learning in signaling game

π(i) = i,

P(τ 3,in > τ 1

n ∧ τ 2n

∣∣Fn, H1n, H

2n

)> 1− 2 exp(−Cst(ε)n).

Proof. Let n ∈ N, assume that H1n, H2

n and H3n hold, and fix i ∈ S such that π(i) = i.

Let us consider the Doob decomposition of (V ki )k>n:

V ki := V n

i + Φk + Ξk

Φk :=∑j>n+1

E(V ji − V

j−1i | Fj−1

).

Now, for all η > 0, if n > Cst(η, ε) and k < τn, (36) implies

Φk+1 − Φk = E(V k+1i − V k

i | Fk)>

1

M1

∑j∼i

(V kij )

2

V ki V

kj

>1

M1

− η.

Let us now estimate the martingale increment: let, for all p > n,

χp :=

p−1∑k=n

Ξk+1 − Ξk

k.

Then, for all p > n,

Ξp =∑

n6k6p−1

(χk+1 − χk)k = −∑

n6k6p−1

χk + (p− 1)χp.

This implies, using Lemma 7.4 (and |Ξk+1 − Ξk| 6 1 for all k > n) that, for all

ε > 0,

P(∀k > n, V k

i > (2ε− η)n+ (k − n)(1/M1 − η) | Fn)

54

Page 55: Reinforcement learning in signaling game

> P(

supp>n

∣∣∣∣Ξp

p

∣∣∣∣ 6 η | Fn)

> P(

supk>n|χk| 6

η

2| Fn

)> 1− 2 exp(−Cst(η)n);

we choose η = min(ε/2, 1/M1 − ε), which completes the proof.

Lemma 7.4. Let (γk)k∈N be a deterministic sequence of positive reals, let G := (Gn)n∈N

be a filtration, and let (Mn)n∈N be a G-adapted martingale such that |Mn+1−Mn| 6 γn

for all n ∈ N. Then, for all n ∈ N,

P(

supk>n

(Mk −Mn) > λ | Gn)

6 exp

(− λ2

2∑

k>n γ2k

).

Proof. Let, for all θ ∈ R,

Zn(θ) := exp

(θMn −

θ2

2

n∑k=1

γ2k

).

Then (Zn(θ))n∈N is supermartingale, so that

P(

supk>n

(Mk −Mn) > λ | Gn)

6 P

(supk>n

Zk(θ) > Zn(θ) exp

(λθ − θ2

2

n∑k=1

γ2k

)| Gn

)

6 exp

(θ2

2

n∑k=1

γ2k − λθ

)= exp

(− λ2

2∑

k>n γ2k

)

if we choose θ := λ/∑

k>n γ2k.

55

Page 56: Reinforcement learning in signaling game

References

[1] R. Argiento, R. Pemantle, B. Skyrms, and S. Volkov. Learning to signal : Analysis

of a micro-level reinforcement model. Stochastic processes and their applications,

119(2):373–390, 2009.

[2] A. W. Beggs. On the convergence of reinforcement learning. Journal of Economic

Theory, 122:1–36, 2005.

[3] M. Benaım. Dynamics of stochastic approximation algorithms. In Seminaires

de Probabilites XXXIII, volume 1709 of Lecture notes in mathematics. Springer-

Verlag, 1999.

[4] R. Durrett. Probability: Theory and Examples. Duxbury Press, Belmont, CA,

Third Edition, 2004.

[5] I. Erev and A. E. Roth. Predicting how people play games: Reinforcement learning

in experimental games with unique, mixed strategy equilibria. The American

Economic Review, 88:848–881, 1998.

[6] D. Fudenberg and D. K. Levine. The Theory of Learning in Games. Cambridge:

MIT Press, 1998.

[7] R. J. Herrnstein. On the law of effect. Journal of the Experimental Analysis of

Behavior, 15:245–266, 1970.

[8] J. Hofbauer and S. Huttegger. Feasibility of communication in binary signaling

games. Journal of Theoretical Biology, 254:842–849, 2008.

[9] J. Hofbauer and K. Sigmund. Evolutionary Games and Population Dynamics.

Cambridge: Cambridge University Press, 1998.

[10] E. Hopkins and M. Posch. Attainability of boundary points under reinforcement

learning. Games and Economic Behavior, 53:110–125, 2005.

56

Page 57: Reinforcement learning in signaling game

[11] S. Huttegger. Evolution and the explanation of meaning. Philosophy of Science,

74:1–27, 2007.

[12] S. Huttegger and K. Zollman. Signaling games: dynamics of evolution and learn-

ing. To appear in Language, Games, and Evolution, 2010.

[13] D. Lewis. Convention: A Philosophical Study. Harvard: Harvard University Press,

1969.

[14] V. Losert and E. Akin. Dynamics of games and genes : Discrete versus continuous

time. J. Math. Biology, 17:241–251, 1983.

[15] J. Maynard Smith. Evolution and the Theory of Games. Cambridge: Cambridge

University Press, 1982.

[16] C. Pawlowitsch. Why evolution does not always lead to an optimal signaling

system. Games and Economic Behavior, 63:203–226, 2008.

[17] R. Pemantle. Random processes with reinforcement. Massachussets Institute of

Technology doctoral dissertation, 1988.

[18] R. Pemantle. Nonconvergence to unstable points in urn models and stochastic

approximations. Annals of Probability, 18:698–712, 1990.

[19] A. E. Roth and I. Erev. Learning in extensive-form games: Experimental data and

simple dynamics models in the intermediate term. Games and Economic Behavior,

8:164–212, 1995.

[20] B. Skyrms. Signals: Evolution, Learning, and Information. Oxford: Oxford Uni-

versity Press, 2010.

[21] P. Tarres. Pieges repulsifs. C.R.Acad.Sci.Paris Ser. I Math, 330:125–130, 2000.

[22] P. Tarres. Vertex-reinforced random walk on Z eventually gets stuck on five points.

Ann. Probab., 32(3B):2650–2701, 2004.

57

Page 58: Reinforcement learning in signaling game

[23] P. D. Taylor and L. Jonker. Evolutionarily stable strategies and game dynamics.

Mathematical Biosciences, 40:145–156, 1978.

[24] P. Trapa and M. Nowak. Nash equilibria for an evolutionary language game.

Journal of Mathematical Biology, 41(2):172–188, 2000.

[25] E.C. Zeeman. Population dynamics from game theory. Proc. Int. Conf. Global

Theory of Dynamical System, Northwestern, 1979.

58