77
PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

PROBABILITY THEORY AND STATISTICS

V. Tassion

D-ITET Spring 2020 (updated: May 24, 2020)

Page 2: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Introduction

Some questions you may askWhat is probability?

Ü A mathematical language describing systems involving randomness.

Where are probabilities used?

Ü Describe random experiments in the real world (coin flip, dice rolling, arrivaltimes of customers in a shop, weather in 1 year,...).

Ü Express uncertainty. For example, when a machine performs a measurement,the value is rarely exact. One may use probability theory in this context by sayingthat the value obtained is equal to the real value plus a small random error.

Ü Decision making. Probability theory can be used to describe a system when onlypart of the information is known. In such context, it may help to make a decision.

Ü Randomized algorithms in computer science. Sometimes, it is more efficient toadd some randomness to perform an algorithm. Examples: Google web search, antssearching food.

Ü Simplify complex systems. Examples: water molecules in water, or cars on thehighway.

The notion of probabilistic modelIf one wants a precise physical description of a coin flip one would need a lot (really!)of information: the exact position of the fingers and the coins, the initial velocity, theinitial angular velocity, imperfections of the coin, the surface characteristics of the table,air currents, the brain activity of the gambler... These parameters are almost impossibleto measure precisely, and a tiny change in one of them may affect completely the result.For practical use, it is very convenient to use probabilistic description which simplifies alot the system. We are only interested in the result, namely the set of possible outcomes.

1

Page 3: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

2

For the coin flip, the probabilistic model is given by 2 possible outcomes (head andtails) and each outcome has probability phead = ptail = 1/2 to be realized. In other words,

Coin flip = head, tail, phead = 1/2, ptail = 1/2

A surprising analysis: coin flips are not fair! If one tosses a coin it has morechance to fall on the same face as its initial face! See the youtube video of Persi Diaconis:How random is a coin toss? - Numberphile

Probability laws: randomness vs orderingIf one performs a single random experiment (for example a coin flip), the result is unpre-dictable. In contrast, when one performs many random experiments, then some generallaws can be observed. For example if one tosses 10000 independent coins, one shouldgenerally observe 5000 heads and 5000 tails approximately: This is an instance of a fun-damental probability law, called the law of large numbers. One goal of probability theoryis to describe how ordering can emerge out of many random experiments, and establishsome general probability laws.

Page 4: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Chapter 1

Mathematical framework

The probabilistic model for the die

When one throws a cubic die (with six faces), we know that half of the time, the upper faceof the die shows an even number: 2, 4, or 6. One would like to give a precise mathematicalmeaning to this phenomena. More precisely, our goal is to give a rigorous sense to

P[“the upper face of the die is even”] = 1

2. (1.1)

First, we have to give a sense to “the upper face of the die is even”, and then to thenotation P[.] which represents the probability. A first goal of the chapter is to introducethe notion of probability space used to model general random experiments. It will be givenby a triple (Ω,F ,P), where Ω is called the sample space, F the set of events and P theprobability measure. Before giving the general definitions, we construct a first exampleof probability space in the simple case of the die.

Step 1: Determine the set of possible outcomes Ω.In probability, we are not interested in the exact protocol of the experiment, we focus

on its result. The first thing we identify is the set of possible outcomes, called the samplespace and denoted by Ω. For the throw of a die, we have

Ω = 1,2,3,4,5,6

Step 2: Give a mathematical sense to “the die is even”.We want to evaluate the probability of certain “observations”, for example that the

upper face of the die is even. These observations are formalized by the notion of event:

An event is given by a subset A ⊂ Ω.

3

Page 5: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 4

We write F for the set of events. Here, we have F = P(E).Examples of events:

A = 2,4,6, “the die is even”B = 1,2,3. “the die is smaller than or equal to 3”

Set operations on events:The intersection corresponds to the logic operation and:

C ∶= A ∩B = 2. “the die is even and smaller than or equal to 3”The union corresponds to the logic operation or:

D ∶= A ∪B = 1,2,3,4,6. “the die is even or smaller than or equal to 3”The complement corresponds to the logic operation not:

E ∶= Ac = 1,3,5. “the die is not even”

Step 3: Give a mathematical sense to the probability P.We want to associate to each event A a probability P[A] ∈ [0,1]. This number rep-

resents the “frequency” at which the event A occurs if one performs many experiments.Before giving the general formula, let us look at some examples.

If one performs many throws of the die, the event A = 1 (“the die is equal to 1”) willoccur typically one time over six. Hence, it is natural to define P[A] = 1/6. The eventA = 1,2,3 will typically occur half of the time and we define P[A] = 1/2 in this case.More generally, we use the following definition:

The probability measure is the function P ∶ F → [0,1] defined by

∀A ∈ F P[A] = ∣A∣6.

For example, the probability of the event A = 2,4,6 (“the die is even”) is

P[A] = ∣2,4,6∣6

= 3

6= 1

2.

Ü the equation above is the mathematical way to express Equation (1.1).Notice that the probability measure P satisfies the following properties

• P[Ω] = 1,

• P[A ∪B] = P[A] + P[B] if A and B are disjoint.

Summary: the probabilistic model for the die

Sample space Ω = 1,2,3,4,5,6

Set of events F = P(E)

Probability measure P ∶ F → [0,1]A ↦ ∣A∣

6

Page 6: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 5

1.1 Sample spaces and events (textbook p.1-4)

Sample space

We want to model a random experiment. The first mathematical object needed is the setof all possible outcomes of the experiment, denoted by Ω.

Definition 1.1. The set Ω is called the sample space.An element ω ∈ Ω is called an outcome (or elementary experiment).

Example 1: Coin toss

Ω = head, tail or Ω = 0,1.

Example 2: Throw of a die

Ω = 1,2,3,4,5,6.

Example 3: Throw of three dice

Ω = 1,2,3,4,5,63.

In this case an outcome ω ∈ Ω has three components ω = (ω1, ω2, ω3) and ωi represents thevalue of the i-th die.Example 4: Throw of infinitely many dice

Ω = 1,2,3,4,5,6N.In this case an outcome ω ∈ Ω has infinitely many components ω = (ω1, ω2, . . .) and ωirepresents the value of the i-th die.Example 5: Number of customers visiting a shop in one day

Ω = N(= 0,1, . . .).

Example 6: Droplet of water falling on a square

Ω = [0,1]2.

Page 7: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 6

Events

Our goal is to measure the probability that the outcome of the experiment satisfies acertain property. Mathematically, such property can be represented by the set of all theoutcomes that satisfy this property. For example, for the throw of die, the property “theupper face of the die is even” corresponds to the subset

2,4,6 ⊂ 1,2,3,4,5,6.

A priori, an event could be any subset A ⊂ Ω. This works very well if the samplespace Ω is finite or countable. But for more complicated (uncountable) sample spaces(e.g. [0,1]), one needs to impose some conditions on the events. This is due to the factthat later, we want to be able to estimate the probability of the events, and this is possibleonly if the events are sufficiently “nice”. For this course, this technicality is not important,but it explains why the following definition may look complicated at a first sight.

Definition 1.2. We write F for the set of all the events. It is given by a subset of P(E)satisfying the following hypotheses:

H1. Ω ∈ F

H2. A ∈ F ⇒ Ac ∈ F if A is an event, “non A“ is also an event.

H3. A1,A2, . . . ∈ F ⇒∞

⋃i=1

Ai ∈ F .if A1,A2, . . . are events, then“A1 or A2 or ...” is an event

Remark 1.3. A set F ⊂ P(E) satisfying H1, H2 and H3 above is called a σ-algebra.

Example 1: Throw of a dieThe event corresponding to “the upper face of the die is even” is mathematically definedby

A = 2,4,6.

Example 2: Droplet falling on a squareThe event corresponding to “the droplet falls on the upper part of the square” is mathe-matically defined by

A = [0,1] × [1/2,1].

In general, an event can be defined in two ways

Ü one can list all the elements in the event, as in the two examples above.

Ü one can define it as the set of ω ∈ Ω satisfying a certain property. For example, forthe throw of a die, the event “the upper face is smaller than or equal to 3” can bedefined by

A = ω ∈ Ω ∶ ω ≤ 3.

Page 8: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 7

Terminology

Let ω ∈ Ω (a possible outcome). Let A be an event.We say the event A occurs (for ω) if ω ∈ A.

A

ω

We say that it does not occur if ω ∉ A.

A

ω

Remark 1.4. The event A = ∅ never occurs. “we never have ω ∈ ∅”The event A = Ω always occurs. “we always have ω ∈ Ω”

Operations on events and interpretation

Since events are defined as subsets of Ω, one can use operations from set theory (union,intersection, complement, symmetric difference,. . . ). From the definition, we know thatwe can take the complement of an event (by H2), or a countable union of events (by H3).The following proposition asserts that the other standard set operations are allowed.

Proposition 1.5 (Consequences of the definition). We have

i. ∅ ∈ F ,

ii. A1,A2, . . . ∈ F ⇒∞

⋂i=1

Ai ∈ F ,

iii. A,B ∈ F ⇒ A ∪B ∈ F ,

iv. A,B ∈ F ⇒ A ∩B ∈ F .Proof. We prove the items one after the other.

i. By H1 in Definition 1.2, we have Ω ∈ F . Hence, by H3 in Definition 1.2, we have

∅ = Ωc ∈ F .

ii. Let A1,A2, . . . ∈ F . By H2, we also have Ac1,Ac2, . . . ∈ F . Then, by H3, we have⋃∞i=1(Ai)c ∈ F . Finally, using H2 again, we conclude that

⋂i=1

Ai = (∞

⋃i=1

(Ai)c)c

∈ F .

Page 9: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 8

iii. Let A,B ∈ F . Define A1 = A, A2 = B, and for every i ≥ 3 Ai = ∅. By H3, we have

A ∪B =∞

⋃i=1

Ai ∈ F .

iv. Let A,B ∈ F . By H2, Ac,Bc ∈ F . Then by iii. above, we have Ac ∪Bc ∈ F . Finally,by H2, we deduce

A ∩B = (Ac ∪Bc)c ∈ F .

Let A,B ∈ F . “A and B are two events”

Event Graphical representation Probab. interpretation

Ac

Ω

A

Ac

A does not occur

A ∩B

A B

A and B occur

A ∪B

A B

A or B occurs

A∆B

A B

one and only one of A or Boccurs

Page 10: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 9

Relations between events and interpretations

Relation Graphical representation Probab. interpretation

A ⊂ B A

B

If A occurs, then B occurs

A ∩B = ∅

A B

A and B cannot occur at thesame time

Ω = A1 ∪A2 ∪A3 withA1,A2,A3 pairwise disjoint

Ω

A1 A3

A2

for each outcome ω, one andonly one of the events A1,

A2, A3 is satisfied.

1.2 Mathematical definition of probability spaces

Motivation: frequencies of events in random experiments

As previously mentioned, in the random experiments that we consider, if we look at theoccurence of a certain event, a stable relative frequency appears after repeating manytimes the experiment (for instance: when throwing a needle onto a parquet floor, “theneedle intersects two laths”). Hence, if we introduce

n = how many times the experiment is repeated,

andnA = the number of occurences of the event A,

We assume that n is very large (e.g. n = 1000000000), we want to say that the probabilityof A corresponds to the proportion of times the event A occurs. Hence, we would likedefine

P[A] ≃ nAn.

In other words, we assign a number P[A] ∈ [0,1] “the probability of A”, to every event A.Let us assume that it is rigorous and let us look at the properties satisfied by P.

• By definition, one has nΩ = n since the event A = Ω (sample space) always occurs.We will thus assume that

P[Ω] = 1.

Page 11: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 10

• In the case when A = A1 ∪A2 ∪⋯ ∪Ak and the Ai are pairwise disjoint, one has

nA = nA1 + nA2 +⋯ + nAk .

Dividing by n, we obtain

nAn

= nA1

n+ nA2

n+⋯ + nAk

n.

Hence, the probability should satisfy

P(A) = P (A1) +⋯ + P (Ak), if A = A1 ∪A2⋯∪Ak (disjoint union).

Probability measure and probability space

Definition 1.6. Let Ω be a sample space, let F be a set of events. A probability measureon (Ω,F) is a map

P ∶ F → [0,1]A ↦ P[A]

that satisfies the following two properties

P1. P[Ω] = 1.

P2. (countable additivity) P[A] = ∑∞i=1 P[Ai] if A = ⋃∞

i=1Ai (disjoint union).

“A probability measure is a map that associates to each event a number in [0,1].”

Definition 1.7. Let Ω be a sample space, F a set of events, and P a probability measure.The triple (Ω,F ,P) is called a probability space.

To summarize, if one want to construct a probabilistic model, we give

• a sample space Ω, “all the possible outcomes of the experiment”

• a set of events F ⊂ P(Ω), “the set of possible observations”

• a probability measure P. “gives a number in [0,1] to every event”

Example 1: Throw of three dice

• Ω = 1,2,3,4,5,63, an outcome is ω = ( ω1

´¸¶die 1

, ω2

´¸¶die 2

, ω3

´¸¶die 3

),

• F = P(Ω), any subset A ⊂ Ω is an event

• for A ∈ F , P[A] = ∣A∣

∣Ω∣. (∣A∣ =number of elements in A).

Page 12: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 11

Example 2: First time we see “tail” in a biased coinWe throw a biased coin multiple times, at each throw, the coin falls on head with

probability p, and it falls on tail with probability 1 − p (p is a fixed parameter in [0,1]).We stop at the first time we see a tail. The probability that we stop exactly at time k isgiven by

pk = pk−1(1 − p).(Indeed, we stop at time k, if we have seen exactly k − 1 heads and 1 tail.)

For this experiment, one possible probability space is given by

• Ω = N/0 = 1,2,3, . . .,

• F = P(Ω),

• for A ∈ F , P[A] = ∑k∈A

pk.

A tempting approach to define a probability measure is to first associate to every ωthe probability pω that the output of the experiment is ω. Then, for an event A ⊂ Ω definethe probability of A by the formula

P[A] = ∑ω∈A

pω. (1.2)

This approach works perfectly well, when the sample space Ω is finite or countable (thisis the case of the two examples above). But this approach does not work well if Ω isuncountable. For example, in the case of the droplet of water in a square Ω = [0,1]2.In this case, the probability of landing a fixed point is always 0 and the equation (1.2)does not make sense. This is for this reason that we use an axiomatic definition (inDefinition 1.6) of probability measure.Example 3: Droplet of water on a square

• Ω = [0,1]2,

• F = Borel σ-algebra1

• for A ∈ F , P(A) = Area(A).1the Borel σ-algebra F is defined as follows: it contains all A = [x1, x2]× [y1, y2], with 0 ≤ x1 ≤ x2 ≤ 1,

0 ≤ y1 ≤ y2 ≤ 1, and it is the smallest collection of subsets of Ω which satisfies H1, H2 and H3 inDefinition 1.2.

Page 13: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 12

Direct consequences of the definition

Proposition 1.8. Let P be a probability measure on (Ω,F).

i. We haveP[∅] = 0.

ii. (additivity) Let k ≥ 1, let A1, . . . ,Ak be k pairwise disjoint events, then

P[A1 ∪⋯ ∪Ak] = P[A1] +⋯ + P[Ak].

iii. Let A be an event, thenP[Ac] = 1 − P[A].

iv. If A and B are two events (not necessarily disjoint), then

P[A ∪B] = P[A] + P[B] − P[A ∩B].

Proof. We prove the items one after the other

i. Define x = P[∅]. We already know that x ∈ [0,1] because x is the probability ofsome event. Defining A1 = A2 = ⋯ = ∅, we have

∅ =∞

⋃i=1

Ai.

The events Ai are disjoint and countable additivity implies

∑i=1

P [Ai] = P[∅].

Since P[Ai] = x for every i and P [∅] ≤ 1, we have

∑i=1

x ≤ 1,

and therefore x = 0.

ii. Define Ak+1 = Ak+2 = ⋯ = ∅. In this way we have

A1 ∪⋯ ∪Ak = A1 ∪⋯ ∪Ak ∪ ∅ ∪∅ ∪⋯ =∞

⋃i=1

Ai.

Page 14: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 13

Since the events Ai are pairwise disjoint, one can apply countable additivity asfollows:

P[A1 ∪⋯ ∪Ak] = P[∞

⋃i=1

Ai]

countableadditivity

=∞

∑i=1

P[Ai]

= P[A1] +⋯ + P[Ak] +∑i>k

P[Ai]´¹¹¸¹¹¶

=0

.

iii. By definition of the complement, we have Ω = A ∪Ac, and therefore

1 = P[Ω] = P[A ∪Ac].

Since the two events A, Ac are disjoint, additivity finally gives

1 = P[A] + P[Ac].

iv. A ∪B is the disjoint union of A with B/A. Hence, by additivity, we have

P[A ∪B] = P[A] + P[B/A]. (1.3)

Also B = (B ∩A) ∪ (B ∩Ac) = (B ∩A) ∪ (B/A). Hence, by additivity,

P[B] = P[B ∩A] + P[B/A],

which give P[B/A] = P[B]−P[A∩B]. Plugging this estimate in Eq. (1.3) we obtainthe result.

Useful Inequalities

Proposition 1.9 (Monotonicity). Let A,B ∈ F , then

A ⊂ B ⇒ P[A] ≤ P[B].

Proof. If A ⊂ B, then we have B = A ∪ (B/A) (disjoint union). Hence, by additivity, wehave

P[B] = P[A] + P[B/A] ≥ P[A].

Page 15: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 14

A simple application: Consider the “droplet of water falling on a square”. Consider theevent En that the droplet falls at Euclidean distance 1/n from the point (1/2,1/2). Onecan see that En ⊂ [1/2 − 1/n,1/2 + 1/n]2. Therefore,

P[En] ≤ P[[1/2 − 1/n,1/2 + 1/n]2] = 4/n,

which implies that P[En] converges to 0 as n tends to infinity.

Proposition 1.10 (Union bound). Let A1,A2, . . . be a sequence of events (not necessarilydisjoint), then we have

P[∞

⋃i=1

Ai] ≤∞

∑i=1

P[Ai].

Remark 1.11. The union bound also applies to a finite collection of events.

Proof. For i ≥ 1, defineAi = Ai ∩Aci−1 ∩⋯ ∩Ac1.

One can check that∞

⋃i=1

Ai =∞

⋃i=1

Ai.

(To prove the direct inclusion, consider ω in the left hand side. Then define the smallesti such that ω ∈ Ai. For this i, we have ω ∈ Ai, which implies that ω belongs to the righthand side. The other inclusion is clear because Ai ⊂ Ai for every i.) Now, one can applythe countable additivity to the Ai, because they are disjoint. We get

P[∞

⋃i=1

Ai] = P[∞

⋃i=1

Ai]

=∞

∑i=1

P[Ai]

≤∞

∑i=1

P[Ai].

Application: We throw a die n times. We consider the probability space given by

• Ω = 1,2,3,4,5,6n, an outcome is ω = ( ω1

´¸¶die 1

,⋯, ωn´¸¶die n

),

• F = P(Ω),

• for A ∈ F , P[A] = ∣A∣

∣Ω∣.

Page 16: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 15

We want to prove that the probability to see more than ` ∶= 7 logn successive 1’s is smallif n is large. Consider the event A that there exist ` successive 1’s. For every 0 ≤ k ≤ n−`,define the event

Ak = ω ∶ ωk+1 = ωk+2 = ⋯ = ωk+` = 1 “` successive 1’s between k + 1 and k + ` ”.Then, we have

A =n−`

⋃k=0

Ak.

Using the union bound, we have

P[A] ≤n−`

∑k=0

P[Ak] ≤ n ⋅ (1

6)`

≤ n ⋅ n− log(7)/ log(6),

and therefore we see that the probability of seeing more than 7 logn consecutive 1’sconverge to 0 as n tends to infinity.

Continuity properties of probability measures

Proposition 1.12. Let (An) be an increasing sequence of events (i.e. An ⊂ An+1 for everyn). Then

limn→∞

P [An] = P[∞

⋃n=1

An]. "increasing limit"

Let (Bn) be a decreasing sequence of events (i.e. Bn ⊃ Bn+1 for every n). Then

limn→∞

P [Bn] = P[∞

⋂n=1

Bn]. "decreasing limit"

A1

∞⋃n=1

An

A2

A3 ∞⋂n=1

Bn B1B2B3

Remark 1.13. By monotonicity, we have P[An] ≤ P[An+1] and P[Bn] ≥ P[Bn+1] forevery n. Hence the limits in the proposition are well defined as monotone limits.

Proof. Let (An)n≥1 be an increasing sequence of events. Define A1 = A1 and for everyn ≥ 2

An = An/An−1.

The events An are disjoint and satisfy

⋃n=1

An =∞

⋃n=1

An and AN =N

⋃n=1

An.

Page 17: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 16

Using first countable additivity and then additivity, we have

P[∞

⋃n=1

An] = P[∞

⋃n=1

An]

=∞

∑n=1

P[An]

= limN→∞

N

∑n=1

P[An]

= limN→∞

P[AN].

Now, let (Bn) be a decreasing sequence of events. Then (Bcn) is increasing, and we can

apply the previous result in the following way:

P[∞

⋂n=1

Bn] = 1 − P[∞

⋃n=1

Bcn]

= 1 − limn→∞

P[Bcn]

= limn→∞

P[Bn].

1.3 Laplace models and counting (textbook p. 21–26)

We now discuss a particular type of probability spaces that appear in many concreteexamples. The sample space Ω is an arbitrary finite set, and all the outcomes have thesame probability pω = 1

∣Ω∣.

Definition 1.14. Let Ω be a finite sample space. The Laplace model on Ω is the triple(Ω,F ,P), where

• F = P(Ω),

• P ∶ F → [0,1] is defined by

∀A ∈ F P[A] = ∣A∣∣Ω∣ .

One can easily check that the mapping P above defines a probaility measure in thesense of the definition 1.6. In this context, estimating the probability P[A] boils down tocounting the number of elements in A and in Ω.Example:

We consider n ≥ 3 points on a circle, from which we select 2 at random. What is theprobability that these two points selected are neighbors?We consider the Laplace model on

Ω = E ⊂ 1,2,⋯, n ∶ ∣E∣ = 2.

Page 18: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 17

12

3

4 5

6

Figure 1.1: A circle with n = 6 points, and the subset 1,3 is selected.

The event “the two points of E are neighbors” is given by

A = 1,2,2,3,⋯,n − 1, n,n,1,

and we haveP[A] = ∣A∣

∣Ω∣ =n

(n2)= 2

n − 1.

1.4 Random variables and distribution functions (textbookp 9–12)

Most often, the probabilistic model under consideration is rather complicated, and oneis only interested in certain quantities in the model. For this reason, one introduces thenotion of random variables.

Definition 1.15. Let (Ω,F ,P) be a probability space. A random variable (r.v.) is amap X ∶ Ω→ R such that for all a ∈ R,

ω ∈ Ω ∶ X(ω) ≤ a ∈ F .

Ü The condition ω ∈ Ω ∶ X(ω) ≤ a ∈ F is needed for P[ω ∈ Ω ∶ X(ω) ≤ a] to bewell-defined.

Notation: When events are defined in terms of random variable, we will omit the de-pendence in ω. For example, for a ≤ b we write

X ≤ a = ω ∈ Ω ∶ X(ω) ≤ a,a <X ≤ b = ω ∈ Ω ∶ a <X(ω) < b,X ∈ Z = ω ∈ Ω ∶ X(ω) ∈ Z.

When consider the probability of events as above, we omit the brackets and for examplesimply write

P[X ≤ a] = P[X ≤ a] = P[ω ∈ Ω ∶ X(ω) ≤ a].

Page 19: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 18

Example 1: Gambling with one dieWe throw a fair die. The sample space is Ω = 1,2,3,4,5,6 and the associated probabilityspace (Ω,F ,P) is defined on page 4. Suppose that we gamble on the outcome in such away that our profit is

−1 if the outcome is 1, 2 or 3, (1.4)0 if the outcome is 4,

2 if the outcome is 5 or 6,

where a negative profit correspond to a loss. Our profit can be represented by the randomvariable X defined by

∀ω ∈ Ω X(ω) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

−1 if ω = 1,2,3,

0 if ω = 4,

2 if ω = 5,6.

(1.5)

Example 2: First coordinate of a random pointAs in Example 3 on page 11, we define

• Ω = [0,1]2,

• F = Borel σ-algebra,

• for A ∈ F , P(A) = Area(A).

Ü The map which returns the x-coordinate, defined by

Y ∶ Ω → R(ω1, ω2) ↦ ω1

,

is a random variable. Indeed, for every a ∈ R, we have

Y ≤ a = (ω1, ω2) ∈ Ω ∶ ω1 ≤ a =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

∅ if a < 0,

[0, a] × [0,1] if 0 ≤ a ≤ 1,

Ω if a > 1,

and ∅, [0, a] × [0,1] and Ω are all elements of F .

Example 3: Area on the left-down part of a random pointWe consider the same probability space as in Example 2 above (Ω = [0,1]2). For ω ∈ Ω,let R(ω) = [0, ω1] × [0, ω2] (it corresponds to rectangle at the left and below ω).Ü The map which returns the area of R(ω), defined by

Z ∶ Ω → R(ω1, ω2) ↦ ω1ω2

, (1.6)

Page 20: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 19

is a random variable. To see this, notice that for every a ∈ R, we have

Z ≤ a = ω ∈ Ω ∶ Y (ω) ≤ a =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

∅ if a < 0,

Ea if 0 ≤ a ≤ 1,

Ω if a > 1,

where Ea = (x, y) ∈ [0,1]2 ∶ xy ≤ a is represented below.

a

1

a 1

Figure 1.2: The set Ea corresponds to the region Ea.

The sets ∅, Ω and Ea are elements of F for every a. For the culture, we now give theformal proof that Ea is in F , but this proof is not important for applications and can beskipped in a first reading.

One just need to check that Ea is in F if 0 < a ≤ 1 (the case a = 0 can easily be treatedindependently). Let us first prove that

Eca = ⋃

0<q≤1q∈Q

(q,1] × (a/q,1]

by double inclusion:

⊂ Let ω = (ω1, ω2) ∈ Eca. Since ω1ω2 > a, one can pick a rational number q in the

interval (a/ω2, ω1). For such a choice, we have ω ∈ (q,1] × (a/q,1].

⊃ Assume that ω ∈ (q,1] × (a/q,1] for some 0 < q ≤ 1. Then ω1ω2 > q ⋅ a/q = a, andtherefore ω ∉ Ea.

Now for every q, the rectangle (q,1]× (a/q,1] is in F . Therefore, Eca is also in F , since it

is a countable union of elements of F . This concludes that Ea is an event, by taking thecomplement.

Example 4: Indicator function of an event

Let A ∈ F . Consider the indicator function 1A of A, defined by

∀ω ∈ Ω 1A(ω) =⎧⎪⎪⎨⎪⎪⎩

0 if ω ∉ A,1 if ω ∈ A.

Page 21: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 20

Then 1A is a random variable. Indeed, we have

1A ≤ a =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

∅ if a < 0,

Ac if 0 ≤ a < 1,

Ω if a ≥ 1,

and ∅, Ac and Ω are three elements of F .Definition 1.16. Let X be a random variable on a probability space (Ω,F ,P). Thedistribution function of X is the function FX ∶ R→ [0,1] defined by

∀a ∈ R FX(a) = P[X ≤ a].

Proposition 1.17 (Basic identity). Let a < b be two real numbers. Then

P[a <X ≤ b] = F (b) − F (a).

Proof. We have X ≤ b = X ≤ a ∪ a <X ≤ b (disjoint union). Hence

P[X ≤ b] = P[X ≤ a] + P[a <X ≤ b],

which directly implies the result.

Example 1: Gambling on a dieLet X be the random variable defined by Eq. (1.5). For a ∈ R, we have

X ≤ a = ω ∶ X(ω) ≤ a =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

∅ if a < −1,

1,2,3 if −1 ≤ a < 0,

1,2,3,4 if 0 ≤ a < 2,

1,2,3,4,5,6 if a ≥ 2.

Hence

FX(a) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

0 if a < −1,

1/2 if −1 ≤ a < 0,

2/3 if 0 ≤ a < 2,

1 if a ≥ 2.

Example 2: First coordinate of a random pointConsider the random variable Y defined by Eq. (1.4). Its distribution function is givenby

FY (a) = P[Y ≤ a] =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

P[∅] if a < 0,P[[0, a] × [0,1]] if 0 ≤ a ≤ 1,P[Ω] if a > 1.

Page 22: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 21

−1 0 2

1/2

1/3

1

Figure 1.3: Graph of the distribution function FX .

Hence,

FY (a) = P[Y ≤ a] =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if a < 0,a if 0 ≤ a ≤ 1,1 if a > 1.

0 1

1

Figure 1.4: Graph of the distribution function FY

Example 3: Area of the left-down part of a random point

0 1

1

Figure 1.5: Graph of the distribution function of FZ

Consider the random variable Z defined by Eq. (1.6) on page 18. For 0 ≤ a ≤ 1, the event

Z ≤ a = (ω1, ω2) ∈ [0,1]2 ∶ ω1ω2 ≤ a = Ea

was represented on Fig. 1.4 and its probability is equal to its area, given by

Area(Ea) = a + ∫1

a

a

xdx = a + a log(1/a))

Page 23: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 22

(with the convention 0 log(1/0) = 0). The event Z ≤ a is empty if a < 0 and it is equalto Ω if a > 1. Therefore

FZ(a) = P[Z ≤ a] =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if a < 0,a + a log(1/a) if 0 ≤ a ≤ 1,1 if a > 1.

Example 4: Indicator function of an event

Let A be an event. Let X = 1A be the indicator function of the event A. Then

FX(a) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 if a < 0,1 − P[A] if 0 ≤ a < 1,1 if a ≥ 1.

Theorem 1.18 (Properties of distribution functions). Let X be a random variable onsome probability space (Ω,F ,P). The distribution function F = FX ∶ R → [0,1] of Xsatisfies the following properties.

(i) F is nondecreasing.

(ii) F is right continuous2.

(iii) lima→−∞

F (a) = 0 and lima→∞

F (a) = 1.

Proof. We first prove (i), then (iii) and finally (ii).

(i) For a ≤ b, we have X ≤ a ⊂ X ≤ b. Hence, by monotonicity, we have P[X ≤ a] ≤P[X ≤ b], i.e.

F (a) ≤ F (b).

(iii) Let an ↑ ∞. For every ω ∈ Ω, there exists n large enough such that X(ω) ≤ an.Hence,

Ω = ⋃n≥1

X ≤ an.

Furthermore, we have X ≤ an ⊂ X ≤ an+1 and the continuity properties ofprobability measures imply

1 = P[Ω] = P[⋃n≥1

X ≤ an]

= limn→∞

P[X ≤ an]= limn→∞

F (an).

2i.e. F (a) = limh↓0

F (a + h) for every a ∈ R.

Page 24: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 23

In the same way, one has lima→−∞

F (a) = 0. Indeed, using that for every an ↓ −∞,

∅ = ⋂n≥1

X ≤ an

and X ≤ an ⊃ X ≤ an+1, it follows from the continuity properties of probabilitymeasures that

0 = P[∅] = limn→∞

P[X ≤ an] = limn→∞

F (an).

(ii) Let a ∈ R, let hn ↓ 0. We have

X ≤ a = ⋂n≥1

X ≤ a + hn,

where X ≤ a + hn ⊃ X ≤ a + hn+1. Hence by the continuity properties of proba-bility measures, we have

F (a) = P[X ≤ a] = P[⋂n≥1

X ≤ a + hn] = limn→∞

P[X ≤ a + hn] = limn→∞

F [a + hn].

Properties (i), (ii) and (iii) actually characterize distribution functions, in the followingsense:

Theorem 1.19. Let F ∶ R → [0,1] satisfying (i), (ii) and (iii), then there exists aprobability space (Ω,F ,P) and a random variable X ∶ Ω→ R such that F = FX .

Proof. Admitted.

Theorem 1.19 above plays an important role in probability theory. Given a nonde-creasing, right continuous function F ∶ R → [0,1] with lim−∞F = 0 and lim+∞F = 1, onecan always consider a random variable X on some probability space (Ω,F ,P) such thatF = FX . This is very useful because it allows one to define a random variable via itsdistribution function. As soon as we have a function F as above, we have the right tosay:

“Let X be a random variable on some probability space (Ω,F ,P) with distri-bution function F .”

The precise choice of the probability space is often not important and we simply say:

“Let X be a r.v. with distribution function F .”

Page 25: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 24

Discontinuity/continuity points of F We have seen that the distribution functionF = FX of a random variableX is always right continuous. What about the left continuity?In Example 1 (gambling on a die), we can see that it is not always left continuous, andthe following proposition gives an interpretation of the left limit

F (a−) = limh↓0

F (a − h)

at a given point a for a general distribution function.

Proposition 1.20 (probability of a given value). Let X ∶ Ω → R be a random variablewith distribution function F . Then for every a in R we have

P[X = a] = F (a) − F (a−)

We omit the proof, which can easily be obtained using the elementary identity to-gether with the continuity properties of probability measures. We rather insist on theinterpretation of this proposition. Fix a ∈ R.

Ü If F is not continuous at a point a ∈ R, then the “jump size” F (a) − F (a−) is equalto the probability that X = a.

Ü If F is continuous at a point a ∈ R, then P[X = a] = 0. This is what happens atevery point a in Example 2 or 3.

Discrete and continuous random variables

Two different types of random variables will play an important role in this class:

• the discrete random variables,

• the continuous random variables.

We give here the definitions but these two classes of random variables will be studiedin more detail.

Definition 1.21 (Discrete random variables). A random variable X ∶ Ω→ R is said to bediscrete if its image

X(Ω) = x ∈ R ∶ ∃ω ∈ Ω X(ω) = xis at most countable.

In other words, a random variables is said to be discrete if it only takes finitely orcountably many values x1, x2, . . .. In this case the probabilistic properties of X are fullydescribed by the numbers

p(x1) = P[X = x1], p(x2) = P[X = x2], . . .

Example: The random variable of Example 1 (defined on Eq. (1.5) page 18) is discrete:it takes only three possible values x1 = −1, x2 = 0 and x3 = 2.

Page 26: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 25

Definition 1.22 (Continuous random variables). A random variable X ∶ Ω → R is saidto be continuous if its distribution function FX can be written as

FX(a) = ∫a

−∞f(x)dx for all a in R (1.7)

for some nonnegative function f ∶ R→ R+, called the density of X.

To understand the terminology “continuous”, observe that the formula (3.1) impliesthat FX is a continuous function. In particular, by Proposition 1.20, the r.v. X satisfiesP[X = x] = 0 for every fixed x. The probability that X is exactly equal to x is 0, but wemay have a chance that X is very close to x and f(x)dx represent the probability thatX takes a value in the infinitesimal interval [x,x + dx].Example: The random variables Y and Z of Example 2 and 3 are continuous.

1.5 Conditional probabilities (textbook p.12–19)

Consider a random experiment represented by some probability space (Ω,F ,P). We maysometimes possess incomplete information about the actual outcome of the experimentwithout knowing this outcome exactly. For example if we throw a die and a friend tellsus that an even number is showing, then this information affects all our calculation ofprobabilities. In general, if A and B are two events and we are given that B occurs, thenew probability may no longer be P[A]. In this new circumstance, we know that A occursif and only if A ∩B occurs, suggesting that the new probability of A is proportional toP[A ∩B].

Definition 1.23 (Conditional probability). Let (Ω,F ,P) be some probability space. LetA, B be two events with P[B] > 0. The conditional probability of A given B is definedby

P[A ∣B] = P[A ∩B]P[B] .

Remark 1.24. P[B ∣B] = 1. Condition on B, the event B always occurs.

Example: We consider the probability space (Ω,F ,P) corresponding to the throw ofone die. Let A = 1,2,3 be the event that the die is smaller than or equal to 3, and letB = 2,4,6 be the event that the die is even. Then

P[A ∣B] = P[A ∩B]P[B] = 1/6

1/2 = 1/3.

Proposition 1.25. Let (Ω,F ,P) be some probability space. Let B be an event withpositive probability. Then P[ . ∣B] is a probability measure on Ω.

Page 27: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 26

Proposition 1.26 (Formula of total probability). Let B1,⋯,Bn be a partition3 of thesample space Ω with P[Bi] > 0 for every 1 ≤ i ≤ n. Then, one has

∀A ∈ F P[A] =n

∑i=1

P[A ∣Bi] P[Bi].

Proof. Using the distributivity of the intersection, we have

A = A ∩Ω = A ∩ (B1 ∪⋯ ∪Bn) = (A ∩B1) ∪⋯ ∪ (A ∩Bn).

Since the events A ∩Bi are pairwise disjoint, we have

P[A] = P[A ∩B1] +⋯ + P[A ∩Bn].

By definition, we have P[A∩Bi] = P[A ∣Bi] P[Bi] for every i and using this expression inthe equation above, we finally get

P[A] = P[A ∣B1] P[B1] +⋯ + P[A ∣Bn] P[Bn].

The formula above is very useful when we have a random variable X and we wantto distinguish on the different values of X. More precisely, assume that X is a randomvariable taking the possible values x1, . . . , xn and P[X = xi] > for every i. Then we have anatural partition Ω = X = x1 ∪⋯∪ X = xn, and the formula of total probability gives

∀A ∈ F P[A] = P[A ∣X = x1] P[X = x1] +⋯ + P[A ∣X = xn] P[X = xn].

Example: A sender emits 0 − 1 bits. Due to noise in the communication channel, eachbit is wrongly transmitted with probability ε and correctly with probability 1 − ε. Thesender emits a 0 with probability p0, and a 1 with probability p1(= 1 − p0): what is theprobability to receive a 1? Consider the random variables:

X = value of the bit sent,Y = value of the bit received.

The formula of total probability implies that the probability to receive 1 is

P[Y = 1] = P[Y = 1 ∣X = 1]P[X = 1] + P[Y = 1 ∣X = 0]P[X = 0]= (1 − ε)p1 + εp0.

One could also consider the converse problem in the previous example: “if one receives1, what is the probability that 1 was sent?” This corresponds to the probability of thesource 1 at receiving, i.e. P[X = 1 ∣ Y = 1]. This leads us to the next formula.

3i.e. Ω = B1 ∪⋯ ∪Bn and the events are pairwise disjoint.

Page 28: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 27

Proposition 1.27 (Bayes formula). Let B1, . . . ,Bn ∈ F be a partition of Ω with P[Bi] > 0for every i. For every event A with P[A] > 0, we have

∀i = 1, . . . , n P[Bi ∣A] = P[A ∣Bi] P[Bi]∑nj=1 P[A ∣Bj] P[Bj]

.

Typical application: A test is performed in order to diagnose a certain rare disease,which concerns 1/10000 of a population. This test is quite reliable and gives the rightanswer 99 percent of the times. If a patient has a “positive test” (i.e. the test indicatesthat he is sick), what is the probability that he is actually sick?

The situation is modeled by setting

Ω = 0,1 × 0,1.

and F = P(Ω). An outcome is a pair ω = (ω1, ω2) representing a patient, where

ω1 =⎧⎪⎪⎨⎪⎪⎩

0 if the patient is healthy,1 if the patient is sick,

ω2 =⎧⎪⎪⎨⎪⎪⎩

0 if the test is negative,1 if the test is positive.

We consider the random variables S and T defined

S(ω) = ω1 T (ω) = ω2.

With this notation S = 1 is the event that the patient is sick, and T = 1 is theevent that the test is positive. From the hypotheses, the information that we have on theprobability measure is

P[S = 1] = 1

10000, P[T = 1 ∣ S = 1] = 99

100, P[T = 1 ∣ S = 0] = 1

100.

We are looking for the a posteriori probability of being sick, given that the test is positive,i.e.

P[S = 1 ∣ T = 1] = P[T = 1 ∣ S = 1] P[S = 1]P[T = 1 ∣ S = 1] P[S = 1] + P[T = 1 ∣ S = 0] P[S = 0]

= 0.99 × 0.0001

0.99 × 0.0001 + 0.01 × 0.9999≃ 0.0098.

This result is quite surprising: the probability to be actually sick when the test is positiveis very small!What is happening? If one looks at the whole population, there are two types ofpersons, who will have a positive test:

- the healthy individuals with a (wrongly) positive test, which represent roughly onepercent of the population.

Page 29: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 28

- the sick individuals with a (correctly) positive test, which represent roughly 1/10000of the population.

Given that the test is positive, a person has much more chances to be in the first groupof individuals.

1.6 Independence

Independence of events

Definition 1.28 (Independence of two events). Let (Ω,F ,P) be a probability space. Twoevents A and B are said to be independent if

P[A ∩B] = P[A] P[B].

Remark 1.29. If P[A] ∈ 0,1, then A is independent of every event, i.e.

∀B ∈ F P[A ∩B] = P[A] P[B].

If an event A is independent with itself (i.e. P[A ∩A] = P[A]2), then P[A] ∈ 0,1A is independent of B if and only if A is independent of Bc.

The concept of independence is fundamental in probability: it corresponds to theintuitive idea that two events do not influence each other, as illustrated in the followingproposition.

Proposition 1.30. Let A, B ∈ F be two events with P[A],P[B] > 0. Then the followingare equivalent:

(i) P[A ∩B] = P[A] P[B], A and B are independent

(ii) P[A ∣B] = P[A], the occurrence of B has no influence on A

(iii) P[B ∣A] = P[B]. the occurrence of A has no influence on B

Proof. Since P[B] > 0 we have

(i)⇔ (P[A ∩B]P[B] = P[A])⇔ (P[A ∣B] = P[A])⇔ (ii).

Since (iii) is just the same as (ii) with the role of A and B reversed, the equivalence(i)⇔ (iii) can proved the same way.

Typical examples of independent events occur when one performs successively a ran-dom experiment, as illustrated below with the throw of two dice.

Example: Throw of two independent dice.

Page 30: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 29

We throw two dice independently. This is modeled by the Laplace model on the samplespace

Ω = 1,2,3,4,5,62.

Let X and Y be the two random variables defined by X(ω) = ω1 and Y (ω) = ω2 (X andY represent the value of the first and second dice respectively). Consider the events

A = X ∈ 2Z, "The first die is even"B = 1 + Y ∈ 2Z, "The second die is odd"C = X + Y ≤ 3, "The sum of the two dice is at most 3"D = X ≤ 2, Y ≤ 2. "Both dice are smaller than or equal to 2"

Check that:

• A and B are independent,

• A and C are not independent,

• A and D are independent.

Definition 1.31. We say that n events A1,⋯,An are independent if

∀J ⊂ 1,⋯n P[⋂j∈J

Aj] =∏j∈J

P[Aj].

Remark: Three events A, B and C are independent if the following 4 equations aresatisfied (and not only the last one!):

P[A ∩B] = P[A] P[B],P[A ∩C] = P[A] P[C],P[B ∩C] = P[B] P[C],

P[A ∩B ∩C] = P[A] P[B] P[C].

Example: We consider the same notation as in the example above Definition 1.31. Theevents A, B, and D are independent (Check that!).

Independence of random variables

Definition 1.32. Let X1, . . . ,Xn be n random variables on some probability space(Ω,F ,P). We say that X1, . . . ,Xn are independent if

∀a1, a2, . . . , an ∈ R P[X1 ≤ a1, . . . ,Xn ≤ an] = P[X1 ≤ a1] . . .P[Xn ≤ an]. (1.8)

Definition 1.33. Let X1,X2, . . . be an infinite sequence of random variables. We say thatX1,X2, . . . are independent if X1, . . . ,Xn are independent, for every n.

Page 31: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 30

Example 1: Throw of n diceConsider the sample space Ω = 1, ...,6n equipped with the Laplace model (F ,P). Definethe random variables

Xi ∶ Ω → R(ω1, . . . , ωn) ↦ ωi

.

(Xi represents the value of the i-th die.) Then the random variables X1, . . . ,Xn are inde-pendent. To prove this, it suffices to prove Equation (1.8) for a1, . . . , an ∈ 1,2,3,4,5,6.For such numbers, we have

P[X1 ≤ a1, . . . ,Xn ≤ an] = P[1, . . . , a1 ×⋯ × 1, . . . , an]

= ∣1, . . . , a1 ×⋯ × 1, . . . , an∣∣Ω∣ = a1

6⋯an

6= P[X1 ≤ a1]⋯P[Xn ≤ an].

Example 2: The coordinates of a random point in [0,1]2

Consider the probability space (Ω = [0,1]2,F ,P) for the random point in a 1 by 1 square.Let X and Y be the two random variables defined by

X((ω1, ω2)) = ω1 and Y ((ω1, ω2)) = ω2,

which correspond to the x-coordinate and y-coordinate of the random point. For everya, b ∈ [0,1] we have

P[X ≤ a, Y ≤ b] = P[ [0, a] × [0, b] ]= Area([0, a] × [0, b])= a ⋅ b= Area([0, a] × [0,1]) ⋅Area([0,1] × [0, b]) = P[X ≤ a] P[Y ≤ b].

The equality P[X ≤ a, Y ≤ b] = P[X ≤ a] P[Y ≤ b] is also true if a ∉ [0,1] or b ∉ [0,1]because at least one of the two event is ∅ or Ω. Hence the two random variables Xand Y are independent.

Theorem 1.34. Let F1, . . . , Fn be n distribution functions4. Then there exist a probabilityspace (Ω,F ,P) and n random variables X1, . . . ,Xn on this probability space such that

• for every i Xi has distribution function Fi (i.e. ∀a P[Xi ≤ a] = Fi(a)), and

• X1, . . . ,Xn are independent.

Proof. Admitted.4i.e. for every i Fi ∶ R→ [0,1] satisfies (i), (ii) and (iii) in Theorem 1.18.

Page 32: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 1. MATHEMATICAL FRAMEWORK 31

The theorem above is important in the theory because it allows us to work with randomvariables directly without defining precisely the probability space (Ω,F ,P). For example,if F and G are two given distribution functions, it allows us for example to write:“LetX,Y be two independent random variables with distribution function F and G resp.”.

Definition 1.35. Some random variables X1,X2, . . . are said to be independent andidentically distributed (iid) if they are independent and they have the same distributionfunction, i.e.

∀i, j FXi = FXj .(The sequence may be finite or infinite.)

Page 33: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Chapter 2

Discrete distributions

2.1 Preliminaries: infinite sums.Let E be finite or countable. We would like to give a sense to

∑x∈E

ax, (2.1)

where (ax)x∈E is a sequence of real numbers.If E is finite, the sum (2.1) is just a finite sum and is always well defined.If E is infinite countable, a natural approach is to “approximate” E with finite sets

and define the sum as a limit. We write Fn ↑ E if

• for every n, Fn ⊂ E and Fn ⊂ Fn+1,

• E = ⋃n∈N

Fn.

The existence of a sequence (Fn) is guaranteed by the fact that E is countable. Forexample, if E = Z, we can choose Fn = −n, . . . , n. For every n, the finite sum

∑x∈Fn

ax

always make sense. To define the infinite sum (2.1), it is natural to try to take the limitas n tends to infinity of the finite sums

limn→∞

( ∑x∈Fn

ax) (2.2)

One problem is that the limit above is not always well defined. For example, if E = Z,Fn = −n, . . . , n and ax = (−1)∣x∣, we have ∑x∈Fn = (−1)n and therefore the limit in (2.2)does not exist. Another problem is that the limit may depend on the chosen sequence(Fn). For example, as above if E = Z and ax = (−1)∣x∣, the limit exists if one choosesFn = −2n, . . . ,2n or Fn = −2n − 1, . . . ,2n + 1 but they do not coincide (it is +1 in thefirst case and −1 in the second case.

32

Page 34: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 33

Definition 2.1 (Sum of nonnegative numbers). Let (ax)x∈E be a sequence of nonnegativenumbers (i.e. ∀x ax ≥ 0). The sum of the ax is defined by

∑x∈E

ax ∶= supF⊂EF finite

∑x∈F

ax.

Remark 2.2.

• The sum can be infinite, for example if E is infinite and ax = 1.

• For every sequence Fn ↑ E, we can check that the limit in (2.2) makes sense and

limn→∞

( ∑x∈Fn

ax) = ∑x∈E

ax.

“The sum of nonnegative numbers is always well defined, it can be finite or infinite”

When E = N, we can use the notation

∑x∈N

ax =∞

∑x=0

ax.

and the sum corresponds to the increasing limit

∑x∈N

ax = limn→∞

(n

∑x=0

ax).

Definition 2.3 (Sum of an integrable sequence). A sequence (ax)x∈E of real numbers issaid to be integrable if

∑x∈E

∣ax∣ <∞.

In this case, define∑x∈E

ax = ∑x∈E

a+x − ∑x∈Eax<0

a−x,

where a+x = max(0, ax) and a−x = max(0,−ax) represent the positive and negative parts ofthe sequence.

Remark 2.4.

• If (ax) is integrable, then the sum is always finite.

• For every sequence Fn ↑ E, we can check that the limit in (2.2) makes sense and

limn→∞

( ∑x∈Fn

ax) = ∑x∈E

ax.

(In particular, the limit does not depende on the chosen sequence Fn ↑ E.

Page 35: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 34

We finish this section with the statement of Fubini’s theorem, which allows one topermute infinite sums. The take-home message is that we can always permute the sumswhen all the terms are nonnegative. If the terms are not all nonegative, one needs anintegrability condition, which is obtained by summing the absolute values of the sequence.

Theorem 2.5 (Fubini’s theorem for nonegative sequences). Let E,F be two finite orcountable sets. Let (ux,y)(x,y)∈E×F be a family of nonnegative numbers. Then

∑(x,y)∈E×F

ux,y = ∑x∈E

(∑y∈F

ux,y) = ∑y∈F

(∑x∈E

ux,y) .

Theorem 2.6 (Fubini’s theorem for integrable sequences). Let E,F be two finite orcountable sets. Let (ux,y)(x,y)∈E×F be a a family of real numbers. Assume that

∑x∈E

(∑y∈F

∣ux,y ∣) <∞,

then we have

∑x∈E

(∑y∈F

ux,y) = ∑y∈F

(∑x∈E

ux,y) .

2.2 Definitions and examplesIn the chapter, we fix some probability space (Ω,F ,P).

Let us recall the following definition from the previous chapter (the definition belowis slightly different from Definition 2.7, we leave the reader check that the two definitionsare equivalent).

Definition 2.7 (Discrete random variables). A random variable X ∶ Ω → R is said to bediscrete if there exists some set E ⊂ R finite or countable such that

∀ω ∈ Ω X(ω) ∈ E.

Definition 2.8. Let X be a discrete random variable taking some values in some finite orcountable set E ⊂ R. The distribution of X is the sequence of numbers (px)x∈E definedby

∀x ∈ E px = P[X = x].

Proposition 2.9. The distribution (px)x∈E of a discrete random variable satisfies

∑x∈E

px = 1.

Page 36: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 35

Proof. The random variable X takes values in E hence

⋃x∈E

X = x = Ω.

Since the union is disjoint and the set E is at most countable, we have

∑x∈E

P[X = x] = P[⋃x∈E

X = x] = P[Ω] = 1.

Remark 2.10. Conversely, if we are given a sequence of numbers (px) with values in[0,1] and such that

∑x∈E

px = 1, (2.3)

then there exists a probability space (Ω,F ,P) and a random variable X with associateddistribution (px). This is a consequence of the existence theorem 1.19 in Chapter 1. Thisobservation is important in practice, it allows us to write:“Let X be a random variable with distribution (px)x∈E.”

Example 1: Bernoulli random variableLet 0 ≤ p ≤ 1. A random variable X is said to be a Bernoulli random variable withparameter p if it takes values in E = 0,1 and

P[X = 0] = 1 − p and P[X = 1] = p .

In this case, we write X ∼ Ber(p).

Example 2: Binomial random variableLet 0 ≤ p ≤ 1, let n ∈ N. A random variable X is said to be a binomial random variablewith parameters n and p if it takes values in E = 0, . . . , n and

∀k ∈ 0, . . . , n P[X = k] = (nk) pk (1 − p)n−k .

In this case, we write X ∼ Bin(n, p).

Remark 2.11. If we define pk = (nk) pk (1 − p)n−k, we have

n

∑k=0

pk =n

∑k=0

(nk) pk (1 − p)n−k = (p + 1 − p)n = 1,

hence the equation (2.3) is satisfied. This guarantees the existence of binomial randomvariables.

Page 37: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 36

Proposition 2.12 (Sum of independent Bernoulli and binomial). Let 0 ≤ p ≤ 1, Let n ∈ N.Let X1, . . . ,Xn be independent Bernoulli random variables with parameter p. Then

Sn ∶=X1 +⋯ +Xn

is a binomial random variable with parameter n and p.

Proof. One can easily check that Sn is a random variable which takes values in 0, . . . , n.Furthermore, for every k ∈ 0, . . . , n we have

Sn = k = ⋃x1,...,xn∈0,1x1+⋯+xn=k

X1 = x1,⋯,Xn = xn.

Since the union is disjoint, we get

P[Sn = k] = ∑x1,...,xn∈0,1x1+⋯+xn=k

P[X1 = x1,⋯,Xn = xn]

= ∑x1,...,xn∈0,1x1+⋯+xn=k

P[X1 = x1]⋯P[Xn = xn]

= ∑x1,...,xn∈0,1x1+⋯+xn=k

pk(1 − p)n−k

= (nk)pk(1 − p)n−k.

Remark 2.13. In particular, the distribution Bin(1, p) is the same as the distributionBer(p). One can also check that if X ∼ Bin(m,p) and Y ∼ Bin(n, p) and X,Y areindependent, then X + Y ∼ Bin(m + n, p).

Example 3: Geometric random variableLet 0 < p ≤ 1. A random variable X is said to be a geometric random variable withparameter p if it takes values in E = N/0 and

∀k ∈ N/0 P[X = k] = (1 − p)k−1 ⋅ p .

In this case, we write X ∼ Geom(p).

Remark 2.14. If we define pk = (1 − p)k−1 p, we have

∑k=1

pk = p∞

∑k=1

(1 − p)k−1 = p ⋅ 1

p= 1,

hence the equation (2.3) is satisfied. This guarantees the existence of geometric randomvariables.

Page 38: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 37

The geometric random variable appears naturally as the first success in an infinitesequence of Bernoulli experiments with parameter p. This is formalized by the followingproposition.

Proposition 2.15. Let X1,X2, . . . be a sequence of infinitely many independent1 Bernoullir.v.’s with parameter p. Then

T ∶= minn ≥ 1 ∶ Xn = 1

is a geometric random variable with parameter p.

Remark 2.16. When saying that T is a geometric random variable, we make a slightabuse: indeed, the random variable T may take the value +∞ if all the random variablesXi’s are equal to 0. Nevertheless, this is not a problem for the calculations because onecan check that P[T =∞] = 0.

Proof. We have T = k if the first k − 1 trials fail, and the k’s one is a success. Formally,we have

T = k = X1 = 0, . . . ,Xk−1 = 0,Xk = 1.

Hence, by independence,

P[T = k] = P[X1 = 0, . . . ,Xk−1 = 0,Xk = 1]= P[X1 = 0]⋯P[Xk−1 = 0] P[Xk = 1]= (1 − p)k−1p.

The previous proposition gives us an easy way to remember the definition of thegeometric r.v., and also some simple formulas related to the geometric distribution. Forexample, if T is a geometric distribution with parameter p, we have X > n if the n firstBernoulli experiments fail, and therefore

P[T > n] = (1 − p)n. (2.4)

Also, it gives an important interpretation to the equation (2.5) in the proposition below:if we are waiting for a first success in a sequence of experiments, and if we know that thefirst n steps were a failure, then the remaining time to wait is again a geometric randomvariable with parameter p.

Proposition 2.17 (Absence of memory of the geometric distribution). Let T ∼ Geom(p)for some 0 < p < 1. Then

∀n ≥ 0∀k ≥ 1 P[T ≥ n + k ∣ T > n] = P[T ≥ k]. (2.5)1this means: (Xj)j∈J are independent r.v., for every J ⊂ N finite.

Page 39: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 38

Proof. It follows directly from the formula (2.4).

Example 3: Poisson random variableLet λ > 0 be a positive real number. A random variable X is said to be a Poissonrandom variable with parameter λ if it takes values in E = N and

∀k ∈ N P[X = k] = λk

k!e−λ .

In this case, we write X ∼ Poisson(λ).

Remark 2.18. If we define pk = λk

k! e−λ, we have

∑k=0

pk = e−λ∞

∑k=0

λk

k!= e−λ ⋅ eλ = 1,

hence the equation (2.3) is satisfied. This guarantees the existence of Poisson randomvariables.

The Poisson distribution appears naturally as an approximation of a binomial distri-bution when the parameter n is large and the parameter p is small, as stated formally inthe following proposition.

Proposition 2.19 (Poisson approximation of the binomial). Let λ > 0. For every n ≥ 1,consider a random variable Xn ∼ Bin(n, λn). Then

∀k ∈ N limn→∞

P[Xn = k] = P[N = k], (2.6)

where N is a Poisson random variable with parameter λ.

Remark 2.20. The convergence (2.6) is called a convergence in distribution. Intuitively,it says that Xn and N have very similar probabilistic properties for n large.

Proof. Fix k ∈ N. For every n ≥ 1, we have

P[Xn = k] = (nk)(λ

n)k

(1 − λn)n−k

= λk

k!⋅ n ⋅ (n − 1)⋯(n − k + 1)

n

k

⋅ (1 − λn)−k

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶Ð→n→∞1

⋅(1 − λn)n

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶Ð→n→∞e

−λ

,

which concludes the proof.

Page 40: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 39

This approximation may be useful in practice. For example, consider a single pageof the “Neue Zürcher Zeitung” containing, say, n = 104 characters, and suppose thatthe typesetter mis-sets approximately 1/1000 of the characters. In other words, eachcharacter has a probability p = 10/n to be mis-set. The number M of mistakes in thepage corresponds to a binomial random variable with parameters n and p = 10/n. Henceby the Poisson approximation, for example we have

P[M = 5] ≃ 105

5!e−10 ≃ 0,0378.

2.3 Joint distribution and image of random variablesDefinition 2.21. Let X1 ∶ Ω → E1, . . . , . . . ,Xn ∶ Ω → En be n discrete random vari-ables on (Ω,F ,P) with values in some finite or countable sets E1, . . . ,En. The family(px1,...,xn)x1∈E1,...,xn∈En defined by

px1,...,xn = P[X1 = x1, . . .Xn = xn]

is called the joint distribution of the r.v.’s X1, . . . ,Xn.

The joint distribution characterizes the probabilistic properties of the random vec-tor (X1, . . . ,Xn). If the random variables X1, . . . ,Xn are independent, then the jointdistribution is given by the product

px1,...,xn = P[X1 = x1]⋯P[Xn = xn].

For example, consider two independent Poisson random variables X,Y ∶ Ω → N with thesame parameter λ, then the joint distribution of X and Y is given by

∀k, ` ∈ N P[X = k, Y = `] = λk+`

k!`!e−2λ.

In general the random variables may have dependencies. For example, let Z be a thirdrandom variables defined by Z = X. Then Z is also a Poisson random variable withparameter λ. This time, the joint distribution is

∀k, ` ∈ N P[X = k,Z = `] = 1k=`λk

k!e−λ.

One of the main advantage of working with random variables is that we can “manipu-late” them as numbers. For instance, if we are given n random variables X1,X2, . . . ,Xn,we can think of them as n “random” numbers and we can make operations with them. Forexample, we can consider the sum S =X1 +X2 +⋯+Xn or the product Y =X1 ⋅X2 ⋅⋯ ⋅Xn

which are two new random variables. More generally, the following proposition allows usto consider random variables as images of discrete random variables.

Page 41: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 40

Proposition 2.22. Let n ≥ 1 and φ ∶ Rn → R be an arbitrary function. Let X1 ∶ Ω →E1, . . . ,Xn ∶ Ω→ En be n discrete random variables on (Ω,F ,P) with values in some finiteor countable sets E1, . . . ,En. Then Z = φ(X1, . . . ,Xn) is a discrete random with values inthe discrete set F = φ(E1 ×⋯ ×En) and with distribution given by

∀z ∈ F P[Z = z] = ∑x1∈E1,...,xn∈Enφ(x1,...,xn)=z

P[X1 = x1, . . .Xn = xn].

Proof. To prove that Z is a discrete random variable, it suffices to show that

• it takes values in the discrete set F , and

• for every z in F , we have Z = z ∈ F .

Indeed, the second item implies that for every a ∈ R

Z ≥ a = ⋃z∈Fz≥a

Z = z

is also an event (as a countable union of events).Now, the first item follows from the hypotheses on X1, . . . ,Xn and the definition of F .

For the second item, let z ∈ F and observe that

Z = z = ⋃x1∈E1,...,xn∈Enφ(x1,...,xn)=z

X1 = x1, . . .Xn = xn. (2.7)

Hence, Z = z ∈ F since it is a countable union of events. This implies that Z is adiscrete r.v.. To compute the distribution, observe that the union in (2.7) is disjoint andat most countable, hence,

P[Z = z] = ∑x1∈E1,...,xn∈Enφ(x1,...,xn)=z

P[X1 = x1, . . .Xn = xn].

2.3.1 A parenthesis: almost sure events

An important notion when working with random variables is the notion of almost sureoccurrence for an event.

Definition 2.23. Let A ∈ F be an event. We say that A occurs almost surely (a.s.) if

P[A] = 1.

Remark 2.24. This notion can be extended to any set A ⊂ Ω (not necessarily an event):We say that A occurs almost surely if there exists an event A′ ∈ F such that A′ ⊂ A andP[A′] = 1.

Page 42: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 41

In other words, something occurs a.s. if it occurs with probability 1. For example, ifX,Y are two random variables, we write

X ≤ Y a.s.

if P[X ≤ Y ] = 1, andX ≤ a a.s.

if P[X ≤ a] = 1.

2.4 ExpectationThe expectation of a random variable is a fundamental concept, which corresponds intu-itively to the notion of “average”. Before defining it for general random variables, let usstart with our favorite example: the throw of a die!

Let X ∶ Ω → E = 1,⋯,6 be a uniform r.v. in 1,2,3,4,5,6, i.e. P[X = x] = 1/6 forevery x = 1, . . . ,6. It is natural to say that the average value is

1 + 2 + 3 + 4 + 5 + 6

6.

With the probabilistic notation, this can be rewritten as

1 ⋅ 1

6+ 2 ⋅ 1

6+ 3 ⋅ 1

6+ 4 ⋅ 1

6+ 5 ⋅ 1

6+ 6 ⋅ 1

6= ∑x∈E

x ⋅ P[X = x],

which is the general formula for the expectation.If a random variable is nonnegative, then the expectation is always well-defined.

Definition 2.25. Let X ∶ Ω → E be a discrete random variable, and assume that X ≥ 0a.s.. Then the expectation of X is defined by

E[X] = ∑x∈E

x ⋅ P[X = x] .

Remark 2.26. If X ≥ 0 a.s, we have E[X] ∈ R+ ∪ +∞.

When a random variable does not have a constant sign, one needs an integrabilityassumption to be able to define the expectation.

Definition 2.27. Let X ∶ Ω → E be a discrete random variable. We say that X isintegrable if

E[∣X ∣] <∞.

In this case the expectation of X is defined by

E[X] = ∑x∈E

x ⋅ P[X = x] .

Page 43: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 42

Remark 2.28. If X is an integrable r.v., then E[X] ∈ R.

Example 1: Bernoulli r.v.Let X be a Bernoulli random variable with parameter p. We have

E[X] = p .

Indeed,E[X] = 0 ⋅ P[X = 0] + 1 ⋅ P[X = 1] = 0 ⋅ (1 − p) + 1 ⋅ p = p.

Example 2: Bet on a dieConsider the random variable X ∶ Ω→ −1,0,+2 defined by Eq. (1.5) page 18. Then

E[X] = −1 ⋅ 1

2+ 0 ⋅ 1

6+ 2 ⋅ 1

3= 1

6.

Example 3: Poisson r.v.Let X be a Poisson random variable with parameter λ > 0, then

E[X] = λ .

Indeed,

E[X] =∞

∑k=0

k ⋅ λk

k!⋅ e−λ = λ ⋅ (

∑k=1

λk−1

(k − 1)!) ⋅ e−λ = λ.

Example 4: Indicator of an eventLet A be an event. Then 1A is a Bernoulli r.v. with parameter P[A]. Hence,

E[1A] = P[A] .

Proposition 2.29 (Tailsum formula). Let X be a discrete r.v. taking values in N = 0,1,2, . . ..Then

E[X] =∞

∑n=1

P[X ≥ n]. (2.8)

Proof. Write pn = P[X = n] for the distribution of X. Since X takes values in N, we havefor every n ≥ 1

P[X ≥ n] = ∑k≥n

P[X = k] =∞

∑k=1

1k≥n ⋅ pk.

Page 44: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 43

Using this estimate in the right hand side of (2.8) and using Fubini’s Theorem (for non-negative sequences) to permute the sums, we obtain

∑n=1

P[X ≥ n] =∞

∑n=1

(∞

∑k=1

1k≥n ⋅ pk)

=∞

∑k=1

(∞

∑n=1

1k≥n) ⋅ pk

=∞

∑k=1

k ⋅ pk

= E[X].

Application: Computation of the expectation of a geometric random variable.Let T be a geometric random variable with parameter 0 < p ≤ 1. Then

E[T ] = 1

p.

Indeed, T takes values in N and the proposition above gives

E[T ] =∑n≥1

P[T ≥ n] =∑n≥1

(1 − p)n−1 = 1

1 − (1 − p) = 1

p.

Theorem 2.30 (Image random variables). Let X1, . . . ,Xn ∶ Ω→ E be n random variables,let φ ∶ Rn → R. Then Z = φ(X1, . . . ,Xn) defines a discrete random variable. Furthermore,if

∑x1,...,xn∈E

∣φ(x1, . . . , xn)∣ ⋅ P[X1 = x1, . . . ,Xn = xn],

then Z is integrable and

E[φ(X1, . . . ,Xn)] = ∑x1,...,xn∈E

φ(x1, . . . , xn)P[X1 = x1, . . . ,Xn = xn] . (2.9)

Proof. Using the formula of Prop. 2.22 and Fubini’s theorem for non negative sequences,we have

∑z∈F

∣z∣P[Z = z] = ∑z∈F

∑x1,...,xn∈E

∣z∣ ⋅ 1φ(x1,...,xn)=zP[X1 = x1, . . . ,Xn = xn]

= ∑x1,...,xn∈E

(∑z∈F

∣z∣1φ(x1,...,xn)=z)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=∣φ(x1,...,xn)∣

P[X1 = x1, . . . ,Xn = xn].

Hence Z is integrable, and the formula (2.9) is proved using the same computation,without the absolute values (the permutation of the two sums is justified by the Fubini’stheorem for integrable sequences).

Page 45: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 44

Remark 2.31. The formula for n = 1 is also important. Let X ∶ Ω → E be a discreterandom variable, and φ ∶ R→ R. Assume that

∑x∈E

∣φ(x)∣ ⋅ P[X = x] <∞,

then the discrete random variable Z = φ(X) is integrable and

E[φ(X)] = ∑x∈E

φ(x) ⋅ P[X = x] .

Remark 2.32. In case φ takes values in a set F ⊂ [0,∞) (which implies Z ≥ 0 a.s.) theformula (2.9) is always true, even without the integrability assumption.

Theorem 2.33 (linearity of the expectation). Let X,Y ∶ Ω→ R be two integrable discreterandom variables, let λ ∈ R, then λ ⋅X and X +Y are integrable discrete random variablesand

1. E[λ ⋅X] = λ ⋅E[X] .

2. E[X + Y ] = E[X] +E[Y ] .

Remark 2.34. The random variables X and Y do not need to be independent.

Remark 2.35. This implies that for every integer n ≥ 1

E[λ1X1 + λ2X2 +⋯ + λnXn] = λ1E[X1] + λ2E[X2] +⋯ + λnE[Xn],

for any integrable discrete r.v. X1,X2, . . . ,Xn ∶ Ω→ E, and any λ1, λ2,⋯, λn ∈ R.

Proof. We omit the proof of the integrability properties. The first equality is proved byapplying the Theorem 2.30 to φ defined by φ(x) = λ ⋅ x. For the second inequality, applyTheorem 2.30 to φ(x, y) = x + y to get

E[X + Y ] =∑x,y

(x + y) ⋅ P[X = x,Y = y]

=∑x,y

xP[X = x,Y = y] +∑x,y

yP[X = x,Y = y]

=∑x

x∑y

P[X = x,Y = y] +∑y

y∑x

P[X = x,Y = y].

Now, for fixed x observe that the event X = x can be partitioned as

X = x =⋃yX = x,Y = y.

Since the union is at most countable, we have

P[X = x] =∑y

P[X = x,Y = y],

Page 46: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 45

and equivalently, for fixed y

P[Y = y] =∑x

P[X = x,Y = y]. (2.10)

Plugging these two equations in (2.10), we obtain

E[X + Y ] =∑x

xP[X = x] +∑y

yP[Y = y],

which concludes the proof.

Application: Expectation of a Binomial random variable.Let n ≥ 1 and 0 ≤ p ≤ 1. Let S be a binomial random variable with parameters n and p.What is the expectation of S?

By definition we have

E[S] =n

∑k=0

k ⋅ (nk)pk(1 − p)n−k

and this sum does not look so nice... However, we can use that S has the same distributionas Sn = X1 + ⋯ +Xn, where X1, . . . ,Xn are n iid Bernoulli r.v.’s with parameter p. Bylinearity we have

E[Sn] = E[X1] +⋯ +E[Xn].

Using that E[Xi] = p for every i, we deduce directly

E[S] = E[Sn] = np.

Theorem 2.36 (Jensen’s inequality). Let X be a discrete random variable. Let φ ∶ R→ Rbe a convex function. If E[φ(X)] and E[X] are well defined, then

φ(E[X]) ≤ E[φ(X)].

The Jensen inequality has several important consequences. First, by applying it toφ(x) = ∣x∣, we obtain the triangle inequality. For every integrable discrete random variableX, we have

∣E[X]∣ ≤ E[∣X ∣].

Another important consequence relates the average of ∣X ∣ with the average of X2. Byapplying it to the convex function φ(x) = x2, we obtain that for every discrete randomvariable, we have

E[∣X ∣] ≤√E[X2]. (2.11)

Page 47: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 46

2.5 Independence of discrete random variablesTheorem 2.37. Let X,Y be 2 discrete random variables. Then the following are equiv-alent

(i) X,Y are independent,

(ii) For every a, b ∈ R we have P[X = a, Y = b] = P[X = a] ⋅ P[Y = b],

(iii) For every f ∶ R→ R, g ∶ R→ R,

E[f(X)g(Y )] = E[f(X)]E[g(Y )]

whenever the expectations are well-defined.

Proof.

(i)⇔ (ii) Already seen.

(ii)⇒ (iii) Let f, g be as in (iii). By Theorem 2.30 applied to φ(x, y) = f(x)g(y), we have

E[f(X)g(Y )] = ∑x,y

f(x)g(y) ⋅ P[X = x,Y = y]

(ii)= ∑x,y

f(x)g(y) ⋅ P[X = x]P[Y = y]

= ∑x

f(x) ⋅ P[X = x]∑y

g(y) ⋅ P[Y = y]

= E[f(X)] ⋅E[g(Y )].

(iii)⇒ (ii) Consider the functions f and g defined by f(x) = 1x=a and g(y) = 1y=b. We have

P[X = a, Y = b] = E[1X=a,Y =b]

= E[f(X)g(Y )](iii)= E[f(X)] ⋅E[g(Y )]

= P[X = a]P[Y = b]

Above, we only considered a pair (X,Y ) of random variables, but the same ideas applyto n random variables X1, . . . ,Xn, as in the following theorem.

Theorem 2.38. Let X1, . . . ,Xn be n discrete random variables. Then the following areequivalent

Page 48: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 47

(i) X1, . . . ,Xn are independent,

(ii) For every x1, . . . , xn ∈ R we have P[X1 = x1, . . . ,Xn = xn] = P[X1 = x1]⋯P[Xn = xn],

(iii) For every f1 ∶ R→ R, . . . , fn ∶ R→ R such that f1(X1), . . . fn(Xn) are integrable,

E[f1(X1)⋯fn(Xn)] = E[f(X1)]⋯E[f(Xn)].

2.6 VarianceDefinition 2.39. Let X be a discrete random variable such that E[X2] <∞. The vari-ance of X is defined by

σ2X = E[(X −m)2], where m = E[X].

The square root of the variance σX is called the standard deviation of X.

Remark 2.40. If E[X2] <∞, then we also have E[∣X ∣] <∞ by Eq. (2.11) and thereforethe average m = E[X] is well-defined.

The standard deviation is an indicator of how large the fluctuations of X aroundm = E[X] are. We illustrate this fact on two simple examples.

Example 1: Deterministic random variable

Let a ∈ R. Consider the random variable defined by X(ω) = a for every ω. Thenm = E[X] = a and σ2

X = E[(X −m)2] = 0.

Example 2: Uniform random variable on two points

Let a < b be two real numbers. Consider a random variable X with distribution givenby P[X = a] = P[X = b] = 1/2. Then m = E[X] = (a + b)/2 and

σX =√E[(X −m)2] = a − b

2.

Proposition 2.41 (basic properties of the variance).

1. Let X be a discrete random variable with E[X2] <∞. Then

σ2X = E[X2] −E[X]2.

2. Let X1, . . . ,Xn be n pairwise independent random variables and S = X1 + ⋯ +Xn.Then

σ2S = σ2

X1+⋯ + σ2

Xn .

Page 49: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 2. DISCRETE DISTRIBUTIONS 48

Proof.

1. Let m = E[X]. Using linearity of the expectation, we get

E[(X −m)2] = [X2 − 2mX +m2] = E[X2] − 2mE[X] +m2 = E[X2] −m2.

2. Writing mi = E[Xi], we have

S −E[S] =n

∑i=1

(Xi −mi),

and thereforeσ2S = ∑

1≤i,j≤n

E[(Xi −mi)(Xj −mj)].

However, for i ≠ j, independence implies E[(Xi−mi)(Xj−mj)] = E[(Xi−mi)]E[(Xj−mj)] = 0. Hence only the diagonal terms (for which i = j) survive in the sum andwe obtain

σ2S =

n

∑i=1

E[(Xi −mi)2] =n

∑i=1

σ2Xi.

Application: Let S be a binomial random variables with parameters n and p. What isthe variance of S?

Here, again, we can use that S has the same distribution as Sn = X1 + ⋯ +Xn whereX1, . . . ,Xn are iid Bernoulli random variables with parameter p. Hence

σ2S = σ2

Sn

independence= σ2X1+⋯ + σ2

Xn

ident. distrib.= n ⋅ σ2X1.

One has σ2X1

= E[X2i ] − p2 = p − p2 = p(1 − p). Hence

σ2S = n ⋅ p(1 − p) .

Here, we have discovered an important effect of summing iid random variables. One has

E[S] = n ⋅ p and σS =√n ⋅

√p(1 − p)

so the expectation of S grows like n, while the fluctuation of Sn grow like√n thanks to

cancellations in the sum Sn =X1 +⋯ +Xn.

Page 50: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Chapter 3

Continuous distributions

In all the chapter, we fix some probability space (Ω,F ,P).

3.1 Definition and examples (textbook p.85-88)

Let us recall the definitions from Chapter 1. A random variable is a function X ∶ Ω → Rsuch that X ≤ a is an event for every a, and the distribution function of X is definedby FX(a) = P[X ≤ a].

Definition 3.1 (Continuous random variables). A random variable X ∶ Ω→ R is said tobe continuous if its distribution function FX can be written as

FX(a) = ∫a

−∞f(x)dx for all a in R (3.1)

for some nonnegative function f ∶ R→ R+, called the density of X.

In this case, we have already noticed in Section 1.4 that FX is a continuous function,and for a ∈ R

P[X = a] = F (a) − F (a−) = 0.

If X has a continuous distribution, we thus have: for all a ≤ b,

P[a <X < b] = P [a ≤X ≤ b] = P[a <X ≤ b] = P[a ≤X < b] = ∫b

af(x)dx.

Example 1: Uniform distribution on [a, b], a < b.A continuous random variable X is said be uniform in [a, b] if its density is equal to

fa,b(x) =⎧⎪⎪⎨⎪⎪⎩

1b−a x ∈ [a, b],0 x ∉ [a, b].

In this case, we write X ∼ U([a, b]).

49

Page 51: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 50

1b−a

a b

x

fa,b(x) 1

a bx

FX(x)

Figure 3.1: Density and distribution function of a uniform random variable in [a, b].

Intuition: X represents a uniformly chosen point in [a, b].Properties of a uniform random variable X in [a, b]:

• The probability to fall in a an interval [c, c+ `] ⊂ [a, b] depends only on its length `:

P[X ∈ [c, c + `]] = `

b − a.

• The distribution function of X is equal to

FX(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 x < a,x−ab−a a ≤ x ≤ b,1 x > b.

Example 2: Exponential distribution with λ > 0

The exponential distribution is the continuous analogue of the geometric distribution.A continuous random variable T is said be exponential with parameter λ > 0 if itsdensity is equal to

fλ(x) =⎧⎪⎪⎨⎪⎪⎩

λe−λx x ≥ 0,

0 x < 0.

In this case, we write T ∼ Exp(λ).

λ/e

λ

1/λ

x

fλ(x) = λe−λx

1

x

FX(x) = 1− e−λx

Figure 3.2: Density and distribution function of an exponential random variable withparameter λ.

Page 52: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 51

Intuition/application: T represents the time of a “clock ring”. For example, the timeat which a first customer arrives in a shop is well modeled by an exponential randomvariable.

Properties of a exponential random variable T with parameter λ.

• The waiting probability is exponentially small:

∀t ≥ 0 P[T > t] = e−λt .

• It has the absence of memory property:

∀t, s ≥ 0 P[T > t + s∣T > t] = [T > s].

The first item follows from the definition:

P[T ≥ t] = ∫∞

tλe−λxdx = e−λt.

The second item is a direct computation of the conditional probability:

P[T > t + s∣T > t] = P[T > t + s]P[T > t] = e

−λ(t+s)

e−λt= e−λs.

Example 3: Normal distribution with parameters m ∈ R and σ2 > 0

A continuous random variable X is said be normal with parameters m and σ2 > 0 ifits density is equal to

fm,σ(x) =1√

2πσ2e−(x−m)2

2σ2

In this case, we write X ∼ N (m,σ2).

fm,σ(x)

m−σ m+σm

Figure 3.3: Density of a normal random variable with parameters m and σ2.

Intuition/application: The normal distribution arises in many applications. For exam-ple, imagine that we measure a physical quantity: the real value is m and the measured

Page 53: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 52

value is in general very well modeled by a normal random variable X with parametersm and σ. The quantity σ, which represents the fluctuations for X can also be inter-preted as the quality of the measurement in this context. A small σ corresponds to smallfluctuations of X, which means that X is typically closed to m. In contrary, a large σcorresponds to large fluctuations and can be interpreted as inaccurate measurement. Wewill see later a mathematical justification explaining why the normal random variableappears in many places.

Properties of normal random variables

• IfX1, . . . ,Xn are independent random variables with parameters (m1, σ21), . . . , (mn, σ2

n)respectively, then

Z =m0 + λ1X1 + . . . + λnXn

is a normal random variable with parameters m = m0 + λ1m1 + ⋯ + λnmn and σ2 =λ2

1σ21 +⋯ + λ2

nσ2n.

• In particular, if X ∼ N (0,1) (in this case we say that X is a standard normalrandom variable), then

Z =m + σ ⋅Xis a normal random variable with parameters m and σ2.

• IfX is a normal random variable with parametersm and σ2, then all the “probabilitymass” is mainly in the interval [m − 3σ,m + 3σ]. Namely, we have

P[∣X −m∣ ≥ 3σ] ≤ 0.0027.

At a first look, it may be surprising that the right hand side (0.0027) does notdepend on the parameters σ and m... This is explained as follows: consider therandom variable Z = X−m

σ . The first property above implies that Z = 1σX − m

σ is astandard normal random variable. The left hand side can be rewritten as

P[∣X −m∣ ≥ 3σ] = P[∣X −mσ

∣ ≥ 3] = P[∣Z ∣ ≥ 3]

Then the inequality P[∣Z ∣ ≥ 3] ≤ 0.0027 can be directly checked from a table.

How to recognize a continuous random variable?

The following theorem will be useful in applications to calculate densities.

Theorem 3.2. Let X be a random variable. Assume the distribution function FX iscontinuous and piecewise C1, i.e. that there exist x0 = −∞ < x1 < ⋯ < xn−1 < xn = +∞ suchthat FX is C1 on every interval (xi, xi+1). Then X is a continuous random variable anda density f can be constructed by defining

∀x ∈ (xi, xi+1) f(x) = F ′X(x)

and setting arbitrary values at x1, . . . , xn−1.

Page 54: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 53

Proof. We simply write F = FX . Let 0 ≤ i < n. If xi < a < b < xi+1, the fundamentaltheorem of calculus implies that

F (b) − F (a) = ∫b

aF ′(y)dy = ∫

b

af(y)dy

Now, let x ∈ R and let i be such that x ∈ [xi, xi+1). Using the convention F (x0) = 0, wecan write F (x) as a telescopic sum

F (x) = F (x) − F (x0) = (F (x) − F (xi)) +⋯ + (F (x1) − F (x0)). (3.2)

By continuity of F , we have

(F (x) − F (xi)) = lima↓xi

(F (x) − F (a)) = lima↓xi

∫x

af(y)dy = ∫

x

xif(y)dy,

and equivalentlyF (xi) − F (xi−1) = ∫

xi

xi−1f(y)dy

Plugging these identities in (3.2), we get

F (x) = ∫x

xif(y)dy + ∫

xi

xi−1f(y)dy +⋯ + ∫

x1

x0f(y)dy

= ∫x

−∞f(y)dy.

A method to compute the density of a transformed random variable

As we have seen, it is very useful to take the image Y = φ(X) of a random variable X bysome function φ ∶ R→ R. Here we consider the following problem:

Problem: Let X be a continuous random variable with density f .Assuming it exists, what is the density of Y = φ(X)?

First, remark that φ(X) may not be continuous. For example, in the case that φ is justa constant random variable, then φ(X) is NOT continuous, and therefore it has not adensity. The method below works well if X and φ are sufficiently “nice”.

Page 55: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 54

How to compute the density of Y = φ(X)?

Step 1: Compute the distribution function.

FY (x) = P[φ(X) ≤ x] = ⋯

Step 2: If the function FY is continuous and piecewise C1, then wedirectly conclude that Y is continuous and its density is given by

f(x) = F ′Y (x)

on each interval where the derivative is defined.

We give two examples where the method can be applied.

Example 1: What is the density of Y =X2, for X standard normal r.v.?

First, we compute the distribution function of Y . For every x < 0, FY (x) = 0 and forevery x ≥ 0

FY (x) = P[X2 ≤ x] = P[−√x ≤X ≤

√x] = Φ(

√x) −Φ(−

√x),

where Φ(x) = 1√2π ∫

x

−∞e−y

2/2dy.The function FY is continuous and piecewise C1 so we can apply Theorem 3.2. We

compute the derivative on the intervals (−∞,0) and (0,∞). For x < 0, we have F ′Y (x) = 0

and for x > 0, we have

F ′Y (x) =

1

2√x(Φ′(

√x) +Φ′(−

√x)) = 1√

xe−x/2.

Therefore, by Theorem 3.2, the random variable Y is continuous and it has density

f(x) =⎧⎪⎪⎨⎪⎪⎩

0 x ≤ 0,1√xe−x/2 x > 0.

Example 2: What is the density of Y =Xn, for X uniformly distributed on [0,1]?First we compute the distribution function of Y . The random variable Y takes values in[0,1], and for 0 < x ≤ 1

FY (x) = P[Y ≤ x] = P[Xn ≤ x] = P[Xn ≤ x] = x1/n.

Therefore,

FY (x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 x ≤ 0

x1/n 0 < x ≤ 1

1 x > 1.

Page 56: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 55

Second, we observe that Y is continuous, piecewise C1. By computing the derivative ofFY on the intervals (−∞,0), (0,1) and (1,∞), we find that it has density

f(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 x ≤ 01nx

1n−1 0 < x ≤ 1

0 x > 1.

3.2 Expectation and varianceDefinition 3.3 (Expectation for a positive random variable). Let X ∶ Ω→ R be a contin-uous random variable with density f , and assume that X ≥ 0 a.s.. Then the expectationof X is defined by

E[X] = ∫∞

−∞x ⋅ f(x)dx.

Remark 3.4. The fact that X ≥ 0 a.s. implies that f(x) = 0 for x < 0, hence

E[X] = ∫∞

0x ⋅ f(x)dx.

and the integral is always well defined. It can be finite or infinite.

Definition 3.5. Let X ∶ Ω → R be a continuous random variable with density f . We saythat X is integrable if E[∣X ∣] <∞. In this case we define the expectation of X by

E[X] = ∫∞

−∞x ⋅ f(x)dx.

Definition 3.6 (expectation of the image). Let X be a random variable with density f .Let φ ∶ R→ R be such that

∫∞

−∞∣φ(x)∣f(x)dx <∞.

Then we define the expectation of φ(X) by

E[φ(X)] = ∫∞

−∞φ(x)f(x)dx.

Remark 3.7. If Y = φ(X) is continuous with density g, the definition above is consistentwith Definition 3.5 in the sense that

∫∞

−∞y ⋅ g(y)dy

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[Y ]

= ∫∞

−∞φ(x) ⋅ f(x)dx

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[φ(X)]

.

Page 57: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 56

If Y = φ(X) is discrete and takes values in a finite or countable set E, then the definitionabove is consistent with Definition 2.27, in the sense that

∑y∈E

yP[Y = y]

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[Y ]

= ∫∞

−∞φ(x) ⋅ f(x)dx

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E[φ(X)]

.

This facts are not obvious in the formalism used in this class, and we admit them.

Example 1: Uniform random variable in [a, b], a < b

E[X] = 1

b − a ∫b

axdx = 1

b − a ⋅ (1

2b2 − 1

2a2).

Therefore,

E[X] = a + b2

.

Example 2: Exponential random variable with parameter λ > 0

By integration by parts, we have

E[X] = ∫∞

0xλe−λxdx = [−xe−λx]∞

0+ ∫

0e−λxdx.

Therefore,

E[X] = 1

λ.

Example 3: Normal random variable with parameters m and σ2

If X is a normal distribution with parameters m and σ2, then it has the same distri-bution as m + σ ⋅ Y where Y is a standard normal random variable. Using definition 3.6,we see that

E[X] = E[m + σ ⋅ Y ] =m + σE[Y ],

hence it suffices to compute the expectation of Y . Writing f0,1 for the density of Y , wehave

E[Y ] = ∫∞

−∞x ⋅ f0,1(x)dx = 0

because x ⋅ f0,1(x) is an odd function. Finally, we obtain

E[X] =m.

Page 58: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 57

Example 4: Cauchy distribution

A random variable X is said to have a Cauchy distribution if it is continuous and hasdensity

f(x) = 1

π

1

1 + x2, x ∈ R.

In this case,

E[∣X ∣] = 1

π ∫∞

∣x∣1 + x2

dx = +∞.

The expectation of X is not well defined. Actually, there is no natural definition of

∫+∞

−∞

1

π

x

1 + x2dx.

Indeed, one can note for instance that

limN→∞

∫2N

−N

1

π

x

1 + x2dx = 1

πlog 2

whilelimN→∞

∫N

−3N

1

π

x

1 + x2dx = − 1

πlog 3.

As in the discrete case, the Jensen’s inequality allows us to compare the expectationof φ(X) to the expectation of X when φ is convex.

Theorem 3.8 (Jensen’s inequality). Let φ ∶ R → R be a convex function. Let X be acontinuous random variable such that E[X] and E[φ(X)] are well defined, then

φ(E[X]) ≤ E[φ(X)].

Definition 3.9. Let X ∶ Ω → R be a continuous random variable with density f . IfE[X2] <∞, then we define the variance of X by

σ2X = E[(X −m)2] = ∫

−∞(x −m)2 ⋅ f(x)dx,

where m = E[X].

Notation: Sometimes the variance of X is written Var(X). And the number σX =√σ2

X

is called the standard deviation of X.

Remark 3.10. As in the discrete case, the fact that E[X2] < ∞ implies that the meanm = E[X] and the variance σ2

X are well defined (by Jensen’s inequality).

Proposition 3.11 (basic properties of the variance).

Page 59: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 58

1. Let X be a continuous random variable with E[X2] <∞. Then

σ2X = E[X2] −E[X]2.

2. Let X be a continuous random variable with E[X2] <∞, let λ,µ ∈ R. Then

σ2λX+µ = λ2 ⋅ σ2

X .

3. Let X1, . . . ,Xn be n pairwise independent continuous random variables with E[X2i ] <

∞, let λ1X1 +⋯ + λnXn. Then the variance of S = λX1 +⋯ + λnXn is equal to

σ2S = σ2

X1+⋯ + σ2

Xn .

Proof. We prove Items 1 and 2. Using the linearity of the integral, we have

σ2X = E[(X −m)2] = ∫

−∞(x −m)2 ⋅ f(x)dx

= ∫∞

−∞x2 ⋅ f(x)dx

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=E[X2]

−2m∫∞

−∞x ⋅ f(x)dx

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=m

+m2∫∞

−∞f(x)dx

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=1

= E[X2] −m2.

The second item is a direct consequence of the quadratic form of the definition: Usingthat E[λX] = λE[X] + µ and the linearity of the integral, we have

σ2λX+µ = ∫

−∞(λx + µ −E[λX + µ])2 ⋅ f(x)dx = ∫

−∞λ2(x −E[X])2 ⋅ f(x)dx = λ2 ⋅ σ2

X .

Example 1: Uniform random variable in [a, b], a < bIf X is a uniform in [a, b], then one can check that X has the same distribution

function as a + (b − a)Y , where Y is a uniform random variable in [0,1]. For a uniformrandom variable in [0,1], we have m = 1/2 and

σ2Y = ∫

1

0x2dx −m2 = 1

3− 1

4= 1

12.

Then the second item of Proposition 3.11 implies that σ2X = (b − a)2 ⋅ σ2

Y and therefore

σ2X = 1

12(b − a)2.

Example 2: Exponential random variable with parameter λ

Page 60: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 59

By integration by parts, we have

E[X2] = ∫∞

0x2 ⋅ λe−λxdx = [−x2e−λx]∞

0+ 2∫

0x ⋅ e−λx = 0 + 2

λE[X] = 2

λ2,

and therefore, using the formula σ2X = E[X2] −E[X]2, we obtain

σ2X = 1

λ2.

Example 3: Normal random variable with parameters m and σ2

We use that X has the same distribution as m+σ ⋅Y , where Y is standard normal. Now,for a standard normal random variable, we have m = 0 and using integration by parts, wefind

σ2Y = 1

2π ∫∞

−∞x2e−x

2/2

´¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¶=x⋅xe−x2/2

dx = 1

2π[−x ⋅ e−x2/2]∞

−∞+ 1

2π ∫∞

−∞e−x

2/2dx = 0 + 1 = 1.

The second item of Proposition 3.11 implies that σ2X = σ2 ⋅ σ2

Y and therefore

σ2X = σ2.

3.3 Joint distributionDefinition 3.12. Two random variables X,Y ∶ Ω→ R have a continuous joint distri-bution if there exists a function f ∶ R2 → R+ such that

P[X ∈ [a, a′], Y ∈ [b, b′]] = ∫a′

a(∫

b′

bf(x, y)dy)dx

for every −∞ ≤ a ≤ a′ ≤ ∞ and −∞ ≤ b ≤ b′ ≤ ∞. A function f as above is called a jointdensity of (X,Y ).

Remark 3.13. By choosing a = b = −∞ and a′ = b′ = +∞ in the definition, one can seethat a joint density always satisfies

∫∞

−∞

(∫∞

−∞f(x, y)dy)dx = 1. (3.3)

Conversely, given a non negative function f satisfying (3.3), one can always construct aprobability space (Ω,F ,P) and two random variables X,Y ∶ Ω → R with joint density f(admitted).

Page 61: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 60

Interpretation: Informally, f(x, y)dxdy represents the probability that the randompoint (X,Y ) lies in the small rectangle [x,x + dx] × [y, y + dy].

Example 1: Uniform point in the square

Consider two random variables X and Y with joint density f(x, y) = 10≤x,y≤1, i.e.

f(x, y) =⎧⎪⎪⎨⎪⎪⎩

1 (x, y) ∈ [0,1]2

0 (x, y) ∉ [0,1]2.

In this case, we have for every rectangle R = (a, a′) × (b, b′) ⊂ [0,1]2

P[(X,Y ) ∈ R] = (a′ − a)(b′ − b) = Area(R),

and (X,Y ) intuitively represents a uniform point in the square [0,1]2.

Example 2: Uniform point in the disk

Let D = (x, y) ∶ x2 + y2 ≤ 1 be the disk of radius 1 around 0. Consider two randomvariables X and Y with joint density f(x, y) = 1

π1x2+y2≤1, i.e.

f(x, y) =⎧⎪⎪⎨⎪⎪⎩

1π x2 + y2 ≤ 1

0 x2 + y2 > 1.

In this case, we have for every rectangle R = (a, a′) × (b, b′) ⊂D

P[(X,Y ) ∈ R] = 1

π(a′ − a)(b′ − b) = Area(R)

Area(D) ,

and (X,Y ) intuitively represents a uniform point in the square [0,1]2.

Marginal densities If X,Y possess a joint density fX,Y , then we have

P[X ≤ a] = P[X ∈ [−∞, a], Y ∈ [−∞,∞]]

= ∫a

−∞(∫

−∞f(x, y)dy)dx,

and therefore X is continuous with density

fX(x) = ∫∞

−∞f(x, y)dy.

Equivalently Y is continuous with density

fY (y) = ∫∞

−∞f(x, y)dx.

Let us calculate the marginal densities in the two examples of joint densities we have seen

above.

Page 62: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 61

Example 1: Uniform point in the square

If fX,Y (x, y) = 10≤x,y≤1, then X has density

fX(x) = ∫0,110≤x≤110≤y≤1dy = 10≤x≤1.

and equivalently fY (y) = 10≤y≤1. In other words, both X and Y are uniform randomvariables in [0,1].

Example 2: Uniform point in the disk

If fX,Y (x, y) = 1π1x2+y2≤1, then the density of X is

fX(x) = ∫∞

−∞

1

π1x2+y2≤1dy = ∫

√1−x2

−√

1−x2

1

πdy = 2

π

√1 − x2,

and equivalently fY (y) = 2π

√1 − y2.

Expectation of φ(X,Y )Let φ ∶ R2 → R. If X,Y have joint density fX,Y , then the expectation of the randomvariable Z = φ(X,Y ) can be calculated by the formula

E[φ(X,Y )] = ∫∞

−∞∫

−∞φ(x, y) ⋅ fX,Y (x, y)dxdy, (3.4)

whenever the integral is well defined.

Application: proof of linearity of the expectation for jointly continuous distributions.For every λ,µ ∈ R, we have

E[λX + µY ] = ∫∞

−∞∫

−∞(λx + µy) ⋅ fX,Y (x, y)dxdy

= λ∫∞

−∞x (∫

−∞fX,Y (x, y)dy)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=fX(x)

dx + µ∫∞

−∞y (∫

−∞fX,Y (x, y)dx)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=fY (y)

dy.

Hence, we obtainE[λX + µY ] = λE[X] + µE[Y ].

Independence for continuous random variablesThe following theorem gives a useful characterization for the independence of continuousrandom variables.

Theorem 3.14. Let X,Y be two continuous random variables with respective densitiesfX and fY . The following are equivalent

(i) X,Y are independent,

Page 63: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 62

(ii) X,Y are jointly continuous with joint density fX,Y (x, y) = fX(x) ⋅ fY (y) ,

(iii) For every φ ∶ R→ R, ψ ∶ R→ R,

E[φ(X)ψ(Y )] = E[φ(X)]E[ψ(Y )]

whenever the expectations are well-defined.

Remark 3.15. An important consequence is that two independent continuous randomvariables are automatically jointly continuous.

Proof.(i) ⇒ (ii) If X and Y are independent, we have

P[X ∈ [a, a′], Y ∈ [b, b′]] = P[X ∈ [a, a′]] ⋅ P[Y ∈ [b, b′]]

= ∫a′

afX(x)dx ⋅ ∫

b′

bfY (y)dy

= ∫a′

a∫

b′

bfX(x)fY (y)dxdy.

Therefore, X and Y have joint density fX,Y (x, y) = fX(x) ⋅ fY (y).(ii) ⇒ (iii) Applying the formula (3.4), we find

E[φ(X)ψ(Y )] = ∫∞

−∞∫

−∞φ(x)ψ(y)fX,Y (x, y)dxdy

(ii)= ∫∞

−∞∫

−∞φ(x)ψ(y)fX(x)fY (y)dxdy

= ∫∞

−∞φ(x)fX(x)dx ⋅ ∫

−∞ψ(y)fY (y)dy

= E[φ(X)] ⋅E[ψ(Y )].

(iii) ⇒ (i) Let a ≤ a′ and b ≤ b′. By applying the item (iii) to φ(x) = 1a≤x≤a′ andψ(y) = 1b≤y≤b′ , we obtain

E[1a≤X≤a′1b≤Y ≤b′] = E[1a≤X≤a′]E[1b≤Y ≤b′].

Then, using E[1A] = P[A] for every event A, we obtain

P[a ≤X ≤ a′, b ≤ Y ≤ b′] = P[a ≤X ≤ a′]P[b ≤ Y ≤ b′],

which concludes that X and Y are independent random variables.

Page 64: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 63

Example 1: Uniform point in the squareIf X and Y have joint density fX,Y (x, y) = 10≤x,y≤1, then

fX,Y (x, y) = 10≤x≤110≤y≤1 = fX(x) ⋅ fY (y).

In other words, the two coordinates of a uniform random point in [0,1]2 are independent.

Example 2: Uniform point in the disk

If X and Y have joint density fX,Y = 1π1D, then we have seen that fX(x) = 2

π

√1 − x2

and fY (y) = 2π

√1 − y2, and therefore

fX,Y (x, y) ≠ fX(x)fY (y).

The two coordinates X and Y of a uniform point in D are not independent! This fact caneasily be understood by looking at the event that X is larger than (say)

√3/2. In this

case, we have some information about Y which is constrainted to belong to [−1/2,1/2].

Page 65: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Complement: asymptotic results

In this complement we fix a probability space (Ω,F ,P) and an infinite sequence of i.i.d.random variables X1,X2, . . . In other words, we are given some random variables Xi ∶ Ω→R such that

∀i1 < ⋯ < ik ∀x1, . . . xk ∈ R P[Xi1 ≤ x1, . . . ,Xik ≤ xk] = F (x1)⋯F (xk).

where F is the common distribution function. We are interested in the behavior (when nis large) of the random variable defined by

X1(ω) +⋯ +Xn(ω)n

. (3.5)

and sometimes called the empirical average.

Law of large numbersTheorem 3.16. Assume that the expectation E[∣X1∣] is well defined and finite (for ex-ample if X1 is discrete or continuous an integrable). Defining m = E[X1] we have

limn→∞

X1 +⋯ +Xn

n=m a.s. (3.6)

What does Eq. (3.6) mean?

If we fix ω ∈ Ω, the values X1(ω)1 , X1(ω)+X2(ω)

2 , ⋯ simply defines a sequence of real numbers.The properties of this sequence depends on the outcome ω. We are here interested on theω’s for which the sequence X1(ω)+⋯+Xn(ω)

n converges to m. More precisely we consider thesubset E ⊂ Ω defined by

E = ω ∶ limn→∞

X1(ω) +⋯ +Xn(ω)n

=m

One can check that E is an event and Eq. (3.6) says that P[E] = 1.

Remark 3.17. In the statement of the theorem, it may be surprising that the assumptionand the definition of m are in terms of X1 only. Actually, since the random variables arei.i.d. we also have E[∣Xi∣] <∞ and m = E[Xi] for every i.

64

Page 66: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 65

Example 1: Bernoulli random variables

IfX1,X2, . . . is an infinite sequence of i.i.d. Bernoulli random variables with parametersp. Then

limn→∞

X1 +⋯ +Xn

n= p a.s.

Example 2: Exponential random variables

If T1, T2, . . . is an infinite sequence of i.i.d. Exponential random variables with param-eters λ. Then

limn→∞

T1 +⋯ + Tnn

= λ a.s.

Central limit theorem

A question of fluctuation?

The law of large numbers tells us that for large n, the empirical average (3.5) is closed tothe expectation m = E[X1]. A second very natural question to ask is:

How far isX1 +⋯ +Xn

nfrom m typically?

The Gaussian case

Let us first look at the very instructive case when X1,X2, . . . is a sequence of i.i.d. normalrandom variables with parameters m and σ2. Then the results we have seen on normalrandom variables tell us that

Z = X1 +⋯ +Xn

n−m

is again a normal random variable with parameters m = 0 and σ2 = 1nσ

2. The standarddeviation σ = 1√

nσ represents the typical fluctuations of Z. Roughly one can say that the

typical distance between X1+⋯+Xnn and m is of order σ√

n.

In this context, a more natural random variable to consider is to rescale Z by afactor

√nσ in order to get fluctuations of order 1: using again the properties of normal

distributions we see that √n

σZ = X1 +⋯ +Xn − n ⋅m√

σ2n

is a standard normal.As a conclusion, if we consider iid normal distributions with expectation m and vari-

ance σ2, then the random variable

X1 +⋯ +Xn − n ⋅m√σ2n

Page 67: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 3. CONTINUOUS DISTRIBUTIONS 66

corresponds to a rescaled version of the fluctuations of X1+⋯+Xnn and is a standard normal.

General case: the central limit theorem

If X1,X2, . . . are not normal, it is in general not easy to compute the law of

X1 +⋯ +Xn − n ⋅m√σ2n

Nevertheless, the central limit theorem tells us that this random variable always get closeto a standard normal if n is large.

Theorem 3.18 (Central limit theorem). Assume that the expectation E[X21 ] is well de-

fined and finite. Defining m = E[X1] and σ2 = Var(X1), we have

P[Sn − n ⋅m√σ2n

≤ a]ÐÐ→n→∞

Φ(a) = 1√2π∫

a

−∞e−x

2/2dx.

Page 68: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

Chapter 4

Introduction to statistics

Let us suppose we are given a probabilistic model, and we observe some data x1, . . . , xnproduced by this model

Probabilisticmodel

Data

x1, . . . , xn

produces

One goal in statistics is to try to describe some features of the probabilistic model fromthe analysis of the data.

For example, one can consider 100 independent different coin flips. The correspondingprobabilistic model is given by n independent Bernoulli random variables. Some typicalquestions that a statistician would aim to answer are:

• Is the coin fair? i.e. is the parameter p of the Bernoulli r.v.’s equal to 1/2?

• What is the parameter p of the underlying Bernoulli random variable?

4.1 Estimation of parametersFramework: We study a parametric model. Namely, the probabilistic model we studydepends on one real parameter θ, or on several parameters θ1, θ2, .... In this chapter themodels will always be given by a sequence of i.i.d. random variables, but in general themodels may involve random variables which are not independent, and the computationsare in general more complicated.

Some example of discrete models:

• X1, . . . ,Xn i.i.d. Bernoulli random variables with parameter p ∈ [0,1],

• X1, . . . ,Xn i.i.d. geometric random variables with parameter p ∈ [0,1],

• X1, . . . ,Xn i.i.d. Poisson random variables with parameter λ > 0.

67

Page 69: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 68

Some example of continuous models:

• X1, . . . ,Xn i.i.d. uniform random variables in [0, θ], θ ≥ 0,

• X1, . . . ,Xn i.i.d. exponential random variables with parameter λ > 0,

• X1, . . . ,Xn i.i.d. normal random variables with parameters m ∈ R and σ2 > 0.

Terminology: A realization of the model is a vector x = (x1, . . . , xn) ∈ Rn of possiblevalues for (X1, . . . ,Xn).Main question: We observe a realization (x1, . . . , xn) of our model. Can one estimatethe underlying parameter(s) of the model?

4.1.1 Maximum likelihood estimator for discrete models

We consider X1, . . . ,Xn i.i.d. discrete random variables with values in a countable andfinite set E, and distribution (pθ(x)) depending on some parameter θ ∈ R. Namely, forevery i

∀x ∈ E P[Xi = x] = pθ(x).If we observe a realization (x1, . . . , xn), what can we say about the parameter. Here ourgoal is to define some number θ(x1, . . . , xn) which would be a good prediction for theparameter θ.

Definition 4.1. Let x = (x1, . . . , xn) ∈ En be a possible realization for (X1, . . . ,Xn), thelikelihood function of x is defined by

L(θ) = Lx(θ) = P[X1 = x1, . . . ,Xn = xn].

The likelihood of x corresponds to the probability to observe the realization x =(x1, . . . , xn) when the underlying parameter is θ. In probabilistic terms it correspondsto the joint distribution of X1, . . . ,Xn. Here, since we only consider i.i.d. random vari-ables, the likelihood function always takes the special form

Lx(θ) = P[X1 = x1]⋯P[Xn = xn] = pθ(x1)⋯pθ(xn).

Definition 4.2. Themaximum likelihood estimator for the realization x = (x1, . . . , xn)is defined as the parameter θ = θ(x1, . . . , xn) for which the likelihood is maximal, i.e.

Lx(θ) = maxθLx(θ).

Page 70: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 69

Example 1: Bernoulli modelLet X1, . . . ,Xn be i.i.d. Bernoulli random variables with parameter p. A possible

realization is a vector x = (x1, . . . , xn) ∈ 0,1,n. The likelihood function for this model isequal to

L(p) = P[X1 = x1,⋯,Xn = xn] = p∣x∣(1 − p)n−∣x∣, where ∣x∣ =∑i

xi.

Now, we want to find p such that

L(p) = max0≤p≤1

L(p).

Due to the special form of the likelihood function it is easier to study

φ(p) = log(L(p)) = ∣x∣ log p + (n − ∣x∣) log(1 − p).

This trick of taking the logarithm is very useful in most of the applications we will see,because we study i.i.d. models and the likelihood function always takes the form of aproduct. Calculating the derivative gives

φ′(p) = ∣x∣p− n − ∣x∣

1 − p .

A simple analysis of the function finally gives that φ has a unique maximum, and we findthat the maximum likelihood estimator for the parameter p is

p(x1, . . . , xn) =x1 +⋯ + xn

n.

4.1.2 Maximum likelihood estimator for continuous models

We consider X1, . . . ,Xn i.i.d. continuous random variables with density fθ depending onsome parameter θ ∈ R (the model may depend on more parameters θ1, θ2, . . .). Our goalis to define a natural estimator for the parameter θ.

Definition 4.3. Let x = (x1, . . . , xn) ∈ Rn be a possible realization for (X1, . . . ,Xn), thelikelihood function of x is defined by

L(θ) = Lx(θ) = fθ(x1)⋯fθ(xn).

Remark 4.4. Intuitively, fθ(x1)⋯fθ(xn) represents the probability that (X1, . . . ,Xn) ≃(x1, . . . , xn). The fact that the likelihood function is written as a product is due to the factthat we are studying a i.i.d. model, more generally one need to consider the joint densityof the vector (X1, . . . ,Xn).

Similarly to the discrete case, one can define the maximum likelihood estimator asfollows.

Page 71: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 70

Definition 4.5. Themaximum likelihood estimator for the realization x = (x1, . . . , xn)is defined as the parameter θ = θ(x1, . . . , xn) for which the likelihood is maximal, i.e.

Lx(θ) = maxθLx(θ).

Remark 4.6. When the model involves more parameters θ1, θ2, . . ., the likelihood functionis of the form Lx(θ1, θ2, . . .) and the maximum likelihood estimators for a realizationx = (x1, . . . , xn) are defined to be the parameters θ1, θ2, . . . for which the likelihood ismaximal, i.e.

Lx(θ1, θ2, . . .) = maxθ1,θ2,...

Lx(θ1, θ2, . . .).

Example 1: normal modelLet X1, . . . ,Xn be i.i.d. normal random variables with parameters m and σ2. A

possible realization is a vector x = (x1, . . . , xn) ∈ Rn. The likelihood function for thismodel is equal to

L(m,σ2) = 1

(2πσ2)n/2 exp −n

∑i=1

(xi −m)2

2σ2.

Now, we want to find m and σ2 such that

L(m, σ2) = maxm∈R,σ2>0

L(m,σ2).

We are looking for the critical points of the function log(L), and calculate its partialderivatives

∂ logL

∂m= − 1

σ2

n

∑i=1

(xi −m),

∂ logL

∂σ2= n

2σ2− 1

2σ4

n

∑i=1

(xi −m)2.

A simple analysis of the function finally give that L has a unique maximum, and we findthat the maximum likelihood estimators for m and σ2 are

m(x1, . . . , xn) =x1 +⋯ + xn

n,

and

σ2(x1, . . . , xn) =1

n

n

∑i=1

(xi − m(x1, . . . , xn))2.

4.2 Confidence intervalsIn the previous section, we gave a method to obtain formulas estimating unknown pa-rameters. A natural question to ask is: how good are these estimators? For example,

Page 72: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 71

imagine that I throw a coin n times without knowing the probability p that the coin fallson head. If I obtain 70 heads, the maximum likelihood estimator for p is p = 0.7. Howfar is p from the true value p? In order to answer this type of question, we introduce thenotion of confidence intervals.

Definition 4.7. Consider a probabilistic model with underlying parameter θ. For 0 ≤ z ≤100, a z%-confidence interval for θ associates to each realization x = (x1, . . . , xn) aninterval I = [a(x), b(x)] ⊂ R such that

P[a(X1, . . . ,Xn) ≤ θ ≤ b(X1, . . . ,Xn)] ≥z

100. (4.1)

Remark 4.8. In Eq. (4.1), the parameter θ is not random. The random elements are thebounds a(X1, . . . ,Xn) and b(X1, . . . ,Xn).

Example 1: An exact confidence interval for the normal model with variance 1

LetX1, . . . ,Xn be n i.i.d. normal random variables with parametersm and σ2 = 1. In otherwords, we consider a normal model where the variance is known and equal to σ2 = 1, butthe parameter m of the normal is not known. One can check that the maximum likelihoodestimator for m is equal to

m(x1, . . . , xn) =x1 +⋯ + xn

n.

We are looking for a confidence interval for m of the form

I(x) = [m(x) − c√n, m(x) + c√

n],

where c > 0 is a constant independent of n. First observe that

P[m(X1, . . . ,Xn) −c√n≤m ≤ m(X1, . . . ,Xn) +

c√n] = P[−c ≤ Z ≤ c],

where Z = X1+⋯+Xn−nm√n

is a standard normal random variable. Hence the last probabilitycan be calculated as

P[−c ≤ Z ≤ c] = P[Z ≤ c] − P[X < −c]´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=1−P[Z≤c]

= 2Φ(c) − 1,

where Φ(c) = 1√2π ∫

c

−∞e−x

2/2dx. It can be checked from a table that 2Φ(1.96) − 1 ≥ 0.95.Therefore, by choosing c = 1.96 in the computation above, we obtain

P[m(X1, . . . ,Xn) −1.96√n

≤m ≤ m(X1, . . . ,Xn) +1.96√n

] ≥ 95

100.

Page 73: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 72

This concludes thatI(x) = [m(x) − 1.96√

n≤m ≤ m(x) + 1.96√

n]

is a 95%-confidence interval for m.

What does it mean?Imagine you are making n measurements of physical quantity. For example, you

measure the temperature at which water boils (in the physical conditions of your room).You know from the properties of your thermometer that each measurement can be modeledby a normal N (m,1), where m is the temperature at which the water boils. You measuresome temperatures x1 = 99.2, x2 = 98.7,⋯. After n = 100 successive experiments youcompute m(x) = x1+⋯+xn

n = 99.106. The confidence interval that we established above tellsyou that the real value m has 95% chance to belong to the interval

[99.106 − 0.196,99.106 + 0.196] = [98.910,99.302].

4.3 Statistical testsLet us start with a small story. Eddy, a young statistician and cyclist is given two bagsof coins. The first bag contains fair coins, and the second bag contains bad coins (withprobability 0.7 to fall on head). Eddy is asked to bring the first bag to a casino, and tothrow the second bag away. On his bike trip to the casino, Eddy is unfortunate: he ishit by a car and all the coins of the two bags get mixed together... Luckily Eddy is nottoo injured, but he is now facing a statistic problem: which coins should he bring to thecasino? For each coin, he needs to decide whether it is fair or bad. He picks the first coin,throws it 10 times, depending on the results of these 10 experiments, he needs to decidewhether he brings the coin to the casino or to the rubbish bin.

The goal of this section is to introduce the notion of test, which can be used to solvethe problem of Eddy, and we will use the story of Eddy as an illustrative example.

In the theory of testing, we have general hypotheses on the probabilistic model, andthe goal of a test is to accept or reject theses hypotheses by analyzing some realization ofthe model.

Framework: In this class we consider the simplified framework of the previous sections.We have n i.i.d. random variables X1, . . . ,Xn whose distribution depends on some pa-rameter θ. Here we consider two simple hypotheses of the form

H0 ∶ θ = θ0,

H1 ∶ θ = θ1,

where θ0, θ1 are two possible values for the parameter. As we will see, the hypotheses H0

and H1 play different roles:

Page 74: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 73

• The hypothesis H0 is called the null hypothesis,

• The hypothesis H1 is called the alternative hypothesis,

Given a certain realization of our model x = (x1, . . . , xn), a test aims to decide whetherthe hypothesis H0 is accepted or rejected in favor of H1. More formally a test is givenby a function

d = d(x1, . . . , xn) ∈ 0,1,

where d(x) = 0 corresponds to the fact that the hypothesis H0 is accepted and d(x) = 1corresponds to the fact that the hypothesis is rejected.

When setting up a test, two types of errors can occur:

Type I error: reject H0 when is true.

This type of error is controlled by the relevance level of the test, defined by

α = PH0[d(X1, . . . ,Xn) = 1´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

reject H0

],

where we use the notation PH0 to denote the probability measure when the hypoth-esis H0 is satisfied (i.e. when the parameter θ is equal to θ0).

Type II error: accept H0 when H1 occurs.

This type of error, denoted β, gives rise to the power of the test, defined by

1 − β = 1 − PH1[d(X1, . . . ,Xn) = 0´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

accept H0

].

The way we choose the test and the hypothesis depends on the situation, but in general,we have the following principles:

• The main goal is to avoid the type I error, controlled by the relevance level α.

• A secondary goal is to avoid type II error. Namely, once we have fixed the relevanceparameter, we are looking for the most powerful test.

The example of Eddy’s problemThe probabilistic model is given by 10 i.i.d. Bernoulli random variables with parameter p

X1, . . . ,X10.

One need to choose the two hypotheses for the test. In his decision, the first priority ofEddy is to avoid to bring bad coins to the casino. In other words, he wishes to reject the

Page 75: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 74

hypothesis that the coin is bad. Eddy will also avoid to throw fair coins away, but this isa second priority. He chooses the null and alternative hypotheses in this sense.

H0 ∶ p = 0.7,

H1 ∶ p = 0.5.

When H0 is rejected, Eddy concludes that the coin is not bad and brings it to the casino.WhenH0 is accepted, then he throws the coin away. This way, the type I error correspondsto bringing a defected coin to the casino, and the type II error corresponds to throwing agood coin away.

Setting up a test with likelihood ratiosIn our framework, the distribution of X1, . . . ,Xn is completely determined under the

hypothesis H0 or H1. Such hypotheses are called simple. In this context, one can setup a test using the method of likelihood ratios, that we now present. We assume herethat the random variables are discrete, but the same approach can be developed if therandom variables are continuous, by replacing the discrete likelihood functions with theircontinuous analogues.

We fix a threshold c > 0. For each possible realization x = (x1, . . . , xn), we calculatethe ratio

r(x) = PH0[X1 = x1, . . . ,Xn = xn]PH1[X1 = x1, . . . ,Xn = xn]

= Lx(θ0)Lx(θ1)

.

We reject H0 if r(x) ≤ c. In other word, we define

d(x) =⎧⎪⎪⎨⎪⎪⎩

0 r(x) > c1 r(x) ≤ c.

(4.2)

The relevance level of the test is equal to

α = PH0[r(X1, . . . ,Xn) ≤ c], (4.3)

and the power of the test is

1 − β = 1 − PH1[r(X1, . . . ,Xn) > c] = PH1[r(X1, . . . ,Xn) ≤ c].

The Neyman-Pearson Lemma shows that basing the test on the likelihood ratio isoptimal:

Theorem 4.9 (Newman-Paerson’s lemma). Consider a statistic framework with two sim-ple hypotheses H0 and H1. Among all the tests with relevance level α, the likelihood ratiotest is the most powerful.

Proof. Let c > 0 and let d be the decision function associated to the likelihood ratio testas defined in (4.2). Let α be the relevance level of the test, see (4.3). The random variabled(X1, . . . ,Xn) is a Bernoulli random variable with parameter α, which implies that

α = EH0[d(X1, . . . ,Xn)].

Page 76: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 75

Now let d⋆ be another test with relevance level α. As above, this can be written as

α ≥ EH0[d⋆(X1, . . . ,Xn)].

Our goal is to show that the test d⋆ is less powerful, i.e.

EH1[d⋆(X1, . . . ,Xn)] ≤ EH1[d(X1, . . . ,Xn)].

We first observe that for every x = (x1, . . . , xn)

d⋆(x)(cPH1[X = x] − PH0[X = x]) ≤ d(x)(cPH1[X = x] − PH0[X = x]).

To see this, distinguish whether r(x) ≤ c or r(x) > c. In the first case, we have d(x) = 1and cPH1[X = x] − PH0[X = x] ≥ 0, and the inequality follows from d∗(x) ≤ 1. In thesecond case, we have d(x) = 0 and cPH1[X = x] − PH0[X = x] ≤ 0 and the inequalityfollows from d⋆(x) ≥ 0. If we sum the equation above over all the possible realizations x,we get

cEH1[d⋆(X1, . . . ,Xn)]−EH0[d⋆(X1, . . . ,Xn)] ≤ cEH1[d(X1, . . . ,Xn)]−EH0[d(X1, . . . ,Xn)]

which can be rewritten as

EH1[d(X1, . . . ,Xn)]−EH1[d⋆(X1, . . . ,Xn)] ≥1

c(EH0[d(X1, . . . ,Xn)]−EH0[d⋆(X1, . . . ,Xn)]).

This concludes the proof, because the right hand side is nonnegative.

Back to Eddy’s problemLet us first look at the likelihood when the coin is bad (p = 0.7) or fair (p = 0.5). For

a bad coin, the likelihood of x = (x1, . . . , x10) is equal to

PH0(X = x) = Lx(0.7) = (10

∣x∣) 0.7∣x∣ 0.310−∣x∣,

and for a fair coinPH1(X = x) = Lx(0.5) = (10

∣x∣) 0.510.

These likelihoods correspond to the distributions of binomial random variables with pa-rameters n = 10 and p = 0.7 for the bad coin (resp. p = 0.5 for the fair one). The ratior(x) = Lx(0.7)/Lx(0.5) is equal to

r(x) = 0.610 (7

3)∣x∣

.

and one can represent them in a table (the second and third lines represent the likelihoodfor the bad coin and the fair coin respectively):

Page 77: PROBABILITY THEORY AND STATISTICSmetaphor.ethz.ch/x/2020/fs/401-0604-00L/sc/lectureNotes.pdf · PROBABILITY THEORY AND STATISTICS V. Tassion D-ITET Spring 2020 (updated: May 24, 2020)

CHAPTER 4. INTRODUCTION TO STATISTICS 76

∣x∣ 0 1 2 3 4 5 6 7 8 9 10bad coin .0000 .0001 .0014 .0090 .0368 .1029 .2001 .2668 .2335 .1211 .0282fair coin .0010 .0098 .0439 .1172 .2051 .2461 .2051 .1172 .0439 .0098 .0010ratio r(x) .0060 .0141 .0329 .0768 0.1792 .4182 .9758 2.277 5.313 12.40 28.93

If Eddy chooses c = 1, then the test consists in rejecting H0 if ∣X ∣ ≤ 6. In this case, therelevance level of the test is

α = PH0[reject H0] = PH0[∣X ∣ ≤ 6] ≃ .3503

and the power of the test is

1 − β = PH1[reject H0] = PH1[∣X ∣ ≤ 6] ≃ .8228.

In other words, with this choice, 35% of the coins that Eddy will bring to the casino willbe defected, but 82% of the fair coins will be brought to the casino.

If Eddy chooses c = 0.1, then the test consists in rejecting H0 if ∣X ∣ ≤ 3. In this case,the relevance level of the test is

α = PH0[reject H0] = PH0[∣X ∣ ≤ 3] ≃ .0101

and the power of the test is

1 − β = PH1[reject H0] = PH1[∣X ∣ ≤ 6] ≃ .1719.

With this new choice, only 1% of the coins that Eddy will bring to the casino will be bad,but only 17% of the fair coins will be brought to the casino.