MAS113 Introduction to Probability and Statistics 1 ...theory. We have two ingredients: set theory, which we will use to describe what sort of things we want to measure our uncertainty

MAS113 Introduction to Probability and Statistics

1 Introduction

1.1 Studying probability theory

There are (at least) two ways to think about the study of probability theory:

1. Probability theory is a branch of Pure Mathematics. We start with someaxioms and investigate the implications of those axioms; what resultsare we able to prove? Probability theory is a subject in its own right,that we can study (and appreciate!) without concerning ourselves withany possible applications. We can also observe links to other areas ofmathematics.

2. Probability theory gives us a mathematical model for explaining chancephenomena that we observe in the real world. As with any mathematicaltheory, it may not describe reality perfectly, but it can still be extremelyuseful. We must think carefully about when and how the theory can beapplied in practice.

We will be studying probability from both perspectives. Applications of prob-ability theory are important and interesting, but as a mathematician, youshould also develop an understanding and an appreciation of the theory inits own right.

1

1.2 Motivation

We are often faced with making decisions in the presence of uncertainty.

• A doctor is treating a patient, and is considering whether to prescribe aparticular drug. How likely is it that the drug will cure the patient?

• A probation board is considering whether to release a prisoner early onparole. How likely is the prisoner to re-offend?

• A bank is considering whether to approve a loan. How likely is it thatthe loan will be repaid?

• A government is considering a CO2 emissions target. What would theeffect of a 20% cut in emissions be on global mean temperatures in 20years’ time?

We often use verbal expressions of uncertainty (“possible”, “quite unlikely”,“very likely”), but these are often inadequate if we want to communicate witheach other about uncertainty. Consider the following example.

You are being screened for a particular disease. You are told that the diseaseis “quite rare”, and that the screening test is accurate but not perfect; if youhave the disease, the test will “almost certainly” detect it, but if you don’thave the disease, there is “a small chance” the test will mistakenly reportthat you have it anyway.

The test result is positive. How certain are you that you really have thedisease?

Clearly, in this example and in many others, it would be useful if we couldquantify our uncertainty. In other words, we would like to measure how likely

2

it is that something will happen, or how likely it is that some statement aboutthe world turns out to be true.

In this course we introduce a theory for measuring uncertainty: probabilitytheory. We have two ingredients: set theory, which we will use to describewhat sort of things we want to measure our uncertainty about, and, appropri-ately, measure theory, which sets out some basic rules for how to measurethings in a sensible way.

1.3 Motivation - the need for set theory and measures

If you have studied probability at GCSE or A-level, you may have seen adefinition of probability like this:

Suppose all the outcomes in an experiment are equally likely. Theprobability of an event A is defined to be

P (A) :=number of outcomes in which A occurs

total number of possible outcomes. (1)

Example 1. A deck of 52 playing cards is shuffled thoroughly. What is theprobability that the top card is an ace?

The ‘definition’ can work for examples such as this (with a finite number ofequally likely outcomes), but we soon run into difficulties:

Example 2. Imagine generating a random angle between 0 and 2π radians,where any angle is equally likely. (For example, consider spinning a roulettewheel, and measuring the angle between the horizontal axis (viewing the wheelfrom above) and a line from the centre of the wheel that bisects the zero on thewheel). What is the probability that the angle is between 0 and

√2 radians?

3

You might guess that the answer should be√

22π , but the definition in (1)

causes us problems! There are infinitely many possible angles that could begenerated, and there are infinitely many angles between 0 and

√2 radians.

Clearly, we can’t divide infinity by infinity and come up with√

22π .

It may not make much sense to think of outcomes as equally likely:

Example 3. What is the probability that a team from Manchester will winthe Premier League this season?

We will resolve this by constructing a more general definition of probability,using the tools of set theory and a special type of function called a measure,and we’ll show how choosing different types of measure will give us sensibleanswers in both the above examples.

2 Set theory and probability

Set theory is covered in more detail in MAS110; in this module we considerset theory in the context of probability.

We consider uncertainty in the context of an experiment, where we usethe word experiment in a loose sense to mean observing something in thefuture, or discovering the true status of something that we are currentlyuncertain about. In an experiment, suppose we want to consider how likelysome particular outcome is. We might start by considering what all thepossible outcomes of the experiment are. We can use a set to list all thepossible outcomes.

Definition 1. A sample space is a set which lists all the possible outcomesof an ‘experiment’.

4

In set theory, we sometimes work with a universal set S, which lists all theelements we wish to consider for the situation at hand. (Note that, in spiteof the word “universal”, the choice of S depends on context.) In the contextof probability, the sample space will play the role of the universal set S.

Example 4. Examples of sample spaces

Definition 2. An event is a subset of a sample space. If the outcome of theexperiment is a member of the event, we say that the event has occurred.

Example 5. Examples of events

2.1 Set operations and events

Given a sample space S and two subsets/events A and B, the set operationsunion, intersection, complement and difference all define further events.

The union A ∪ B corresponds to either A occurring or to B occuring (orto both occurring). We can visualise this using a Venn diagram (see Figure1). The rectangle represents the universal set, the two circles represent thesubsets A and B, and the shaded area represents the set A ∪B.

Figure 1: A Venn diagram, with the shaded region showing A ∪B.

5

The intersection A ∩B corresponds to both A and B occurring.

Figure 2: A Venn diagram, with the shaded region showing A ∩B.

If there are no elements in S that are both in A and in B, then A ∩ B = ∅.In this case we say that A and B are mutually exclusive.

Figure 3: A Venn diagram with mutually exclusive A and B.

The complement of A, which we will write A, corresponds to A not occur-ring. (Alternative notation: A is also written as AC and A′.)

6

Figure 4: A Venn diagram, with the shaded region showing A.

Note that ∅ = S, and that the event ∅ is one which cannot happen, as itcontains no elements.

The set difference A \B corresponds to A occurring but B not.

Figure 5: A Venn diagram, with the shaded region showing A \B.

Note thatA \B = A ∩ B.

Theorem 1. (De Morgan’s Laws) For any two sets A and B,

(A ∪B) = A ∩ B,(A ∩B) = A ∪ B.

You will prove these results as an exercise in MAS110.

Example 6. Combinations of events

7

2.2 The difference between an outcome and an event

Let’s clarify the difference between an “outcome” and an “event”. An out-come is an element of the sample space S, whereas an event is a subset of S.Consider the following example.

Suppose Spain play Germany at football this evening. The possible outcomeshere can be thought of as ordered pairs of non-negative integers, e.g. (2, 1)representing the outcome that Spain win by 2 goals to 1. So the sample spaceis the set of such pairs, N0 × N0.

• Before the match, you can bet on an event, eg “Either Spain win ordraw”.

• Tomorrow, a news reporter will tell you the outcome, eg “Spain won2-1”. The news reporter would not report the event “either Spain wonor the match was drawn last night”.

• If the observed outcome belongs to the event “either Spain win or draw”,the event has occurred, and you have won your bet.

• An event could be a single outcome (you could bet on “Spain win 2-1”).

For a sample space S, outcome a and event A, we would use the notationa ∈ S and A ⊂ S. Make sure you understand when to use the symbol ∈ andwhen to use the symbol ⊂ or ⊆.

8

3 Measure

3.1 Set functions

We can define set functions that take sets as their inputs and produce realnumbers as outputs.

!"#$%&'(')*%' +$%#$%&'(',*(-'"$./*,')*%'0$"12+"'

Not all set functions make sense for all possible sets, so when defining a setfunction, we’ll first say what the sample space is, and then define set functionsthat operate on subsets of a sample space.

Example 7. Examples of set functions

Formally, we can think of a set function as a function whose domain is a setwhose elements are subsets of S.

3.2 Measures

We’re close to being able to give a formal definition of probability: we’regoing to define probability as a special type of set function that we apply toa set. We first define a measure as a special type of set function.

Definition 3. Given a sample space (or universal set) S, a (finite) measureis a set function that assigns a non-negative real number m(A) to a set A ⊆ S,which satisfies the following condition: for any two disjoint subsets A and B

9

of S (ie A ∩B = ∅),

m(A ∪B) = m(A) +m(B). (2)

Note: This is a simplified definition of measure which is sufficient for thismodule. There are further technical conditions in the full definition of mea-sure, which we don’t need to concern ourselves with here. For more details,see the supplementary online notes.

Informally, the above definition can be put into words as follows. A measureis something which assigns a number to a set, produces a non-negative num-ber as its output, and ‘behaves additively’: if we take two non-overlapping(disjoint) sets, we can either combine the sets and then apply the measure,or apply the measure to each set and then sum the results, and we will getthe same value either way.

We can extend the condition in (2) to finite unions of disjoint sets. If threesets A, B and C are all mutually disjoint (A ∩ B = A ∩ C = B ∩ C = ∅), ifwe define D = B ∪ C, then

m(A ∪B ∪ C) = m(A ∪D)

= m(A) +m(D)

= m(A) +m(B ∪ C)

= m(A) +m(B) +m(C).

Example 8. Examples of measures

Theorem 2. (Properties of measures)

Let m be a measure on some set S, with A,B ⊆ S.

(M1) If B ⊆ A, thenm(A \B) = m(A)−m(B).

10

More generally, m(A \B) = m(A \ (A ∩B)) = m(A)−m(A ∩B).

(M2) If B ⊆ A, thenm(B) ≤ m(A).

(M3) m(∅) = 0

(M4) m(A ∪B) = m(A) +m(B)−m(A ∩B).

(M5) For any constant c > 0, the function g(A) = c×m(A) is also a measure.

(M6) If m and n are both measures, the function h(A) = m(A) + n(A) is alsoa measure.

3.3 Measures on partitions of sets

Definition 4. A partition of a set S is a collection of sets E = {E1, E2, . . . , En}such that

1. Ei ∩ Ej = ∅, for all 1 ≤ i, j ≤ n with i 6= j;

2. E1 ∪ E2 ∪ . . . ∪ En = S.

We say that {E1, E2, . . . , En} are mutually exclusive and exhaustive. If Sis a sample space, one of the events in E must occur, but no two events canoccur simultaneously.

From (2), we have

m(S) =n∑i=1

m(Ei).

11

Figure 6: An example of a partition with 5 sets.

4 Probability as measure

Definition 5. Suppose S is a sample space and P a measure on S. We saythat P is a probability (or probability measure) if P (S) = 1. Hence probabilityis defined to be a special case of measure: a measure that takes the value 1on a sample space S.

To summarise, using the concept of measure, we have defined probability asfollows.

• We start with a set S of all possible outcomes of the experiment. Wecall S the sample space.

• Probability is a function, a measure, that can be applied to a subset ofS (including S itself).

• A probability must be non-negative: we must have P (A) ≥ 0 for any setA ⊆ S.

• An event that is certain to happen must have probability 1, so thatP (S) = 1.

12

• If two events A and B cannot both occur (so that A ∩ B = ∅), thenthe probability that either A occurs or B occurs is the sum of their twoprobabilities:

P (A ∪B) = P (A) + P (B).

This is as far as the ‘mathematical definition’ of probability goes. Thereis no general definition to say what value P (A) must take. Rather, in anypractical setting, we will choose a particular probability measure that we feelgives a suitable model for the problem at hand. To see this in action, wereturn to Examples 1 and 2 (where we initially struggled to come up with asuitable definition of probability). Theorem 2 (M5) says that we can multiplyor divide a measure by a positive constant to get another valid measure. Wewill now use the measures f1 and f2 given in Example 8, divided by suitableconstants.

Example 9. Examples of probability measures

4.1 Properties of probabilities

We can apply what we know about measures to deduce various properties ofprobabilities. For a sample space S, and A,B ⊆ S

(P1) P (S) = 1.

(P2) 0 ≤ P (A) ≤ 1.

(P3) P (∅) = 0.

(P4) P (A) = 1− P (A).

(P5) P (A ∪B) = P (A) + P (B)− P (A ∩B).

13

(P6) If A ∩B = ∅, then P (A ∪B) = P (A) + P (B).

(P7) If B ⊆ A then P (A \B) = P (A)− P (B) and P (B) ≤ P (A).

(P8) If E = {E1, E2, . . . , En} is a partition of S, then 1 = P (S) =∑n

i=1 P (Ei).

You should verify that each of these properties hold, by referring back to theproperties of measures discussed earlier.

4.2 The law of large numbers: bridging the gap from theory toobservation

In Example 1, we said that the probability of the top card being an ace was4/52. In one sense, all this statement is saying is that there are 52 possibleoutcomes in total, and for 4 of these outcomes, the event of the top card beingan ace will occur. But does this probability value of 4/52 tell us anythingabout what will actually happen if we do the experiment? Not really. Wemay see an ace, we may not, and it’s not clear how we would relate what wewould see to the value of 4/52. However, there is a result in probability theorythat tells us what will actually happen over a large number of experiments.This result is the law of large numbers. We will study this result later onthis module (and state it more precisely), but for now, we’ll just give thefollowing informal version.

The law of large numbers (an informal version). Suppose we do asequence of independent experiments, so that the outcome in one experimenthas no effect on the outcome in another experiment. Now suppose in ex-periment i, there is an event Ei that has a probability of p of occurring, fori = 1, 2, . . .. The proportion of events out of E1, E2, . . . which actually oc-cur as we do the experiments will typically get closer and closer to p, as the

14

number of experiments increases.

The sequence of experiments could correspond to repeating the same activitylots of times under the same conditions:

If we repeat a large number of times the process of shuffling a deck of cardsand drawing the top card, we should expect to see the top card being an aceapproximately 8% (' 4/52) of the time.

Or the sequence of experiments could correspond to different activities:

Let E1 be the event of tossing a coin and observing a head, E2 be the event ofrolling a die an observing an even number, E3 be the event of drawing a redplaying card from a standard deck, E4 be the event that a randomly selectednew born baby is a boy and so on. Given a sufficiently long list of such eventsE1, E2, . . ., each with probability 0.5, we should see approximately half theseevents occur if we do the experiments.

So, although we can’t predict what will happen in any one instance, we canmake useful predictions about will happen ‘overall’.

Suppose, for the same illness, the probability of a patient being cured by drugA is 0.1, and the probability of a patient being cured by drug B is 0.2. Wecannot be certain whether any single patient will be cured by either drug. Butwe can predict confidently that in a large population of patients, 10% will becured by drug A, and 20% will be cured by drug B.

There remains a rather difficult philosophical question of whether a singleexperiment in the real world actually has a ‘true probability’ attached to it,but we won’t concern ourselves with this! We’ll simply take the position thatif we can justify our choice of probability values for some experiments in thereal world, we can make useful predictions about what will happen in the

15

world, and these predictions can be tested.

5 Assigning probabilities

We have used measure theory to establish some basic rules that probabilitiesmust follow, and the law of large numbers tells us what we should observegiven a sequence of independent events, each with the same probability occur-ring. But for modelling any experiment in the real world, there is no theoremto tell us what the correct probabilities are! (Or even if there are correctprobabilities!). There are three main methods for assigning probabilities: theclassical approach based on symmetry, relative frequencies, and subjectivejudgements. Arguably, the first two methods are special cases of the third,because they also involve making subjective judgements.

5.1 Classical probability: assuming symmetry

In the case of a finite sample space, in the classical approach we makethe assumption that each outcome in the sample space is equally likely. Ifthere are k outcomes in the sample space, then the probability of any singleoutcome is 1/k. The probability of any event is then calculated as the numberof outcomes in which the event occurs, divided by the total number of possibleoutcomes.

P (A) =|A||S|

.

In effect, we have chosen P to be the counting measure, divided by thenumber of elements in the sample space. This is the approach that we usedin Example 1.

16

One situation in which we are usually willing to assume equally likely out-comes is in gambling and games of chance.

Example 10. On a European roulette wheel, the ball can land on one of theintegers 0 to 36. A bet of one pound on odd returns one pound (plus theoriginal stake) if the ball lands on any odd number from 1 to 35. Supposingyou bet one pound a large number N times. According to the law of largenumbers, who would you expect to make a profit: you or the casino?

5.2 Counting methods for classical probability

The sample space may be very large, so writing out all the possible elementsand counting directly may be laborious. Often, we can apply some standardresults for permutations and combinations, together with a little logic.

Definition 6. Let nPr denote the number of permutations, that is the num-ber of ways of choosing r elements out of n, where the order matters. Then

nPr =n!

(n− r)!.

Let(nr

)(also written nCr) denote the number of combinations, that is the

number of ways of choosing r elements out of n, where the order does notmatter. Then (

n

r

)=

n!

r!(n− r)!.

These results are proved in MAS110. For reference, more detail is given inthe supplementary online notes.

Example 11. Consider the National Lottery. On a single ticket, you mustchoose 6 integers between 1 and 59. A machine also selects 6 integers between

17

1 and 59, and it is reasonable to assume that each number is equally likely tobe chosen. If I buy one ticket:

1. What is the probability that I match all 6 numbers?

2. What is the probability I match 3 numbers only?

5.3 Classical probability under random sampling

We can design ‘experiments’ in which equally likely outcomes can be assumed.We start with a ‘population’ of people or items, and then pick a subset ofthe population, such that each member of the population is equally likely tobe selected. This is called (simple) random sampling. (Some care is neededto make sure we are satisfied that each member really does have the samechance of being selected). This concept is important in Statistics, where wewant to make inferences about a population based on what we have observedin a random sample. We can use probability theory to understand how thecharacteristics of a random sample may differ from the characteristics of apopulation.

Example 12. A factory has produced 50 items, some of which may be faulty.If an item is tested for a fault, it can no longer be used (testing is ‘destruc-tive’). To estimate the proportion of faulty items, a sample of 5 items areselected at random for testing. Suppose out of the 50 items, 10 are faulty.

1. What is the probability that none of the 5 items in the sample are faulty?

2. What is the probability that the proportion of faulty items in the sampleis the same as the proportion of faulty items in the population?

18

5.4 Relative frequencies

In some situations, we may judge that outcomes are not equally likely, sothat the classical approach is not appropriate. However, we may be able torepeat the experiment, and under the assumption that the conditions of theexperiment are the same for each repetition, we may assign a probability ofan event A as follows.

1. Repeat the experiment a large number of times.

2. Set P (A) to be the proportion of experiments in which A occurs.

Here, we are essentially appealing to the law of large numbers. The law oflarge numbers says that if in each experiment P (A) is the same value p, thenthe proportion of experiments in which A occurs will converge to p as thenumber of experiments tends to infinity. Hence, in practice, if we are satisfiedthat the conditions of the experiment are the same each time, estimatingP (A) by the proportion of occurrences in a finite number of experimentsseems reasonable. In fact, there is a more formal theory of how to choose aprobability based on a finite number of repetitions of the experiment. Thisis taught in a 3rd year module, Bayesian Statistics.

Example 13. I have a backgammon game on my phone. I have played thecomputer in 73 matches, and have won 50 of them. Assuming my abilityhasn’t changed over this time, I may judge P ({I win}) = 50

73 and P ({I lose}) =2373 for a single match. If, for completeness, I also state that P (∅) = 0 andP (S) = 1, where S = {I win, I lose}, have I specified a valid probabilitymeasure?

19

5.5 Subjective probabilities and fair odds

In some situations, we will not believe that the outcomes are equally likely,and the experiment is either not repeatable, or no past observations of theexperiment exist. An example would be observing which party gets the mostnumber of seats at the next UK election. If we write the sample space as

S = {Conservative,Labour,Liberal Democrats,Other},

no-one with any knowledge of UK politics would judge the Liberal Democratsto have the same chance as Conservative or Labour, and we argue that pre-vious elections were held under different circumstances, and so cannot bethought of as replications of the same experiment. Nevertheless, you can stillchoose your own probability values, based on your own opinion. Bookmakersdo this regularly when setting their opening odds, and generally they are verygood at it. In fact, the concept of ‘fair odds’ can be a helpful way to decidewhat your probability should be.

Bookmakers usually quote odds against an event. If you bet £1 on the eventE to occur at odds of x to y against (written as x/y), if E occurs, you getyour £1 back, plus a further £x

y . If you bet £1 on England winning the nextfootball world cup at odds of 19 to 1 against, if England win, you get your£1 back, plus a further £19.

Definition 7. Informally, we define fair odds to mean odds that you thinkfavour neither the bookmaker or the gambler. If you think the odds favour thegambler, you’d expect the bookmaker to make a loss. If you think the oddsfavour the bookmaker, you’d expect the bookmaker to make a profit.

Example 14. Suppose I roll a fair six-sided die, and offer you odds of 3 to1 against the outcome being a six. Do you think these odds are fair? If not,what would the fair odds be?

20

More generally, if we bet £1 on a large number n of events of probabilityp with odds x/y, then our informal discussion of the law of large numberssuggests that we should expect to win on approximately pn of them, giving

us approximately £pn(

1 + xy

). We have spent £n on the bets, so if the odds

are fair we should expect

p =

(1 +

x

y

)−1

.

Re-arranging this suggests that the fair odds against an event E are

1− P (E)

P (E),

and we can also say that the fair odds on E are

P (E)

1− P (E).

Example 15. On a European roulette wheel (numbers from 0 to 36 inclusive,each number appears once), a casino offers odds of 2 to 1 against the resultbeing from 1 to 12 inclusive. What probability of getting 1-12 would this implyassuming the odds were fair? What is the classical probability?

Different people, depending on their knowledge, will judge different odds tobe fair, and so will have different probabilities. Does this result in a validprobability measure? Under a rather more precise definition of “fair odds”,the answer is yes. Further details are given in the supplementary online notes.

You can look at a bookmaker’s quoted odds to give an indication of theirsubjective probability for an event in question, though their actual proba-bilities will be smaller, for reasons that we will explain later in the course.(If you calculate the probabilities for all the possible outcomes from a book-maker’s odds, the probabilities will sum to more than 1). A bookmaker’s

21

odds will also be influenced by how popular the bets are. For example, smallodds against an event may be (partly) the result of lots of people betting onthe event, and the bookmaker lowering the odds to protect against potentiallosses. At the time of writing (July 2019), here are some quoted odds fromvarious bookmakers for some events.

Event Odds Implied probability

Lewis Hamilton wins 2019 F1 championship 1/33 0.97Manchester City win 2019/20 Premier League 4/6 0.60

Sheffield Wednesday promoted in 2019/20 7/1 0.13Jeremy Corbyn next Prime Minister 5/2 0.29

Ben Stokes to win BBC Sports Personality 7/4 0.36England to win next Cricket World Cup 3/1 0.25Bernie Sanders to win 2020 US election 20/1 0.05

Aside: In August 2015, it was possible to get odds against Leicester Citywinning the 2015/16 Premier League of 5000/1, giving an implied probabilityof 1/5001, and of course this event did occur. Was this a very improbableevent which actually happened, or was it in fact less improbable than theodds suggested?

Although it is impossible to state whether any single subjective probabilityis ‘correct’, we can still evaluate the performance of subjective probabilityassessments in the long run. For example, some weather forecasters providesubjective probabilities, for example probabilities that it will rain the nextday. If we look at all the occasions where the forecaster has said “there is a40% chance of rain tomorrow”, we would hope to see that there was rain thefollowing day on (approximately) 40% of these occasions.

22

6 Conditional probability

6.1 Introduction and definition

Conditional probability is an important concept that we can use to change ameasurement of uncertainty as our information changes.

Example 16. For a randomly selected individual, suppose the probabilities ofthe four blood types are P (type O) = 0.45, P (type A) = 0.4, P (type B) = 0.1and P (type AB) = 0.05. A test is taken to determine the blood type, but thetest is only able to declare that the blood type is either A or B. What is theprobability that the blood type is A?

Definition 8. We define P (E|F ) to be the conditional probability of Egiven F , where

P (E|F ) :=P (E ∩ F )

P (F ), (3)

assuming P (F ) > 0. We can interpret this to mean

“If it is known that F has occurred, what is the probability that E has alsooccurred?”

If we know that the outcome belongs to the set F , then for E to occur also, theoutcome must lie in the intersection E∩F . To get the conditional probabilityof E|F , we ‘measure’ (using the probability measure P ) the fraction of theset F that is also in the set E.

A comment on joint probabilitites

Definition 8 gives us an intuitive way to think about joint probabilities P (E∩

23

F ). Rearranging ((3)) we have

P (E ∩ F ) = P (F )P (E|F ),

and we can swap E and F and write

P (F ∩ E) = P (E ∩ F ) = P (E)P (F |E).

This means we can calculate the probability that both E and F occur byconsidering either

1. the probability that E occurs, and then the probability that F occursgiven that E has occurred, or:

2. the probability that F occurs, and then the probability that E occursgiven that F has occurred.

Specifying conditional probabilities directly

Equation (3) tells us how to calculate P (E|F ) if we already know P (E ∩ F )and P (F ). But in some situations, we may be able to specify P (E|F ) directly,given the information at hand.

Example 17. A diagnostic test has been developed for a particular disease.In a group of patients known to be carrying the disease, the test successfullydetected the disease for 95% of the patients. An individual is selected atrandom from the population (and so may or may not be carrying the disease).Let D be the event that the individual has the disease, and T be the event thatthe test declares the individual has the disease. Using the frequency approachto specifying a probability and the information above, which of the followingprobabilities can we specify?

• P (D)

24

• P (D ∩ T )

• P (T |D)

• P (D|T )

In some cases, the conditional probability will be ‘obvious’, and you shouldhave the confidence just to write it down!

Example 18. 1. A playing card is drawn at random from a standard deckof 52. Let A be the event that the card is a heart, and B be the eventthat the card is red. What are P (A|B) and P (B|A)?

2. On a National Lottery ticket (6 numbers chosen out of 59), let A bethe event that 6th number matches the 6th number drawn, and B be theevent that the first 5 numbers on the ticket match the first 5 numbersdrawn. What is P (A|B)?

6.2 Independence

Definition 9. Two events E and F are said to be independent if

P (E ∩ F ) = P (E)P (F ).

We can use the definition of conditional probability to give a more intuitivedefinition of independence. The events A and B are independent if

P (E|F ) = P (E). (4)

(If (4) holds, then P (E ∩ F ) = P (E|F )P (F ) = P (E)P (F )). We can readthis to mean “If E and F are independent, then learning that E has occurreddoes not change the probability that F will occur (and vice versa).”

25

Example 19. Suppose two pregnant women are chosen at random, and con-sider whether each gives birth to a boy or girl (assume there will be no twins,triplets etc.) We can write the sample space as

S = {(boy, boy), (boy, girl), (girl, boy), (girl, girl)},

where, for example, element (boy, girl), is the outcome that the first womangives birth to a boy, and the second woman gives birth to a girl. Define Bi tobe the event that the ith woman gives birth to a boy.

1. With regard to S, what are the subsets B1, B2 and B1 ∩B2?

2. Suppose we assume that each outcome in the sample space is equallylikely. What are the values of P (B1), P (B2) and P (B1 ∩B2)?

3. Are B1 and B2 independent?

Example 20. Two playing cards are drawn at random from a standard 52card deck. Let A be the event of at least one ace, and let K be the event ofat least one king.

1. Assume that each card in the deck has the same chance of being selected.Calculate P (A), P (K) and P (A ∩K).

2. Are A and K independent?

3. Compare P (A) with P (A|K) and comment on the result.

Calculating joint probabilities: a summary We have now seen variousways to calculate a joint probability P (A ∩B). These are as follows.

1. Assume independenceIf we think that learning A has occurred will not change our probability

26

of B occurring (and vice versa) then we have

P (A ∩B) = P (A)P (B).

Example 21. If I buy a single National Lottery ticket, the probability Idon’t win any prize is 53/54. If I buy one ticket every week, what is theprobability I win nothing in the first two weeks? What is the probabilityI win nothing in the first month?

2. Direct calculation using classical probabilityWhen using classical probability (assuming the elements of the sam-ple space are equally likely), it may be straightforward to count in thenumber of outcomes in which both A and B occur, and hence calculateP (A ∩ B) directly. In a sense, we are saying that if we write C as theevent A ∩B, it is straightforward to calculate P (C).

Example 22. One playing card is drawn at random from a standarddeck of 52. Let A be the event that the card is a red face card (king,queen or jack). Let B be the event that the card is a king.

(a) Are A and B independent?

(b) What is P (A ∩B)?

3. Using conditional probabilityAs has already been commented, we have

P (A ∩B) = P (A)P (B|A) = P (B)P (A|B).

If we already know P (A) and P (B|A), or P (B) and P (A|B), we cancalculate P (A ∩B).

Example 23. (Example 17 revisited). Suppose 1% of people in a pop-ulation have a particular disease. A diagnostic test has a 95% chanceof detecting the disease in a person carrying the disease. One person isselected at random. What is the probability that they have the diseaseand the diagnostic test detects it?

27

Having three methods may seem confusing! In fact, the conditional proba-bility method can be thought of as the way to calculate a joint probability(remembering that conditional probabilities can be specified directly), withthe first two methods being short cuts or special cases.

Example 24. Show how the conditional probability method can be used tocalculate the joint probability in Example 22.

6.3 The law of total probability

In some situations, calculating a probability of an event is easiest if we firstconsider some appropriate conditional probabilities.

Example 25. Suppose the four teams in this year’s Champions League semi-finals are Manchester City, Barcelona, Juventus, and Bayern Munich. De-pending on their opponents, you judge Manchester City’s probabilities ofreaching the final to be

P (reach final|opponent is Barcelona) = 0.2,

P (reach final|opponent is Juventus) = 0.4,

P (reach final|opponent is Bayern Munich) = 0.5.

If the semi-final draw has yet to be made (with any two teams having thesame probability of being drawn against each other), what is your probabilityof Manchester City reaching the final?

Theorem 3. (The law of total probability) Suppose we have a partition E ={E1, . . . , En} of a sample space S. Then for any event F ,

P (F ) =n∑i=1

P (F ∩ Ei),

28

or, equivalently,

P (F ) =n∑i=1

P (F |Ei)P (Ei).

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F

S

E1

E2

E3E4

E5

Figure 7: Given a partition E = {E1, . . . , En}, we can calculate P (F ) by considering in turnhow likely each event Ei is, and then how likely F is conditional on Ei.

6.4 Bayes’ theorem

In example 17 we considered a diagnostic test, and the probability of thetest detecting the disease in someone who has it. But diagnostic tests cansometimes produce ‘false positives’: a test may claim the presence of thedisease in someone who does not have it. In these situations, we will wantto know how likely it is someone has the disease, conditional on their testresult.

29

Example 26. A new diagnostic test has been developed for a particular dis-ease. It is known that 0.1% of people in the population have the disease. Thetest will detect the disease in 95% of all people who really do have the disease.However, there is also the possibility of a “false positive”; out of all peoplewho do not have the disease, the test will claim they do in 2% of cases.

A person is chosen at random to take the test, and the result is “positive”.How likely is it that that person has the disease?

Theorem 4. (Bayes’ theorem) Suppose we have a partition of E = {E1, . . . , En}of a sample space S. Then for any event F ,

P (Ei|F ) =P (Ei)P (F |Ei)

P (F ).

Note that we can calculate P (F ) via the law of total probability:

P (F ) =n∑j=1

P (F |Ej)P (Ej).

Using this we can write Bayes’ theorem as

P (Ei|F ) =P (Ei)P (F |Ei)∑nj=1 P (F |Ej)P (Ej)

. (5)

This is often the most useful version in practice.

Note that if E is a single event then {E, E} is a partition with n = 2, so weget a special case of Bayes’ theorem,

P (E|F ) =P (F |E)P (E)

P (F |E)P (E) + P (F |E)P (E). (6)

In the context of Bayes’ theorem, we sometimes refer to P (Ei) as the priorprobability of Ei, and P (Ei|F ) as the posterior probability of Ei given F .

30

The prior probability states how likely we thought Ei was before we knewthat F had occurred, and the posterior probability states how likely we thinkEi is after we have learnt that F has occurred.

6.5 Conditional probabilities: deciding which formula to use

Initially, it can seem confusing that we have two formulae for conditionalprobabilities: the definition in equation (3) and Bayes’ theorem (and we havealso said that you can sometimes just write down a conditional probabilitydirectly). Firstly, you should note that these aren’t really ‘different’ formulae.To get Bayes’ theorem, we have just started with the conditional probabilitydefinition

P (A|B) =P (A ∩B)

P (B),

then rewritten the numerator using P (A ∩B) = P (A)P (B|A) and rewrittenthe denominator using the law of total probability. With practice, you willquickly learn to recognise what probabilities you already know, and so howto calculate P (A|B). However, to start with, you may find it helpful to usethe following scheme.

1. Is it ‘obvious’?You may have the information you need to write down P (A|B) directly.If not, carry on to Step 2.

2. Write down the conditional probability formula

P (A|B) =P (A ∩B)

P (B)

3. Consider the numerator P (A ∩B)

31

(a) Is it straightforward to write down or calculate P (A ∩ B)? If so,move on to Step 4.

(b) Do you know P (A) and P (B|A)? If so, you can calculate P (A∩B)

P (A ∩B) = P (A)P (B|A).

(In this case, you are now using Bayes’ Theorem).

4. Consider the denominator P (B)If you know this value already, you’re done. If not, try the law of totalprobability:

P (B) = P (A)P (B|A) + P (A)P (B|A).

Try working through Examples 16, 18 and 26 again, following the schemeabove.

7 Discrete random variables

Informally, we think of a random variable as any quantity that is uncertainto us. For example:

• the number of emergency call-outs received by a fire station in a givenweek;

• the price of a barrel of oil in one month’s time;

• the number of gold medals won by Great Britain at the next summerOlympics.

32

We cannot say with certainty what any of these quantities are, but probabilitytheory gives us a framework for describing how likely different values are.Whereas elements of a sample space may not be numerical, random variablesare always numerical quantities, and so, when defining a random variable, weneed a rule for getting from the random outcome in the sample space to thevalue of the random variable.

Definition 10. Given a sample space S, we define a random variable Xto be a mapping from S to the real line R.

We sometimes write a random variable as X(s), where s ∈ S. We define therange of X to be the set of all possible values of X:

RX := {x ∈ R;x = X(s) for some s ∈ S}.

In this chapter, we consider discrete random variables, in which the numberof possible values is either finite or countably infinite.

Example 27. Counting heads in coin tosses

7.1 Summation notation

Before continuing with the discussion of random variables, we define somenew summation notation, and recap some results regarding manipulations ofsums.

Let X be a discrete random variable with range RX = {x1, x2, . . . , xn}. Forany function g(x), we define∑

x∈Rx

g(x) :=n∑i=1

g(xi) = g(x1) + g(x2) + . . .+ g(xn).

33

For any two constants a and b, we have

n∑i=1

(a+ bg(xi)) = (a+ bg(x1)) + (a+ bg(x2)) + . . .+ (a+ bg(xn))

= na+ bn∑i=1

g(xi).

Note thatn∑i=1

a = na,

(so the sum is not equal to a).

7.2 Probability mass functions

Definition 11. For a discrete random variable X, we define the probabilitymass function (p.m.f. for short) pX to be

pX(x) := P (X = x),

where x can be any real number. Note that P (X = x) = P (A) where

A = {s ∈ S;X(s) = x}.

In the notation, we use X to represent the random variable, and x to representa possible value of X. Whereas X refers to a specific random variable, the useof the letter x is arbitrary; we could just as well write pX(a) := P (X = a).

A probability mass function must have the following two properties.

1. pX(x) ≥ 0∀x ∈ R.

34

2. Probability mass functions must ‘sum to 1’:∑x∈RX

pX(x) = 1.

Property 1 follows from the definition of pX(x). To prove property 2, firstwrite RX = {x1, x2, . . . , xn}, and let Ai = {s ∈ S;X(s) = xi}. You shouldnow convince yourself that A1, . . . , An is a partition of S.

Example 28. In a simple lottery, two numbers are drawn at random, withoutreplacement, from the numbers 1,2,3,4. You choose two numbers: 1 and 3.If both 1 and 3 are drawn, you win £10. If either 1 or 3 is drawn (but notboth), you win £5. Otherwise, you win nothing. Let X be the amount inpounds that you win. Tabulate pX(x).

Note that a probability mass function can be thought of as defining a prob-ability measure on the range space RX : for a subset A ⊆ RX , define

mX(A) := P (X ∈ A) =∑x∈A

pX(x).

It is not hard to check that mX satisfies the definition of a probability mea-sure, and it is called the law or distribution of X. We generally think interms of the probability mass function rather than of the measure, but themeasure idea is useful when we come to generalise beyond discrete randomvariables.

7.3 Cumulative distribution and quantile functions

Definition 12. We define the cumulative distribution function, abbre-viated to c.d.f., FX of a random variable X (discrete or continuous) to be

FX(x) := P (X ≤ x),

35

where x can be any real number.

The cumulative distribution function can be written in terms of the proba-bility mass function:

FX(x) := P (X ≤ x) =∑

a≤x,a∈RX

pX(a). (7)

Example 29. Suppose England are to play the West Indies in a 3 match testseries. Let X be the number of matches won by England. If my probabilitymass function for X is

pX(0) = 0.05, pX(1) = 0.2, pX(2) = 0.6, pX(3) = 0.15,

tabulate my cumulative distribution function.

The quantile function is related to the inverse of the cumulative distributionfunction.

Definition 13. For α ∈ [0, 1] the α quantile (or 100×α percentile) is thesmallest value of x such that

FX(x) ≥ α

The median is the 0.5 quantile.

7.4 Independence of random variables

We say that two random variables X and Y are independent of each otherif any event defined only using the value of X is independent of any eventdefined only using the value of Y . More specifically, we can make the followingdefinition for discrete random variables:

36

Definition 14. Two discrete random variables X and Y are independentif

P (X = x, Y = y) = P (X = x)P (Y = y),

for all x and y, or, equivalently, if

P (X = x|Y = y) = P (X = x).

If two random variables are not independent, then we say that they are de-pendent.

Example 30. Independence of random variables

8 Expectation and variance

The probability mass function gives a complete description of the uncertaintywe have about a random variable X; it tells us how likely each possible valueof X is. However, there are other quantities that can tell us useful thingsabout a random variable, which we can derive from the probability massfunction. We consider here the expectation and variance of a randomvariable.

8.1 Expectation

On a European roulette wheel, the ball can land on one of the integers 0 to36. A bet of one pound on odd returns one pound (plus the original stake) ifthe ball lands on any odd number from 1 to 35. Assuming the ball is equally

37

likely to land anywhere, if you bet one pound on odd a large number of times,how much money per game are you likely to win (or lose)?

Informally, if you bet on odd 10000 times, we might suppose that you willwin 18

37 × 10000 = 4865 (to 4 s.f.) times and lose 1937 × 10000 = 5135 (to 4 s.f.)

times, so you will lose 270 pounds overall, or 2.7 pence per game.

Formally, we define the expected profit (or loss) per game.

Definition 15. The expectation E(X) of a discrete random variable X isdefined as

E(X) :=∑x∈RX

xP (X = x)

We refer to the expectation of X as the mean of X and write µX to representthe mean:

µX := E(X).

Example 31. In the roulette example, let X be your net winnings after asingle bet of one pound on odd. What is E(X)?

Example 32. In Section 5.5, we saw how you can choose your probabilityP (A) based on your fair odds. Suppose Everton are to play Chelsea in thePremier League, and that you believe fair odds against Everton winning are 3to 1, so that your probability that Everton win is 0.25. If you offer these oddsto someone, and they place a £1 bet, what is your expected profit? Supposea bookmaker also judges the probability that Everton win is 0.25, but insteadoffers odds of 12 to 5 against. If someone places a £1 bet on Everton winning,what is the bookmaker’s expected profit?

Example 33. Let X be a random variable with RX = {−1, 0, 1}. DefineY = g(X) = X2. Then Y is another random variable, with RY (y) = {0, 1}.What is E{g(X)}?

38

Sometimes it is useful to calculate the expectation of a function of X. Thefollowing result generalises the previous example to tell us how.

Theorem 5. (The expectation of g(X))

For any function g of a random variable X, with probability mass functionpX(x),

E{g(X)} =∑x∈RX

g(x)pX(x).

8.2 Variance

If we can repeat the experiment and observe X lots of times, informally,the expectation of X tells us what we are likely to see ‘on average’. (Wewill consider this more carefully when we study sums of random variables).It will also be useful to consider how far X might be from its expectation.Consider two random variables X and Y , with the following probability massfunctions:

pX(32) = 13 , pX(36) = 1

3 , pX(46) = 13 ,

pY (12) = 13 , pY (20) = 1

3 , pY (82) = 13 .

Then

E(X) = 32× 1

3+ 36× 1

3+ 46× 1

3= 38,

E(Y ) = 12× 1

3+ 20× 1

3+ 82× 1

3= 38.

Both X and Y have the same expected value, but for whatever values of Xand Y we observe, X will be closer to E(X) than Y will be to E(Y ).

We use the concept of variance to describe how close a random variable islikely to be to its expected value.

39

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

XE (X )

E (Y )Y

Figure 8: The possible values of X and Y are shown as the crosses (with each value equallylikely). Both X and Y have the same mean, but we can see Y must be further away fromits mean.

Definition 16. The variance Var(X) of a discrete random variable X isdefined as

Var(X) : = E[{X − E(X)}2

]= E{(X − µX)2}

=∑x∈RX

(x− µX)2pX(x).

We denote the variance by σ2X:

σ2X := Var(X).

As the variance is defined as an expected squared difference, the variancewill be expressed in units that are the square of the units of X. If we wanta measure of spread that is in the same units as X, we take the square rootof the variance.

Definition 17. The standard deviation of a random variable X, denotedby σX, is the square root of the variance of X.

σX :=√

Var(X).

40

The following result is useful for calculating variances:

Theorem 6. (The variance identity)

Var(X) = E(X2)− E(X)2,

To calculate a variance, if we already have E(X), we just need to calculateE(X2). (Alternatively, if we know the mean and variance, this gives us aquick way of calculating E(X2)). Note that as long as Var(X) > 0 we cansee that

E(X2) 6= E(X)2.

Example 34. Let X be the random variable defined in Example 31, withE(X) = −1/37. What is Var(X)?

8.3 The expectation and variance of aX + b

Theorem 7. (Expectation and variance of aX + b )

Let X be a random variable, and a and b be any two constants. Then

E(aX + b) = aE(X) + b,

Var(aX + b) = a2 Var(X).

If we set a = 0, then we can see that for any constant b,

E(b) = b,

Var(b) = 0.

41

Letting a = 1 and b = −E(X) in Theorem 7 also tells us that E{X−E(X)} =0 for any random variable X. So if you were wondering why the variance isdefined as E

[{X − E(X)}2

]rather than E{X − E(X)}, one answer is that

the latter expression will not tell us anything useful about X.

8.4 Expectation and variance of more than one random variable

It is often useful to consider expectations of sums of random variables:

Theorem 8. Given any two random variables X and Y

E(X + Y ) = E(X) + E(Y ).

We might hope that the same was true for variance, so that the variance ofX + Y was the sum of the variances of X and Y . This is not true in general,but it is true when X and Y are independent. To prove this we will first of allprove an important result about the expectation of a product of independentrandom variables.

Theorem 9. For any two random variables X and Y which are independent,

E(XY ) = E(X)E(Y ).

Corollary 10. For any two random variables X and Y which are indepen-dent,

Var(X + Y ) = Var(X) + Var(Y ).

Example 35. Let X and Y be independent random variables with Var(X) =9 and Var(Y ) = 16. What is Var(X − Y )?

42

9 Standard discrete probability distributions

We now consider some standard probability distributions for discrete randomvariables, that can be used in a variety of different applications. By “distri-bution”, we mean a particular choice of probability mass function, which maybe specified in terms of some parameters.

9.1 The Bernoulli distribution

A Bernoulli random variable X can take one of two values: 0 and 1. Examplesof ‘experiments’ that we might describe using a Bernoulli random variableare

• a patient is given a drug, and the drug either ‘works’: X = 1, or doesnot: X = 0;

• a tennis player either wins a match: X = 1, or loses: X = 0;

• in one year, a house is either burgled: X = 1, or not: X = 0.

Definition 18. If a random variable X has a Bernoulli distribution, thenits probability mass function is

pX(1) = p,

pX(0) = 1− p,

and pX(x) = 0 otherwise, with 0 ≤ p ≤ 1. We write

X ∼ Bernoulli(p),

to mean “X has a Bernoulli distribution with parameter p’ (the probabilitythat X = 1)”.

43

Theorem 11. (Expectation and variance of a Bernoulli random variable)

For the expectation of a Bernoulli random variable X ∼ Bernoulli(p), wehave

E(X) = p,

Var(X) = p(1− p).

9.2 The binomial distribution

Consider the following situations:

• 100 patients are given a drug. Each patient either ‘responds’ to the drug,or does not. X is the number of patients that respond to the drug.

• in a crime survey, 1000 people are selected at random, and asked whetherthey have been burgled in the last year. X is the number of people whorespond ‘yes’.

• in a quality control procedure, 20 items are selected at random, andtested to see whether they are faulty. X is the number of faulty items.

In each case we have a fixed number of “trials”, each of which can have twopossible outcomes (often called “success” and “failure”). (Each trial can beconsidered an example of a Bernoulli distribution, with “success” correspond-ing to 1 and “failure” to 0, so they are often referred to as Bernoulli trials.)In each of these situations it is reasonable to assume that the probability ofa “success”, which we will call p, is constant from from one trial to the next,and that the trials are independent. In each case we are counting the totalnumber of successes.

44

To think about the form of the probability mass function of X, consider thecase n = 2. The possible outcomes are

s X(s) ProbabilityTrial 1 Trial 2

failure failure 0 (1− p)× (1− p)failure success 1 (1− p)× psuccess failure 1 p× (1− p)success success 2 p× p

Consider calculating pX(1). We are not interested in which trials are suc-cesses, only the total number of successes. There are two ways of achievingone success in total, and for each of these possibilities, the correspondingprobability is p(1− p), so we have pX(1) = 2p(1− p).

In general for n trials, the number of possible sequences that contain x suc-cesses in total will be (

n

x

)=

n!

x!(n− x)!,

and the probability of any individual sequence with x successes in total willbe px(1− p)n−x, so we will have pX(x) =

(nx

)px(1− p)n−x.

Motivated by this, we make the following definition:

Definition 19. If a random variable X has a binomial distribution, withparameters n (the number of trials) and p (the probability of success in eachtrial), then the probability mass function of X is given by

pX(x) =n!

x!(n− x)!px(1− p)n−x,

for x ∈ RX = {0, 1, 2, . . . , n}, and 0 otherwise. We write

X ∼ Bin(n, p),

45

to mean “X has a binomial distribution with parameters n (the number oftrials) and p (the probability of success in each trial)”.

Note that the binomial theorem confirms that the binomial probability massfunction is valid (ie it sums to 1): it tells us that

(a+ b)n =n∑x=0

(n

x

)axbn−x.

and if we now choose a = p and b = 1− p, we have

n∑x=0

pX(x) =n∑x=0

(n

x

)px(1− p)n−x = (p+ (1− p))n = 1,

as we should have.

Theorem 12. (Expectation and variance of a binomial random variable)

For X ∼ Bin(n, p) we have

E(X) = np

Var(X) = np(1− p).

As well as being interested in the total number of ‘successes’ X, we may alsobe interested in the proportion of success X/n. We have

E

(X

n

)= p,

V ar

(X

n

)=

p(1− p)n

.

What do you think will happen to X/n as n→∞?

46

9.2.1 The cumulative distribution function

The cumulative distribution function is given by

FX(x) = P (X ≤ x) =x∑a=0

pX(a)

=x∑a=0

n!

a!(n− a)!pa(1− p)n−a.

We cannot simplify this expression, and so calculating the c.d.f. by hand canbe tedious. Fortunately, we can do this and other calculations related to thebinomial distribution very easily in R.

9.2.2 The binomial distribution in R

As with most standard distributions in R, there are commands for calculatingthe p.m.f., c.d.f., quantile function, and for randomly sampling from thedistribution.

• Calculate the p.m.f. : dbinom(x,n,p)Example: calculate pX(2) = P (X = 2) when X ∼ Bin(10, 0.3).

> dbinom(2,10,0.3)

[1] 0.2334744

• Calculate the c.d.f.: pbinom(x,n,p)Example: calculate FX(4) = P (X ≤ 4) when X ∼ Bin(30, 0.25).

> pbinom(4,30,0.25)

47

[1] 0.0978696

• Invert the c.d.f. to find the α quantile: qbinom(alpha,n,p)Example: ifX ∼ Bin(15, 0.7), what is the smallest value of x ∈ {0, 1, . . . , 15}such that FX(x) = P (X ≤ x) ≥ 0.4?

> qbinom(0.4,15,0.7)

[1] 10

Check:> pbinom(10,15,0.7)

[1] 0.4845089

> pbinom(9,15,0.7)

[1] 0.2783786

• Generate m random observations from a binomial distribution: rbinom(m,n,p)Example: generate 10 random observations from the Bin(100, 0.1) dis-tribution.

> rbinom(10,100,0.1)

[1] 9 11 6 15 10 8 9 8 13 12

9.2.3 The binomial distribution: examples

Example 36. A company claims that, for a particular product, 8 out of 10people prefer their brand A over a rival’s brand B. You randomly sample 50people, and ask them whether they prefer brand A to brand B. Let X be the

48

number of people who choose brand A. If the company is right,

1. What are the expectation and variance of X?

2. What is the probability that X = 40?

3. What is the probability that X ≤ 30?

4. What is the probability that X ≥ 45?

9.3 The Poisson distribution

The Poisson distribution is used to represent count data: the number oftimes an event occurs in some finite interval in time or space. Some situationsthat we might model using a Poisson distribution are as follows.

• The number of arrivals at an Accident & Emergency ward in one night;

• the number of burglaries in a city in a year;

• the number of goals scored by a team in a football match;

• the number of leaks in 1km section of water pipe.

We can motivate the form of the Poisson distribution as follows. Considerthe third example, assume that we expect the team to score about λ goals(for some λ) and imagine dividing the match up into n short time intervals,where n is some arbitrary but large number. (E.g. n could be 90 and each ofthe intervals one minute long.) Assume that in each of these time intervalsthe probability of the team scoring one goal is p, independently of the otherintervals, and that the probability of them scoring more than one goal in an

49

interval is “negligible”. Under these assumptions, the number of goals scoredhas a Bin(n, p) distribution. That the expectation of a Bin(n, p) randomvariable is np suggests that we should now take p = λ/n.

Because we set n to be large, this suggests we should consider the behaviourof Bin(n, λ/n) for large n.

Theorem 13. Consider X ∼ Bin(n, λ/n), and suppose that n→∞. Then,as n→∞

pX(x)→ e−λλx

x!.

Motivated by this result, we make the following definition:

Definition 20. If a random variable X has a Poisson distribution, withparameter λ > 0, then its probability mass function is given by

pX(x) = P (X = x) =e−λλx

x!,

for x ∈ N0 and 0 otherwise. We write

X ∼ Poisson(λ),

to mean “X has a Poisson distribution with rate parameter λ”.

The Poisson distribution has a single parameter λ, known as the rate param-eter. Shortly, we will show that E(X) = λ, so you can interpret λ as theexpected number of times the event will occur.

Theorem 14. (Poisson random variable: valid p.m.f., expectation and vari-ance)

1. The Poisson probability mass function is a valid probability mass func-tion.

50

2. If X ∼ Poisson(λ) then

E(X) = Var(X) = λ.

9.3.1 The cumulative distribution function

The cumulative distribution function is given by

FX(x) = P (X ≤ x) =x∑a=0

pX(a)

=x∑a=0

e−λλa

a!.

As with the binomial distribution, this is tedious to calculate by hand, buteasy to calculate using R.

9.3.2 The Poisson distribution in R

• Calculate the p.m.f.: dpois(x,lambda)Example: calculate pX(4) = P (X = 4) when X ∼ Poisson(2).

> dpois(4,2)

[1] 0.09022352

• Calculate the c.d.f.: ppois(x,lambda)Example: calculate FX(10) = P (X ≤ 10) when X ∼ Poisson(6.5).

> ppois(10,6.5)

[1] 0.9331612

51

• Invert the c.d.f. to find the α quantile: qpois(alpha,lambda)Example: if X ∼ Poisson(4.2), what is the smallest value of x ∈{0, 1, . . . , 15} such that FX(x) = P (X ≤ x) ≥ 0.8?> qpois(0.8,4.2)

[1] 6

Check:> ppois(6,4.2)

[1] 0.867464

> ppois(5,4.2)

[1] 0.7531429

• Generate m random observations from a binomial distribution: rbinom(m,n,p)Example: generate 10 random observations from the Poisson(20) dis-tribution.

> rpois(10,20)

[1] 15 29 20 30 23 19 20 12 32 20

9.3.3 The Poisson distribution: examples

Example 37. Suppose X, the number of accidents at a road junction in oneyear has a Poisson distribution with rate parameter 5.

1. What are the expectation and variance of X?

2. What is the probability that X = 0?

3. What is the probability that X ≤ 5?

52

4. What is the probability that X ≥ 10?

0 5 10 150

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

x

pX

(x)

0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

FX

(x)

Figure 9: The p.m.f. and c.d.f. for X ∼ Poisson(5).

Note that Theorem 13 suggests that for large n and small p, the Bin(n, p)distribution can be well approximated by the Poisson(np) distribution.

Example 38. An article published on the BBC news website1 reported “three-fold variation” in UK bowel cancer death rates. The average death rate frombowel cancer across the UK is reported as 17.6 per 100,000. By consider-ing 100 ‘regions’ each with the same population size of 100,000, how muchvariation could be due to chance alone?

9.4 The geometric distribution

Consider the following situations:1BBC News, 2011. ‘Three-fold variation’ in UK bowel cancer death rates. Available at

http://www.bbc.co.uk/news/health-14854019 [Accessed 25/8/2015].

53

• Throwing darts at a dartboard until the bull’s eye is hit;

• Buying a national lottery ticket each week until the jackpot is won.

As with the Binomial examples, these involve a sequence of Bernoulli trials,each of which a ‘success’, with probability p, or a ‘failure’, with probability1− p, but now there is no fixed number of trials; rather, we repeat the trialsuntil we obtain a success.

A geometric random variable X is the number of the trial in which the firstsuccess is observed.

Definition 21. If a random variable X has a geometric distribution,with parameter p (the probability of a success in any single trial), then theprobability mass function of X is given by

pX(x) = (1− p)x−1p,

for x ∈ N and 0 otherwise. We write

X ∼ Geometric(p).

Theorem 15. (Cumulative distribution function of a geometric random vari-able)

If X ∼ Geometric(p) and x is a non-negative integer then

FX(x) = 1− (1− p)x.

Note that as x→∞, (1− p)x → 0, so that∞∑x=1

pX(x) = FX(∞) = 1,

confirming that the probability mass function is valid.

54

Theorem 16. (Expectation and variance of a geometric random variable)

If X ∼ Geometric(p) then

E(X) =1

p

Var(X) =1− pp2

.

Example 39. In section 5.1, we calculated the probability of winning theNational Lottery jackpot with a single ticket is p = 1

45057474. Suppose I buyone ticket per week. Let X be the week number in which I first win the jackpot.

1. What is the expectation of X?

2. What is the probability that I don’t win at any time in the next 50 years?

10 Continuous random variables

10.1 Introduction

We continue with the study of random variables, but now consider the casewhere the range RX of a random variable X is either the real line R, or aninterval (or collection of intervals) contained in R. Examples of continuousrandom variables are

• the total rainfall (in cm) in Sheffield tomorrow;

• your current blood pressure level (in millimetres of mercury);

• the time in seconds for a runner to complete the London Marathon.

55

(In practice, measurements of these quantities would actually be discrete, butit’s usually more convenient (and harmless) to treat them as continuous).

10.2 Why make the distinction between continuous and discreterandom variables?

Let’s try to write down a probability mass function for a continuous randomvariable. Consider again the spinning roulette wheel example from section 1:

Example 40. A (European) roulette wheel is spun. Consider the angle be-tween the horizontal axis (viewing the wheel from above) and a line from thecentre of the wheel that bisects the zero on the wheel. Assuming that anyangle is equally likely, (i) what it the probability that this angle is π/2? (ii)What is the probability that this angle is between π/2 and π?

Denote the random angle by X. This is a continuous random variable withrange RX = [0, 2π]. We assume all possible values of X to be ‘equally likely’,so that P (X = a) = P (X = b) for any a, b ∈ [0, 2π]. If we write P (X = x) =k for some constant value k, what value would k be? Now, P{X ∈ [0, 2π]} =1, so

1 = P{X ∈ [0, 2π]} ?=∑x∈Rx

P (X = x) =∑x∈Rx

k.

But there are (uncountably) infinitely many different x in the set RX , so howcan we sum k an infinite number of times and get 1? (And how would weeven write down a sum of uncountably many values?) Clearly, this isn’t goingto work. In fact, we’ve already dealt with this problem earlier in the course,when we considered the problem of a randomly generated angle, and definedprobability as a measure. We will repeat the discussion here and extend itto consider continuous random variables in general.

56

10.3 Probability measures for continuous random variables

Recall that in section 7.2 we said that a probability mass function for adiscrete random variable X could be thought of as defining a probabilitymeasure mX on RX , such that for a set of interest A we have P (X ∈ A) =mX(A). We consider how to do something similar in a continuous setting.

For the example above, we want our range to be RX = [0, 2π]. We don’twant one part of the circle to be favoured over any other, so for two intervalsof the same length [a, a+ w] and [b, b+ w] (with a, a+ w, b, b+ w ∈ RX) wewould like the probabilities that X is in each of them to be the same.

We can achieve this by considering the measure mX defined by

mX([a, b]) =b− a2π

. (8)

(This is the Lebesgue measure, divided by 2π).

Assume that this measure defines the distribution of X, so that P (X ∈ A) =mX(A). Then, as we wanted, any two intervals of the same width as abovewill be equally likely:

mX([a, a+ w]) = mX([b, b+ w]) =w

2π.

However, any single value has zero probability:

P (X = a) = mX([a, a]) =a− a

2π= 0.

We can now answer the questions in Example 40 without difficulty. We haveP (X = π/2) = 0 and

P (π/2 ≤ X ≤ π) = mX([π/2, π]) =π − π/2

2π= 0.25.

57

One way to think of the measure defined in (8) as an area under a curve.If we draw the ‘curve’ y = 1/(2π) for x ∈ [0, 2π], and y = 0 otherwise,then mX([a, b]) is the area under the curve between x = a and x = b. Anillustration is given in Figure 10. Again, as the area between x = a and x = ais zero, this reinforces the point that P (X = a) = 0.

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

x

y

Figure 10: P (X ∈ [π/2, π]) is given by the area under the curve y = 1/(2π) between x = π/2and x = π.

The ‘area under the curve’ interpretation suggests that we can construct othervalid probability measures, by drawing different curves. We can consider anyfunction f(x) with f(x) ≥ 0 for all x, such that the total area under thecurve is 1. Another example is given in Figure 11, for a random variable Xwith range RX = [0, 10], using the curve

f(x) =3x(10− x)

500,

for x ∈ [0, 10], and f(x) = 0 otherwise. This measure will give more proba-bility to X lying in the interval [4, 6] than the interval [0, 2], say, even thoughthe two intervals have the same width.

58

0 1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0.3

x

y

Figure 11: P (X ∈ ([2, 4]) is given by the area under the curve y = −3x(x−10)500

between x = 2and x = 4.

Based on this, we make the following definition.

Definition 22. A probability density function (p.d.f. for short) fX is afunction from R to R such that both fX(x) ≥ 0 for all x, and∫ ∞

−∞fX(t) dt = 1.

A random variable X with p.d.f. fX has the property that

P (a ≤ X ≤ b) =

∫ b

a

fX(t) dt.

We have exactly the same definition as in Section 7.3 for the cumulativedistribution function:

FX(x) := P (X ≤ x),

but for a continuous random variable, this is calculated as an area under acurve instead of a summation (see (7) on page 36 in Section 7.3):

59

FX(x) =

∫ x

−∞fX(t)dt,

where fX(x) is the p.d.f. of X. So, whereas a discrete random variable hasa probability mass function, a continuous random variable has a probabilitydensity function.

From the definitions, it follows that

d

dxFX(x) = fX(x).

In summary, for the distribution of a continuous random variable, we use‘area under a curve’ as a probability measure, and the curve that we choosefor a particular random variable X is called the probability density functionof X. Note that the condition∫ ∞

−∞fX(t)dt = 1,

applies because this integral represents P (−∞ < X <∞). Note also that

P (X = a) = P (a ≤ X ≤ a) =

∫ a

a

fX(x)dx = 0,

and thatP (X < a) = P (X ≤ a)− P (X = a) = P (X ≤ a).

Example 41. Let X be a random variable with p.d.f. given by fX(x) = ke−x

for x ∈ [0, 2], and 0 otherwise.

1. Find the value of k.

2. Derive the cumulative distribution function

3. Calculate the probability that X ≤ 0.5.

60

10.4 Aside: integrating functions with split definitions

This section will not be lectured, but should be read.

Many probability density functions which you will encounter are defined usingdifferent formulae for different regions of R; this is the case in Example 41,where fX is defined as ke−x for x ∈ [0, 2] but otherwise is defined to be zero.To work with these probability density functions you will need to know howto integrate functions which are defined in this way.

Consider calculating the definite integral∫ c

a

f(x) dx,

where f(x) has such a split definition:

f(x) =

{f1(x) a ≤ x < b

f2(x) b ≤ x ≤ c.

(The particular combination of < and ≤ at b does not affect the principlehere.) Recall that the integral here is the (signed) area under the curve off(x) between a and c. This area can be split up into the area under the curvebetween a and b, and that between a and c. Thinking about the integral inthis way, we get ∫ c

a

f(x) dx =

∫ b

a

f1(x) dx+

∫ c

b

f2(x) dx,

so the way to calculate the original integral is to split it up into separatedefinite integrals corresponding to the different parts of the definition of f .

The following example shows this in practice for a probability density functionsimilar to that in Example 41.

61

Example 42. Let X be a random variable with p.d.f. given by fX(x) = ke−|x|

for x ∈ [−1, 1], and 0 otherwise. Find the value of k.

Here we need to find ∫ 1

−1

fX(x) dx,

and by considering the modulus we know that fX(x) = kex for x ∈ [−1, 0]and fX(x) = ke−x for x ∈ [0, 1]. Hence we can write∫ 1

−1

fX(x) dx =

∫ 0

−1

kex dx+

∫ 1

0

ke−x dx,

and evaluating the two integrals on the right separately gives that∫ 1

−1

fX(x) dx = k(1− e−1) + k(1− e−1) = 2k(1− e−1

).

Hence k = 12(1−e−1) . This is illustrated in Figure 12.

11 Expectation and variance of continuous random vari-

ables

Definition 23. For a continuous random variable X, the expectation of g(X),for some function g defined on RX, is defined as

E{g(X)} :=

∫ ∞−∞

g(x)fX(x)dx.

Setting g(X) = X gives

E{X} =

∫ ∞−∞

xfX(x)dx,

62

Figure 12: Integral for Example 42. The integral is the sum of an integral from −1 to 0(light shading) and an integral from 0 to 1 (dark shading).

and we again use the notation

µX := E(X),

with µX referred to as the mean of X.

Variance has the same definition as before, but it is now calculated usingintegration

Var(X) := E{(X − µX)2}

=

∫ ∞−∞

(x− µX)2fX(x)dx.

Note: for a continuous random variable X, the identity

Var(X) = E(X2)− E(X)2

63

still holds, so, as with discrete random variables, we can calculate Var(X) bycalculating E(X) and E(X2).

All the properties of expectation and variance that we met in Section 8 stillhold:

E(aX + b) = aE(X) + b,

E(X + Y ) = E(X) + E(Y ),

Var(aX + b) = a2 Var(X),

and, if X and Y are independent,

E(XY ) = E(X)E(Y ),

Var(X + Y ) = Var(X) + Var(Y ).

Example 43. Calculate the expectation and variance of the random variabledefined in Example 41.

12 Standard continuous probability distributions

12.1 The exponential distribution

We consider three standard probability distributions for continuous randomvariables: the exponential distribution, the uniform distribution, and thenormal distribution.

The exponential distribution is used to represent a ‘time to an event’. Ex-amples of ‘experiments’ that we might describe using a exponential randomvariable are

64

• a patient with heart disease is given a drug, and we observe the timeuntil the patient’s next heart attack;

• a new car is bought and we observe how many miles the car is drivenbefore it has its first breakdown.

Definition 24. If a random variable X has an exponential distribution,with rate parameter λ, then its probability density function is given by

fX(x) = λe−λx,

for x ≥ 0, and 0 otherwise. We write

X ∼ Exp(rate = λ), or just X ∼ Exp(λ),

to mean “X has an exponential distribution with rate parameter λ”.

Theorem 17. Cumulative distribution function of an exponential randomvariable

If X ∼ Exp(λ), then for x ≥ 0

FX(x) = 1− e−λx.

We can see that limx→∞ F (X)(x) = 1 (so that “FX(∞) = 1”), so that, asrequired of a p.d.f., ∫ ∞

−∞fX(x)dx =

∫ ∞0

λe−λxdx = 1.

We plot both the p.d.f. and c.d.f. of an Exp(2) random variable in Figure 13

Theorem 18. (Expectation and variance of an exponential random variable)

If X ∼ Exp(λ), then

E(X) =1

λ,

Var(X) =1

λ2.

65

0 1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

fX

(x)

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

FX

(x)

Figure 13: The p.d.f. (left plot) and c.d.f. (right plot) of an exponential random variable Xwith rate parameter λ = 2.

Theorem 19. (The ‘lack of memory’ property of an exponential randomvariable)

If X ∼ Exp(λ), then

P (X > x+ a|X > a) = P (X > x).

In other words, exponential random variables have the interesting propertythat they ‘forget’ how ‘old’ they are. If the lifetime of some object has anexponential distribution, and the object survives from time 0 to time a, itwill ‘carry on’ as if it was starting at time 0.

Example 44. A computer is left running continuously until it first developsa fault. The time until the fault, X, is to be modelled with an exponentialdistribution. The expected time until the first fault is 100 days.

66

1. If X ∼ Exp(λ), determine the value of λ. What is the standard deviationof X?

2. What is the probability that is the computer develops a fault within thefirst 100 days?

3. If the computer is still working after 100 days, what is the probabilitythat it will still be working after 150 days?

Example 45. Suppose the number of earthquakes Nt in an interval [0, t] hasa Poisson(φt) distribution, for any value of t. Recall that if X ∼ Poisson(λ),

pX(x) = P (X = x) =e−λλx

x!

Let T be the time until the first earthquake. What is the cumulative distribu-tion function of T? What is the distribution of T?

12.1.1 Probability density functions and histograms

There is a connection between the idea of a probability density function,where area represents probability, and a histogram, where area representsobserved data. We illustrate this using simulations of the exponential distri-bution.

We can simulate a sample of n independent random values from an Exp(1)distibution and draw a histogram of them using the R code

x<-rexp(n)

hist(x,freq=FALSE)

(You will need to first define n. The freq=FALSE option tells R to plotthe histogram with the areas normalised so that the total area is 1.) With

67

large values of n, the shape of the plot should show a close resemblance tothe probability density function of Exp(1), which is e−x for x > 0. This isillustrated in Figure 14 for n = 10000.

Histogram of x

x

Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

x

PD

F o

f Exp

(1)

Figure 14: Left: histogram from R of 10000 independent values from an Exp(1) distribution.Right: the probability density function of the same distribution

12.2 The uniform distribution

The uniform distribution is used to describe a random variable that is con-strained to lie in some interval [a, b], but has the same probability of lying inany interval contained within [a, b] of a fixed width. The uniform distributionis an important concept in probability theory, but it is less useful for mod-

68

elling uncertainty in the real world; it is not often plausible in real situationsthat all intervals of the same width are equally likely.

Definition 25. If a random variable X has a uniform distribution overthe interval [a, b], then its probability density function is given by

fX(x) =1

b− a,

for x ∈ [a, b], and 0 otherwise. We write

X ∼ U [a, b],

to mean “X has a uniform distribution over the interval [a, b].”

Theorem 20. (Cumulative distribution function of a uniform random vari-able)

If X ∼ U [a, b], then for x ∈ [a, b]

FX(x) =x− ab− a

.

Plotting the c.d.f. between x = a and x = b will give a straight line, joiningthe points (a, 0) and (b, 1). We plot the p.d.f. and c.d.f. in Figure 15.

Theorem 21. (Expectation and variance of a uniform random variable)

If X ∼ U [a, b], then

E(X) =a+ b

2,

Var(X) =(b− a)2

12.

Example 46. Let X ∼ U [−1, 1]. Calculate E(X), Var(X) and P (X ≤−0.5|X ≤ 0).

69

10 15 200

0.05

0.1

0.15

x

fX

(x)

10 15 200

0.2

0.4

0.6

0.8

1

x

FX

(x)

Figure 15: The p.d.f. (left plot) and c.d.f. (right plot) of a uniform random variable X overthe interval [10, 20].

12.3 The standard normal distribution

The normal distribution is very important distribution in both probabilityand statistics. Before studying it, we first introduce the Gaussian integral:∫ ∞

−∞e−x

2

dx =√π. (9)

(A proof is given in Applebaum (2008), though you will need to understandchanges of variables within double integration).

We first define the standard normal distribution, before considering the moregeneral case.

Definition 26. If a random variable Z has a standard normal distribu-

70

tion, then its probability density function is given by

fZ(z) =1√2π

exp

(−z

2

2

).

We writeZ ∼ N(0, 1),

to mean “Z has the standard normal distribution.”

We can use (9) to confirm that this is a valid p.d.f. Starting with the Gaus-sian integral, we make the substitution x = z/

√2, so that dx

dz = 1/√

2: (9)immediately gives ∫ ∞

−∞

1√πe−x

2

dx = 1

and the substitution then gives∫ ∞−∞

1√2πe−

z2

2 dz = 1,

The p.d.f. is plotted in Figure 16 (left plot). Note the distinctive ‘bell-shaped’curve and the symmetry about z = 0.

12.3.1 The cumulative distribution function of a standard normal random vari-able

The c.d.f. is

FZ(z) = P (Z ≤ z) =

∫ z

−∞

1√2π

exp

{−t

2

2

}dt.

However, we can’t evaluate this integral analytically, and have to use nu-merical methods. There are various statistical tables available that give the

71

value of FZ(z) for different z, but these have now largely been superseded bymodern computing packages, and we will see how to calculate FZ(z) using Rin Section 12.4.3. The c.d.f. is plotted below (right plot).

5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

z

fZ(z

)

5 0 50

0.2

0.4

0.6

0.8

1

z

FZ(z

)

Figure 16: The p.d.f. (left plot) and c.d.f. (right plot) of a standard normal random variable.Note the ‘bell shape’ of the pdf; the p.d.f. is sometimes referred to as the ‘bell-shaped curve’.

The notation Φ is commonly used to represent the cdf, and φ to representthe pdf:

Φ(z) := P (Z ≤ z) =

∫ z

−∞

1√2π

exp

(−t

2

2

)dt

φ(z) :=1√2π

exp

(−z

2

2

).

Thend

dzΦ(z) = φ(z).

Theorem 22. (Relationship between Φ(z) and Φ(−z).)

72

Φ(−z) = 1− Φ(z).

This can be seen in Figure 17.

4 3 2 1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

z

fZ(z

)

Figure 17: As fZ is a symmetric about z = 0 the area under the curve between −∞ and −zis the same as the area under the curve between z and ∞.

We denote the quantile function by Φ−1. If we want z such that P (Z ≤ z) =α, then we write

Φ(z) = α

z = Φ−1(α).

Theorem 23. (Expectation and variance of a standard normal random vari-able)

73

If Z ∼ N(0, 1), then

E(Z) = 0

Var(Z) = 1.

12.4 The normal distribution: the general case

The standard normal distribution is one example of the family of normaldistributions, in which the mean is 0 and the variance is 1, but, in general,normal random variables can have any values for the mean and variance(though variances cannot be negative, of course).

Normal distributions are used very widely in many situations, for example:

• many physical characteristics of humans and other animals, for examplethe distribution of heights of females in a particular age group, can bewell represented with a normal distribution;

• scientists often assume that ‘measurement errors’ are normally distributed;

• normal distributions are commonly used in finance to model changes instock prices (though not always sensibly!).

Some idea why the normal distribution is so important will be given later inthe course in the section on the Central Limit Theorem.

Definition 27. If a random variable X has a normal distribution withmean µ and variance σ2, then its probability density function is given by

fX(x) =1√

2πσ2exp

{− 1

2σ2(x− µ)2

}74

We writeX ∼ N(µ, σ2),

to mean “X has a normal distribution with mean µ and variance σ2.”

Immediately, we can see that by setting µ = 0 and σ2 = 1 in Definition 27,we get the standard normal p.d.f. in Definition 26.

Theorem 24. (Definition of a general normal random variable via transfor-mation of a standard normal random variable)

Let Z ∼ N(0, 1), and define

X = µ+ σZ.

Then E(X) = µ, Var(X) = σ2 and

X ∼ N(µ, σ2),

12.4.1 Summary

It’s worth stating again the relationship between a standard normal randomvariable Z and a ‘general’ normal random variable X.

• Given Z ∼ N(0, 1), we can obtain X ∼ N(µ, σ2) by transforming Z:

X = µ+ σZ.

• Given X ∼ N(µ, σ2), we can obtain Z ∼ N(0, 1) by transforming X:

Z =X − µσ

,

and we refer to transformingX to get a standardN(0, 1) random variableas standardising X.

75

Traditionally, we would calculate the c.d.f. of X via standardising and usingthe Φ(z) function:

P (X ≤ x) = P

(X − µσ

≤ x− µσ

)= Φ

(x− µσ

),

where Φ(z) is given in statistical tables for various values of z. As discussedbefore, statistical tables have become largely obsolete given computer pack-ages such as R, although the technique of standardising is still computation-ally useful.

12.4.2 Visualising the mean and variance

By plotting the density function, we can see the effect of changing the valueof µ and σ2 and so interpret these parameters more easily. Starting with themean, we see, in Figure 18 that if X ∼ N(µ, σ2), then the maximum of thep.d.f. is at x = µ. If we change µ whilst leaving σ2 unchanged (Figure 18,top plot), the p.d.f. ‘shifts’ along the x-axis, but the shape of the p.d.f. isunchanged.

The variance parameter σ2 determines how ‘spread out’ the p.d.f. is. If weincrease σ2, whilst leaving µ unchanged (Figure 18, bottom plot), the peakof the p.d.f. is in the same place, but we get a flatter curve. This is to beexpected, remembering that random variables with larger variances are morelikely to be further away from their expectations.

12.4.3 The normal distribution in R

R will calculate the p.d.f., c.d.f. and quantile functions, and will also generatenormal random variables. Note that in R, we specify the standard

76

5 4 3 2 1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

x

fX

(x) N (0, 1) N (2, 1)

6 4 2 0 2 4 60

0.1

0.2

0.3

0.4

x

fX

(x)

N (0, 1)

N (0, 4)

Figure 18: Top plot: pdfs for the N(0, 1) and N(2, 1) distributions. Bottom plot: pdfs forthe N(0, 1) and N(0, 4) distribution.

deviation rather than the variance.

• Calculate the pdf: dnorm(x,mu,sigma)Example: calculate fX(2) when X ∼ N(1, 4).

> dnorm(2,1,2)

[1] 0.1760327

• Calculate the cdf: pnorm(x,mu,sigma)Example: calculate FX(−1) = P (X ≤ −1) when X ∼ N(1, 4).

> pnorm(-1,1,2)

[1] 0.1586553

77

• Invert the c.d.f. to find the α quantile: qnorm(alpha,mu,sigma)Example: if X ∼ N(0, 1), what value of x satisfies the equation FX(x) =P (X ≤ x) = 0.95?

> qnorm(0.95,0,1)

[1] 1.644854

Check:> pnorm(1.644854,0,1)

[1] 0.95

• Generate m random observations from a normal distribution: rnorm(m,mu,sigma)Example: generate 3 random observations from the N(15, 25) distribu-tion.

> rnorm(3,15,5)

[1] 6.985971 20.671469 11.637691

Example 47. If X ∼ N(3, 4), what is the 25th percentile of the distributionof X?

Example 48. (from Ross, 2010).An expert witness in a paternity suit testifies that the length, in days, ofhuman gestation is approximately normally distributed, with mean 270 daysand standard deviation 10 days. The defendant has proved that he was outof the country during a period between 290 days before the birth of the childand 240 days before the birth of the child, so if he is the father, the gestationperiod must have either exceeded 290 days, or been shorter than 240 days.How likely is this?

78

12.4.4 The two-σ rule

For a standard normal random variable Z,

P (−1.96 ≤ Z ≤ 1.96) = 0.95.

Since E(Z) = 0 and Var(Z) = 1, the probability of Z being within twostandard deviations of its mean value is approximately 0.95 (ie P (−2 ≤ Z ≤2) = 0.9545 to 4 d.p.). We illustrate this in Figure 19.

4 3 2 1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

x

fX

(x)

Figure 19: The p.d.f. of a N(0, 1) random variable (mean 0, standard deviation 1). There isa 95% probability that a normal random variable will lie within 1.96 standard deviations ofits mean.

If we now consider any normal random variable X ∼ N(µ, σ2), the probabilitythat it will lie within a distance of two standard deviations from its mean is

79

approximately 0.95. This is straightforward to verify:

P (|X − µ| ≤ 1.96σ) = P (µ− 1.96σ ≤ X ≤ µ+ 1.96σ)

= P

(−1.96 ≤ X − µ

σ≤ 1.96

)= P (−1.96 ≤ Z ≤ 1.96)

= 0.95,

(with P (|X − µ| ≤ 2σ) = 0.9545 to 4 d.p.).

In Statistics, there is a convention of using 0.05 as a threshold for a ‘small’probability (more of this in Semester 2), though the choice of 0.05 is arbitrary.However, the two-σ rule is an easy to remember fact about normal randomvariables, and can be a useful yardstick in various situations.

13 Multivariate discrete random variables

13.1 Introduction

There are various scenarios in which we may be interested in the joint be-haviour of multiple random variables. For example

• A share portfolio is made up of stocks in several companies listed on theFTSE 100 index. How likely is that all the stocks lose value over sometime period?

• A department is recruiting students onto three courses, but can onlyaccommodate a fixed number of students in total. How likely is it thatthe total number of students exceeds the limit?

80

13.2 Joint probability mass functions and cumulative distributionfunctions

Definition 28. Suppose we have two discrete random variables X and Y .We define the joint probability mass function (p.m.f.) to be

pX,Y (x, y) := P{(X = x) ∩ (Y = y)},

which we also write as P (X = x, Y = y). In this setting, pX and pY arereferred to as the marginal probability mass functions.

A valid joint probability mass function must satisfy the conditions pX,Y (x, y) ≥0 for all x and y, and ∑

x∈RX

∑y∈RY

pX,Y (x, y) = 1.

Example 49. A total prize of 10 pounds is to be split between two players.Each player decides, in secret, whether to share or steal the prize. The rulesare:

• if both players choose to share, each player gets 5 pounds

• if both players choose to steal, each player gets nothing

• if one player chooses to steal, and the other chooses to share, the playerwho chooses to steal gets 10 pounds; the other player gets nothing.

Player A tosses a coin, and will choose to share if the outcome is a head, andsteal if the outcome is a tail. Player B rolls a 6-sided die, and will chooseto share if the outcome is 1 to 5, and steal if the outcome is a 6. Let X bePlayer A’s winnings, and let Y be Player B’s winnings (both in pounds).

81

1. Tabulate the joint probability mass function of X and Y , over the rangesof X and Y .

2. Find the marginal probability mass functions of X and Y .

3. If it is known that X = 5, find the probability that Y = 5.

In the above example, we calculated the joint probabilities pX,Y (x, y) =P (X = x, Y = y) ‘directly’. Sometimes, however, it is easier to calculatethe joint probability via a marginal and a conditional probability:

pX,Y (x, y) = P (X = x, Y = y) = P (X = x)P (Y = y|X = x).

(Go back and re-read Section 6.2). Here is an illustration.

Example 50. Two people are living in one house, and in Week 1, are bothat risk of catching a cold.

Suppose that each person has a probability of 0.1 of catching a cold in Week1, independently of each other.

If no-one catches a cold, the probabilities of catching a cold are unchangedfor Week 2. If exactly one person catches a cold in Week 1, the remaining‘uninfected’ housemate has a probability of 0.2 of catching a cold in Week 2.

Assume that no-one will catch a cold twice (eg if both housemates catch a coldin Week 1, no-one can catch a cold in Week 2). Tabulate the joint probabilitymass function of the number of housemates who catch a cold in each of Weeks1 and 2.

82

14 Covariance and correlation

14.1 Covariance

For two dependent random variables, it can be useful to quantify the ‘strength’of the dependence. We do so via covariance:

Definition 29. The covariance between two random variables X and Y isdefined to be

Cov(X, Y ) := E{(X − µX)(Y − µY )}=

∑x∈Rx

∑y∈Ry

(x− µX)(y − µY )pX,Y (x, y).

To calculate a covariance, we can use the result that

Cov(X, Y ) = E(XY )− E(X)E(Y ).

This can be proved using the expectation result of Theorem 8 and the defi-nition of a covariance. Note that this result combined with Theorem 9 im-mediately tells us that if X and Y are independent their covariance is zero.

The value of the covariance will depend on the units of measurement usedfor X and Y . To get a measure of dependence that is scale independent, wetransform the covariance to get the correlation.

Definition 30. The correlation between two random variables is defined as

Cor(X, Y ) :=Cov(X, Y )√

Var(X) Var(Y ).

83

Example 51. In Example 49, demonstrate that X and Y are not indepen-dent. Calculate Cov(X, Y ) and Cor(X, Y )

The following result generalises Corollary 10, where we showed that for in-dependent X and Y Var(X + Y ) = Var(X) + Var(Y ):

Theorem 25. (The variance of X + Y )

For any two random variables X and Y ,

Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ).

Theorem 26. (Properties of covariances and correlation coefficients)

1. Cov(X,X) = Var(X).

2. Cov(aX + b, cY + d) = ac× Cov(X, Y ), for any constants a, b, c and d.

3. If X and Y are independent, then Cov(X, Y ) = Cor(X, Y ) = 0, but theconverse does not necessarily hold.

4. Cor(X,X) = 1 and Cor(X,−X) = −1.

5. |Cor(X, Y )| ≤ 1 for any two random variables X and Y .

6. If Y = a + bX, then Cor(X, Y ) = 1 if b > 0, and Cor(X, Y ) = −1 ifb < 0.

To see that the converse in (3) does not hold, consider the following example.

Example 52. Covariance zero but not independent

84

15 The multinomial distribution

The multinomial distribution is a generalisation of the binomial distribution,to the case where there are more than two possible outcomes in each ‘trial’.We have n trials as before, and k possible outcomes in each trial, we writeX =(X1, X2, . . . , Xk), where Xi is the number of times (out of n) that outcomei is observed. Some scenarios which we might model with a multinomialdistribution are

• 50 students take an exam, where the possible grades are A,B,C,D,Eand fail. We observe how many students get each possible grade.

• In a market research survey, 100 people are asked to taste 3 differentvarieties of a drink, and state which one they prefer. We observe howmany people choose each variety.

Definition 31. Let p1, p2, . . . , pk be non-negative numbers such that∑k

i=1 pi =1; think of pi as being the probability that a trial falls into category i.

If X = (X1, X2, . . . , Xk) has a multinomial distribution with n trials andprobabilities p1, p2, . . . , pk, then its probability mass function is given by

pX(x1, . . . , xk) = P (X1 = x1, X2 = x2, . . . , Xk = xk) =n!

x1!x2! · · ·xk!px11 p

x22 . . . pxkk ,

where each xi ∈ {0, . . . , n}, and∑k

i=1 xi = n, and pX(x1, . . . , xk) = 0 other-wise. We write

X = (X1, X2, . . . , Xk) ∼ multinomial(n; p1, p2, . . . , pk).

Clearly, X1, . . . , Xk are dependent. For example, if we learnt that X1 = n,we must have X2 = X3 = . . . = Xk = 0.

85

To understand the form of the p.m.f., suppose we have n = 10, k = 3, andwe want pX(5, 2, 3). We’ll refer to the three categories as 1, 2 and 3. Howmany ways are there of choosing 5 people to be in category 1, 2 people to bein category 2, and 3 people to be in category 3? Each one of these ways willhave probability p5

1p22p

33.

First, consider arranging the ten people in a line, and then classifying the firstfive as category A, the next two people as category B, and the final threeas category C. There are 10× 9× 8× · · · × 2× 1 = 10! such arrangements.For any single arrangement, we can permute the first five people withoutchanging the classification, and we can also permute the next two peoplewithout changing the classification, and so on. Hence for one arrangement,there will be 5!× 2!× 3! permutations that allocate the same people into thesame categories. Therefore, the number of unique ways of choosing 5 peopleto be in category A, 2 people to be in category B, and 3 people to be incategory C is

10!

5!2!3!.

Alternatively, there are(

105

)choices of 5 people to be category A. For any

choice of 5, there are (10-5) people left, out of which there are(

53

)choices

of people to be category B. Once the category A’s and B’s are chosen,everyone left must be category C, so there are no further choices. Hence thetotal number of choices is (

10

5

)×(

5

3

)=

10!

5!2!3!.

15.1 The marginal distribution of Xi

An easy way to derive the marginal distribution of Xi is to redefine thecategories. In the exam example, with 50 students and the possible grades

86

are A,B,C,D,E and fail, suppose X2 is the number of students getting a Bgrade. What is the marginal distribution of X2? If we consider the categoriesas simply ‘B’ or ‘not B’, then it is straightforward to see that X2 has abinomial distribution:

X2 ∼ Bin(n, p2).

From the properties of the binomial distribution, we have E(Xi) = npi andVar(Xi) = npi(1− pi).

Theorem 27. Conditional distributions for a multinomial random variableFor the random variable

X = (X1, X2, . . . , Xk) ∼ multinomial(n; p1, p2, . . . , pk),

suppose it is known that X1 = x1. Then

(X2, . . . , Xk)|X1 ∼ multinomial(n−X1; p2/(1− p1), . . . , pk/(1− p1)).

Example 53. On an exam paper, it is expected that 15% of students will getfirst class marks, and 40% will get 2.1 marks. If 100 students take the paper,calculate the probability that 15 students get first class marks, and 40 studentsget 2.1 marks. If 20 students get first class marks, calculate the probabilitythat 40 students or more will get 2.1 marks.

16 Sums of independent and identically distributed ran-

dom variables

Suppose we have n random variables X1, X2, . . . , Xn. If these each have thesame probability distribution, which is the same thing as their cumulativedistribution functions being the same,

P (X1 ≤ a) = P (X2 ≤ a) = · · ·P (Xn ≤ a),

87

then we say that they are identically distributed.

We are particularly interested in the case where X1, X2, . . . , Xn are not onlyidentically distributed, but also independent so that for all 1 ≤ i, j ≤ nwith i 6= j:

P{(Xi ≤ a) ∩ (Xj ≤ b)} = P (Xi ≤ a)P (Xj ≤ b).

We then say that X1, X2, . . . , Xn are independent and identically dis-tributed, or i.i.d. for short.

We can, if we like, regard X1, X2, . . . , Xn as independent copies of some givenrandom variable. I.i.d. random variables are very important in applicationsas they describe repeated experiments that are carried out under identicalconditions, in which the outcome of each experiment does not affect theothers.

Now define S(n) to be the sum and X(n) to be the sample mean:

S(n) =n∑i=1

Xi, (10)

and

X(n) =S(n)

n. (11)

Both S(n) and X(n) are also random variables, as they are functions of therandom variables X1, . . . , Xn. If we write E(Xi) = µ and Var(Xi) = σ2 (forall i, as the variables have the same distribution), it is straightforward toderive the mean and variance of S(n) and X(n) in terms of µ and σ2.

Theorem 28. We have:

1. E(S(n)) = nµ;

88

2. Var(S(n)) = nσ2;

3. E(X(n)) = µ;

4. Var(X(n)) = σ2

n

The standard deviation of X(n) plays an important role. It is called thestandard error and we denote it by SE(X(n)), so that

SE(X(n)) =σ√n. (12)

These results have important applications in statistics. Suppose we are ableto observe i.i.d. random variables X1, . . . , Xn, but we don’t know the valueof E(Xi) = µ. Theorem 28 part 4 tells us that as n increases, the variance ofX(n) gets smaller, and the smaller the variance is, the closer we expect X(n)to be to its mean value. Theorem 28 part 3 tells us that the mean value ofX(n) is µ (for any value of n). In other words, as n gets larger, we expectX(n) to be increasingly close to the unknown quantity µ, so we can use theobserved value of X(n) to estimate µ.

We illustrate this in Figure 20. The four plots show the density functionsof X(n) for n = 1, 10, 20 and 100. In each case, X1, . . . , Xn ∼ N(0, 1), soE(Xi) = µ = 0. We can see that the density function of X(n) becomesmore tightly concentrated about the value 0 as we increase n, and that theobserved value of X(n) is increasingly likely to be close to 0.

Here are two key examples of sums of i.i.d. random variables:

• If Xi ∼ Bernoulli(p) for 1 ≤ i ≤ n, then S(n) ∼ Bin(n, p).

• If Xi ∼ N(µ, σ2) for 1 ≤ i ≤ n, then S(n) ∼ N(nµ, nσ2).

89

4 2 0 2 40

1

2

3

4

x

n=1

4 2 0 2 40

1

2

3

4

x

n=10

4 2 0 2 40

1

2

3

4

x

n=20

4 2 0 2 40

1

2

3

4

x

n=100

Figure 20: The density function of X(n), when Xi ∼ N(0, 1), for four choices of n. If wedidn’t know that µ = E(Xi) = 0, but we could observe X(n), then using the observed valueof X(n) for large n is likely to give us a good estimate of µ, as X(n) is likely to be close toµ.

The first of these is how we defined the Binomial; we will prove the secondlater on, using moment generating functions.

17 Laws of Large Numbers

We will now derive an important result regarding the behaviour of X(n) forlarge n. We first prove a useful inequality. It is true if X is discrete orcontinuous.

Theorem 29. (Chebyshev’s inequality)

Let X be a random variable for which E(X) = µ and Var(X) = σ2. Then

90

for any c > 0

P (|X − µ| ≥ c) ≤ σ2

c2. (13)

The same inequality holds if P (|X − µ| ≥ c) is replaced by P (|X − µ| > c)in (13), as P (|X − µ| > c) ≤ P (|X − µ| ≥ c).

What does Chebyshev’s inequality tell us? We expect the probability to findthe value of a random variable to be smaller, the further away we get fromthe mean. But a large variance may counteract this a little, as it tells us thatthe values which have high probability are more spread out. Chebyshev’sinequality makes this more precise.

Example 54. A random variable X has mean 1 and variance 0.5. What canyou say about P (X > 6)?

Theorem 30. (The Weak Law of Large Numbers)

Let X1, X2, . . . be a sequence of i.i.d. random variables, each with mean µ andvariance σ2. Then for all ε > 0,

limn→∞

P (|X(n)− µ| > ε) = 0.

It is possible to prove a stronger result, which is that

P(

limn→∞

X(n) = µ)

= 1;

but the proof is outside the scope of this module. This result is known as thestrong law of large numbers.

17.1 Relationship to informal law of large numbers

In section 4 we introduced the following informal version:

91

The law of large numbers (an informal version). Suppose we do asequence of independent experiments, so that the outcome in one experimenthas no effect on the outcome in another experiment. Now suppose in ex-periment i, there is an event Ei that has a probability of p of occurring, fori = 1, 2, . . .. The proportion of events out of E1, E2, . . . which actually oc-cur as we do the experiments will typically get closer and closer to p, as thenumber of experiments increases.

To see the relationship between this and Theorem 30, for each of the eventsEi define a random variable Xi which takes the value 1 if Ei occurs and 0 if itdoes not, so that Xi is a Bernoulli random variable with P (Xi = 1) = p andP (Xi = 0) = 1 − p. As the Ei are independent, the Xi will be independenttoo, and it is easy to see that the mean µ = p. Hence Theorem 30 tells usthat, for any ε > 0,

limn→∞

P (|X(n)− p| > ε) = 0,

and X(n) here is precisely the proportion of the events E1, E2, . . . , En whichactually occur.

So Theorem 30 tells us that if n is large there is high probability that theproportion of the events E1, E2, . . . , En which occur is close to p.

17.2 Shape of the distribution of the sample mean

The Weak Law of Large Numbers tells us that for large n the distribution ofthe sample mean X(n) is concentrated around the mean µ. Sometimes it isuseful to have more precise results concerning the fluctuations of the samplemean around µ: what is the shape of this distribution?

To investigate this further we normalise the sample mean by subtracting µ

92

and dividing by σ/√n, which we know from Theorem 28 is the standard

deviation of the sample mean. Hence we consider

X(n)− µσ/√n

,

which we know has mean 0 and variance 1. For now, we will look at someexamples using simulations.

Figure 21 shows histograms of sample means of normalised exponential ran-dom variables with parameter 1 for four different values of n. In this caseµ = σ = 1, so we will plot

√n(X(n)− 1).

n = 1

Normalised sample mean

Den

sity

−4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

n = 10


Den

sity

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

n = 100


Den

sity

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

n = 1000


Den

sity

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

Figure 21: Histograms of simulations of sample means of exponential random variables, withsample sizes 1,10,100,1000. Each histogram is based on 10000 repetitions.

93

The top left picture, when n = 1, shows the shape of the original exponentialdistribution. The other plots show that as n increases the shape of the plotchanges and comes to look suspiciously like the shape of the standard normaldistribution.

Something very similar happens with the other distributions we have seen,both discrete and continuous: Binomial, Poisson, uniform and the normalitself. That we should expect to see, approximately, a standard normal dis-tribution here is a consequence of an important result called the Central LimitTheorem (CLT). The next two sections will develop the idea of moment gen-erating functions, and use them to give an idea of how we might prove theCLT.

18 Moment Generating Functions

Let X be a random variable. Two important associated numerical quantitiesare the expected values E(X) and E(X2), from which we can compute thevariance, using

Var(X) = E(X2)− E(X)2.

More generally we might try to find the nth moment E(Xn) for n ∈ N. Forexample E(X3) and E(X4) give information about the shape of the distribu-tion and are used to calculate quantities called the skewness and kurtosis.We can try to calculate moments directly, but there is also a useful shortcut.

Definition 32. The moment generating function (or mgf) is defined forall t ∈ R by:

MX(t) = E(etX). (14)

94

SoMX(t) =

∑x∈RX

etxpX(x)

if X is discrete and

MX(t) =

∫ ∞−∞

etxfX(x)dx,

if X is continuous.

For example, ifX ∼ Bernoulli(p), it only takes two values: 1 with probabilityp, and 0 with probability 1− p, and so

MX(t) = pet.1 + (1− p)et.0 = pet + 1− p.

Now differentiate (14) to get

d

dtMX(t) = E(XetX),

and sod

dtMX(t)

∣∣∣∣t=0

= E(X).

Differentiating again gives

d2

dt2MX(t) = E(X2etX),

and henced2

dt2MX(t)

∣∣∣∣t=0

= E(X2).

In fact you can find all the moments by this procedure:

dn

dtnMX(t)

∣∣∣∣t=0

= E(Xn).

95

Example 55. Find the moment generating function of a Poisson randomvariable with parameter λ. Verify that the mean is λ.

Another way of seeing how the mgf of X contains information about all themoments comes from writing the series expansion for the exponential function

etX =∞∑n=0

tn

n!Xn.

It then follows from (14) that:

MX(t) =∞∑n=0

tn

n!E(Xn). (15)

Example 56. Let X have a uniform distribution on [0, 1]. Find an expressionfor the moment generating function MX(t), write it as a series expansion, andhence find the moments E(X), E(X2) and E(X3).

Our main purpose in introducing mgfs at this stage is not to use them tocalculate moments, but to give a rough idea of the proof of the central limittheorem. The next result is very useful for that, and also has other applica-tions.

Theorem 31. If X and Y are independent, then for all t ∈ R

MX+Y (t) = MX(t)MY (t).

It follows (by using mathematical induction) that if S(n) = X1+X2+· · ·+Xn

is a sum of i.i.d random variables having common mgf MX then

MS(n)(t) = MX(t)n. (16)

Here is the key example that we need.

96

Lemma 32. If X ∼ N(0, 1) then

MX(t) = e12 t

2

To get the moment generating function of a general normal distributionN(µ, σ2), we can use the following result.

Theorem 33. If X has moment generating function MX(t) and Y = aX+b,then Y has moment generating function MY (t) = etbMX(at).

Combined with Lemma 32 and the relationship between the general andstandard normal distributions (Theorem 24), this tells us that the mgf of aN(µ, σ2) random variable is

exp

(µt+

1

2σ2t2

). (17)

In fact it can be shown that the mgf uniquely determines the distribution ofa random variable, so X ∼ N(µ, σ2) is the only random variable with the mgf(17). Using this fact we can show that the sum of n i.i.d. normal distributionsis itself normal. From Theorem 31 and (17), we have

MS(n)(t) = [eµt+12σ

2t2]n

= enµt+12nσ

2t2,

so S(n) ∼ N(nµ, nσ2), and X(n) ∼ N(µ, σ2/n).

Before moving onto the Central Limit Theorem we will give one other appli-cation of Theorem 31.

Theorem 34. Let X and Y be independent random variables each with Pois-son distributions with parameters λ and µ respectively. Then X + Y has aPoisson distribution with parameter λ+ µ.

97

19 The Central Limit Theorem (CLT)

We finish with another important result, which tells us about the distributionof X(n) for large n. It could be argued that this is the most important resultin the whole of probability (and statistics). The law of large numbers tellsus that X(n) tends to µ as n → ∞. But the central limit theorem gives farmore information. It tells us about the behaviour of the distribution of thefluctuations of X(n) around µ, as n → ∞. As hinted in section 17.2, theseare always normally distributed!

Theorem 35. (The central limit theorem)

Let X1, X2, . . . be a sequence of i.i.d random variables, each with mean µ andvariance σ2. For any −∞ ≤ a < b ≤ ∞,

limn→∞

P

(a ≤ X(n)− µ

σ/√n≤ b

)=

1√2π

∫ b

a

exp

(−1

2z2

)dz. (18)

In other words, the distribution of X(n) tends to a normal distribution withmean µ and variance σ2/n, as n→∞. So for large n, we have, approximately

X(n) ∼ N

(µ,σ2

n

),

S(n) ∼ N(nµ, nσ2

).

Notice that the right hand side of (18) is P (a ≤ Z ≤ b) where Z ∼ N(0, 1)is the standard normal.

We can also rewrite the left hand side in terms of S(n) by multiplying topand bottom by n. Then we get another equivalent form of the central limittheorem (CLT) which is often seen in books:

98

limn→∞

P

(a ≤ S(n)− nµ

σ√n

≤ b

)= P (a ≤ Z ≤ b). (19)

As long as X1, X2, . . . have the same distribution (and are independent) (19)is valid; it doesn’t matter what that distribution is. It can be discrete, orcontinuous – uniform, Poisson, Bernoulli, exponential, etc etc. 2

The full proof of the CLT is beyond the scope of this course. However, wewill give an outline, based on moment generating functions.

19.1 Outline proof of the CLT

We will aim to establish (19). We will assume µ = 0; this is sufficient toprove the theorem in general as we can replace Xn by Xn − µ. Define

Y (n) =S(n)

σ√n.

Then E(Y (n)) = 0 and Var(Y (n)) = 1. (You check this.)

By Theorem 31 and Theorem 33 (with b = 0 and a = 1/σ√n)

MY (n)(t) =

(M

(t

σ√n

))nfor all t ∈ R

where M on the right-hand side denotes the common moment generatingfunction of the Xis.

2We do need the mean and variance to be well-defined. They are for all the examples listed, but thereare distributions for which they are not.

99

Then expanding (15) using E(Xn) = 0 and Var(Xn) = σ2:

M

(t

σ√n

)= 1 +

(t

σ√n× 0

)+

(1

2

t2

σ2nσ2

)+α(n)

n

= 1 +1

2

t2

n+α(n)

n

where α(n) is a remainder term that satisfies limn→∞ α(n) = 0. Thus

MY (n)(t) =

(1 +

1

2

t2

n+

1

nα(n)

)n

From MAS110, we know that ey = limn→∞(1 + yn)n and, furthermore, since

limn→∞ α(n) = 0, then it can also be shown that

ey = limn→∞

(1 +

y

n+

1

nα(n)

)n

Hence, using Lemma 32, we find that

limn→∞

MY (n)(t) = exp

(1

2t2)

= MZ(t)

and the result follows, since the mgf determines the distribution.

19.2 Normal approximation to binomial

As a first application of this result we can investigate the normal ap-proximation to the binomial distribution. If we take X1, X2, . . . to beBernoulli random variables with common parameter p, then S(n) is binomial

100

with parameters n and p. Since E(S(n)) = np and Var(S(n)) = np(1 − p)we get the following:

Corollary 36. (de Moivre–Laplace central limit theorem)

If X1, X2, . . . are Bernoulli with common parameter p, then

limn→∞

P

(a ≤ S(n)− np√

np(1− p)≤ b

)=

1√2π

∫ b

a

exp

(−1

2z2

)dz (20)

This was historically the first central limit theorem to be established. Itis named after Abraham de Moivre (1667–1754) who first used integrals ofexp(−1

2z2) to approximate probabilities, and Pierre-Simon Laplace (1749–

1857) who obtained a formula quite close in spirit to (20). Laplace was study-ing the dynamics of the solar system, and was motivated in these calculationsby the need to estimate the probability that the inclination of cometary or-bits to a given plane was between certain limits. Corollary 36 is sometimereferred to as the normal approximation to the binomial distribution.

19.2.1 Continuity correction

Because the binomial distribution is discrete, while the normal distributionis continuous, we can get greater accuracy in the CLT by using a continu-ity correction. Assume we are trying to approximate a binomial randomvariable X ∼ Bin(n, p) by a normal random variable Y ; by Corollary 20 wewill have Y ∼ N(np, np(1− p)).

Because X takes integer values, it in fact makes sense to use the value of Yrounded to the nearest integer; call this Y . So then we approximate P (X = x)by P (Y = x), which is equal to P (x− 0.5 < Y ≤ x+ 0.5). This implies that

101

we should use the following approximations to various probabilities involvingthe binomial:

P (X ≤ x) ≈ P (Y ≤ x+ 0.5),

P (X < x) ≈ P (Y ≤ x− 0.5),

P (x1 ≤ X ≤ x2) ≈ P (x1 − 0.5 < Y ≤ x2 + 0.5).

Example 57. A woman claims to have psychic abilities, in that she canpredict the outcome of a coin toss. If she is tested 100 times, but she is reallyjust guessing, what is the probability that she will be right 60 or more times?

19.3 Application to finance

Imagine that we are keeping track of the price of a stock, which we will labelSt at time t. Assume time here is measured in some small units such asseconds.

It is usual to think of the change in price from time t−1 to t multiplicatively,so that we can write St = St−1Rt, where Rt is a random variable. (Oneadvantage of thinking in this way is that as long as the Rt are not negativethe modelled stock price will not go negative.) As a starting point, assumethat the Rt are independent and identically distributed, and that S0 is known.We thus can write

St = S0

t∏k=1

Rt.

Taking logs, we have

logSt = logS0 +t∑

k=1

logRt.

102

Assume that E(logRt) = µ and Var(logRt) = σ2. Then, by the CentralLimit Theorem, if t is large we expect logSt to have an approximately normaldistribution with mean S0 + tµ and variance tσ2.

A random variable S such that logS has a normal distribution has a lognormaldistribution, so this argument suggests that St should have an approximatelylognormal distribution. In fact, a standard model for stock prices, whichyou will meet if you take later courses in finance, assumes that the prices Stbehave as what is known as geometric Brownian motion, which implies thatSt has exactly a lognormal distribution, with the mean and variance of logStbeing as suggested in the previous paragraph.

103

Documents

MAS113 Introduction to Probability and Statistics 1 ...theory. We have two ingredients: set theory, which we will use to describe what sort of things we want to measure our uncertainty