Stats 210 Course Book

8/8/2019 Stats 210 Course Book

1/200


2/200

Contents

1. Probability 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Probability: a way of measuring sets . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Probabilities of combined events . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.8 Examples of basic probability calculations . . . . . . . . . . . . . . . . . . . . . 20

1.9 Formal probability proofs: non-examinable . . . . . . . . . . . . . . . . . . . . . 22

1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.11 Examples of conditional probability and partitions . . . . . . . . . . . . . . . . . 29

1.12 Bayes Theorem: inverting conditional probabilities . . . . . . . . . . . . . . . . 31

1.13 Chains of events and probability trees: non-examinable . . . . . . . . . . . . . . 34

1.14 Simpsons paradox: non-examinable . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.15 Equally likely outcomes and combinatorics: non-examinable . . . . . . . . . . . 39

1.16 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.17 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

1.18 Key Probability Results for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 49

2. Discrete Probability Distributions 51

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.2 The probability function, fX(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.4 Example of the probability function: the Binomial Distribution . . . . . . . . . 55

2.5 The cumulative distribution function, FX(x) . . . . . . . . . . . . . . . . . . . . 59

2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . . . . . . . . . . 70

2.8 Example: Politicians and the alphabet . . . . . . . . . . . . . . . . . . . . . . . 77

2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 92

2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

2 . 1 3 V a r i a n c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 7

2.14 Mean and variance of the Binomial(n, p) distribution . . . . . . . . . . . . . . . 113

1


3/200

3. Modelling with Discrete Probability Distributions 119

3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.4 Hypergeometric distribution: sampling without replacement . . . . . . . . . . . 128

3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4. Continuous Random Variables 139

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 394.2 The probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

4.4 Likelihood and estimation for continuous random variables . . . . . . . . . . . . 157

4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . . . . . . . 163

4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.9 The Change of Variable Technique: finding the distribution ofg(X) . . . . . . . 1 6 9

4.10 Change of variable for non-monotone functions: non-examinable . . . . . . . . . 174

4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . . . . . . . . 179

5. The Normal Distribution and the Central Limit Theorem 180

5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 185

6. Wrapping Up 192

6.1 Estimators the good, the bad, and the estimator PDF . . . . . . . . . . . . . 192

6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 197


4/200

3

Chapter 1: Probability

1.1 Introduction

Definition: A probability is a number between 0 and 1 representing how likely itis that an event will occur.

Probabilities can be:

1. Frequentist (based on frequencies),

e.g. number of times event occursnumber of opportunities for event to occur

;

2. Subjective: probability represents a persons degree of belief that anevent will occur,e.g. I think there is an 80% chance it will rain today,

written as P(rain) = 0.80.

Regardless of how we obtain probabilities, we always combine and manipulatethem according to the same rules.

1.2 Sample spaces

Definition: A random experiment is an experiment whose outcome is not knownuntil it is observed.

Definition: A sample space, , is a set of outcomes of a random experiment.

Every possible outcome must be listed once and only once.

Definition: A sample point is an element of the sample space.

For example, if the sample space is = {s1, s2, s3}, then each si is a samplepoint.


5/200

4

Examples:

Experiment: Toss a coin twice and observe the result.Sample space: = {H H , H T , T H , T T }An example of a sample point is: HT

Experiment: Toss a coin twice and count the number of heads.Sample space: = {0, 1, 2}

Experiment: Toss a coin twice and observe whether the two tosses are the same(e.g. HH or TT).Sample space: = {same, different}

Discrete and continuous sample spaces

Definition: A sample space is finite if it has a finite number of elements.

Definition: A sample space is discrete if there are gaps between the differentelements, or if the elements can be listed, even if an infinite list(eg. 1, 2, 3, . . .).

In mathematical language, a sample space is discrete if it is countable.

Definition: A sample space is continuous if there are no gaps between the elements,so the elements cannot be listed (eg. the interval[0, 1]).

Examples:

= {0, 1, 2, 3} (discrete and finite) = {0, 1, 2, 3, . . .} (discrete, infinite) = {4.5, 4.6, 4.7} (discrete, finite) = {H H , H T , T H , T T } (discrete, finite) = [0, 1] ={all numbers between 0 and 1 inclusive} (continuous, infinite) =

[0, 90), [90, 360)

(discrete, finite)


6/200

1.3 Events

Kolmogorov (1903-1987).

One of the founders of

probability theory.

Suppose you are setting out to create a scienceof randomness. Somehow you need to harnessthe idea of randomness, which is all about theunknown, and express it in terms of mathematics.

How would you do it?

So far, we have introduced the sample space, ,which lists all possible outcomes of a randomexperiment, and might seem unexciting.

However, is a set. It lays the ground for a whole mathematical formulationof randomness, in terms of set theory.

The next concept that you would need to formulate is that of something thathappens at random, or an event.

How would you express the idea of an eventin terms of set theory?

Definition: An event is a subset of the sample space.

That is, any collection of outcomes forms an event.

Example: Toss a coin twice. Sample space: = {H H , H T , T H , T T }

Let event A be the event that there is exactly one head.

We write: A =exactly one head

Then A = {H T , T H }.

A is a subset of , as in the definition. We write A .

Definition: Event A occurs if we observe an outcome that is a member of the setA.

Note: is a subset of itself, so is an event. The empty set, = {}, is also asubset of . This is called the null event, or theevent with no outcomes.


7/200

6

Example:

Experiment: throw 2 dice.Sample space: = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 6)}

Event A =sum of two faces is

5

= {(1, 4), (2, 3), (3, 2), (4, 1)}

Combining Events

Formulating random events in terms of sets gives us the power of set theoryto describe all possible ways of combining or manipulating events. For exam-ple, we need to describe things like coincidences (events happening together),alternatives, opposites, and so on.

We do this in the language of set theory.

Example: Suppose our random experiment is to pick a person in the class and seewhat form(s) of transport they used to get to campus today.

Bus

Bike

Walk

Car

Train

People in class

This sort of diagram representing events in a sample space is called a Venndiagram.


8/200

7

1. Alternatives: the union or operator

We wish to describe an event that is composed of several different alternatives.

For example, the event that you used a motor vehicle to get to campus is theevent that your journey involved a car, or a bus, or both.

To represent the set of journeys involving both alternatives, we shade all out-comes in Bus and all outcomes in Car.

Bus

Bike

Walk

Car

Train

People in class

Overall, we have shaded all outcomes in the UNION of Bus and Car.

We write the event that you used a motor vehicle as the event Bus Car, readas Bus UNION Car.

The union operator, , denotes Bus OR Car OR both.

Note: Be careful not to confuse Or and And. To shade the union of Bus andCar, we had to shade everything in Bus AND everything in Car.

To remember whether union refers to Or or And, you have to consider whatdoes an outcome need to satisfy for the shaded event to occur?

The answer is Bus, OR Car, OR both. NOT Bus AND Car.

Definition: Let A and B be events on the same sample space : so A andB .

The union of events A and B is written A B, and is given byA B = {s : s A or s B or both} .


9/200

8

2. Concurrences and coincidences: the intersection and operator

The intersection is an event that occurs when two or more events ALL occurtogether.

For example, consider the event that your journey today involved BOTH a car

AND a train. To represent this event, we shade all outcomes in the overlap ofCar and Train.

Bus

Bike

Walk

Car

Train

People in class

We write the event that you used both car and train as Car Train, read asCar INTERSECT Train.

The intersection operator, , denotes both Car AND Train together.Definition: The intersection of events A and B is written A B and is given by

A B = {s : s A AND s B} .

3. Opposites: the complement or not operator

The complement of an event is the opposite of the event: whatever the eventwas, it didnt happen.

For example, consider the event that your journey today did NOT involvewalking. To represent this event, we shade all outcomes in except those in theevent Walk.


10/200

9

People in class

Bus

Bike

Walk

Car

Train

People in class

We write the event not Walk as Walk.

Definition: The complement of event A is written A and is given by

A = {s : s / A}.Examples:

Experiment: Pick a person in this class at random.Sample space: = {all people in class}.Let event A =person is male and event B =person travelled by bike today.

Suppose I pick a male who did not travel by bike. Say whether the followingevents have occurred:

1) A Yes. 2) B No.

3) A No. 4) B Yes.

5) A B = {female or bike rider or both}. No.6) A B = {male and non-biker}. Yes.7) A B = {male and bike rider}. No.8) A B = everything outsideA B. A B did not occur, soA B did occur.Yes.

Question: What is the event ? = Challenge: can you express A B using only a sign?Answer: A B = (A B).


11/200

10

Limitations of Venn diagrams

Venn diagrams are generally useful for up to 3 events, although they are notused to provide formal proofs. For more than 3 events, the diagram might notbe able to represent all possible overlaps of events. (This was probably the casefor our transport Venn diagram.)

Example: A B

C

(a) A B C

A B

C

(b) A B C

Properties of union, intersection, and complement

The following properties hold.

(i) = and = .(ii) For any event A, A A = ,

and A

A =

.

(iii) For any events A and B, A B = B A,and A B = B A. Commutative.

(iv) (a) (A B) = A B. (b) (A B) = A B.

A B

A B


12/200

11

Distributive laws

We are familiar with the fact that multiplication is distributive over addition.This means that, if a, b, and c are any numbers, then

a (b + c) = a b + a c.However, addition is not distributive over multiplication:

a + (b c) = (a + b) (a + c).

For set union and set intersection, union is distributive over intersection, ANDintersection is distributive over union.

Thus, for any sets A, B, and C:

A (B C) = (A B) (A C),

and A (B C) = (A B) (A C).

A B

C

A B

C

More generally, for several events A andB1, B2, . . . , Bn,,

A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

i.e. A

n

i=1Bi

=n

i=1(A Bi),

and

A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

i.e. A

ni=1

Bi

=

ni=1

(A Bi).


13/200

12

1.4 Partitioning sets and events

The idea of a partition is fundamental in probability manipulations. Later inthis chapter we will encounter the important Partition Theorem. For now, wegive some background definitions.

Definition: Two events A and B are mutually exclusive, or disjoint, if A B =.

This means events A and B cannot happen together. If A happens, it excludes B

from happening, and vice-versa.

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

A B

Note: Does this mean that A and B are independent?

No: quite the opposite. A EXCLUDES B from happening, soB depends stronglyon whether or notA happens.

Definition: Any number of events A1, A2, . . . , Ak are mutually exclusive if every

pair of the events is mutually exclusive: ie. Ai Aj = for alli, j with i = j.

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

A1 A2 A3

Definition: A partition of the sample space is a collection of mutually exclusive

events whose union is .

That is, sets B1, B2, . . . , Bk form a partition of if

Bi Bj = for all i, j with i = j ,

andk

i=1

Bi = B1 B2 . . . Bk = .


14/200

13

Examples:

B1, B2, B3, B4 form a partition of: B1, . . . , B5 partition :

B1

B2

B3

B4

B1

B2

B3

B4B5

Important: B andB partition for any event B:

BB

Partitioning an event A

Any set or eventA can be partitioned: it doesnt have to be.

IfB1, . . . , Bk form a partition of, then (A B1), . . . , (A Bk) form a partitionofA.

0 0 0 00 0 0 00 0 0 00 0 0 01 1 1 11 1 1 11 1 1 11 1 1 1

A

B1

B2

B3

B4

We will see that this is very useful for finding the probability of event A.

This is because it is often easier to find the probability of small chunks of A(the partitioned sections) than to find the whole probability of A at once. Thepartition idea shows us how to add the probabilities of these chunks together:see later.


15/200

14

1.5 Probability: a way of measuring sets

Remember that you are given the job of building the science of randomness.This means somehow measuring chance.

If I sent you away to measure heights, the first

thing you would ask is what you are supposedto be measuring the heights of.People? Trees? Mountains?

We have the same question when setting out to measure chance.Chance of what?

The answer is sets.

It was clever to formulate our notions of events and sample spaces in terms ofsets: it gives us something to measure. Probability, the name that we give toour chance-measure, is a way of measuring sets.

You probably already have a good idea for a suitable way to measure the sizeof a set or event. Why not just count the number of elements in it?

In fact, this is often what we do to measure probability (although countingthe number of elements can be far from easy!) But there are circumstances

where this is not appropriate.What happens, for example, if one set is far more likely than another, butthey have the same number of elements? Should they be the same probability?

0 0 0 01 1 1 1

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

First set: {Lions win}.Second set: {All Blacks win}.Both sets have just one element, butwe definitely need to give them different probabilities!

More problems arise when the sets are infiniteor continuous.

Should the intervals [3, 4] and [13, 14] be the same probability, just becausethey are the same length? Yes they should, if (say) our random experiment isto pick a random number on [0, 20] but no they shouldnt (hopefully!) if ourexperiment was the time in years taken by a student to finish their degree.


16/200

15

Most of this course is about probability distributions.

A probability distribution is a rule according to which probability is apportioned,or distributed, among the different sets in the sample space.

At its simplest, a probability distribution just lists every element in the samplespace and allots it a probability between 0 and 1, such that the total sum of

probabilities is 1.

In the rugby example, we could use the following probability distribution:

P(Lions win)= 0.01, P(All Blacks win)= 0.99.

In general, we have the following definition for discrete sample spaces.

Discrete probability distributions

Definition: Let = {s1, s2, . . .} be a discrete sample space.A discrete probability distribution on is a set of real numbers {p1, p2, . . .}associated with the sample points {s1, s2, . . .} such that:

1. 0

pi

1 for all i;

2.i

pi = 1.

pi is called the probability of the event that the outcome is si.

We write: pi = P(si).

The rule for measuring the probability of any set, or event, A , is to sumthe probabilities of the elements ofA:

P(A) =iA

pi.

E.g. ifA = {s3, s5, s14}, then P(A) = p3 + p5 +p14.


17/200

16

Continuous probability distributions

On a continuous sample space , e.g. = [0, 1], we can not list all the ele-ments and give them an individual probability. We will need more sophisticatedmethods detailed later in the course.

However, the same principle applies. A continuous probability distribution is arule under which we can calculate a probability between 0 and 1 for any set, orevent, A .

Probability Axioms

For any sample space, discrete or continuous, all of probability theory is basedon the following three definitions, or axioms.

Axiom 1: P() = 1.

Axiom 2: 0 P(A) 1 for all events A.Axiom 3: If A1, A2, . . . , An aremutually exclusive events, (no overlap), then

P(A1 A2 . . . An) = P(A1) + P(A2) + . . . + P(An).

If our rule for measuring sets satisfies the three axioms, it is a valid probabilitydistribution.

It should be clear that the definitions given for the discrete sample space on page15 will satisfy the axioms. The challenge of defining a probability distribution

on a continuous sample space is left till later.

Note: The axioms can never be proved: they are definitions.

Note: P() = 0.

Note: Remember that an EVENT is a SET: an event is a subset of the sample space.


18/200

17

1.6 Probabilities of combined events

In Section 1.3 we discussed unions, intersections, and complements of events.We now look at the probabilities of these combinations. Everything belowapplies to events (sets) in either a discrete or a continuous sample space.

1. Probability of a union

Let A and B be events on a sample space . There are two cases for theprobability of the union A B:

1. A andB are mutually exclusive (no overlap): i.e. A B = .

2. A andB are not mutually exclusive: A B = .For Case 1, we get the probability of A B straight from Axiom 3:

If A B = then P(A B) = P(A) + P(B).

For Case 2, we have the following formula;

For ANY events A, B, P(A B) = P(A) + P(B) P(A B).

Note: The formula for Case 2 applies also to Case 1: just substituteP(A B) = P() = 0.

For three or more events: e.g. for any A, B, and C,

P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C)+ P(A B C) .


19/200

18

Explanation

For any events A and B, P(A B) = P(A) + P(B) P(A B).

The formal proof of this formula is in Section 1.9 (non-examinable).

To understand the formula, think of the Venn diagrams:

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 00 0 0 0

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 11 1 1 1

A

B \ (A n B)

A

B

When we add P(A) + P(B), weadd the intersection twice.

So we have to subtract theintersection once to getP(A B):P(A B) = P(A) + P(B) P(A B).

Alternatively, think of A B astwo disjoint sets: all ofA,

and the bits ofB without the

intersection. SoP(A B) =P(A) +

P(B) P(A B)

.

2. Probability of an intersection

0 0 0 0

0 0 0 0

0 0 0 00 0 0 00 0 0 0

1 1 1 1

1 1 1 1

1 1 1 11 1 1 11 1 1 1

A

B

There is no easy formula for P(A B).We might be able to use statistical independence(Section 1.16).

If A and B are not statistically independent,we often use conditional probability

(Section 1.10.)

3. Probability of a complement

A

A

P(A) = 1 P(A).

This is obvious, but a formal proof is given in Sec. 1.9.


20/200

19

1.7 The Partition Theorem

The Partition Theorem is one of the most useful tools for probability calcula-tions. It is based on the fact that probabilities are often easier to calculate ifwe break down a set into smaller parts.

Recall that a partition of is a collectionof non-overlapping sets B1, . . . , Bm which

together cover everything in .

B1

B3

B2

B4

Also, if B1, . . . , Bm form a partition of , then (A B1), . . . , (A Bm) form apartition of the set or eventA.

A

A B1 A B2

A B3 A B4

B1 B2

B3 B4

The probability of event A is therefore the sum of its parts:

P(A) = P(A B1) + P(A B2) + P(A B3) + P(A B4).

The Partition Theorem is a mathematical way of saying the whole is the sumof its parts.

Theorem 1.7: The Partition Theorem. (Proof in Section 1.9.)

LetB1, . . . , Bm form a partition of. Then for any event A,

P(A) =mi=1

P(A Bi).

Note: Recall the formal definition of a partition. Sets B1, B2, . . . , Bm form a par-tition of if Bi Bj = for all i = j , and

mi=1 Bi = .


21/200

20

1.8 Examples of basic probability calculations

300 Australians were asked about their car preferences in 1998. Of the respon-dents, 33% had children. The respondents were asked what sort of car theywould like if they could choose any car at all. 13% of respondents had children

and chose a large car. 12% of respondents did not have children and chose alarge car.

Find the probability that a randomly chosen respondent:(a) would choose a large car;(b) either has children or would choose a large car (or both).

First formulate events:

Let C = has children C = no children

L = chooses large car.

Next write down all the information given:

P(C) = 0.33

P(C L) = 0.13P(C L) = 0.12.

(a) Asked forP(L).

P(L) = P(L C) + P(L C) (Partition Theorem)= P(C L) + P(C L)= 0.13 + 0.12

= 0.25. P(chooses large car)= 0.25.

(b) Asked forP(L C).P(L C) = P(L) + P(C) P(L C) (Section 1.6)

= 0.25 + 0.33 0.13= 0.45.


22/200

21

Respondents were also asked their opinions on car reliability and fuel consump-tion. 84% of respondents considered reliability to be of high importance, while40% considered fuel consumption to be of high importance.

Formulate events: R = considers reliability of high importance,

F = considers fuel consumption of high importance.

(c) What is P(R)?

(d) What is P(R F)?Information given: P(R) = 0.84 P(F) = 0.40.

(c) P(R) = 1 P(R)= 1 0.84= 0.16.

(d) We can not calculateP(R F) from the information given.

(e) Given the further information that 12% of respondents considered neitherreliability nor fuel consumption to be of high importance, find P(R F) andP(R F).Information given: P(R F) = 0.12. R F

Thus P(R F) = 1 P(R F)= 1 0.12= 0.88.

Probability that respondent considers either reliability or fuel consumption, or

both, of high importance.

P(R F) = P(R) + P(F) P(R F) (Section 1.6)= 0.84 + 0.40 0.88= 0.36.

Probability that respondent considers BOTH reliability AND fuel consumption

of high importance.


23/200

22

(f) Find the probability that a respondent considered reliability, but not fuelconsumption, of high importance.

P(R F) = P(R) P(R F) (Partition Theorem)

= 0.84 0.36= 0.48.

1.9 Formal probability proofs: non-examinable

If you are a mathematician, you will be interested to see how properties of

probability are proved formally. Only the Axioms, together with standard set-theoretic results, may be used.

Theorem : The probability measure P has the following properties.

(i) P() = 0.(ii) P(A) = 1 P(A) for any event A.

(iii) (Partition Theorem.) IfB1, B2, . . . , Bm form a partition of , then for anyevent A,

P(A) =mi=1

P(A Bi).

(iv) P(A B) = P(A) + P(B) P(A B) for any events A, B.

Proof:

i) For any A, we have A = A ; and A = (mutually exclusive).So P(A) = P(A ) = P(A) + P() (Axiom 3) P() = 0.


24/200

23

ii) = A A; and A A = (mutually exclusive).So 1 = P()

Axiom 1

= P(A A) = P(A) + P(A). (Axiom 3)

iii) Suppose B1, . . . , Bm are a partition of :

then Bi Bj = if i = j, and mi=1 Bi = .Thus, (A Bi) (A Bj) = A (Bi Bj) = A = , for i = j,ie. (A B1), . . . , (A Bm) are mutually exclusive also.

So,m

i=1P(A Bi) = P

m

i=1(A Bi)

(Axiom 3)

= P

A mi=1

Bi

(Distributive laws)

= P(A )= P(A) .

iv)

A

B = (A

)

(B

) (Set theory)

=

A (B B) B (A A) (Set theory)= (A B) (A B) (B A) (B A) (Distributive laws)= (A B) (A B) (A B).

These 3 events are mutually exclusive:

eg. (A B) (A B) = A (B B) = A = , etc.

So, P(A B) = P(A B) + P(A B) + P(A B) (Axiom 3)=P(A) P(A B)

from (iii) using B and B

+P(B) P(A B)

from (iii) using A and A

+ P(A B

= P(A) + P(B) P(A B).


25/200

24

1.10 Conditional Probability

Conditioning is another of the fundamental tools of probability: probably themost fundamental tool. It is especially helpful for calculating the probabilitiesof intersections, such as P(A B), which themselves are critical for the usefulPartition Theorem.Additionally, the whole field of stochastic processes (Stats 320 and 325) is basedon the idea of conditional probability. What happens next in a process depends,or is conditional, on what has happened beforehand.

Dependent events

Suppose A and B are two events on the same sample space. There will often

be dependence between A and B. This means that if we know that B hasoccurred, it changes our knowledge of the chance that A will occur.

Example: Toss a die once.

Let event A = get a 6Let event B= get an even number

If the die is fair, then P(A) = 16

andP(B) = 12

.

However, if we know that B has occurred, then there is an increased chancethat A has occurred:

P(A occurs given that B has occurred) = 13

.

result 6result 2 or 4 or 6

We write

P(A given B) = P(A | B) = 13

.

Question: what would be P(B | A)?

P(B | A) = P(B occurs, given thatA has occurred)= P(get an even number, given that we know we got a 6)

= 1.


26/200

25

Conditioning as reducing the sample space

The car survey in Section 1.8 also asked respondents which they valued morehighly in a car: ease of parking, or style/prestige. Here are the responses:

Male Female Total

Prestige more important than parking 79 51 130

Prestige less important than parking 71 99 170

Total 150 150 300

Suppose we pick a respondent at random from all those in the table.

Let event A = respondent thinks that prestige is more important.

P(A) =# As

total # respondents=

130

300= 0.43.

However, this probability differs between males and females. Suppose we reduceour sample space from

= {all people in table}to

B = {all males in table}.

P(respondent thinks prestige is more important, given that respondent is male)

=# males who favour prestige

total # males

=# maleAs

# males

=79

150

= 0.53.

We write: P(A | B) = 0.53.


27/200

26

We could follow the same working for any pair of events, A and B:

P(A | B) = #Bs who areAtotal #Bs

=# in table who are BOTHB andA

#Bs

=(# in B ANDA)/ (# in )

(# in B)/ (# in )

=P(A B)P(B)

.

This is our definition of conditional probability:

Definition: Let A and B be two events. The conditional probability that eventA occurs, given that event B has occurred, is written P(A | B),

and is given by

P(A | B) = P(A B)P(B)

.

Read P(A | B) as probability of A, given B.

Note: P(A | B) gives P(A and B , from within the set of Bs only).P(A B) gives P(A and B , from the whole sample space ).

Note: Follow the reasoning above carefully. It is important to understand whythe conditional probability is the probability of the intersection within the newsample space.

Conditioning on event B means changing the sample space to B.

Think ofP(A | B) as the chance of getting an A, from the set ofBs only.


28/200

27

The symbol P belongs to the sample space

Recall the first of our probability axioms on page 16:

P() = 1.

This indicates that the symbol P is defined with respect to . That is,

P BELONGS to the sample space.

If we change the sample space, we need to change the symbol P. This is whatwe do in conditional probability:

to change the sample space from to B, say, we change from the symbolP tothe symbolP( | B).

The symbol P( | B) should behave exactly like the symbolP.For example:

P(C D) = P(C) + P(D) P(C D),so

P(C D | B) = P(C| B) + P(D | B) P(C D | B).

Trick for checking conditional probability calculations:

A useful trick for checking a conditional probability expression is to replace theconditioned set by, and see whether the expression is still true.

For example, is P(A | B) + P(A | B) = 1?Answer: ReplaceB by: this gives

P(A | ) + P(A | ) = P(A) + P(A) = 1.So, yes, P(A | B) + P(A | B) = 1 for any other sample spaceB.Is P(A | B) + P(A | B) = 1?Try to replace the conditioning set by : we cant! There are two conditioningsets: B andB.

The expression is NOT true, and in fact it doesnt make sense to try to add to-

gether probabilities from two different sample spaces.


29/200

28

The Multiplication Rule

For any events A and B,

P(A B) = P(A | B)P(B) = P(B | A)P(A).

Proof:

Immediate from the definitions:

P(A | B) = P(A B)P(B)

P(A B) = P(A | B)P(B) ,

and

P(B | A) = P(B A)P(A)

P(B A) = P(A B) = P(B | A)P(A).

New statement of the Partition Theorem

The Multiplication Rule gives us a new statement of the Partition Theorem:IfB1, . . . , Bm partition S, then for any event A,

P(A) =

mi=1

P(A Bi) =mi=1

P(A | Bi)P(Bi).

Both formulations of the Partition Theorem are very widely used, but especially

the conditional formulation mi=1 P(A | Bi)P(Bi).Warning:

Be careful to use this new version of the Partition Theorem correctly:

it is P(A) = P(A | B1)P(B1) + . . . + P(A | Bm)P(Bm),NOT P(A) = P(A | B1) + . . . + P(A | Bm).


30/200

29

Conditional probability and Peter Pan

When Peter Pan was hungry but had nothing to eat,he would pretend to eat.(An excellent strategy, I have always found.)

Conditional probability is the Peter Pan of Stats 210. When you dont knowsomething that you need to know, pretend you know it.

Conditioning on an event is like pretending that you know that the event hashappened.

For example, if you know the probability of getting to work on time in differentweather conditions, but you dont know what the weather will be like today,

pretend you do and add up the different possibilities.

P(work on time)= P(work on time| fine)P(fine)+ P(work on time| wet)P(wet).

1.11 Examples of conditional probability and partitions

Tom gets the bus to campus every day. The bus is on time with probability0.6, and late with probability 0.4.

The sample space can be written as = {bus journeys}. We can formulateevents as follows:

T = on time; L = late.

From the information given, the events have probabilities:

P(T) = 0.6 ; P(L) = 0.4.

(a) Do the events T and L form a partition of the sample space ? Explain whyor why not.

Yes: they cover all possible journeys (probabilities sum to 1), and there is no

overlap in the events by definition.


31/200

30

The buses are sometimes crowded and sometimes noisy, both of which areproblems for Tom as he likes to use the bus journeys to do his Stats assign-ments. When the bus is on time, it is crowded with probability 0.5. When itis late, it is crowded with probability 0.7. The bus is noisy with probability0.8 when it is crowded, and with probability 0.4 when it is not crowded.

(b) Formulate events C and N corresponding to the bus being crowded and noisy.Do the events C and N form a partition of the sample space? Explain whyor why not.

Let C = crowded, N =noisy.C and N do NOT form a partition of . It is possible for the bus to be noisywhen it is crowded, so there must be some overlap between C andN.

(c) Write down probability statements corresponding to the information givenabove. Your answer should involve two statements linking C with T and L,and two statements linking N with C.

P(C| T) = 0.5; P(C| L) = 0.7.P(N| C) = 0.8; P(N| C) = 0.4.

(d) Find the probability that the bus is crowded.

P(C) = P(C| T)P(T) + P(C| L)P(L) (Partition Theorem)= 0.5 0.6 + 0.7 0.4= 0.58.

(e) Find the probability that the bus is noisy.

P(N) = P(N| C)P(C) + P(N| C)P(C) (Partition Theorem)= 0.8 0.58 + 0.4 (1 0.58)= 0.632.


32/200

31

1.12 Bayes Theorem: inverting conditional probabilities

ConsiderP(B A) = P(A B). Apply multiplication rule to each side:

P(B | A)P(A) = P(A | B)P(B)

Thus P(B | A) = P(A | B)P(B)P(A)

. ()

This is the simplest form of Bayes Theorem, namedafter Thomas Bayes (170261), English clergymanand founder of Bayesian Statistics.

Bayes Theorem allows us to invert the conditioning,i.e. to express P(B | A) in terms ofP(A | B).

This is very useful. For example, it might be easy to calculate,

P(later event|earlier event),but we might only observe the later event and wish to deduce the probability

that the earlier event occurred,

P(earlier event| later event).

Full statement of Bayes Theorem:

Theorem 1.12: Let B1, B2, . . . , Bm form a partition of . Then for any event A,and for any j = 1, . . . , m,

P(Bj | A) = P(A | Bj)P(Bj)mi=1 P(A | Bi)P(Bi)

(Bayes Theorem)

Proof:

Immediate from () (put B = Bj), and the Partition Rule which gives P(A) =

mi=1 P(A | Bi)P(Bi).


33/200

32

Special case of Bayes Theorem when m = 2: useB and B as the partition of:

then P(B | A) = P(A | B)P(B)P(A

|B)P(B) + P(A

|B)P(B)

Example: The case of the Perfidious Gardener.Mr Smith owns a hysterical rosebush. It will die withprobability 1/2 if watered, and with probability 3/4 ifnot watered. Worse still, Smith employs a perfidiousgardener who will fail to water the rosebush withprobability 2/3.

Smith returns from holiday to find the rosebush . . . DEAD!!!What is the probability that the gardener did not water it?

Solution:

First step: formulate events

Let : D = rosebush dies

W = gardener waters rosebush

W = gardener fails to water rosebushSecond step: write down all information given

P(D | W) = 12

P(D | W) = 34

P(W) = 23

(so P(W) = 13

)

Third step: write down what were looking for

P(W | D)

Fourth step: compare this to what we know

Need to invert the conditioning, so use Bayes Theorem:

P(W | D) = P(D | W)P(W)P(D | W)P(W) + P(D | W)P(W) =

3/4 2/33/4 2/3 + 1/2 1/3 =

3

4

So the gardener failed to water the rosebush with probability 34

.


34/200

Example: The case of the Defective Ketchup Bottle.

Ketchup bottles are produced in 3 different factories, accountingfor 50%, 30%, and 20% of the total output respectively.The percentage of bottles from the 3 factories that are defectiveis respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer who

eats only ketchup finds a defective bottle in her wig.What is the probability that it came from Factory 1?

Solution:

1. Events:

letFi = bottle comes from Factory i (i=1,2,3)letD = bottle is defective

2. Information given:

P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2P(D | F1) = 0.004 P(D | F2) = 0.006 P(D | F3) = 0.012

3. Looking for:

P(F1 | D) (so need to invert conditioning).

4. Bayes Theorem:

P(F1 | D) = P(D | F1)P(F1)P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3)P(F3)

=0.004 0.5

0.004 0.5 + 0.006 0.3 + 0.012 0.2=

0.002

0.0062

= 0.322.


35/200

34

1.13 Chains of events and probability trees: non-examinable

The multiplication rule is very helpful for calculating probabilities when eventshappen in sequence.

Example: Two balls are drawn at random without replacement from a box con-taining 4 white and 2 red balls. Find the probability that:(a) they are both white,(b) the second ball is red.

Solution

Let eventWi = ith ball is white and Ri = ith ball is red.

a)P(W1 W2) = P(W2 W1) = P(W2 | W1)P(W1)

NowP(W1) =4

6and P(W2 | W1) = 3

5.

W1

SoP(both white) = P(W1 W2) = 35

46

=2

5.

b) Looking forP(2nd ball is red). We cant find this without conditioning on whathappened in the first draw.

Event 2nd ball is red is actually event{W1R2, R1R2} = (W1 R2) (R1 R2).SoP(2nd ball is red) = P(W1

R2) + P(R1

R2) (mutually exclusive)

= P(R2 | W1)P(W1) + P(R2 | R1)P(R1)=

2

5 4

6+

1

5 2

6

=

1

3

W1 R 1


36/200

35

Probability trees

Probability trees are a graphical way of representing the multiplication rule.

First Draw Second Draw

P(W1) =4

6

P(R1) =2

6

P(W2|

W1) =3

5

P(R2 | W1) = 25

P(W2

|R1) =

4

5

P(R2 | R1) = 15

W1

R1

W2

R2

W2

R2

Write conditional probabilities on the branches, and multiply to get probability

of an intersection: eg. P(W1

W2) =

4

6 3

5

, or P(R1

W2) =

2

6 4

5

.

More than two events

To find P(A1 A2 A3) we can apply the multiplication rule successively:P(A1 A2 A3) = P(A3 (A1 A2))

= P(A3 | A1 A2)P(A1 A2) (multiplication rule)= P(A3 | A1 A2)P(A2 | A1)P(A1) (multiplication rule)


37/200

36

Remember as: P(A1 A2 A3) = P(A1)P(A2 | A1)P(A3 | A2 A1).

On the probability tree:

P(A1)

P(A1)

P(A2|

A1)

P(A3 | A2 A1) P(A1 A2 A3)

In general, for n events A1, A2, . . . , An, we have

P(A1A2 . . .An) = P(A1)P(A2 | A1)P(A3 | A2 A1) . . .P(An | An1 . . .A1).

Example: A box contains w white balls and r red balls. Draw 3 balls withoutreplacement. What is the probability of getting the sequence white, red, white?

Answer:

P(W1 R2 W3) = P(W1)P(R2 | W1)P(W3 | R2 W1)

= ww + r

rw + r 1 w 1w + r 2 .


38/200

37

Two separate studies say . . .

Youre

Better

Off

with

AntiCough!

So youre better off with AntiCough

. . . or are you???

Have a look at the figures:

AntiCough Other Medicine

Given to: 25 75

Cured: 20 58

%Cured: 80% 77%


Given to: 75 25

Cured: 50 16

%Cured: 67% 64%

u

y1

u

y2

Combine the studies . . . What happens?Never believe what you read. . . This is Simpsons Paradox. . . Never believe what you read. . . This is Sim


39/200

38

1.14 Simpsons paradox: non-examinable

It is possible for one treatment (e.g. Anticough) to be better than another (Other

Medicine) in every one of a set of categories (e.g. Study 1 and Study 2), butworse overall!

Combining the results overleaf:


Given to: 100 100

Cured: 70 74

%Cured: 70% 74%

Overall, AntiCough has a 4% lower cure percentage (70%),despite being about 3% higher in both Study 1 and Study 2.

This effect is known as Simpsons Paradox.

It occurs because

P(C| A) = P(C| A S1)P(S1 | A) + P(C| A S2)P(S2 | A) ;

P(C| A) = P(C| A S1)P(S1 | A) + P(C| A S2)P(S2 | A) .

C = {cured} A = {Anticough} A = {Other Medicine}S1 = {Study 1} S2 = {Study 2}

Although P(C| A S1) > P(C| A S1), and P(C| A S2) > P(C| A S2), theother terms can change the overall outcome:

P(S1 | A), P(S1 | A), P(S2 | A), P(S2 | A).


40/200

39

1.15 Equally likely outcomes and combinatorics: non-examinable

Sometimes, all the outcomes in a discrete finite sample space are equally likely.This makes it easy to calculate probabilities. If:

i) =

{s1, . . . , sk

};

ii) each outcome si is equally likely, so p1 = p2 = . . . = pk =1k

;

iii) event A = {s1, s2, . . . , sr} contains r possible outcomes,then

P(A) =r

k=

# outcomes in A

# outcomes in .

Example: For a 3-child family, possible outcomes from oldest to youngest are:

= {GGG,GGB,GBG,GBB,BGG,BGB,BBG,BBB}= {s1, s2, s3, s4, s5, s6, s7, s8}

Let {p1, p2, . . . , p8} be a probability distribution on . If every baby is equallylikely to be a boy or a girl, then all of the 8 outcomes in are equally likely, so

p1 = p2 = . . . = p8 = 18 .

Let event A be A = oldest child is a girl.

Then A ={GGG, GGB, GBG, GBB}.Event A contains 4 of the 8 equally likely outcomes, so event A occurs withprobability P(A) = 48 =

12

.

Counting equally likely outcomes

To count the number of equally likely outcomes in an event, we often needto use permutations or combinations. These give the number of ways ofchoosing r objects from n distinct objects.

For example, if we wish to select 3 objects from n = 5 objects (a, b, c, d, e), wehave choices abc, abd, abe, acd, ace, . . . .


41/200

40

1. Number of Permutations, nPr

The number of permutations, nPr, is the number of ways of selectingr objectsfrom n distinct objects when different orderings constitute different choices.

That is, choice(a,b,c) counts separately from choice(b,a,c).

Then

#permutations = nPr = n(n 1)(n 2) . . . (n r + 1) = n!(n r)! .

(n choices for first object, (n 1) choices for second, etc.)

2. Number of Combinations, nCr =n

r

The number of combinations, nCr, is the number of ways of selecting r objectsfrom n distinct objects when different orderings constitute the same choice.

That is, choice(a,b,c) and choice(b,a,c) are the same.

Then

#combinations = nCr =

n

r

=

nPrr!

=n!

(n r)!r! .

(becausenPr counts each permutation r! times, and we only want to count it once:so dividenPr byr!)

Use the same rule on the numerator and the denominator

When P(A) =

# outcomes in A# outcomes in

, we can often think about the problem

either with different orderings constituting different choices, or with differentorderings constituting the same choice. The critical thing is to use the samerule for both numerator and denominator.


42/200

41

Example: (a) Tom has five elderly great-aunts who live together in a tiny bunga-low. They insist on each receiving separate Christmas cards, and threaten todisinherit Tom if he sends two of them the same picture. Tom has Christmascards with 12 different designs. In how many different ways can he select 5different designs from the 12 designs available?

Order of cards is not important, so use combinations. Number of ways of select-ing 5 distinct designs from 12 is

12C5 =

12

5

=

12 !

(12 5)! 5! = 792.

b) The next year, Tom buys a pack of 40 Christmas cards, featuring 10 differentpictures with 4 cards of each picture. He selects 5 cards at random to send tohis great-aunts. What is the probability that at least two of the great-auntsreceive the same picture?

Looking forP(at least 2 cards the same)= P(A) (say).

Easiest to findP(all 5 cards are different)= P(A).

Number of outcomes in A is

(# ways of selecting 5 different designs) = 40 36 32 28 24 .

(40 choices for first card; 36 for second, because the 4 cards with thefirst design are excluded; etc.

Note that order matters: e.g. we are counting choice 12345 separately

from 23154.)

Total number of outcomes is

(total # ways of selecting 5 cards from 40) = 40 39 38 37 36 .(Note: order mattered above, so we need order to matter here too.)

So

P(A) =40 36 32 28 2440 39 38 37 36 = 0.392.

Thus

P(A) = P(at least 2 cards are the same design) = 1 P(A) = 1 0.392 = 0.608.


43/200

42

Alternative solution if order does not matter on numerator and denominator:(much harder method)

P(A) =

10

5

45

40

5 .

This works because there are 105 ways of choosing 5 different designs from 10,

and there are 4 choices of card within each of the 5 chosen groups. So the totalnumber of ways of choosing 5 cards of different designs is

105

45. The total

number of ways of choosing 5 cards from 40 is

405

.

Exercise: Check that this gives the same answer for P(A) as before.

Note: Problems like these belong to the branch of mathematics calledCombinatorics: the science of counting.

1.16 Statistical Independence

Two events A and B are statistically independent if the occurrence of one does

not affect the occurrence of the other.

This means P(A | B) = P(A) and P(B | A) = P(B).

Now P(A | B) = P(A B)P(B)

,

so if P(A | B) = P(A) then P(A B) = P(A) P(B).We use this as our definition of statistical independence.

Definition: Events A and B are statistically independent if

P(A B) = P(A)P(B).


44/200

43

For more than two events, we say:

Definition: Events A1, A2, . . . , An are mutually independent if

P(A1 A2 . . . An) = P(A1)P(A2) . . .P(An), AND

the same multiplication rule holds for every subcollection of the events too.

Eg. events A1, A2, A3, A4 are mutually independent if

i) P(Ai Aj) = P(Ai)P(Aj) for alli, j with i = j;AND

ii) P(Ai Aj Ak) = P(Ai)P(Aj)P(Ak) for alli,j,k that are all different;AND

iii) P(A1 A2 A3 A4) = P(A1)P(A2)P(A3)P(A4).

Statistical independence for calculating the probability of an intersection

In section 1.6 we said that it is often hard to calculate P(A

B).

We usually have two choices.

1. IFA andB are statistically independent, then

P(A B) = P(A) P(B).

2. IfA andB are not known to be statistically independent, we usually have touse conditional probability and the multiplication rule:

P(A B) = P(A | B)P(B).This still requires us to be able to calculate P(A | B).

Note: If events are physically independent, then they will also be statisticallyindependent.


45/200

44

Example: Toss a fair coin and a fair die together. The coin and die are physicallyindependent.

Sample space: = {H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}- all 12 items are equally likely.

Let A= heads and B= six.

Then P(A) = P({H1, H2, H3, H4, H5, H6}) = 612

= 12

P(B) = P({H6, T6}) = 212

= 16

Now P(A B) = P(Heads and 6) = P({H6}) = 112

But P(A) P(B) =12 1

6= 1

12also,

So P(A B) = P(A)P(B) and thus A and B are statistically indept.

Pairwise independence does not imply mutual independence

Example: A jar contains 4 balls: one red, one white, one blue, and one red, white& blue. Draw one ball at random.

Let A =ball has red on it,B =ball has white on it,C =ball has blue on it.

Two balls satisfy A, so P(A) = 24 = 12. Likewise, P(B) = P(C) = 12 .

Pairwise independent:

Consider P(A B) = 14

(one of 4 balls has both red and white on it).

But, P(A) P(B) = 12 1

2= 1

4, so P(A B) = P(A)P(B).

Likewise, P(A C) = P(A)P(C), and P(B C) = P(B)P(C).

So A, B and C are pairwise independent.

Mutually independent?

Consider P(A B C) = 14

(one of 4 balls)

while P(A)P(B)P(C) =12 12 12 = 18 = P(A B C).So A, B and C are NOT mutually independent, despite being pairwise indepen-dent.


46/200

45

1.17 Random Variables

We have one more job to do in laying the foundations of our science of random-ness. So far we have come up with the following ideas:

1. Things that happen are sets, also called events.

2. We measure chance by measuring sets, using a measure called probability.

Finally, what are the sets that we are measuring? It is a nuisance to have lotsof different sample spaces:

= {head, tail}; = {same, different}; = {Lions win, All Blacks win}.All of these sample spaces could be represented more concisely in terms ofnumbers:

= {0, 1}.On the other hand, there are many random experiments that genuinely produce

random numbers as their outcomes.

For example, the number of girls in a three-child family; the number of headsfrom 10 tosses of a coin; and so on.

When the outcome of a random experiment isa number,

it enables us to quantifymany new things of interest:

1. quantify the average value (e.g. the average number of heads we would getif we made 10 coin-tosses again and again);

2. quantify how much the outcomes tend to diverge from the average value;

3. quantify relationships between different random quantities (e.g. is the num-ber of girls related to the hormone levels of the fathers?)

The list is endless. To give us a framework in which these investigations cantake place, we give a special name to random experiments that produce numbersas their outcomes.

A random experiment whose possible outcomes are real numbers is called a

random variable.


47/200

46

In fact, any random experiment can be made to have outcomes that are realnumbers, simply by mapping the sample space onto a set of real numbers usinga function.

For example: function X : RX(Lions win) = 0; X(All Blacks win) = 1.

This gives us our formal definition of a random variable:

Definition: A random variable (r.v.) is a function from a sample space to thereal numbers R.

We writeX : R.

Although this is the formal definition, the intuitive definition of a random vari-

able is probably more useful. Intuitively, remember that a random variableequates to a random experiment whose outcomes are numbers.

A random variable produces random real numbersas the outcome of a random experiment.

Defining random variables serves the dual purposes of:

1. Describing many different sample spaces in the same terms:e.g. = {0, 1} with P(1) = p andP(0) = 1 p describes EVERY possibleexperiment with two outcomes.

2. Giving a name to a large class of random experiments that genuinely pro-duce random numbers, and for which we want to develop general rules forfinding averages, variances, relationships, and so on.

Example: Toss a coin 3 times. The sample space is

= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}One example of a random variable is X : R such that, for sample pointsi,we haveX(si) = # heads in outcomesi.

SoX(HHH) = 3, X(T HT) = 1, etc.


48/200

47

Another example is Y : R such thatY(si) =

1 if 2nd toss is a head,0 otherwise.

Then Y(HT H) = 0, Y(T HH) = 1, Y(HHH) = 1, etc.

Probabilities for random variables

By convention, we use CAPITAL LETTERS for random variables (e.g. X), andlower case letters to represent the values that the random variable takes (e.g.x).

For a sample space and random variable X : R, and for a real number x,P(X = x) = P(outcome s is such thatX(s) = x) = P({s : X(s) = x}).

Example: toss a fair coin 3 times. All outcomes are equally likely:P(HHH) = P(HHT) = . . . = P(TTT) = 1/8.

Let X : R, such that X(s) = # heads in s.Then P(X = 0) = P({T T T}) = 1/8.

P(X = 1) = P({H T T , T H T , T T H }) = 3/8.P(X = 2) = P({H H T , H T H , T H H }) = 3/8.P(X = 3) = P({HHH}) = 1/8.

Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1.

Independent random variables

Random variables X and Y are independent if each does not affect the other.

Recall that two events A and B are independent ifP(A B) = P(A)P(B).Similarly, random variables X and Y are defined to be independent if

P({X = x} {Y = y}) = P(X = x)P(Y = y)forall possible values x andy.


49/200

48

We usually replace the cumbersome notation P({X = x} {Y = y}) by thesimpler notation P(X = x, Y = y).

From now on, we will use the following notations interchangeably:

P({X = x} {Y = y}) = P(X = x ANDY = y) = P(X = x, Y = y).

Thus X andY are independent if and only if

P(X = x, Y = y) = P(X = x)P(Y = y) for ALL possible values x, y.


50/200

49

1.18 Key Probability Results for Chapter 1

1. IfA and B are mutually exclusive (i.e. A B = ), thenP(A B) = P(A) + P(B).

2. Conditional probability: P(A | B) = P(A B)P(B)

for any A, B.

Or: P(A B) = P(A | B)P(B).

3. For any A, B, we can write

P(A | B) = P(B | A)P(A)P(B)

.

This is a simplified version of Bayes Theorem. It shows how to invert the conditioning,i.e. how to find P(A | B) when you know P(B | A).

4. Bayes Theorem slightly more generalized:

for any A, B,

P(A | B) = P(B | A)P(A)P(B | A)P(A) + P(B | A)P(A) .

This works because A and A form a partition of the sample space.

5. Complete version of Bayes Theorem:

If sets A1, . . . , Am form a partition of the sample space, i.e. they do not overlap(mutually exclusive) and collectively cover all possible outcomes (their union is thesample space), then

P(Aj | B) = P(B | Aj)P(Aj)P(B | A1)P(A1) + . . . + P(B | Am)P(Am)

=P(B | Aj)P(Aj)mi=1P(B | Ai)P(Ai)

.


51/200

50

6. Partition Theorem: if A1, . . . , Am form a partition of the sample space, then

P(B) = P(B A1) + P(B A2) + . . . + P(B Am) .

This can also be written as:

P(B) = P(B | A1)P(A1) + P(B | A2)P(A2) + . . . + P(B | Am)P(Am) .

These are both very useful formulations.

7. Chains of events:

P(A1 A2 A3) = P(A1)P(A2 | A1)P(A3 | A2 A1) .

8. Statistical independence:

if A and B are independent, then

P(A B) = P(A)P(B)and

P(A | B) = P(A)and

P(B | A) = P(B) .

9. Conditional probability:

IfP(B) > 0, then we can treat P(

|B) just like P:

e.g. if A1 and A2 are mutually exclusive, then P(A1 A2 | B) = P(A1 | B) + P(A2 | B)(compare with P(A1 A2) = P(A1) + P(A2));ifA1, . . . ,Am partition the sample space, then P(A1 | B) +P(A2 | B) +. . .+P(Am | B) = 1;and P(A | B) = 1 P(A | B) for any A.(Note: it is not generally true that P(A | B) = 1 P(A | B).)The fact that P( | B) is a valid probability measure is easily verified by checking that itsatisfies Axioms 1, 2, and 3.

10. Unions: For any A, B, C,

P(A B) = P(A) + P(B) P(A B) ;

P(A B C) = P(A) +P(B) +P(C) P(A B) P(A C) P(B C) +P(A B C) .

The second expression is obtained by writing P(ABC) = P

A(BC)

and applying

the first expression to A and (B C), then applying it again to expand P(B C).


52/200

51

Chapter 2: Discrete Probability

Distributions

2.1 Introduction

In the next two chapters we meet several important concepts:

1. Probability distributions, and the probability function fX(x):

the probability functionof a random variable lists the values the randomvariable can take, and their probabilities.

2. Hypothesis testing:

I toss a coin ten times and get nine heads. How unlikely is that? Can wecontinue to believe that the coin is fair when it produces nine heads outof ten tosses?

3. Likelihood and estimation:

what if we know that our random variable is (say) Binomial(5, p), for somep, but we dont know the value of p? We will see how to estimate thevalue of p using maximum likelihood estimation.

4. Expectation and variance of a random variable: the expectationof a random variable is the value it takes on average. the variance of a random variable measures how much the random variable

varies about its average.

5. Change of variable procedures:

calculating probabilities and expectations of g(X), where X is a randomvariable and g(X) is a function, e.g. g(X) =

X or g(X) = X2.

6. Modelling:

we have a situation in real life that we know is random. But what doesthe randomness look like? Is it highly variable, or little variability? Doesit sometimes give results much higher than average, but never give resultsmuch lower(long-tailed distribution)? We will see how different probabilitydistributions are suitable for different circumstances. Choosing a probabil-ity distribution to fit a situation is called modelling.


53/200

52

2.2 The probability function, fX(x)

The probability function fX(x) lists all possible values of X,

and gives a probability to each value.

Recall that a random variable, X, assigns a real number to every possibleoutcome of a random experiment. The random variable is discrete if the set ofreal values it can take is finite or countable, eg. {0,1,2,. . . }.

Ferrari

Porsche

MG...

Random experiment: which car?

Random variable: X.

X gives numbers to the possible outcomes.

If he chooses. . .

Ferrari X = 1

Porsche X = 2MG X = 3

Definition: The probability function, fX(x), for a discrete random variableX, isgiven by,

fX(x) = P(X = x), for all possible outcomes x ofX.

Example: Which car?

Outcome: Ferrari Porsche MG

x 1 2 3Probability function, fX(x) = P(X = x)16

16

46

We write: P(X = 1) = fX(1) =16

: the probability he makes choice 1 (a Ferrari)

is 16

.


54/200

53

We can also write the probability function as: fX(x) =

1/6 ifx = 1,

1/6 ifx = 2,4/6 ifx = 3,

0 otherwise.

Example: Toss a fair coin once, and let X=number of heads. Then

X =

0 with probability 0.5,1 with probability 0.5.

The probability function of X is given by:

x 0 1

fX(x) = P(X = x) 0.5 0.5 or fX(x) =

0.5 if x=00.5 if x=1

0otherwise

We write (eg.) fX(0) = 0.5, fX(1) = 0.5, fX(7.5) = 0, etc.

fX(x) is just a list of probabilities.

Properties of the probability function

i) 0 fX(x) 1 for all x; probabilities are always between 0 and 1.ii)x

fX(x) = 1; probabilities add to 1 overall.

iii) P (X A) = xA

fX(x);

e.g. in the car example,

P(X {1, 2}) = P(X = 1 or2) = P(X = 1) + P(X = 2) = 16

+ 16

= 26

.

This is the probability of choosing either a Ferrari or a Porsche.


55/200

54

2.3 Bernoulli trials

Many of the discrete random variables that we meetare based on counting the outcomes of a series oftrials called Bernoulli trials. Jacques Bernoulli wasa Swiss mathematician in the late 1600s. He and

his brother Jean, who were bitter rivals, both stud-ied mathematics secretly against their fathers will.Their father wanted Jacques to be a theologist andJean to be a merchant.

Definition: A random experiment is called a set of Bernoulli trials if it consistsof several trials such that:

i) Each trial has only 2 possible outcomes (usually called Success and Fail-

ure);

ii) The probability of success, p, remains constant for all trials;

iii) The trials are independent, ie. the event success in trial i does not dependon the outcome of any other trials.

Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial withP(success) = P(head) = 1

2.

2) Repeated tossing of a fair die: success = 6, failure= not 6. Each toss isa Bernoulli trial with P(success) = 1

6.

Definition: The random variable Y is called a Bernoulli random variable if ittakes only 2 values, 0 and 1.

The probability function is,

fY(y) = p ify = 11 p ify = 0That is,

P(Y = 1) = P(success) = p,

P(Y = 0) = P(failure) = 1 p.


56/200

55

2.4 Example of the probability function: the Binomial Distribution

The Binomial distribution counts the number of successesin a fixed number of Bernoulli trials.

Definition: LetX be the number of successes in n independent Bernoulli trials each

with probability of success = p. Then X has the Binomial distribution withparameters n andp. We writeX Bin(n, p), orX Binomial(n, p).

Thus X Bin(n, p) if X is the number of successes out of n independenttrials, each of which has probabilityp of success.

Probability function

If X Binomial(n, p), then the probability function for X is

fX(x) = P(X = x) = n

xpx(1 p)nx for x = 0, 1, . . . , n

Explanation:

An outcome with x successes and(n x) failures has probability,

px

(1)(1 p)nx

(2)where:

(1) succeeds x times, each with probabilityp

(2) fails (n x) times, each with probability(1 p).


57/200

56

There are

n

x

possible outcomes with x successes and(n x) failures because

we must selectx trials to be our successes, out ofn trials in total.

Thus,

P(#successes= x) = (#outcomes with x successes) (prob. of each such outcome=

n

x

px(1 p)nx

Note:

fX(x) = 0 if x / {0, 1, 2, . . . , n}.

Check thatn

x=0

fX(x) = 1:

nx=0

fX(x) =n

x=0

nx

px(1 p)nx = [p + (1 p)]n (Binomial Theorem)

= 1n = 1

It is this connection with the Binomial Theorem that gives the Binomial Dis-tribution its name.


58/200

57

Example 1: Let X Binomial(n = 4, p = 0.2). Write down the probabilityfunction of X.

x 0 1 2 3 4

fX(x) = P(X = x) 0.4096 0.4096 0.1536 0.0256 0.0016

Example 2: Let X be the number of times I get a 6 out of 10 rolls of a fair die.

1. What is the distribution of X?

2. What is the probability that X 2?

1. X Binomial(n = 10, p = 1/6).2.

P(X 2 ) = 1 P(X < 2)= 1 P(X = 0) P(X = 1)= 1

10

0

1

6

01 1

6

100

10

1

1

6

11 1

6

101= 0.515.

Example 3: Let X be the number of girls in a three-child family. What is thedistribution of X?

Assume:

(i) each child is equally likely to be a boy or a girl;

(ii) all children are independent of each other.

Then X Binomial(n = 3, p = 0.5).


59/200

58

Shape of the Binomial distribution

The shape of the Binomial distribution depends upon the values ofn and p. Forsmall n, the distribution is almost symmetrical for values of p close to 0.5, buthighly skewed for values of p close to 0 or 1. As n increases, the distributionbecomes more and more symmetrical, and there is noticeable skew only if p isvery close to 0 or 1.

The probability functions for various values of n and p are shown below.

0 1 2 3 4 5 6 7 8 9 10

0.0

0.0

5

0.10

0.1

5

0.2

0

0.2

5

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.0

0.0

2

0.0

4

0.0

6

0.0

8

0.1

0

0.1

2

80 90 100

n = 10, p = 0.5 n = 10, p = 0.9 n = 100, p = 0.9

Sum of independent Binomial random variables:

If X and Y are independent, and X Binomial(n, p), Y Binomial(m, p),then

X + Y Bin(n + m, p).

This is because X counts the number of successes out ofn trials, and Y countsthe number of successes out of m trials: so overall, X + Y counts the totalnumber of successes out of n + m trials.

Note: X and Y must both share the same value ofp.


60/200

59

2.5 The cumulative distribution function, FX(x)

We have defined the probability function, fX(x), as fX(x) = P(X = x).

The probability function tells us everything there is to know about X.

The cumulative distribution function, or just distribution function, written asFX(x), is an alternative function that also tells us everything there is to knowabout X.

Definition: The (cumulative) distribution function (c.d.f.) is

FX(x) = P(X x) for < x <

If you are asked to give the distribution ofX, you could answer by giving eitherthe distribution function, FX(x), or the probability function, fX(x). Each ofthese functions encapsulate all possible information about X.

The distribution function FX(x) as a probability sweeper

The cumulative distribution function, FX(x),

sweeps up all the probability up to and including the pointx.

0.00

0.05

0.10

0.15

0.2

0

0.25

0 1 2 3 4 5 6 7 8 9 10

X ~ Bin(10, 0.5)

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4 5 6 7 8 9 10

X ~ Bin(10, 0.9)


61/200

60

Example: Let X Binomial(2, 12

).x 0 1 2

fX(x) = P(X = x)14

12

14

Then FX(x) = P(X

x) =

0 if x < 00.25 if 0 x < 1

0.25 + 0.5 = 0.75 if 1 x < 20.25 + 0.5 + 0.25 = 1 if x 2.

0

0

1

1

1

2

2

1

4

1

4

1

2

1

2

3

4

x

x

f(x)

F(x)

FX(x) gives the cumulative probability up to and including pointx.

SoFX(x) =

yxfX(y)

Note that FX(x) is a step function: it jumps by amount fX(y) at every pointy with positive probability.


62/200

61

Reading off probabilities from the distribution function

As well as using the probability function to find the distribution function, wecan also use the distribution function to find probabilities.

fX(x) = P(X = x) = P(X

x)

P(X

x

1) (ifX takes integer values)

= FX(x) FX(x 1).

This is why the distribution function FX(x) contains as much information asthe probability function, fX(x), because we can use either one to find the other.

In general:

P(a < X b) = FX(b) FX(a) ifb > a.

Proof: P(X b) = P(X a) + P(a < X b)

a b

X b

a < X bX a

So

FX(b) = FX(a) + P(a < X b)

FX(b) FX(a) = P(a < X b).


63/200

Warning: endpoints

Be careful of endpoints and the difference between and 42)?1 P(X 42) = 1 FX(42).

5. P(50 X 60)?P(X

60)

P(X

49) = FX(60)

FX(49).

Properties of the distribution function

1) F() =P(X ) = 0.F(+) =P(X +) = 1.

(These are true because values are strictly between and ).2) FX(x) is a non-decreasing function of x: that is,

if x1 < x2, then FX(x1) FX(x2).

3) P(a < X b) = FX(b) FX(a) if b > a.4) F is right-continuous: that is, limh0 F(x + h) = F(x).


64/200

63

2.6 Hypothesis testing

You have probably come across the idea of hypothesis tests, p-values, and sig-nificance in other courses. Common hypothesis tests include t-tests and chi-squared tests. However, hypothesis tests can be conducted in much simpler

circumstances than these. The concept of the hypothesis test is at its easiest tounderstand with the Binomial distribution in the following example. All otherhypothesis tests throughout statistics are based on the same idea.

Example: Weird Coin?

H

H

I toss a coin 10 times and get 9 heads. How weird is that?

What is weird?

Getting 9 heads out of 10 tosses: well call this weird. Getting 10 heads out of 10 tosses: even more weird! Getting 8 heads out of 10 tosses: less weird. Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:

just as weird as 9 heads if the coin is fair.

Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than9 heads if the coin is fair.

Set of weird outcomes

If our coin is fair, the outcomes that are as weird or weirder than 9 headsare:

9 heads, 10 heads, 1 head, 0 heads.

So how weird is 9 heads or worse, if the coin is fair?

We can add the probabilities of all the outcomes that are at least as weirdas 9 heads out of 10 tosses, assuming that the coin is fair.

Distribution of X, if the coin is fair: X Binomial(n = 10, p = 0.5).


65/200

64

Probability of observing something at least as weird as 9 heads,

if the coin is fair:

P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X Binomial(10, 0.5).

Probabilities for Binomial(n = 10, p = 0.5)

0 1 2 3 4 5 6 7 8 9 10

0.0

0

.05

0.1

5

0.2

5

x

P(X=x)

For X Binomial(10, 0.5), we have:P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) =

109 (0.5)9(0.5)1 +1010(0.5)10(0.5)0 +10

1

(0.5)1(0.5)9 +

10

0

(0.5)0(0.5)10

= 0.00977 + 0.00098 + 0.00977 + 0.00098

= 0.021.

Is this weird?

Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would onlyexpect to see something as extreme as 9 heads on about 2.1% of occasions.


66/200

65

Is the coin fair?

Obviously, we cant say. It might be: after all, on 2.1% of occasions that youtoss a fair coin 10 times, you do get something as weird as 9 heads or more.

However, 2.1% is a small probability, so it is still very unusual for a fair coin to

produce something as weird as what weve seen. If the coin really was fair, itwould be very unusual to get 9 heads or more.

We can deduce that, EITHER we have observed a very unusual event with a faircoin, OR the coin is not fair.

In fact, this gives us some evidence that the coin is not fair.

The value 2.1% measures the strength of our evidence. The smaller this proba-

bility, the more evidence we have.

Formal hypothesis test

We now formalize the procedure above. Think of the steps:

We have a question that we want to answer: Is the coin fair?

There are two alternatives:1. The coin is fair.

2. The coin is not fair.

Our observed information is X, the number of heads out of 10 tosses. Wewrite down the distribution of X if the coin is fair:X Binomial(10, 0.5).

We calculate the probability of observing something AT LEAST AS EX-TREME as our observation, X = 9, if the coin is fair: prob=0.021.

The probability is small (2.1%). We conclude that this is unlikely with afair coin, so we have observed some evidence that the coin is NOT fair.


67/200

66

Null hypothesis and alternative hypothesis

We express the steps above as two competing hypotheses.

Null hypothesis: the first alternative, that the coin IS fair.

We expect to believe the null hypothesis unless we see convincing evidence that

it is wrong.

Alternative hypothesis: the second alternative, that the coin is NOT fair.

In hypothesis testing, we often use this same formulation.

The null hypothesis is specific.It specifies an exact distribution for our observation: X Binomial(10, 0.5).

The alternative hypothesis is general.It simply states that the null hypothesis is wrong. It does not say whatthe right answer is.

We use H0 andH1 to denote the null and alternative hypotheses respectively.

The null hypothesis is H0 : the coin is fair.The alternative hypothesis is H1 : the coin is NOT fair.

More precisely, we write:

Number of heads, X Binomial(10, p),

and

H0 : p = 0.5

H1 : p = 0.5.

Think of null hypothesis as meaning the default: the hypothesis we willaccept unless we have a good reason not to.


68/200

67

p-values

In the hypothesis-testing framework above, we always measure evidence AGAINSTthe null hypothesis.

That is, we believe that our coin is fair unless we see convincing evidence

otherwise.

We measure the strength of evidence against H0 using the p-value.

In the example above, the p-value was p = 0.021.

A p-value of 0.021 represents quite strong evidence against the null hypothesis.

It states that, if the null hypothesis is TRUE, we would only have a 2.1% chanceof observing something as extreme as 9 heads or tails.

Many of us would see this as strong enough evidence to decide that the nullhypothesis is not true.

In general, the p-value is the probability of observing something AT LEAST ASEXTREME AS OUR OBSERVATION, ifH0 is TRUE.

This means that SMALLp-values represent STRONG evidence againstH0.

Small p-values mean Strong evidence.Large p-values mean Little evidence.

Note: Be careful not to confuse the term p-value, which is 0.021 in our exam-ple, with the Binomial probability p. Our hypothesis test is designed to testwhether the Binomial probability is p = 0.5. To test this, we calculate thep-value of 0.021 as a measure of the strength of evidence against the hypoth-esis that p = 0.5.


69/200

68

Interpreting the hypothesis test

There are different schools of thought about how ap-value should be interpreted.

Most people agree that the p-value is a useful measure of the strength ofevidence against the null hypothesis. The smaller the p-value, thestronger the evidence against H0.

Some people go further and use an accept/reject framework. Underthis framework, the null hypothesis H0 should be rejected if the p-value isless than 0.05 (say), and accepted if the p-value is greater than 0.05.

In this course we use the strength of evidence interpretation. Thep-value measures how far out our observation lies in the tails of the dis-tribution specified by H0. We do not talk about accepting or rejectingH0. This decision should usually be taken in the context of other scientificinformation.

However, it is worth bearing in mind that p-values of 0.05 and less startto suggest that the null hypothesis is doubtful.

Statistical significance

You have probably encountered the idea of statistical significance in othercourses.

Statistical significance refers to thep-value.

The result of a hypothesis test is significant at the 5% level if the p-valueis less than 0.05.

This means that the chance of seeing what we did see (9 heads), or more, is lessthan 5% if the null hypothesis is true.

Saying the test is significant is a quick way of saying that there is evidenceagainst the null hypothesis, usually at the 5% level.


70/200

69

In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p = 0.5is significant at the 5% level, because thep-value is 0.021 which is < 0.05.

This means:

we have some evidence thatp = 0.5.

It does not mean:

the difference between p and 0.5 is large, or the difference between p and 0.5 is important in practical terms.

Statistically significant means that we have evidence that

there IS a difference. It says NOTHING about the SIZE,

or the IMPORTANCE, of the difference.

Beware!

The p-value gives the probability of seeing something as weird as what we didsee, ifH0 is true.

This means that 5% of the time, we will get ap-value< 0.05 EVEN WHENH0

IS TRUE!!

Indeed, about once in every thousand tests, we will get a p-value < 0.001, eventhough H0 is true!

A smallp-value does NOT mean thatH0 is definitely wrong.

One-sided and two-sided tests

The test above is a two-sided test. This means that we considered it just asweird to get 9 tails as 9 heads.

If we had a good reason, before tossing the coin, to believe that the binomialprobability could only be = 0.5 or > 0.5, i.e. that it would be impossibleto have p < 0.5, then we could conduct a one-sided test: H0 : p = 0.5 versusH1 : p > 0.5.

This would have the effect of halving the resultant p-value.


71/200

70

2.7 Example: Presidents and deep-sea divers

Men in the class: would you like to have daughters? Then become a deep-seadiver, a fighter pilot, or a heavy smoker.

Would you prefer sons? Easy!Just become a US president.

Numbers suggest that men in differentprofessions tend to have more sons thandaughters, or the reverse. Presidents havesons, fighter pilots have daughters. But is it real, or just chance? We can usehypothesis tests to decide.

The facts

The 43 US presidents from George Washington to George W. Bush havehad a total of 151 children, comprising 88 sons and only 63 daughters: asex ratio of 1.4 sons for every daughter.

Two studies of deep-sea divers revealed that the men had a total of 190children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughters

for every son.

Could this happen by chance?

Is it possible that the men in each group really had a 50-50 chance of producingsons and daughters?

This is the same as the question in Section 2.6.

For the presidents: If I tossed a coin 151 times and got only 63 heads, couldI continue to believe that the coin was fair?

For the divers: If I tossed a coin 190 times and got only 65 heads, could Icontinue to believe that the coin was fair?


72/200

71

Hypothesis test for the presidents

We set up the competing hypotheses as follows.

LetX be the number of daughters out of 151 presidential children.

Then X Binomial(151, p), wherep is the probability that each child is a daugh-ter.

Null hypothesis: H0 : p = 0.5.

Alternative hypothesis: H1 : p = 0.5.p-value: We need the probability of getting a result AT LEAST

AS EXTREME as X = 63 daughters, ifH0 is trueandp really is 0.5.

Which results are at least as extreme as X = 63?

X = 0, 1, 2, . . . , 63, for even fewer daughters.

X = (151

63), . . . , 151, for too many daughters, because we would be just assurprised if we saw 63 sons, i.e. (151 63) = 88 daughters.

Probabilities for X Binomial(n = 151, p = 0.5)

0.0

0

0.0

1

0.0

2

0.0

3

0.0

4

0.0

5

0.0

6

0 20 40 60 80 100 120 140


73/200

72

Calculating the p-value

The p-value for the president problem is given by

P(X 63) + P(X 88) whereX Binomial(151, 0.5).

In principle, we could calculate this asP(X = 0) + P(X = 1) + . . . + P(X =

Documents

Stats 210 Course Book