Stats 210 Course Book

Embed Size (px)

Citation preview

  • 8/8/2019 Stats 210 Course Book

    1/200

  • 8/8/2019 Stats 210 Course Book

    2/200

    Contents

    1. Probability 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.5 Probability: a way of measuring sets . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.6 Probabilities of combined events . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    1.8 Examples of basic probability calculations . . . . . . . . . . . . . . . . . . . . . 20

    1.9 Formal probability proofs: non-examinable . . . . . . . . . . . . . . . . . . . . . 22

    1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    1.11 Examples of conditional probability and partitions . . . . . . . . . . . . . . . . . 29

    1.12 Bayes Theorem: inverting conditional probabilities . . . . . . . . . . . . . . . . 31

    1.13 Chains of events and probability trees: non-examinable . . . . . . . . . . . . . . 34

    1.14 Simpsons paradox: non-examinable . . . . . . . . . . . . . . . . . . . . . . . . . 38

    1.15 Equally likely outcomes and combinatorics: non-examinable . . . . . . . . . . . 39

    1.16 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    1.17 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    1.18 Key Probability Results for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 49

    2. Discrete Probability Distributions 51

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    2.2 The probability function, fX(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    2.4 Example of the probability function: the Binomial Distribution . . . . . . . . . 55

    2.5 The cumulative distribution function, FX(x) . . . . . . . . . . . . . . . . . . . . 59

    2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . . . . . . . . . . 70

    2.8 Example: Politicians and the alphabet . . . . . . . . . . . . . . . . . . . . . . . 77

    2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    2 . 1 3 V a r i a n c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 7

    2.14 Mean and variance of the Binomial(n, p) distribution . . . . . . . . . . . . . . . 113

    1

  • 8/8/2019 Stats 210 Course Book

    3/200

    3. Modelling with Discrete Probability Distributions 119

    3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

    3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    3.4 Hypergeometric distribution: sampling without replacement . . . . . . . . . . . 128

    3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    4. Continuous Random Variables 139

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 394.2 The probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

    4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    4.4 Likelihood and estimation for continuous random variables . . . . . . . . . . . . 157

    4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

    4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

    4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . . . . . . . 163

    4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    4.9 The Change of Variable Technique: finding the distribution ofg(X) . . . . . . . 1 6 9

    4.10 Change of variable for non-monotone functions: non-examinable . . . . . . . . . 174

    4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

    4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . . . . . . . . 179

    5. The Normal Distribution and the Central Limit Theorem 180

    5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 185

    6. Wrapping Up 192

    6.1 Estimators the good, the bad, and the estimator PDF . . . . . . . . . . . . . 192

    6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 197

  • 8/8/2019 Stats 210 Course Book

    4/200

    3

    Chapter 1: Probability

    1.1 Introduction

    Definition: A probability is a number between 0 and 1 representing how likely itis that an event will occur.

    Probabilities can be:

    1. Frequentist (based on frequencies),

    e.g. number of times event occursnumber of opportunities for event to occur

    ;

    2. Subjective: probability represents a persons degree of belief that anevent will occur,e.g. I think there is an 80% chance it will rain today,

    written as P(rain) = 0.80.

    Regardless of how we obtain probabilities, we always combine and manipulatethem according to the same rules.

    1.2 Sample spaces

    Definition: A random experiment is an experiment whose outcome is not knownuntil it is observed.

    Definition: A sample space, , is a set of outcomes of a random experiment.

    Every possible outcome must be listed once and only once.

    Definition: A sample point is an element of the sample space.

    For example, if the sample space is = {s1, s2, s3}, then each si is a samplepoint.

  • 8/8/2019 Stats 210 Course Book

    5/200

    4

    Examples:

    Experiment: Toss a coin twice and observe the result.Sample space: = {H H , H T , T H , T T }An example of a sample point is: HT

    Experiment: Toss a coin twice and count the number of heads.Sample space: = {0, 1, 2}

    Experiment: Toss a coin twice and observe whether the two tosses are the same(e.g. HH or TT).Sample space: = {same, different}

    Discrete and continuous sample spaces

    Definition: A sample space is finite if it has a finite number of elements.

    Definition: A sample space is discrete if there are gaps between the differentelements, or if the elements can be listed, even if an infinite list(eg. 1, 2, 3, . . .).

    In mathematical language, a sample space is discrete if it is countable.

    Definition: A sample space is continuous if there are no gaps between the elements,so the elements cannot be listed (eg. the interval[0, 1]).

    Examples:

    = {0, 1, 2, 3} (discrete and finite) = {0, 1, 2, 3, . . .} (discrete, infinite) = {4.5, 4.6, 4.7} (discrete, finite) = {H H , H T , T H , T T } (discrete, finite) = [0, 1] ={all numbers between 0 and 1 inclusive} (continuous, infinite) =

    [0, 90), [90, 360)

    (discrete, finite)

  • 8/8/2019 Stats 210 Course Book

    6/200

    1.3 Events

    Kolmogorov (1903-1987).

    One of the founders of

    probability theory.

    Suppose you are setting out to create a scienceof randomness. Somehow you need to harnessthe idea of randomness, which is all about theunknown, and express it in terms of mathematics.

    How would you do it?

    So far, we have introduced the sample space, ,which lists all possible outcomes of a randomexperiment, and might seem unexciting.

    However, is a set. It lays the ground for a whole mathematical formulationof randomness, in terms of set theory.

    The next concept that you would need to formulate is that of something thathappens at random, or an event.

    How would you express the idea of an eventin terms of set theory?

    Definition: An event is a subset of the sample space.

    That is, any collection of outcomes forms an event.

    Example: Toss a coin twice. Sample space: = {H H , H T , T H , T T }

    Let event A be the event that there is exactly one head.

    We write: A =exactly one head

    Then A = {H T , T H }.

    A is a subset of , as in the definition. We write A .

    Definition: Event A occurs if we observe an outcome that is a member of the setA.

    Note: is a subset of itself, so is an event. The empty set, = {}, is also asubset of . This is called the null event, or theevent with no outcomes.

  • 8/8/2019 Stats 210 Course Book

    7/200

    6

    Example:

    Experiment: throw 2 dice.Sample space: = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 6)}

    Event A =sum of two faces is

    5

    = {(1, 4), (2, 3), (3, 2), (4, 1)}

    Combining Events

    Formulating random events in terms of sets gives us the power of set theoryto describe all possible ways of combining or manipulating events. For exam-ple, we need to describe things like coincidences (events happening together),alternatives, opposites, and so on.

    We do this in the language of set theory.

    Example: Suppose our random experiment is to pick a person in the class and seewhat form(s) of transport they used to get to campus today.

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    This sort of diagram representing events in a sample space is called a Venndiagram.

  • 8/8/2019 Stats 210 Course Book

    8/200

    7

    1. Alternatives: the union or operator

    We wish to describe an event that is composed of several different alternatives.

    For example, the event that you used a motor vehicle to get to campus is theevent that your journey involved a car, or a bus, or both.

    To represent the set of journeys involving both alternatives, we shade all out-comes in Bus and all outcomes in Car.

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    Overall, we have shaded all outcomes in the UNION of Bus and Car.

    We write the event that you used a motor vehicle as the event Bus Car, readas Bus UNION Car.

    The union operator, , denotes Bus OR Car OR both.

    Note: Be careful not to confuse Or and And. To shade the union of Bus andCar, we had to shade everything in Bus AND everything in Car.

    To remember whether union refers to Or or And, you have to consider whatdoes an outcome need to satisfy for the shaded event to occur?

    The answer is Bus, OR Car, OR both. NOT Bus AND Car.

    Definition: Let A and B be events on the same sample space : so A andB .

    The union of events A and B is written A B, and is given byA B = {s : s A or s B or both} .

  • 8/8/2019 Stats 210 Course Book

    9/200

    8

    2. Concurrences and coincidences: the intersection and operator

    The intersection is an event that occurs when two or more events ALL occurtogether.

    For example, consider the event that your journey today involved BOTH a car

    AND a train. To represent this event, we shade all outcomes in the overlap ofCar and Train.

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    We write the event that you used both car and train as Car Train, read asCar INTERSECT Train.

    The intersection operator, , denotes both Car AND Train together.Definition: The intersection of events A and B is written A B and is given by

    A B = {s : s A AND s B} .

    3. Opposites: the complement or not operator

    The complement of an event is the opposite of the event: whatever the eventwas, it didnt happen.

    For example, consider the event that your journey today did NOT involvewalking. To represent this event, we shade all outcomes in except those in theevent Walk.

  • 8/8/2019 Stats 210 Course Book

    10/200

    9

    People in class

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    We write the event not Walk as Walk.

    Definition: The complement of event A is written A and is given by

    A = {s : s / A}.Examples:

    Experiment: Pick a person in this class at random.Sample space: = {all people in class}.Let event A =person is male and event B =person travelled by bike today.

    Suppose I pick a male who did not travel by bike. Say whether the followingevents have occurred:

    1) A Yes. 2) B No.

    3) A No. 4) B Yes.

    5) A B = {female or bike rider or both}. No.6) A B = {male and non-biker}. Yes.7) A B = {male and bike rider}. No.8) A B = everything outsideA B. A B did not occur, soA B did occur.Yes.

    Question: What is the event ? = Challenge: can you express A B using only a sign?Answer: A B = (A B).

  • 8/8/2019 Stats 210 Course Book

    11/200

    10

    Limitations of Venn diagrams

    Venn diagrams are generally useful for up to 3 events, although they are notused to provide formal proofs. For more than 3 events, the diagram might notbe able to represent all possible overlaps of events. (This was probably the casefor our transport Venn diagram.)

    Example: A B

    C

    (a) A B C

    A B

    C

    (b) A B C

    Properties of union, intersection, and complement

    The following properties hold.

    (i) = and = .(ii) For any event A, A A = ,

    and A

    A =

    .

    (iii) For any events A and B, A B = B A,and A B = B A. Commutative.

    (iv) (a) (A B) = A B. (b) (A B) = A B.

    A B

    A B

  • 8/8/2019 Stats 210 Course Book

    12/200

    11

    Distributive laws

    We are familiar with the fact that multiplication is distributive over addition.This means that, if a, b, and c are any numbers, then

    a (b + c) = a b + a c.However, addition is not distributive over multiplication:

    a + (b c) = (a + b) (a + c).

    For set union and set intersection, union is distributive over intersection, ANDintersection is distributive over union.

    Thus, for any sets A, B, and C:

    A (B C) = (A B) (A C),

    and A (B C) = (A B) (A C).

    A B

    C

    A B

    C

    More generally, for several events A andB1, B2, . . . , Bn,,

    A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

    i.e. A

    n

    i=1Bi

    =n

    i=1(A Bi),

    and

    A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

    i.e. A

    ni=1

    Bi

    =

    ni=1

    (A Bi).

  • 8/8/2019 Stats 210 Course Book

    13/200

    12

    1.4 Partitioning sets and events

    The idea of a partition is fundamental in probability manipulations. Later inthis chapter we will encounter the important Partition Theorem. For now, wegive some background definitions.

    Definition: Two events A and B are mutually exclusive, or disjoint, if A B =.

    This means events A and B cannot happen together. If A happens, it excludes B

    from happening, and vice-versa.

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    1 1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1 1 1

    A B

    Note: Does this mean that A and B are independent?

    No: quite the opposite. A EXCLUDES B from happening, soB depends stronglyon whether or notA happens.

    Definition: Any number of events A1, A2, . . . , Ak are mutually exclusive if every

    pair of the events is mutually exclusive: ie. Ai Aj = for alli, j with i = j.

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    1 1 1 1

    1 1 1 1

    1 1 1 1

    1 1 1 1

    1 1 1 1

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    1 1 1 1

    1 1 1 1

    1 1 1 1

    1 1 1 1

    1 1 1 1

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    0 0 0 0 0

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    1 1 1 1 1

    A1 A2 A3

    Definition: A partition of the sample space is a collection of mutually exclusive

    events whose union is .

    That is, sets B1, B2, . . . , Bk form a partition of if

    Bi Bj = for all i, j with i = j ,

    andk

    i=1

    Bi = B1 B2 . . . Bk = .

  • 8/8/2019 Stats 210 Course Book

    14/200

    13

    Examples:

    B1, B2, B3, B4 form a partition of: B1, . . . , B5 partition :

    B1

    B2

    B3

    B4

    B1

    B2

    B3

    B4B5

    Important: B andB partition for any event B:

    BB

    Partitioning an event A

    Any set or eventA can be partitioned: it doesnt have to be.

    IfB1, . . . , Bk form a partition of, then (A B1), . . . , (A Bk) form a partitionofA.

    0 0 0 00 0 0 00 0 0 00 0 0 01 1 1 11 1 1 11 1 1 11 1 1 1

    A

    B1

    B2

    B3

    B4

    We will see that this is very useful for finding the probability of event A.

    This is because it is often easier to find the probability of small chunks of A(the partitioned sections) than to find the whole probability of A at once. Thepartition idea shows us how to add the probabilities of these chunks together:see later.

  • 8/8/2019 Stats 210 Course Book

    15/200

    14

    1.5 Probability: a way of measuring sets

    Remember that you are given the job of building the science of randomness.This means somehow measuring chance.

    If I sent you away to measure heights, the first

    thing you would ask is what you are supposedto be measuring the heights of.People? Trees? Mountains?

    We have the same question when setting out to measure chance.Chance of what?

    The answer is sets.

    It was clever to formulate our notions of events and sample spaces in terms ofsets: it gives us something to measure. Probability, the name that we give toour chance-measure, is a way of measuring sets.

    You probably already have a good idea for a suitable way to measure the sizeof a set or event. Why not just count the number of elements in it?

    In fact, this is often what we do to measure probability (although countingthe number of elements can be far from easy!) But there are circumstances

    where this is not appropriate.What happens, for example, if one set is far more likely than another, butthey have the same number of elements? Should they be the same probability?

    0 0 0 01 1 1 1

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    First set: {Lions win}.Second set: {All Blacks win}.Both sets have just one element, butwe definitely need to give them different probabilities!

    More problems arise when the sets are infiniteor continuous.

    Should the intervals [3, 4] and [13, 14] be the same probability, just becausethey are the same length? Yes they should, if (say) our random experiment isto pick a random number on [0, 20] but no they shouldnt (hopefully!) if ourexperiment was the time in years taken by a student to finish their degree.

  • 8/8/2019 Stats 210 Course Book

    16/200

    15

    Most of this course is about probability distributions.

    A probability distribution is a rule according to which probability is apportioned,or distributed, among the different sets in the sample space.

    At its simplest, a probability distribution just lists every element in the samplespace and allots it a probability between 0 and 1, such that the total sum of

    probabilities is 1.

    In the rugby example, we could use the following probability distribution:

    P(Lions win)= 0.01, P(All Blacks win)= 0.99.

    In general, we have the following definition for discrete sample spaces.

    Discrete probability distributions

    Definition: Let = {s1, s2, . . .} be a discrete sample space.A discrete probability distribution on is a set of real numbers {p1, p2, . . .}associated with the sample points {s1, s2, . . .} such that:

    1. 0

    pi

    1 for all i;

    2.i

    pi = 1.

    pi is called the probability of the event that the outcome is si.

    We write: pi = P(si).

    The rule for measuring the probability of any set, or event, A , is to sumthe probabilities of the elements ofA:

    P(A) =iA

    pi.

    E.g. ifA = {s3, s5, s14}, then P(A) = p3 + p5 +p14.

  • 8/8/2019 Stats 210 Course Book

    17/200

    16

    Continuous probability distributions

    On a continuous sample space , e.g. = [0, 1], we can not list all the ele-ments and give them an individual probability. We will need more sophisticatedmethods detailed later in the course.

    However, the same principle applies. A continuous probability distribution is arule under which we can calculate a probability between 0 and 1 for any set, orevent, A .

    Probability Axioms

    For any sample space, discrete or continuous, all of probability theory is basedon the following three definitions, or axioms.

    Axiom 1: P() = 1.

    Axiom 2: 0 P(A) 1 for all events A.Axiom 3: If A1, A2, . . . , An aremutually exclusive events, (no overlap), then

    P(A1 A2 . . . An) = P(A1) + P(A2) + . . . + P(An).

    If our rule for measuring sets satisfies the three axioms, it is a valid probabilitydistribution.

    It should be clear that the definitions given for the discrete sample space on page15 will satisfy the axioms. The challenge of defining a probability distribution

    on a continuous sample space is left till later.

    Note: The axioms can never be proved: they are definitions.

    Note: P() = 0.

    Note: Remember that an EVENT is a SET: an event is a subset of the sample space.

  • 8/8/2019 Stats 210 Course Book

    18/200

    17

    1.6 Probabilities of combined events

    In Section 1.3 we discussed unions, intersections, and complements of events.We now look at the probabilities of these combinations. Everything belowapplies to events (sets) in either a discrete or a continuous sample space.

    1. Probability of a union

    Let A and B be events on a sample space . There are two cases for theprobability of the union A B:

    1. A andB are mutually exclusive (no overlap): i.e. A B = .

    2. A andB are not mutually exclusive: A B = .For Case 1, we get the probability of A B straight from Axiom 3:

    If A B = then P(A B) = P(A) + P(B).

    For Case 2, we have the following formula;

    For ANY events A, B, P(A B) = P(A) + P(B) P(A B).

    Note: The formula for Case 2 applies also to Case 1: just substituteP(A B) = P() = 0.

    For three or more events: e.g. for any A, B, and C,

    P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C)+ P(A B C) .

  • 8/8/2019 Stats 210 Course Book

    19/200

    18

    Explanation

    For any events A and B, P(A B) = P(A) + P(B) P(A B).

    The formal proof of this formula is in Section 1.9 (non-examinable).

    To understand the formula, think of the Venn diagrams:

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

    1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 1

    1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 00 0 0 0

    1 1 1 1

    1 1 1 1

    1 1 1 1

    1 1 1 11 1 1 1

    A

    B \ (A n B)

    A

    B

    When we add P(A) + P(B), weadd the intersection twice.

    So we have to subtract theintersection once to getP(A B):P(A B) = P(A) + P(B) P(A B).

    Alternatively, think of A B astwo disjoint sets: all ofA,

    and the bits ofB without the

    intersection. SoP(A B) =P(A) +

    P(B) P(A B)

    .

    2. Probability of an intersection

    0 0 0 0

    0 0 0 0

    0 0 0 00 0 0 00 0 0 0

    1 1 1 1

    1 1 1 1

    1 1 1 11 1 1 11 1 1 1

    A

    B

    There is no easy formula for P(A B).We might be able to use statistical independence(Section 1.16).

    If A and B are not statistically independent,we often use conditional probability

    (Section 1.10.)

    3. Probability of a complement

    A

    A

    P(A) = 1 P(A).

    This is obvious, but a formal proof is given in Sec. 1.9.

  • 8/8/2019 Stats 210 Course Book

    20/200

    19

    1.7 The Partition Theorem

    The Partition Theorem is one of the most useful tools for probability calcula-tions. It is based on the fact that probabilities are often easier to calculate ifwe break down a set into smaller parts.

    Recall that a partition of is a collectionof non-overlapping sets B1, . . . , Bm which

    together cover everything in .

    B1

    B3

    B2

    B4

    Also, if B1, . . . , Bm form a partition of , then (A B1), . . . , (A Bm) form apartition of the set or eventA.

    A

    A B1 A B2

    A B3 A B4

    B1 B2

    B3 B4

    The probability of event A is therefore the sum of its parts:

    P(A) = P(A B1) + P(A B2) + P(A B3) + P(A B4).

    The Partition Theorem is a mathematical way of saying the whole is the sumof its parts.

    Theorem 1.7: The Partition Theorem. (Proof in Section 1.9.)

    LetB1, . . . , Bm form a partition of. Then for any event A,

    P(A) =mi=1

    P(A Bi).

    Note: Recall the formal definition of a partition. Sets B1, B2, . . . , Bm form a par-tition of if Bi Bj = for all i = j , and

    mi=1 Bi = .

  • 8/8/2019 Stats 210 Course Book

    21/200

    20

    1.8 Examples of basic probability calculations

    300 Australians were asked about their car preferences in 1998. Of the respon-dents, 33% had children. The respondents were asked what sort of car theywould like if they could choose any car at all. 13% of respondents had children

    and chose a large car. 12% of respondents did not have children and chose alarge car.

    Find the probability that a randomly chosen respondent:(a) would choose a large car;(b) either has children or would choose a large car (or both).

    First formulate events:

    Let C = has children C = no children

    L = chooses large car.

    Next write down all the information given:

    P(C) = 0.33

    P(C L) = 0.13P(C L) = 0.12.

    (a) Asked forP(L).

    P(L) = P(L C) + P(L C) (Partition Theorem)= P(C L) + P(C L)= 0.13 + 0.12

    = 0.25. P(chooses large car)= 0.25.

    (b) Asked forP(L C).P(L C) = P(L) + P(C) P(L C) (Section 1.6)

    = 0.25 + 0.33 0.13= 0.45.

  • 8/8/2019 Stats 210 Course Book

    22/200

    21

    Respondents were also asked their opinions on car reliability and fuel consump-tion. 84% of respondents considered reliability to be of high importance, while40% considered fuel consumption to be of high importance.

    Formulate events: R = considers reliability of high importance,

    F = considers fuel consumption of high importance.

    (c) What is P(R)?

    (d) What is P(R F)?Information given: P(R) = 0.84 P(F) = 0.40.

    (c) P(R) = 1 P(R)= 1 0.84= 0.16.

    (d) We can not calculateP(R F) from the information given.

    (e) Given the further information that 12% of respondents considered neitherreliability nor fuel consumption to be of high importance, find P(R F) andP(R F).Information given: P(R F) = 0.12. R F

    Thus P(R F) = 1 P(R F)= 1 0.12= 0.88.

    Probability that respondent considers either reliability or fuel consumption, or

    both, of high importance.

    P(R F) = P(R) + P(F) P(R F) (Section 1.6)= 0.84 + 0.40 0.88= 0.36.

    Probability that respondent considers BOTH reliability AND fuel consumption

    of high importance.

  • 8/8/2019 Stats 210 Course Book

    23/200

    22

    (f) Find the probability that a respondent considered reliability, but not fuelconsumption, of high importance.

    P(R F) = P(R) P(R F) (Partition Theorem)

    = 0.84 0.36= 0.48.

    1.9 Formal probability proofs: non-examinable

    If you are a mathematician, you will be interested to see how properties of

    probability are proved formally. Only the Axioms, together with standard set-theoretic results, may be used.

    Theorem : The probability measure P has the following properties.

    (i) P() = 0.(ii) P(A) = 1 P(A) for any event A.

    (iii) (Partition Theorem.) IfB1, B2, . . . , Bm form a partition of , then for anyevent A,

    P(A) =mi=1

    P(A Bi).

    (iv) P(A B) = P(A) + P(B) P(A B) for any events A, B.

    Proof:

    i) For any A, we have A = A ; and A = (mutually exclusive).So P(A) = P(A ) = P(A) + P() (Axiom 3) P() = 0.

  • 8/8/2019 Stats 210 Course Book

    24/200

    23

    ii) = A A; and A A = (mutually exclusive).So 1 = P()

    Axiom 1

    = P(A A) = P(A) + P(A). (Axiom 3)

    iii) Suppose B1, . . . , Bm are a partition of :

    then Bi Bj = if i = j, and mi=1 Bi = .Thus, (A Bi) (A Bj) = A (Bi Bj) = A = , for i = j,ie. (A B1), . . . , (A Bm) are mutually exclusive also.

    So,m

    i=1P(A Bi) = P

    m

    i=1(A Bi)

    (Axiom 3)

    = P

    A mi=1

    Bi

    (Distributive laws)

    = P(A )= P(A) .

    iv)

    A

    B = (A

    )

    (B

    ) (Set theory)

    =

    A (B B) B (A A) (Set theory)= (A B) (A B) (B A) (B A) (Distributive laws)= (A B) (A B) (A B).

    These 3 events are mutually exclusive:

    eg. (A B) (A B) = A (B B) = A = , etc.

    So, P(A B) = P(A B) + P(A B) + P(A B) (Axiom 3)=P(A) P(A B)

    from (iii) using B and B

    +P(B) P(A B)

    from (iii) using A and A

    + P(A B

    = P(A) + P(B) P(A B).

  • 8/8/2019 Stats 210 Course Book

    25/200

    24

    1.10 Conditional Probability

    Conditioning is another of the fundamental tools of probability: probably themost fundamental tool. It is especially helpful for calculating the probabilitiesof intersections, such as P(A B), which themselves are critical for the usefulPartition Theorem.Additionally, the whole field of stochastic processes (Stats 320 and 325) is basedon the idea of conditional probability. What happens next in a process depends,or is conditional, on what has happened beforehand.

    Dependent events

    Suppose A and B are two events on the same sample space. There will often

    be dependence between A and B. This means that if we know that B hasoccurred, it changes our knowledge of the chance that A will occur.

    Example: Toss a die once.

    Let event A = get a 6Let event B= get an even number

    If the die is fair, then P(A) = 16

    andP(B) = 12

    .

    However, if we know that B has occurred, then there is an increased chancethat A has occurred:

    P(A occurs given that B has occurred) = 13

    .

    result 6result 2 or 4 or 6

    We write

    P(A given B) = P(A | B) = 13

    .

    Question: what would be P(B | A)?

    P(B | A) = P(B occurs, given thatA has occurred)= P(get an even number, given that we know we got a 6)

    = 1.

  • 8/8/2019 Stats 210 Course Book

    26/200

    25

    Conditioning as reducing the sample space

    The car survey in Section 1.8 also asked respondents which they valued morehighly in a car: ease of parking, or style/prestige. Here are the responses:

    Male Female Total

    Prestige more important than parking 79 51 130

    Prestige less important than parking 71 99 170

    Total 150 150 300

    Suppose we pick a respondent at random from all those in the table.

    Let event A = respondent thinks that prestige is more important.

    P(A) =# As

    total # respondents=

    130

    300= 0.43.

    However, this probability differs between males and females. Suppose we reduceour sample space from

    = {all people in table}to

    B = {all males in table}.

    P(respondent thinks prestige is more important, given that respondent is male)

    =# males who favour prestige

    total # males

    =# maleAs

    # males

    =79

    150

    = 0.53.

    We write: P(A | B) = 0.53.

  • 8/8/2019 Stats 210 Course Book

    27/200

    26

    We could follow the same working for any pair of events, A and B:

    P(A | B) = #Bs who areAtotal #Bs

    =# in table who are BOTHB andA

    #Bs

    =(# in B ANDA)/ (# in )

    (# in B)/ (# in )

    =P(A B)P(B)

    .

    This is our definition of conditional probability:

    Definition: Let A and B be two events. The conditional probability that eventA occurs, given that event B has occurred, is written P(A | B),

    and is given by

    P(A | B) = P(A B)P(B)

    .

    Read P(A | B) as probability of A, given B.

    Note: P(A | B) gives P(A and B , from within the set of Bs only).P(A B) gives P(A and B , from the whole sample space ).

    Note: Follow the reasoning above carefully. It is important to understand whythe conditional probability is the probability of the intersection within the newsample space.

    Conditioning on event B means changing the sample space to B.

    Think ofP(A | B) as the chance of getting an A, from the set ofBs only.

  • 8/8/2019 Stats 210 Course Book

    28/200

    27

    The symbol P belongs to the sample space

    Recall the first of our probability axioms on page 16:

    P() = 1.

    This indicates that the symbol P is defined with respect to . That is,

    P BELONGS to the sample space.

    If we change the sample space, we need to change the symbol P. This is whatwe do in conditional probability:

    to change the sample space from to B, say, we change from the symbolP tothe symbolP( | B).

    The symbol P( | B) should behave exactly like the symbolP.For example:

    P(C D) = P(C) + P(D) P(C D),so

    P(C D | B) = P(C| B) + P(D | B) P(C D | B).

    Trick for checking conditional probability calculations:

    A useful trick for checking a conditional probability expression is to replace theconditioned set by, and see whether the expression is still true.

    For example, is P(A | B) + P(A | B) = 1?Answer: ReplaceB by: this gives

    P(A | ) + P(A | ) = P(A) + P(A) = 1.So, yes, P(A | B) + P(A | B) = 1 for any other sample spaceB.Is P(A | B) + P(A | B) = 1?Try to replace the conditioning set by : we cant! There are two conditioningsets: B andB.

    The expression is NOT true, and in fact it doesnt make sense to try to add to-

    gether probabilities from two different sample spaces.

  • 8/8/2019 Stats 210 Course Book

    29/200

    28

    The Multiplication Rule

    For any events A and B,

    P(A B) = P(A | B)P(B) = P(B | A)P(A).

    Proof:

    Immediate from the definitions:

    P(A | B) = P(A B)P(B)

    P(A B) = P(A | B)P(B) ,

    and

    P(B | A) = P(B A)P(A)

    P(B A) = P(A B) = P(B | A)P(A).

    New statement of the Partition Theorem

    The Multiplication Rule gives us a new statement of the Partition Theorem:IfB1, . . . , Bm partition S, then for any event A,

    P(A) =

    mi=1

    P(A Bi) =mi=1

    P(A | Bi)P(Bi).

    Both formulations of the Partition Theorem are very widely used, but especially

    the conditional formulation mi=1 P(A | Bi)P(Bi).Warning:

    Be careful to use this new version of the Partition Theorem correctly:

    it is P(A) = P(A | B1)P(B1) + . . . + P(A | Bm)P(Bm),NOT P(A) = P(A | B1) + . . . + P(A | Bm).

  • 8/8/2019 Stats 210 Course Book

    30/200

    29

    Conditional probability and Peter Pan

    When Peter Pan was hungry but had nothing to eat,he would pretend to eat.(An excellent strategy, I have always found.)

    Conditional probability is the Peter Pan of Stats 210. When you dont knowsomething that you need to know, pretend you know it.

    Conditioning on an event is like pretending that you know that the event hashappened.

    For example, if you know the probability of getting to work on time in differentweather conditions, but you dont know what the weather will be like today,

    pretend you do and add up the different possibilities.

    P(work on time)= P(work on time| fine)P(fine)+ P(work on time| wet)P(wet).

    1.11 Examples of conditional probability and partitions

    Tom gets the bus to campus every day. The bus is on time with probability0.6, and late with probability 0.4.

    The sample space can be written as = {bus journeys}. We can formulateevents as follows:

    T = on time; L = late.

    From the information given, the events have probabilities:

    P(T) = 0.6 ; P(L) = 0.4.

    (a) Do the events T and L form a partition of the sample space ? Explain whyor why not.

    Yes: they cover all possible journeys (probabilities sum to 1), and there is no

    overlap in the events by definition.

  • 8/8/2019 Stats 210 Course Book

    31/200

    30

    The buses are sometimes crowded and sometimes noisy, both of which areproblems for Tom as he likes to use the bus journeys to do his Stats assign-ments. When the bus is on time, it is crowded with probability 0.5. When itis late, it is crowded with probability 0.7. The bus is noisy with probability0.8 when it is crowded, and with probability 0.4 when it is not crowded.

    (b) Formulate events C and N corresponding to the bus being crowded and noisy.Do the events C and N form a partition of the sample space? Explain whyor why not.

    Let C = crowded, N =noisy.C and N do NOT form a partition of . It is possible for the bus to be noisywhen it is crowded, so there must be some overlap between C andN.

    (c) Write down probability statements corresponding to the information givenabove. Your answer should involve two statements linking C with T and L,and two statements linking N with C.

    P(C| T) = 0.5; P(C| L) = 0.7.P(N| C) = 0.8; P(N| C) = 0.4.

    (d) Find the probability that the bus is crowded.

    P(C) = P(C| T)P(T) + P(C| L)P(L) (Partition Theorem)= 0.5 0.6 + 0.7 0.4= 0.58.

    (e) Find the probability that the bus is noisy.

    P(N) = P(N| C)P(C) + P(N| C)P(C) (Partition Theorem)= 0.8 0.58 + 0.4 (1 0.58)= 0.632.

  • 8/8/2019 Stats 210 Course Book

    32/200

    31

    1.12 Bayes Theorem: inverting conditional probabilities

    ConsiderP(B A) = P(A B). Apply multiplication rule to each side:

    P(B | A)P(A) = P(A | B)P(B)

    Thus P(B | A) = P(A | B)P(B)P(A)

    . ()

    This is the simplest form of Bayes Theorem, namedafter Thomas Bayes (170261), English clergymanand founder of Bayesian Statistics.

    Bayes Theorem allows us to invert the conditioning,i.e. to express P(B | A) in terms ofP(A | B).

    This is very useful. For example, it might be easy to calculate,

    P(later event|earlier event),but we might only observe the later event and wish to deduce the probability

    that the earlier event occurred,

    P(earlier event| later event).

    Full statement of Bayes Theorem:

    Theorem 1.12: Let B1, B2, . . . , Bm form a partition of . Then for any event A,and for any j = 1, . . . , m,

    P(Bj | A) = P(A | Bj)P(Bj)mi=1 P(A | Bi)P(Bi)

    (Bayes Theorem)

    Proof:

    Immediate from () (put B = Bj), and the Partition Rule which gives P(A) =

    mi=1 P(A | Bi)P(Bi).

  • 8/8/2019 Stats 210 Course Book

    33/200

    32

    Special case of Bayes Theorem when m = 2: useB and B as the partition of:

    then P(B | A) = P(A | B)P(B)P(A

    |B)P(B) + P(A

    |B)P(B)

    Example: The case of the Perfidious Gardener.Mr Smith owns a hysterical rosebush. It will die withprobability 1/2 if watered, and with probability 3/4 ifnot watered. Worse still, Smith employs a perfidiousgardener who will fail to water the rosebush withprobability 2/3.

    Smith returns from holiday to find the rosebush . . . DEAD!!!What is the probability that the gardener did not water it?

    Solution:

    First step: formulate events

    Let : D = rosebush dies

    W = gardener waters rosebush

    W = gardener fails to water rosebushSecond step: write down all information given

    P(D | W) = 12

    P(D | W) = 34

    P(W) = 23

    (so P(W) = 13

    )

    Third step: write down what were looking for

    P(W | D)

    Fourth step: compare this to what we know

    Need to invert the conditioning, so use Bayes Theorem:

    P(W | D) = P(D | W)P(W)P(D | W)P(W) + P(D | W)P(W) =

    3/4 2/33/4 2/3 + 1/2 1/3 =

    3

    4

    So the gardener failed to water the rosebush with probability 34

    .

  • 8/8/2019 Stats 210 Course Book

    34/200

    Example: The case of the Defective Ketchup Bottle.

    Ketchup bottles are produced in 3 different factories, accountingfor 50%, 30%, and 20% of the total output respectively.The percentage of bottles from the 3 factories that are defectiveis respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer who

    eats only ketchup finds a defective bottle in her wig.What is the probability that it came from Factory 1?

    Solution:

    1. Events:

    letFi = bottle comes from Factory i (i=1,2,3)letD = bottle is defective

    2. Information given:

    P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2P(D | F1) = 0.004 P(D | F2) = 0.006 P(D | F3) = 0.012

    3. Looking for:

    P(F1 | D) (so need to invert conditioning).

    4. Bayes Theorem:

    P(F1 | D) = P(D | F1)P(F1)P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3)P(F3)

    =0.004 0.5

    0.004 0.5 + 0.006 0.3 + 0.012 0.2=

    0.002

    0.0062

    = 0.322.

  • 8/8/2019 Stats 210 Course Book

    35/200

    34

    1.13 Chains of events and probability trees: non-examinable

    The multiplication rule is very helpful for calculating probabilities when eventshappen in sequence.

    Example: Two balls are drawn at random without replacement from a box con-taining 4 white and 2 red balls. Find the probability that:(a) they are both white,(b) the second ball is red.

    Solution

    Let eventWi = ith ball is white and Ri = ith ball is red.

    a)P(W1 W2) = P(W2 W1) = P(W2 | W1)P(W1)

    NowP(W1) =4

    6and P(W2 | W1) = 3

    5.

    W1

    SoP(both white) = P(W1 W2) = 35

    46

    =2

    5.

    b) Looking forP(2nd ball is red). We cant find this without conditioning on whathappened in the first draw.

    Event 2nd ball is red is actually event{W1R2, R1R2} = (W1 R2) (R1 R2).SoP(2nd ball is red) = P(W1

    R2) + P(R1

    R2) (mutually exclusive)

    = P(R2 | W1)P(W1) + P(R2 | R1)P(R1)=

    2

    5 4

    6+

    1

    5 2

    6

    =

    1

    3

    W1 R 1

  • 8/8/2019 Stats 210 Course Book

    36/200

    35

    Probability trees

    Probability trees are a graphical way of representing the multiplication rule.

    First Draw Second Draw

    P(W1) =4

    6

    P(R1) =2

    6

    P(W2|

    W1) =3

    5

    P(R2 | W1) = 25

    P(W2

    |R1) =

    4

    5

    P(R2 | R1) = 15

    W1

    R1

    W2

    R2

    W2

    R2

    Write conditional probabilities on the branches, and multiply to get probability

    of an intersection: eg. P(W1

    W2) =

    4

    6 3

    5

    , or P(R1

    W2) =

    2

    6 4

    5

    .

    More than two events

    To find P(A1 A2 A3) we can apply the multiplication rule successively:P(A1 A2 A3) = P(A3 (A1 A2))

    = P(A3 | A1 A2)P(A1 A2) (multiplication rule)= P(A3 | A1 A2)P(A2 | A1)P(A1) (multiplication rule)

  • 8/8/2019 Stats 210 Course Book

    37/200

    36

    Remember as: P(A1 A2 A3) = P(A1)P(A2 | A1)P(A3 | A2 A1).

    On the probability tree:

    P(A1)

    P(A1)

    P(A2|

    A1)

    P(A3 | A2 A1) P(A1 A2 A3)

    In general, for n events A1, A2, . . . , An, we have

    P(A1A2 . . .An) = P(A1)P(A2 | A1)P(A3 | A2 A1) . . .P(An | An1 . . .A1).

    Example: A box contains w white balls and r red balls. Draw 3 balls withoutreplacement. What is the probability of getting the sequence white, red, white?

    Answer:

    P(W1 R2 W3) = P(W1)P(R2 | W1)P(W3 | R2 W1)

    = ww + r

    rw + r 1 w 1w + r 2 .

  • 8/8/2019 Stats 210 Course Book

    38/200

    37

    Two separate studies say . . .

    Youre

    Better

    Off

    with

    AntiCough!

    So youre better off with AntiCough

    . . . or are you???

    Have a look at the figures:

    AntiCough Other Medicine

    Given to: 25 75

    Cured: 20 58

    %Cured: 80% 77%

    AntiCough Other Medicine

    Given to: 75 25

    Cured: 50 16

    %Cured: 67% 64%

    u

    y1

    u

    y2

    Combine the studies . . . What happens?Never believe what you read. . . This is Simpsons Paradox. . . Never believe what you read. . . This is Sim

  • 8/8/2019 Stats 210 Course Book

    39/200

    38

    1.14 Simpsons paradox: non-examinable

    It is possible for one treatment (e.g. Anticough) to be better than another (Other

    Medicine) in every one of a set of categories (e.g. Study 1 and Study 2), butworse overall!

    Combining the results overleaf:

    AntiCough Other Medicine

    Given to: 100 100

    Cured: 70 74

    %Cured: 70% 74%

    Overall, AntiCough has a 4% lower cure percentage (70%),despite being about 3% higher in both Study 1 and Study 2.

    This effect is known as Simpsons Paradox.

    It occurs because

    P(C| A) = P(C| A S1)P(S1 | A) + P(C| A S2)P(S2 | A) ;

    P(C| A) = P(C| A S1)P(S1 | A) + P(C| A S2)P(S2 | A) .

    C = {cured} A = {Anticough} A = {Other Medicine}S1 = {Study 1} S2 = {Study 2}

    Although P(C| A S1) > P(C| A S1), and P(C| A S2) > P(C| A S2), theother terms can change the overall outcome:

    P(S1 | A), P(S1 | A), P(S2 | A), P(S2 | A).

  • 8/8/2019 Stats 210 Course Book

    40/200

    39

    1.15 Equally likely outcomes and combinatorics: non-examinable

    Sometimes, all the outcomes in a discrete finite sample space are equally likely.This makes it easy to calculate probabilities. If:

    i) =

    {s1, . . . , sk

    };

    ii) each outcome si is equally likely, so p1 = p2 = . . . = pk =1k

    ;

    iii) event A = {s1, s2, . . . , sr} contains r possible outcomes,then

    P(A) =r

    k=

    # outcomes in A

    # outcomes in .

    Example: For a 3-child family, possible outcomes from oldest to youngest are:

    = {GGG,GGB,GBG,GBB,BGG,BGB,BBG,BBB}= {s1, s2, s3, s4, s5, s6, s7, s8}

    Let {p1, p2, . . . , p8} be a probability distribution on . If every baby is equallylikely to be a boy or a girl, then all of the 8 outcomes in are equally likely, so

    p1 = p2 = . . . = p8 = 18 .

    Let event A be A = oldest child is a girl.

    Then A ={GGG, GGB, GBG, GBB}.Event A contains 4 of the 8 equally likely outcomes, so event A occurs withprobability P(A) = 48 =

    12

    .

    Counting equally likely outcomes

    To count the number of equally likely outcomes in an event, we often needto use permutations or combinations. These give the number of ways ofchoosing r objects from n distinct objects.

    For example, if we wish to select 3 objects from n = 5 objects (a, b, c, d, e), wehave choices abc, abd, abe, acd, ace, . . . .

  • 8/8/2019 Stats 210 Course Book

    41/200

    40

    1. Number of Permutations, nPr

    The number of permutations, nPr, is the number of ways of selectingr objectsfrom n distinct objects when different orderings constitute different choices.

    That is, choice(a,b,c) counts separately from choice(b,a,c).

    Then

    #permutations = nPr = n(n 1)(n 2) . . . (n r + 1) = n!(n r)! .

    (n choices for first object, (n 1) choices for second, etc.)

    2. Number of Combinations, nCr =n

    r

    The number of combinations, nCr, is the number of ways of selecting r objectsfrom n distinct objects when different orderings constitute the same choice.

    That is, choice(a,b,c) and choice(b,a,c) are the same.

    Then

    #combinations = nCr =

    n

    r

    =

    nPrr!

    =n!

    (n r)!r! .

    (becausenPr counts each permutation r! times, and we only want to count it once:so dividenPr byr!)

    Use the same rule on the numerator and the denominator

    When P(A) =

    # outcomes in A# outcomes in

    , we can often think about the problem

    either with different orderings constituting different choices, or with differentorderings constituting the same choice. The critical thing is to use the samerule for both numerator and denominator.

  • 8/8/2019 Stats 210 Course Book

    42/200

    41

    Example: (a) Tom has five elderly great-aunts who live together in a tiny bunga-low. They insist on each receiving separate Christmas cards, and threaten todisinherit Tom if he sends two of them the same picture. Tom has Christmascards with 12 different designs. In how many different ways can he select 5different designs from the 12 designs available?

    Order of cards is not important, so use combinations. Number of ways of select-ing 5 distinct designs from 12 is

    12C5 =

    12

    5

    =

    12 !

    (12 5)! 5! = 792.

    b) The next year, Tom buys a pack of 40 Christmas cards, featuring 10 differentpictures with 4 cards of each picture. He selects 5 cards at random to send tohis great-aunts. What is the probability that at least two of the great-auntsreceive the same picture?

    Looking forP(at least 2 cards the same)= P(A) (say).

    Easiest to findP(all 5 cards are different)= P(A).

    Number of outcomes in A is

    (# ways of selecting 5 different designs) = 40 36 32 28 24 .

    (40 choices for first card; 36 for second, because the 4 cards with thefirst design are excluded; etc.

    Note that order matters: e.g. we are counting choice 12345 separately

    from 23154.)

    Total number of outcomes is

    (total # ways of selecting 5 cards from 40) = 40 39 38 37 36 .(Note: order mattered above, so we need order to matter here too.)

    So

    P(A) =40 36 32 28 2440 39 38 37 36 = 0.392.

    Thus

    P(A) = P(at least 2 cards are the same design) = 1 P(A) = 1 0.392 = 0.608.

  • 8/8/2019 Stats 210 Course Book

    43/200

    42

    Alternative solution if order does not matter on numerator and denominator:(much harder method)

    P(A) =

    10

    5

    45

    40

    5 .

    This works because there are 105 ways of choosing 5 different designs from 10,

    and there are 4 choices of card within each of the 5 chosen groups. So the totalnumber of ways of choosing 5 cards of different designs is

    105

    45. The total

    number of ways of choosing 5 cards from 40 is

    405

    .

    Exercise: Check that this gives the same answer for P(A) as before.

    Note: Problems like these belong to the branch of mathematics calledCombinatorics: the science of counting.

    1.16 Statistical Independence

    Two events A and B are statistically independent if the occurrence of one does

    not affect the occurrence of the other.

    This means P(A | B) = P(A) and P(B | A) = P(B).

    Now P(A | B) = P(A B)P(B)

    ,

    so if P(A | B) = P(A) then P(A B) = P(A) P(B).We use this as our definition of statistical independence.

    Definition: Events A and B are statistically independent if

    P(A B) = P(A)P(B).

  • 8/8/2019 Stats 210 Course Book

    44/200

    43

    For more than two events, we say:

    Definition: Events A1, A2, . . . , An are mutually independent if

    P(A1 A2 . . . An) = P(A1)P(A2) . . .P(An), AND

    the same multiplication rule holds for every subcollection of the events too.

    Eg. events A1, A2, A3, A4 are mutually independent if

    i) P(Ai Aj) = P(Ai)P(Aj) for alli, j with i = j;AND

    ii) P(Ai Aj Ak) = P(Ai)P(Aj)P(Ak) for alli,j,k that are all different;AND

    iii) P(A1 A2 A3 A4) = P(A1)P(A2)P(A3)P(A4).

    Statistical independence for calculating the probability of an intersection

    In section 1.6 we said that it is often hard to calculate P(A

    B).

    We usually have two choices.

    1. IFA andB are statistically independent, then

    P(A B) = P(A) P(B).

    2. IfA andB are not known to be statistically independent, we usually have touse conditional probability and the multiplication rule:

    P(A B) = P(A | B)P(B).This still requires us to be able to calculate P(A | B).

    Note: If events are physically independent, then they will also be statisticallyindependent.

  • 8/8/2019 Stats 210 Course Book

    45/200

    44

    Example: Toss a fair coin and a fair die together. The coin and die are physicallyindependent.

    Sample space: = {H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}- all 12 items are equally likely.

    Let A= heads and B= six.

    Then P(A) = P({H1, H2, H3, H4, H5, H6}) = 612

    = 12

    P(B) = P({H6, T6}) = 212

    = 16

    Now P(A B) = P(Heads and 6) = P({H6}) = 112

    But P(A) P(B) =12 1

    6= 1

    12also,

    So P(A B) = P(A)P(B) and thus A and B are statistically indept.

    Pairwise independence does not imply mutual independence

    Example: A jar contains 4 balls: one red, one white, one blue, and one red, white& blue. Draw one ball at random.

    Let A =ball has red on it,B =ball has white on it,C =ball has blue on it.

    Two balls satisfy A, so P(A) = 24 = 12. Likewise, P(B) = P(C) = 12 .

    Pairwise independent:

    Consider P(A B) = 14

    (one of 4 balls has both red and white on it).

    But, P(A) P(B) = 12 1

    2= 1

    4, so P(A B) = P(A)P(B).

    Likewise, P(A C) = P(A)P(C), and P(B C) = P(B)P(C).

    So A, B and C are pairwise independent.

    Mutually independent?

    Consider P(A B C) = 14

    (one of 4 balls)

    while P(A)P(B)P(C) =12 12 12 = 18 = P(A B C).So A, B and C are NOT mutually independent, despite being pairwise indepen-dent.

  • 8/8/2019 Stats 210 Course Book

    46/200

    45

    1.17 Random Variables

    We have one more job to do in laying the foundations of our science of random-ness. So far we have come up with the following ideas:

    1. Things that happen are sets, also called events.

    2. We measure chance by measuring sets, using a measure called probability.

    Finally, what are the sets that we are measuring? It is a nuisance to have lotsof different sample spaces:

    = {head, tail}; = {same, different}; = {Lions win, All Blacks win}.All of these sample spaces could be represented more concisely in terms ofnumbers:

    = {0, 1}.On the other hand, there are many random experiments that genuinely produce

    random numbers as their outcomes.

    For example, the number of girls in a three-child family; the number of headsfrom 10 tosses of a coin; and so on.

    When the outcome of a random experiment isa number,

    it enables us to quantifymany new things of interest:

    1. quantify the average value (e.g. the average number of heads we would getif we made 10 coin-tosses again and again);

    2. quantify how much the outcomes tend to diverge from the average value;

    3. quantify relationships between different random quantities (e.g. is the num-ber of girls related to the hormone levels of the fathers?)

    The list is endless. To give us a framework in which these investigations cantake place, we give a special name to random experiments that produce numbersas their outcomes.

    A random experiment whose possible outcomes are real numbers is called a

    random variable.

  • 8/8/2019 Stats 210 Course Book

    47/200

    46

    In fact, any random experiment can be made to have outcomes that are realnumbers, simply by mapping the sample space onto a set of real numbers usinga function.

    For example: function X : RX(Lions win) = 0; X(All Blacks win) = 1.

    This gives us our formal definition of a random variable:

    Definition: A random variable (r.v.) is a function from a sample space to thereal numbers R.

    We writeX : R.

    Although this is the formal definition, the intuitive definition of a random vari-

    able is probably more useful. Intuitively, remember that a random variableequates to a random experiment whose outcomes are numbers.

    A random variable produces random real numbersas the outcome of a random experiment.

    Defining random variables serves the dual purposes of:

    1. Describing many different sample spaces in the same terms:e.g. = {0, 1} with P(1) = p andP(0) = 1 p describes EVERY possibleexperiment with two outcomes.

    2. Giving a name to a large class of random experiments that genuinely pro-duce random numbers, and for which we want to develop general rules forfinding averages, variances, relationships, and so on.

    Example: Toss a coin 3 times. The sample space is

    = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}One example of a random variable is X : R such that, for sample pointsi,we haveX(si) = # heads in outcomesi.

    SoX(HHH) = 3, X(T HT) = 1, etc.

  • 8/8/2019 Stats 210 Course Book

    48/200

    47

    Another example is Y : R such thatY(si) =

    1 if 2nd toss is a head,0 otherwise.

    Then Y(HT H) = 0, Y(T HH) = 1, Y(HHH) = 1, etc.

    Probabilities for random variables

    By convention, we use CAPITAL LETTERS for random variables (e.g. X), andlower case letters to represent the values that the random variable takes (e.g.x).

    For a sample space and random variable X : R, and for a real number x,P(X = x) = P(outcome s is such thatX(s) = x) = P({s : X(s) = x}).

    Example: toss a fair coin 3 times. All outcomes are equally likely:P(HHH) = P(HHT) = . . . = P(TTT) = 1/8.

    Let X : R, such that X(s) = # heads in s.Then P(X = 0) = P({T T T}) = 1/8.

    P(X = 1) = P({H T T , T H T , T T H }) = 3/8.P(X = 2) = P({H H T , H T H , T H H }) = 3/8.P(X = 3) = P({HHH}) = 1/8.

    Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1.

    Independent random variables

    Random variables X and Y are independent if each does not affect the other.

    Recall that two events A and B are independent ifP(A B) = P(A)P(B).Similarly, random variables X and Y are defined to be independent if

    P({X = x} {Y = y}) = P(X = x)P(Y = y)forall possible values x andy.

  • 8/8/2019 Stats 210 Course Book

    49/200

    48

    We usually replace the cumbersome notation P({X = x} {Y = y}) by thesimpler notation P(X = x, Y = y).

    From now on, we will use the following notations interchangeably:

    P({X = x} {Y = y}) = P(X = x ANDY = y) = P(X = x, Y = y).

    Thus X andY are independent if and only if

    P(X = x, Y = y) = P(X = x)P(Y = y) for ALL possible values x, y.

  • 8/8/2019 Stats 210 Course Book

    50/200

    49

    1.18 Key Probability Results for Chapter 1

    1. IfA and B are mutually exclusive (i.e. A B = ), thenP(A B) = P(A) + P(B).

    2. Conditional probability: P(A | B) = P(A B)P(B)

    for any A, B.

    Or: P(A B) = P(A | B)P(B).

    3. For any A, B, we can write

    P(A | B) = P(B | A)P(A)P(B)

    .

    This is a simplified version of Bayes Theorem. It shows how to invert the conditioning,i.e. how to find P(A | B) when you know P(B | A).

    4. Bayes Theorem slightly more generalized:

    for any A, B,

    P(A | B) = P(B | A)P(A)P(B | A)P(A) + P(B | A)P(A) .

    This works because A and A form a partition of the sample space.

    5. Complete version of Bayes Theorem:

    If sets A1, . . . , Am form a partition of the sample space, i.e. they do not overlap(mutually exclusive) and collectively cover all possible outcomes (their union is thesample space), then

    P(Aj | B) = P(B | Aj)P(Aj)P(B | A1)P(A1) + . . . + P(B | Am)P(Am)

    =P(B | Aj)P(Aj)mi=1P(B | Ai)P(Ai)

    .

  • 8/8/2019 Stats 210 Course Book

    51/200

    50

    6. Partition Theorem: if A1, . . . , Am form a partition of the sample space, then

    P(B) = P(B A1) + P(B A2) + . . . + P(B Am) .

    This can also be written as:

    P(B) = P(B | A1)P(A1) + P(B | A2)P(A2) + . . . + P(B | Am)P(Am) .

    These are both very useful formulations.

    7. Chains of events:

    P(A1 A2 A3) = P(A1)P(A2 | A1)P(A3 | A2 A1) .

    8. Statistical independence:

    if A and B are independent, then

    P(A B) = P(A)P(B)and

    P(A | B) = P(A)and

    P(B | A) = P(B) .

    9. Conditional probability:

    IfP(B) > 0, then we can treat P(

    |B) just like P:

    e.g. if A1 and A2 are mutually exclusive, then P(A1 A2 | B) = P(A1 | B) + P(A2 | B)(compare with P(A1 A2) = P(A1) + P(A2));ifA1, . . . ,Am partition the sample space, then P(A1 | B) +P(A2 | B) +. . .+P(Am | B) = 1;and P(A | B) = 1 P(A | B) for any A.(Note: it is not generally true that P(A | B) = 1 P(A | B).)The fact that P( | B) is a valid probability measure is easily verified by checking that itsatisfies Axioms 1, 2, and 3.

    10. Unions: For any A, B, C,

    P(A B) = P(A) + P(B) P(A B) ;

    P(A B C) = P(A) +P(B) +P(C) P(A B) P(A C) P(B C) +P(A B C) .

    The second expression is obtained by writing P(ABC) = P

    A(BC)

    and applying

    the first expression to A and (B C), then applying it again to expand P(B C).

  • 8/8/2019 Stats 210 Course Book

    52/200

    51

    Chapter 2: Discrete Probability

    Distributions

    2.1 Introduction

    In the next two chapters we meet several important concepts:

    1. Probability distributions, and the probability function fX(x):

    the probability functionof a random variable lists the values the randomvariable can take, and their probabilities.

    2. Hypothesis testing:

    I toss a coin ten times and get nine heads. How unlikely is that? Can wecontinue to believe that the coin is fair when it produces nine heads outof ten tosses?

    3. Likelihood and estimation:

    what if we know that our random variable is (say) Binomial(5, p), for somep, but we dont know the value of p? We will see how to estimate thevalue of p using maximum likelihood estimation.

    4. Expectation and variance of a random variable: the expectationof a random variable is the value it takes on average. the variance of a random variable measures how much the random variable

    varies about its average.

    5. Change of variable procedures:

    calculating probabilities and expectations of g(X), where X is a randomvariable and g(X) is a function, e.g. g(X) =

    X or g(X) = X2.

    6. Modelling:

    we have a situation in real life that we know is random. But what doesthe randomness look like? Is it highly variable, or little variability? Doesit sometimes give results much higher than average, but never give resultsmuch lower(long-tailed distribution)? We will see how different probabilitydistributions are suitable for different circumstances. Choosing a probabil-ity distribution to fit a situation is called modelling.

  • 8/8/2019 Stats 210 Course Book

    53/200

    52

    2.2 The probability function, fX(x)

    The probability function fX(x) lists all possible values of X,

    and gives a probability to each value.

    Recall that a random variable, X, assigns a real number to every possibleoutcome of a random experiment. The random variable is discrete if the set ofreal values it can take is finite or countable, eg. {0,1,2,. . . }.

    Ferrari

    Porsche

    MG...

    Random experiment: which car?

    Random variable: X.

    X gives numbers to the possible outcomes.

    If he chooses. . .

    Ferrari X = 1

    Porsche X = 2MG X = 3

    Definition: The probability function, fX(x), for a discrete random variableX, isgiven by,

    fX(x) = P(X = x), for all possible outcomes x ofX.

    Example: Which car?

    Outcome: Ferrari Porsche MG

    x 1 2 3Probability function, fX(x) = P(X = x)16

    16

    46

    We write: P(X = 1) = fX(1) =16

    : the probability he makes choice 1 (a Ferrari)

    is 16

    .

  • 8/8/2019 Stats 210 Course Book

    54/200

    53

    We can also write the probability function as: fX(x) =

    1/6 ifx = 1,

    1/6 ifx = 2,4/6 ifx = 3,

    0 otherwise.

    Example: Toss a fair coin once, and let X=number of heads. Then

    X =

    0 with probability 0.5,1 with probability 0.5.

    The probability function of X is given by:

    x 0 1

    fX(x) = P(X = x) 0.5 0.5 or fX(x) =

    0.5 if x=00.5 if x=1

    0otherwise

    We write (eg.) fX(0) = 0.5, fX(1) = 0.5, fX(7.5) = 0, etc.

    fX(x) is just a list of probabilities.

    Properties of the probability function

    i) 0 fX(x) 1 for all x; probabilities are always between 0 and 1.ii)x

    fX(x) = 1; probabilities add to 1 overall.

    iii) P (X A) = xA

    fX(x);

    e.g. in the car example,

    P(X {1, 2}) = P(X = 1 or2) = P(X = 1) + P(X = 2) = 16

    + 16

    = 26

    .

    This is the probability of choosing either a Ferrari or a Porsche.

  • 8/8/2019 Stats 210 Course Book

    55/200

    54

    2.3 Bernoulli trials

    Many of the discrete random variables that we meetare based on counting the outcomes of a series oftrials called Bernoulli trials. Jacques Bernoulli wasa Swiss mathematician in the late 1600s. He and

    his brother Jean, who were bitter rivals, both stud-ied mathematics secretly against their fathers will.Their father wanted Jacques to be a theologist andJean to be a merchant.

    Definition: A random experiment is called a set of Bernoulli trials if it consistsof several trials such that:

    i) Each trial has only 2 possible outcomes (usually called Success and Fail-

    ure);

    ii) The probability of success, p, remains constant for all trials;

    iii) The trials are independent, ie. the event success in trial i does not dependon the outcome of any other trials.

    Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial withP(success) = P(head) = 1

    2.

    2) Repeated tossing of a fair die: success = 6, failure= not 6. Each toss isa Bernoulli trial with P(success) = 1

    6.

    Definition: The random variable Y is called a Bernoulli random variable if ittakes only 2 values, 0 and 1.

    The probability function is,

    fY(y) = p ify = 11 p ify = 0That is,

    P(Y = 1) = P(success) = p,

    P(Y = 0) = P(failure) = 1 p.

  • 8/8/2019 Stats 210 Course Book

    56/200

    55

    2.4 Example of the probability function: the Binomial Distribution

    The Binomial distribution counts the number of successesin a fixed number of Bernoulli trials.

    Definition: LetX be the number of successes in n independent Bernoulli trials each

    with probability of success = p. Then X has the Binomial distribution withparameters n andp. We writeX Bin(n, p), orX Binomial(n, p).

    Thus X Bin(n, p) if X is the number of successes out of n independenttrials, each of which has probabilityp of success.

    Probability function

    If X Binomial(n, p), then the probability function for X is

    fX(x) = P(X = x) = n

    xpx(1 p)nx for x = 0, 1, . . . , n

    Explanation:

    An outcome with x successes and(n x) failures has probability,

    px

    (1)(1 p)nx

    (2)where:

    (1) succeeds x times, each with probabilityp

    (2) fails (n x) times, each with probability(1 p).

  • 8/8/2019 Stats 210 Course Book

    57/200

    56

    There are

    n

    x

    possible outcomes with x successes and(n x) failures because

    we must selectx trials to be our successes, out ofn trials in total.

    Thus,

    P(#successes= x) = (#outcomes with x successes) (prob. of each such outcome=

    n

    x

    px(1 p)nx

    Note:

    fX(x) = 0 if x / {0, 1, 2, . . . , n}.

    Check thatn

    x=0

    fX(x) = 1:

    nx=0

    fX(x) =n

    x=0

    nx

    px(1 p)nx = [p + (1 p)]n (Binomial Theorem)

    = 1n = 1

    It is this connection with the Binomial Theorem that gives the Binomial Dis-tribution its name.

  • 8/8/2019 Stats 210 Course Book

    58/200

    57

    Example 1: Let X Binomial(n = 4, p = 0.2). Write down the probabilityfunction of X.

    x 0 1 2 3 4

    fX(x) = P(X = x) 0.4096 0.4096 0.1536 0.0256 0.0016

    Example 2: Let X be the number of times I get a 6 out of 10 rolls of a fair die.

    1. What is the distribution of X?

    2. What is the probability that X 2?

    1. X Binomial(n = 10, p = 1/6).2.

    P(X 2 ) = 1 P(X < 2)= 1 P(X = 0) P(X = 1)= 1

    10

    0

    1

    6

    01 1

    6

    100

    10

    1

    1

    6

    11 1

    6

    101= 0.515.

    Example 3: Let X be the number of girls in a three-child family. What is thedistribution of X?

    Assume:

    (i) each child is equally likely to be a boy or a girl;

    (ii) all children are independent of each other.

    Then X Binomial(n = 3, p = 0.5).

  • 8/8/2019 Stats 210 Course Book

    59/200

    58

    Shape of the Binomial distribution

    The shape of the Binomial distribution depends upon the values ofn and p. Forsmall n, the distribution is almost symmetrical for values of p close to 0.5, buthighly skewed for values of p close to 0 or 1. As n increases, the distributionbecomes more and more symmetrical, and there is noticeable skew only if p isvery close to 0 or 1.

    The probability functions for various values of n and p are shown below.

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0.0

    5

    0.10

    0.1

    5

    0.2

    0

    0.2

    5

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0.1

    0.2

    0.3

    0.4

    0.0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    8

    0.1

    0

    0.1

    2

    80 90 100

    n = 10, p = 0.5 n = 10, p = 0.9 n = 100, p = 0.9

    Sum of independent Binomial random variables:

    If X and Y are independent, and X Binomial(n, p), Y Binomial(m, p),then

    X + Y Bin(n + m, p).

    This is because X counts the number of successes out ofn trials, and Y countsthe number of successes out of m trials: so overall, X + Y counts the totalnumber of successes out of n + m trials.

    Note: X and Y must both share the same value ofp.

  • 8/8/2019 Stats 210 Course Book

    60/200

    59

    2.5 The cumulative distribution function, FX(x)

    We have defined the probability function, fX(x), as fX(x) = P(X = x).

    The probability function tells us everything there is to know about X.

    The cumulative distribution function, or just distribution function, written asFX(x), is an alternative function that also tells us everything there is to knowabout X.

    Definition: The (cumulative) distribution function (c.d.f.) is

    FX(x) = P(X x) for < x <

    If you are asked to give the distribution ofX, you could answer by giving eitherthe distribution function, FX(x), or the probability function, fX(x). Each ofthese functions encapsulate all possible information about X.

    The distribution function FX(x) as a probability sweeper

    The cumulative distribution function, FX(x),

    sweeps up all the probability up to and including the pointx.

    0.00

    0.05

    0.10

    0.15

    0.2

    0

    0.25

    0 1 2 3 4 5 6 7 8 9 10

    X ~ Bin(10, 0.5)

    0.0

    0.1

    0.2

    0.3

    0.4

    0 1 2 3 4 5 6 7 8 9 10

    X ~ Bin(10, 0.9)

  • 8/8/2019 Stats 210 Course Book

    61/200

    60

    Example: Let X Binomial(2, 12

    ).x 0 1 2

    fX(x) = P(X = x)14

    12

    14

    Then FX(x) = P(X

    x) =

    0 if x < 00.25 if 0 x < 1

    0.25 + 0.5 = 0.75 if 1 x < 20.25 + 0.5 + 0.25 = 1 if x 2.

    0

    0

    1

    1

    1

    2

    2

    1

    4

    1

    4

    1

    2

    1

    2

    3

    4

    x

    x

    f(x)

    F(x)

    FX(x) gives the cumulative probability up to and including pointx.

    SoFX(x) =

    yxfX(y)

    Note that FX(x) is a step function: it jumps by amount fX(y) at every pointy with positive probability.

  • 8/8/2019 Stats 210 Course Book

    62/200

    61

    Reading off probabilities from the distribution function

    As well as using the probability function to find the distribution function, wecan also use the distribution function to find probabilities.

    fX(x) = P(X = x) = P(X

    x)

    P(X

    x

    1) (ifX takes integer values)

    = FX(x) FX(x 1).

    This is why the distribution function FX(x) contains as much information asthe probability function, fX(x), because we can use either one to find the other.

    In general:

    P(a < X b) = FX(b) FX(a) ifb > a.

    Proof: P(X b) = P(X a) + P(a < X b)

    a b

    X b

    a < X bX a

    So

    FX(b) = FX(a) + P(a < X b)

    FX(b) FX(a) = P(a < X b).

  • 8/8/2019 Stats 210 Course Book

    63/200

    Warning: endpoints

    Be careful of endpoints and the difference between and 42)?1 P(X 42) = 1 FX(42).

    5. P(50 X 60)?P(X

    60)

    P(X

    49) = FX(60)

    FX(49).

    Properties of the distribution function

    1) F() =P(X ) = 0.F(+) =P(X +) = 1.

    (These are true because values are strictly between and ).2) FX(x) is a non-decreasing function of x: that is,

    if x1 < x2, then FX(x1) FX(x2).

    3) P(a < X b) = FX(b) FX(a) if b > a.4) F is right-continuous: that is, limh0 F(x + h) = F(x).

  • 8/8/2019 Stats 210 Course Book

    64/200

    63

    2.6 Hypothesis testing

    You have probably come across the idea of hypothesis tests, p-values, and sig-nificance in other courses. Common hypothesis tests include t-tests and chi-squared tests. However, hypothesis tests can be conducted in much simpler

    circumstances than these. The concept of the hypothesis test is at its easiest tounderstand with the Binomial distribution in the following example. All otherhypothesis tests throughout statistics are based on the same idea.

    Example: Weird Coin?

    H

    H

    I toss a coin 10 times and get 9 heads. How weird is that?

    What is weird?

    Getting 9 heads out of 10 tosses: well call this weird. Getting 10 heads out of 10 tosses: even more weird! Getting 8 heads out of 10 tosses: less weird. Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:

    just as weird as 9 heads if the coin is fair.

    Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than9 heads if the coin is fair.

    Set of weird outcomes

    If our coin is fair, the outcomes that are as weird or weirder than 9 headsare:

    9 heads, 10 heads, 1 head, 0 heads.

    So how weird is 9 heads or worse, if the coin is fair?

    We can add the probabilities of all the outcomes that are at least as weirdas 9 heads out of 10 tosses, assuming that the coin is fair.

    Distribution of X, if the coin is fair: X Binomial(n = 10, p = 0.5).

  • 8/8/2019 Stats 210 Course Book

    65/200

    64

    Probability of observing something at least as weird as 9 heads,

    if the coin is fair:

    P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X Binomial(10, 0.5).

    Probabilities for Binomial(n = 10, p = 0.5)

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0

    .05

    0.1

    5

    0.2

    5

    x

    P(X=x)

    For X Binomial(10, 0.5), we have:P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) =

    109 (0.5)9(0.5)1 +1010(0.5)10(0.5)0 +10

    1

    (0.5)1(0.5)9 +

    10

    0

    (0.5)0(0.5)10

    = 0.00977 + 0.00098 + 0.00977 + 0.00098

    = 0.021.

    Is this weird?

    Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would onlyexpect to see something as extreme as 9 heads on about 2.1% of occasions.

  • 8/8/2019 Stats 210 Course Book

    66/200

    65

    Is the coin fair?

    Obviously, we cant say. It might be: after all, on 2.1% of occasions that youtoss a fair coin 10 times, you do get something as weird as 9 heads or more.

    However, 2.1% is a small probability, so it is still very unusual for a fair coin to

    produce something as weird as what weve seen. If the coin really was fair, itwould be very unusual to get 9 heads or more.

    We can deduce that, EITHER we have observed a very unusual event with a faircoin, OR the coin is not fair.

    In fact, this gives us some evidence that the coin is not fair.

    The value 2.1% measures the strength of our evidence. The smaller this proba-

    bility, the more evidence we have.

    Formal hypothesis test

    We now formalize the procedure above. Think of the steps:

    We have a question that we want to answer: Is the coin fair?

    There are two alternatives:1. The coin is fair.

    2. The coin is not fair.

    Our observed information is X, the number of heads out of 10 tosses. Wewrite down the distribution of X if the coin is fair:X Binomial(10, 0.5).

    We calculate the probability of observing something AT LEAST AS EX-TREME as our observation, X = 9, if the coin is fair: prob=0.021.

    The probability is small (2.1%). We conclude that this is unlikely with afair coin, so we have observed some evidence that the coin is NOT fair.

  • 8/8/2019 Stats 210 Course Book

    67/200

    66

    Null hypothesis and alternative hypothesis

    We express the steps above as two competing hypotheses.

    Null hypothesis: the first alternative, that the coin IS fair.

    We expect to believe the null hypothesis unless we see convincing evidence that

    it is wrong.

    Alternative hypothesis: the second alternative, that the coin is NOT fair.

    In hypothesis testing, we often use this same formulation.

    The null hypothesis is specific.It specifies an exact distribution for our observation: X Binomial(10, 0.5).

    The alternative hypothesis is general.It simply states that the null hypothesis is wrong. It does not say whatthe right answer is.

    We use H0 andH1 to denote the null and alternative hypotheses respectively.

    The null hypothesis is H0 : the coin is fair.The alternative hypothesis is H1 : the coin is NOT fair.

    More precisely, we write:

    Number of heads, X Binomial(10, p),

    and

    H0 : p = 0.5

    H1 : p = 0.5.

    Think of null hypothesis as meaning the default: the hypothesis we willaccept unless we have a good reason not to.

  • 8/8/2019 Stats 210 Course Book

    68/200

    67

    p-values

    In the hypothesis-testing framework above, we always measure evidence AGAINSTthe null hypothesis.

    That is, we believe that our coin is fair unless we see convincing evidence

    otherwise.

    We measure the strength of evidence against H0 using the p-value.

    In the example above, the p-value was p = 0.021.

    A p-value of 0.021 represents quite strong evidence against the null hypothesis.

    It states that, if the null hypothesis is TRUE, we would only have a 2.1% chanceof observing something as extreme as 9 heads or tails.

    Many of us would see this as strong enough evidence to decide that the nullhypothesis is not true.

    In general, the p-value is the probability of observing something AT LEAST ASEXTREME AS OUR OBSERVATION, ifH0 is TRUE.

    This means that SMALLp-values represent STRONG evidence againstH0.

    Small p-values mean Strong evidence.Large p-values mean Little evidence.

    Note: Be careful not to confuse the term p-value, which is 0.021 in our exam-ple, with the Binomial probability p. Our hypothesis test is designed to testwhether the Binomial probability is p = 0.5. To test this, we calculate thep-value of 0.021 as a measure of the strength of evidence against the hypoth-esis that p = 0.5.

  • 8/8/2019 Stats 210 Course Book

    69/200

    68

    Interpreting the hypothesis test

    There are different schools of thought about how ap-value should be interpreted.

    Most people agree that the p-value is a useful measure of the strength ofevidence against the null hypothesis. The smaller the p-value, thestronger the evidence against H0.

    Some people go further and use an accept/reject framework. Underthis framework, the null hypothesis H0 should be rejected if the p-value isless than 0.05 (say), and accepted if the p-value is greater than 0.05.

    In this course we use the strength of evidence interpretation. Thep-value measures how far out our observation lies in the tails of the dis-tribution specified by H0. We do not talk about accepting or rejectingH0. This decision should usually be taken in the context of other scientificinformation.

    However, it is worth bearing in mind that p-values of 0.05 and less startto suggest that the null hypothesis is doubtful.

    Statistical significance

    You have probably encountered the idea of statistical significance in othercourses.

    Statistical significance refers to thep-value.

    The result of a hypothesis test is significant at the 5% level if the p-valueis less than 0.05.

    This means that the chance of seeing what we did see (9 heads), or more, is lessthan 5% if the null hypothesis is true.

    Saying the test is significant is a quick way of saying that there is evidenceagainst the null hypothesis, usually at the 5% level.

  • 8/8/2019 Stats 210 Course Book

    70/200

    69

    In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p = 0.5is significant at the 5% level, because thep-value is 0.021 which is < 0.05.

    This means:

    we have some evidence thatp = 0.5.

    It does not mean:

    the difference between p and 0.5 is large, or the difference between p and 0.5 is important in practical terms.

    Statistically significant means that we have evidence that

    there IS a difference. It says NOTHING about the SIZE,

    or the IMPORTANCE, of the difference.

    Beware!

    The p-value gives the probability of seeing something as weird as what we didsee, ifH0 is true.

    This means that 5% of the time, we will get ap-value< 0.05 EVEN WHENH0

    IS TRUE!!

    Indeed, about once in every thousand tests, we will get a p-value < 0.001, eventhough H0 is true!

    A smallp-value does NOT mean thatH0 is definitely wrong.

    One-sided and two-sided tests

    The test above is a two-sided test. This means that we considered it just asweird to get 9 tails as 9 heads.

    If we had a good reason, before tossing the coin, to believe that the binomialprobability could only be = 0.5 or > 0.5, i.e. that it would be impossibleto have p < 0.5, then we could conduct a one-sided test: H0 : p = 0.5 versusH1 : p > 0.5.

    This would have the effect of halving the resultant p-value.

  • 8/8/2019 Stats 210 Course Book

    71/200

    70

    2.7 Example: Presidents and deep-sea divers

    Men in the class: would you like to have daughters? Then become a deep-seadiver, a fighter pilot, or a heavy smoker.

    Would you prefer sons? Easy!Just become a US president.

    Numbers suggest that men in differentprofessions tend to have more sons thandaughters, or the reverse. Presidents havesons, fighter pilots have daughters. But is it real, or just chance? We can usehypothesis tests to decide.

    The facts

    The 43 US presidents from George Washington to George W. Bush havehad a total of 151 children, comprising 88 sons and only 63 daughters: asex ratio of 1.4 sons for every daughter.

    Two studies of deep-sea divers revealed that the men had a total of 190children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughters

    for every son.

    Could this happen by chance?

    Is it possible that the men in each group really had a 50-50 chance of producingsons and daughters?

    This is the same as the question in Section 2.6.

    For the presidents: If I tossed a coin 151 times and got only 63 heads, couldI continue to believe that the coin was fair?

    For the divers: If I tossed a coin 190 times and got only 65 heads, could Icontinue to believe that the coin was fair?

  • 8/8/2019 Stats 210 Course Book

    72/200

    71

    Hypothesis test for the presidents

    We set up the competing hypotheses as follows.

    LetX be the number of daughters out of 151 presidential children.

    Then X Binomial(151, p), wherep is the probability that each child is a daugh-ter.

    Null hypothesis: H0 : p = 0.5.

    Alternative hypothesis: H1 : p = 0.5.p-value: We need the probability of getting a result AT LEAST

    AS EXTREME as X = 63 daughters, ifH0 is trueandp really is 0.5.

    Which results are at least as extreme as X = 63?

    X = 0, 1, 2, . . . , 63, for even fewer daughters.

    X = (151

    63), . . . , 151, for too many daughters, because we would be just assurprised if we saw 63 sons, i.e. (151 63) = 88 daughters.

    Probabilities for X Binomial(n = 151, p = 0.5)

    0.0

    0

    0.0

    1

    0.0

    2

    0.0

    3

    0.0

    4

    0.0

    5

    0.0

    6

    0 20 40 60 80 100 120 140

  • 8/8/2019 Stats 210 Course Book

    73/200

    72

    Calculating the p-value

    The p-value for the president problem is given by

    P(X 63) + P(X 88) whereX Binomial(151, 0.5).

    In principle, we could calculate this asP(X = 0) + P(X = 1) + . . . + P(X =