Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Welcome to
STAB52
Instructor: Dr. Ken Butler
1
Contact information
(on Intranet: intranet.utsc.utoronto.ca, My Courses)
• E-mail: [email protected]
• Office: H 417
• Office hours: to be announced
• Phone: 5654 (416-287-5654)
2
Probability Models
3
Measuring uncertainty
In life, often faced with own ignorance:
• don’t know what winning lottery number will be
• don’t know tomorrow’s weather
• don’t know what traffic will be like on way home tonight
• don’t know who next mayor of Toronto will be
This course will not give you answers to above, but will teach you
how to recognize own ignorance and work with it.
4
Why we need probability theory
Consider a couple of apparently reasonable gambles:
• A friend has 3 cards, one red both sides, one black both sides,
the other one colour each side. Friend picks card at random,
places on table. Side showing is red. Offers $4 against your $3
that other side also red.
• Another friend suggests flipping a (fair) coin 1000 times. If coin
comes up heads 600 or more times, he pays you $100, else you
pay him just $1.
5
Cards: might think either colour equally likely, in which case bet is
good one. But in fact other side will be red 2/3 of the time. Losing
bet in long run. (Conditional probability, expectation.)
Coin: getting 600 or more heads very unlikely, even to make this bet
pay off in long run. (Law of large numbers, expectation.)
If you have friends like this, get some new ones!
6
Many things in life involve assessing risk, and making decision in
light of risk:
• the bets above
• predicting earthquakes
• walking down the street
• whether to buy a house
• whether machinery in factory will work properly
• in science, what experiment to do, and how to interpret results
• insurance
7
Probability Models
To talk about risk, need to assess:
• what might happen
• how likely each thing is to happen.
List of all possible outcomes called sample space , denoted S.
Eg. in Lotto 6/49, you pick 6 numbers, and get prize according to
how many of winning numbers you match (including “bonus”), so
S = {0, 2 + b, 3, 4, 5, 5 + b, 6}
Things in S called outcomes . 0 means “no prize”.
8
Outcomes, events, probability measure
Subsets of S are called events . Example: {0, 2 + b} means “no
prize or two-plus-bonus prize”.
A probability measure must assign to each event A a probability,
written P (A). P has these properties:
• 0 ≤ P (A) ≤ 1
• P (∅) = 0 (“impossible for nothing to happen”)
• P (S) = 1 (“something must happen”)
• If A1, A2 . . . are disjoint events,
P (A1 ∪ A2 ∪ · · · ) = P (A1) + P (A2) + · · · .
“Disjoint” means the events have no outcomes in common.
9
So {0, 2 + b} and {4, 5} are disjoint, but {0, 2 + b} and
{2 + b, 3} are not.
Last property says
P ({0, 2 + b, 4, 5}) = P ({0, 2 + b}) + P ({4, 5})
but says nothing about P ({0, 2 + b, 3}).
10
Probability model
A probability model consists of a (nonempty) sample space S,
subsets of S called events, and a probability measure P satisfying
properties above.
Usually give probability measure for each outcome, eg. for lottery:
11
Outcome Probability
6 0.00000007
5 + b 0.0000004
5 0.000018
4 0.000968
3 0.017544
2 + b 0.0123457
0 0.96875
These sum to 1 (to accuracy shown).
12
Example 2: flip a fair coin. It can come up heads (H) or tails (T), so
S = {H,T}. Also
1 = P (S) = P ({H,T}) = P ({H}) + P ({T})
so P ({H}) = P ({T}) = 0.5 (fair coin, sum to 1).
Example 3: flip 2 fair coins. Then
S = {HH,HT, TH, TT}
with any outcome equally likely, so eg. P (HT ) = 14. But
P (1 head) = P ({HT}) + P ({TH}) = 12, different from
P (0 heads) = P (2 heads) = 14.
13
Venn Diagrams
When handling events, ie. subsets of S, nice to be able to draw
picture of what we mean (easier for thinking, too).
A Venn Diagram is a rectangle representing S containing circles
representing events A,B, . . .. Some pictures on later pages.
Definitions:
• Subset Ac = {s : s 6∈ A} called complement of A.
• Intersection A ∩ B = {s : s ∈ A and s ∈ B}.
• Union A ∪ B = {s : s ∈ A or s ∈ B}.• A ∩ Bc called complement of B in A: “in A but not in B”.
If A and B disjoint, draw as two non-overlapping circles.
14
15 16
17 18
Facts from Venn diagrams
In the 3rd diagram, the area outside the two circles is “(everything
not in A) and (everything not in B)”, ie. Ac ∩ Bc.
Now, A ∪ B is everything inside the two circles (“in A or B or
both”), and everything outside the two circles is (A ∪ B)c. Thus
(A ∪ B)c = Ac ∩ Bc.
Similar logic gives
(A ∩ B)c = Ac ∪ Bc :
the elements not in (both A and B) are those (not in A) or (not in
B).
19
Summary
• Sample space contains outcomes, events (collections of
outcomes)
• Probability measure gives prob. of events (outcomes), between
0 and 1.
• Prob. of whole sample space is 1; probs of disjoint events add.
• Probability model contins sample space, events, probability
measure.
• Venn diagrams: picture of events, disjoint or not.
20
Properties of probability models
Ac denotes event “A does not happen”. (Flipping 2 coins: if
A = HH , then Ac = {HT, TH, TT}.)
A and Ac are disjoint, and A ∪ Ac = S (must either flip HH or
not-HH). But P (A ∪Ac) = P (A) + P (Ac) and P (S) = 1. Thus
P (A) + P (Ac) = 1 or P (Ac) = 1 − P (A).
This is often easiest way to find prob. of an Ac.
Coin-flipping ex.: P (A) = 14, so P (Ac) = 1 − 1
4= 3
4.
21
Total probability version 1
Above, A and Ac were disjoint, and A ∪ Ac = S. Disjoint events
whose union is S called partition of S. Let A1, A2, . . . , An be a
partition of S. Suppose we have another event B. What can we say
about P (B)?
22
Draw a Venn diagram. B consists of the bit of B intersecting with
A1, the bit intersecting with A2, etc. In symbols:
B = (A1 ∩ B) ∪ (A2 ∩ B) ∪ · · · ∪ (An ∩ B)
and, using addition rule,
P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B);
because the Ai are disjoint, the Ai ∩ B are too.
Called law of total probability .
This is version 1 of total prob. law; see version 2 (more useful) later.
23
When B is a subset of A
When B ⊆ A, A contains all outcomes in B plus more, so expect
P (B) ≤ P (A). Get to that in a minute.
A has 2 parts: the bit intersecting with B, and the bit not. Hence
A = (A ∩ B) ∪ (A ∩ Bc).
But A ∩ B is just B, and these 2 parts of A are disjoint. (Or think
of B,Bc as partition of S).
24
Hence
P (A) = P (B) + P (A ∩ Bc).
Two followups to this, the second because P (A ∩ Bc) ≥ 0:
P (A ∩ Bc) = P (A) − P (B)
P (A) ≥ P (B).
25
Inclusion-exclusion
Back to general A and B.
If A,B disjoint, then P (A ∪ B) = P (A) + P (B). But if not?
Draw a Venn diagram of the general case where A and B overlap.
Shade A and B (next page), find you shaded A ∩ B twice.
Therefore need to subtract once (“exclude”) to get
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
26
27
Or, to show mathematically, need to be careful. A ∪ B consists of:
the bit in A but not B, the bit in B but not A, and the bit in both A
and B. These bits are disjoint, so
A ∪ B = (A ∩ Bc) ∪ (B ∩ Ac) ∪ (A ∩ B)
and
P (A ∪ B) = P (A ∩ Bc) + P (B ∩ Ac) + P (A ∩ B).
28
By total probability, P (A) = P (A ∩ B) + P (A ∩ Bc) so
P (A ∩ Bc) = P (A) − P (A ∩ B);
likewise
P (B ∩ Ac) = P (B) − P (A ∩ B).
Hence
P (A ∪ B) = (P (A) − P (A ∩ B)) + (P (B) − P (A ∩ B))
+ (P (A ∩ B))
= P (A) + P (B) − P (A ∩ B).
as we wanted.
29
Example: suppose that an employee arrives late with probability
0.10, leaves early with probability 0.05, and does both with
probability 0.02. What is the probability that the employee will either
arrive late, leave early, or both?
Let A be “arrives late” and B be “leaves early”. Want P (A ∪ B),
but A and B are not disjoint (both can happen). Thus:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= 0.10 + 0.05 − 0.02 = 0.13.
30
Summary
• Prob. of “not A” is 1 − P (A).
• If A1, . . . , An is partition of S, can write P (B) in terms of
P (Ai ∩ B) — total probability.
• When B is subset of A, P (A ∩ Bc) = P (A) − P (B) and
P (A) ≥ P (B).
• For any A and B, P (A∪B) = P (A) + P (B)−P (A∩B).
31
Equally likely outcomes
If the outcomes in S are equally likely, and there are |S| of them,
probability of each is 1/|S|.If event A has |A| outcomes in it, additivity says
P (A) =|A||S| .
(As a check, P (S) = |S|/|S| = 1 as it should be.)
Advantage of this: can find P (A) by counting number of outcomes
in A.
32
Some examples
• Flipping a fair coin. S = {H,T} so |S| = 2 and
P (H) = P (T ) = 12.
• Rolling a fair (six-sided) die. Now S = {1, 2, 3, 4, 5, 6} and
|S| = 6, so eg. P ({3}) = 16
and P ({2, 3, 4}) = 36.
• Flip a fair coin and roll a fair 6-sided die. S =
{H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}.
Now |S| = 12 and P (s) = 112
for any outcome s.
33
Combinatorial principles
Said that we can find a P (A) by counting ways in which A can
happen. Look at some ways to do this.
Multiplication principle
In 3rd example above, flipped a coin and rolled a die. Turned out to
be 2 × 6 = 12 possible outcomes, with 2 possible coin flips and 6
possible die rolls.
Mathematically: have sample spaces S1, . . . , Sk. Get one outcome
from each. Total possible outcomes |S1| × |S2| · · · × |Sk|.
34
Same applies to probabilities, if selections from each sample space
independent. That is, as long as knowing one result tells you
nothing about other results.
Example: flip a fair coin, roll a fair die (independently). P (H) = 12,
P ({5, 6}) = 26, so
P (H and {5, 6}) =1
2· 2
6=
1
6.
Another example: roll 2 fair dice, count total number of spots.
P (total 10)?
6 × 6 = 36 possible outcomes, equally likely. Those giving total 10
are (4, 6), (5, 5), (6, 4), so P (total = 10) = 336
.
35
Permutations
What if we make more than one selection from the same set? Not
independent any more.
If the order of selection matters, dealing with permutations .
Example: 4 people eating lunch. How many ways can they sit at a
4-person table?
First person to sit down chooses one of the 4 seats. Then 2nd
person chooses one of the 3 seats left, 3rd person chooses 1 of 2
seats left, last person sits in remaining seat. Number of ways:
4 × 3 × 2 × 1 = 24 = 4!, called factorial of 4.
36
In general:
• number of ways to arrange k items, if order matters, is k!
• number of ways to select k items out of n, if order matters, is
n(n − 1) · · · (n − k + 1) = n!/(n − k)!.
Example of latter: a student society wants to choose a President,
Vice-President and Secretary from its 10 members (so order
matters). Number of ways to do this:
10!
7!= 720
or just 10 × 9 × 8 = 720.
37
Counting subsets
Consider student society above.
Already saw: 720 ways to choose 3 of 10 if order matters. Think of
this as two-stage process:
• first choose 3 members to be society officers
• then decide who gets which role
Know final answer is 720. If can figure out ways to get from 1st
stage to 2nd, can figure out number of ways to do 1st stage.
38
If we have chosen the 3 officers, how many ways can they be
assigned to President, Vice-President and Secretary? This is
3! = 6.
So number of ways to choose the 3 officers from the 10 members
(with order not mattering) is 720/6 = 120.
In general: take the number of permutations, divide by factorial of
number of items, so number of subsets of k items out of n is(
n
k
)
=n!
k!(n − k)!,
read “n choose k”.
39
In our student society, n = 10 and k = 3, giving(
10
3
)
=10!
3!7!= 120
as above.
Example 2: suppose we flip 5 fair coins. What is probability of
getting exactly 3 heads?
2 possible outcomes for each coin flip (H/T). So
2 × 2 × 2 × 2 × 2 = 25 = 32 possible equally likely outcomes.
Out of these,(
53
)
= 10 have 3 heads, so P (3 heads) = 1032
.
40
Example 3: suppose coins no longer fair; let P (H) = θ, so
P (T ) = 1− θ (not 12). Outcomes no longer equally likely, but those
outcomes giving 3 heads and 2 tails each have prob. θ3(1 − θ)5−3.
Still(
53
)
= 10 of them, so P (3 heads) = 10θ3(1 − θ)2.
More generally, P (k heads) when flipping n coins is
(
n
k
)
θk(1 − θ)n−k.
The(
nk
)
called binomial coefficients : what you get if you expand
(θ + (1 − θ))n in a binomial series.
41
Sets divided into more than 2 types
Above, two types of outcome: “heads”/”tails”, “selected/not
selected”. Suppose there are now more than two:
Student society again. Now choose 3 officers, 4 members to form a
committee, out of 10 members total.
One approach: select 3 officers out of 10 members, then out of
10− 3 = 7 remaining members, select 4 to be on committee. Then
multiply to get number of ways as(
10
3
)(
7
4
)
= 120 × 35 = 4200.
42
Or write as factorials:
10!
3!7!· 7!
4!3!=
10!
3!4!3!= 4200.
In general, number of ways to divide set of n items into subsets of
sizes k1, k2, . . . , kl with k1 + · · · + kl = n is(
n
k1 k2 . . . kl
)
=n!
k1!k2! · · · kl!,
called a multinomial coefficient . As with a binomial coefficient,
have to include those items not explicitly selected, eg. those
members who are not officers or on the committee.
43
Summary
• Equally likely outcomes: count number in A and S, then
P (A) = |A|/|S|.
• Multiplication: a ways to do A, b to do B, ab ways to do A and
B.
• Selections of r objects from set of n:
– if order matters, n!/(n − r)! ways (permutation)
– if order does not matter,(
nr
)
= n!/{r!(n − r)!} ways
(binomial coeff).
• extensions to more than 2 types of object in set (multinomial
coeff).
44
Conditional probability and independence
Suppose we flip 3 fair coins. P (first coin H) = 12.
But now, suppose as well that someone tells us that 2 of the 3 coins
came up H. Since there were “more heads than average”, this
changes our opinion about P (first coin H): expect it to be higher
than 12
now.
45
We know that 2 of the 3 coins came up H, so can ignore any events
not containing two H – that is, one of HHT, HTH, THH must have
happened. Out of these, 2 of the three have the first coin H. In
symbols:
P (1st coin H | 2 coins H) =2
3,
given that 2 coins were H, prob. that 1st coin is H is now 23. Called
conditional probability .
For general events A and B, work out P (A|B) by saying “out of all
the ways B can happen, how many have A happen as well”, ie.
P (A|B) =P (A ∩ B)
P (B).
46
In coin example, A = {HHH,HHT,HTH,HTT} ( “H on 1st
flip”) and B = {HHT,HTH, THH} (“two H”).
A ∩ B = {HHT,HTH}, and there are 23 = 8 possible
outcomes altogether, so
P (A|B) =P (A ∩ B)
P (B)=
2/8
3/8=
2
3.
Definition of conditional probability P (B|A) can be rearranged:
P (A ∩ B) = P (A) P (B|A);
in words, prob. of A and B both happening is prob. of A happening
times prob. of B happening given that A has happened. This is
general multiplication rule: can calculate P (A ∩ B) regardless of
whether or not A happening affects chance of B happening.
47
Total probability (again)
Recall total probability: if A1, A2 . . . , An is a partition of S, and B
is some event,
P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B).
Replace P (Ai ∩ B) with P (Ai) P (B|Ai):
P (B) = P (A1) P (B|A1)+P (A2) P (B|A2)+· · ·+P (An) P (B|An).
Called law of total probability, version 2 .
48
Often use this law in two-stage systems: make 1 choice, then
depending on that choice, make a 2nd choice.
Example: suppose pot 1 has 3 red balls and 2 blue balls, and pot 2
has 1 red ball and 3 blue balls. If I choose pot 1 with prob. 23, what is
prob. of drawing a red ball?
Let B be “drawing a red ball” and let Ai be “choose pot i”.
Then P (B|A1) = 35
(“if I choose pot 1, there are 3 red balls out of
5 to choose”), and P (B|A2) = 14. Thus
P (B) =2
3· 3
5+
1
3· 1
4=
29
60.
Near 50-50 chance: likely to choose pot with more red balls, but
may choose pot with very few red balls.
49
Bayes’ theorem
How do we reverse conditional probabilities, ie. from P (B|A) how
do we get P (A|B)? Using definitions:
P (A|B) =P (A ∩ B)
P (B)=
P (B ∩ A)
P (B)·P (A)
P (A)=
P (A)
P (B)P (B|A).
Formula called Bayes’ theorem .
Revisit previous example: what is prob. I picked pot 1, given that I
chose a red ball?
In notation of that example: want P (A1|B) = P (A1)P (B)
P (B|A1).
50
P (A1) = 23
(given before), P (B|A1) = 35, P (B) = 29
60(found
before). Thus
P (A1|B) =2/3
29/60· 3
5=
24
29.
Very likely to have picked pot 1 if we chose red ball.
Often, as here, need law of total probability to figure out P (B).
51
Example: Testing for disease
Let A be event “have particular disease”, B be event “test positive
for that disease”. Typically P (A) = 0.01,
P (B|A) = 0.95, P (B|Ac) = 0.10: rare disease, test tends to be
accurate. If we pick a person at random:
P (B) = P (A)P (B|A) + P (Ac)P (B|Ac)
= (0.01)(0.95) + (0.99)(0.10) = 0.1085
about 11% chance of testing positive.
52
Real interest: prob. of having disease if you test positive:
P (A|B) =P (A)
P (B)P (B|A) =
0.01
0.1085· 0.95 = 0.0876 :
for an apparently accurate test, this is surprisingly small.
Reason: for a rare disease, large majority of positive tests will come
from people who don’t have disease (even if positive tests in that
case rare), because disease even rarer.
Compare numbers on previous page: of 100 people, about 1 will
have disease, but about 11 will test positive (so about 10 of those
are false alarms).
53
Independence of events
Let’s recycle an old example:
• Flip a fair coin and roll a fair 6-sided die. S =
{H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}.
|S| = 12 and P (s) = 112
for any outcome s.
Suppose we know the coin came up T. Then eg.
P (die = 3|coin = T ) =1
6
since only look at last 6 outcomes.
But P (die = 3) = 16
as well – that is, knowing coin was T told us
nothing extra about die prob. In other words, coin and die results are
independent : knowing one tells us nothing about the other.
54
Mathematically: suppose A and B are independent events. Then
P (A ∩ B) = P (A) P (B|A) = P (A) P (B).
Make this definition: if P (A ∩ B) = P (A) P (B), then A and B
are independent.
55
With more than two events, gets more complicated. Eg. with 3
events, A,B,C , need all of these true:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
P (B ∩ C) = P (B) P (C)
P (A ∩ B ∩ C) = P (A) P (B) P (C)
Let S = {1, 2, 3, 4} equally likely, let A = {1, 2}, B = {1, 3},
C = {1, 4}. Then A,B,C satisfy first 3 above, but not 4th, so not
independent (called pairwise independent).
56
Example
Events A and B have
P (A) = 0.4, P (B) = 0.2, P (A ∩ B) = 0.1. Are A and B
independent?
Check: P (A) P (B) = (0.4)(0.2) = 0.08 6= 0.1 = P (A ∩ B),
so A and B are not independent. In fact,
P (A|B) =P (A ∩ B)
P (B)=
0.1
0.2= 0.5,
so if B happens, A is more likely to happen as well.
57
Independence and disjointness
Are independent events and disjoint events the same thing?
Short answer: NO!
Long answer: Two events A and B can be:
• independent , if P (A ∩ B) = P (A) P (B)
• related , otherwise (knowledge of A tells you about P (B)).
Related events can be:
– disjoint if P (A ∩ B) = 0,
– overlapping if P (A ∩ B) > 0.
For disjoint events, if A happens, B cannot happen: that is,
knowing that A happens tells you that P (B|A) = 0.
58
Summary
• Conditional probability P (A|B) is prob. of A if we know that B
has happened. P (A|B) = P (A ∩ B)/P (B).
• If A1, . . . , An partition of S, law of total prob. version 2 gives
P (B) in terms of P (B|Ai) P (Ai).
• Bayes’ theorem gives P (A|B) in terms of P (B|A).
• Total prob. and Bayes’ theorem are key tools for working with
conditional probs.
• A and B independent if P (A|B) = P (A) for all B or
P (A ∩ B) = P (A) P (B).
• Independent and disjoint events not the same!
59
Random Variables and
Distributions
60
Random Variables
Suppose we flip two (fair) coins, and note whether each coin
(ordered) comes up H or T.
• Sample space is S = {HH,HT, TH, TT}.
• Probability measure is 14
for each of 4 outcomes.
What about “number of heads”? Could be 0, 1 or 2:
• P (0 heads) = P (TT ) = 14
• P (1 head) = P (TH) + P (HT ) = 12
• P (2 heads) = P (HH) = 14.
61
“Number of heads” is random variable : function from S to R. That
is, given outcome, get value of random variable.
Random variables can be any function from S to R. If
S = {rain, snow, clear}, random variable X could be
X(rain) = 3
X(snow) = 6
X(clear) = −2.7.
62
Some more examples of random variables
Roll a fair 6-sided die, so that S = {1, 2, 3, 4, 5, 6}. Let X be the
number of spots showing, let Y be square of number of spots. If s is
number of spots, on a particular roll, let W = s + 10, let
U = s2 − 5s + 3, etc.
In previous situation, let C = 3 regardless of s. C is constant
random variable.
Suppose have event A, only interested in whether A happens or
not. Define indicator random variable I to be 1 if A happens, 0
otherwise. Example (rolling die) I6(s) = 1 if s = 6, 0 otherwise.
63
≥, =, sum for random variables
Imagine rolling a fair die again, S = {1, 2, 3, 4, 5, 6}. Let X = s,
and let Y = X + I6.
X is number of spots, I6 is 1 if you roll a 6 and 0 otherwise. What
does Y mean?
Eg. roll a 4, X = 4, Y = 4 + 0 = 4. But if you roll a 6,
Y = 6 + 1 = 7. (That is, Y is the number of spots plus a “bonus
point” if you roll a 6.)
Sum of random variables (like Y here) for any outcome is sum of
their values for that outcome.
64
Also: if s = 1, 2, 3, 4, 5, values of X and Y are same. If s = 6,
X < Y .
Say that random variable X ≤ Y if value of X ≤ value of Y for
every single outcome. True in example.
Say that random variable X = Y if value of X equals value of Y
for every single outcome. Not true in example (different when
outcome is s = 6).
For constant random variable c, X ≤ c if all possible values of X
are ≤ c.
65
When S is infinite
When S infinite, random variable can take infinitely many different
values (but may not).
Example: S = {1, 2, 3, . . .}. If X = s, X takes all infinitely many
values in S. But define Y = 3 if s ≤ 4, Y = 2 if 4 < s ≤ 10,
Y = 1 when s > 10. Y has only finitely many (3) different values.
66
Summary
• Random variable is function from S to R: from outcome, get
real number.
• Indicator IA is 1 if event A happens, 0 if not.
• Random variable X ≥ Y if value of X ≥ value of Y for all
outcomes. Same idea for =.
• Random variable X + Y for outcome s is X(s) + Y (s).
• When S infinite, random variable may or may not take infinitely
many different values.
67
Distributions of random variables
A random variable can be described by listing all its possible vales
and their probabilities. Started this chapter with a coin-flipping
example:
Flip two (fair) coins, and note whether each coin (ordered) comes up
H or T.
Let X be “number of heads”. Could be 0, 1 or 2:
• P (X = 0) = P (TT ) = 14
• P (X = 1) = P (TH) + P (HT ) = 12
• P (X = 2) = P (HH) = 14.
Called the distribution of X .
68
Notice how can talk about P (X = s) for all s. In this case, listing
all the s for which P (X = s) > 0 describes distribution.
Consider now random variable U taking values in [0, 1] with
P (a ≤ U ≤ b) = b − a
for 0 ≤ a ≤ b ≤ 1. Try to figure out eg. P (U = 0.4): is
P (0.4 ≤ U ≤ 0.4) = 0.4 − 0.4 = 0.
Can’t define probability of a value, but still can define probability of
landing in subset of R (namely interval).
69
To account for all of this, define distribution of random variable X
as: collection of probabilities P (X ∈ B) for all subsets B of
real numbers .
Works for both examples above. Eg. in first example,
P (X ≤ 1) = P (X = 0) + P (X = 1) = 34.
In practice, often messy to define probabilities for “all possible
subsets”. Think first about examples like 1st, “discrete”, where can
talk about probabilities of individual values. Then consider
“continuous” case (like 2nd), where have to look at intervals.
70
Discrete distributions
Often it makes sense to talk about individual probs, P (X = x).
When all probability included in these probs, ie.
∑
x∈R
P (X = x) = 1,
don’t need to look at anything else.
Another way to look at it: there is a finite or countable set of x
values, x1, x2, . . ., each having probability pi = P (X = xi), such
that∑
x∈R pi = 1.
Either of these is definition of discrete distribution .
71
Compare case where P (a ≤ X ≤ b) = b − a: P (X = x) = 0
for all x, so not discrete distribution.
Another example: suppose X = −1 with prob 12, and for
0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b − a)/2. Can talk
about P (X = −1) = 12, but P (X = x) = 0 for any other x. So
not a discrete distribution.
Notation for discrete distributions (emphasize function):
pX(x) = P (X = x)
called probability function or mass function .
Now look at some important discrete distributions.
72
Degenerate distributions
If random variable C is constant, equal to c, then P (C = c) = 1
and P (C = x) = 0 for any x 6= c. Since∑
x∈R P (C = x) = P (C = c) = 1, is a proper (though dull)
discrete distribution. Called degenerate distribution or point
mass .
73
Bernoulli distribution
Flip a coin once, let X be number of heads (has to be 0 or 1).
Suppose P (head) = θ, so P (tail) = 1 − θ. Then
pX(1) = P (X = 1) = P (head) = θ;
pX(0) = P (X = 0) = P (tail) = 1 − θ.
X said to have Bernoulli distribution ; write X ∼ Bernoulli(θ).
Application: any kind of “success/failure”. Denote “success” by 1,
“failure” by 0. Or selection from population with two kinds of
individual like male/female, agree/disagree.
74
Binomial distribution
Now suppose we flip the coin n times (independently) and again
count number of heads. Probability of exactly x heads is
pX(x) = P (X = x) =
(
n
x
)
θx(1 − θ)n−x.
X said to have binomial distribution , written
X ∼ Binomial(n, θ).
Applications: as for Bernoulli. Eg. randomly select 100 Canadian
adults, let X be number of females.
75
Let X ∼ Binomial(4, 0.5), Y ∼ Binomial(4, 0.2). Then
x P (X = x) P (Y = x)
0 0.0625 0.4096
1 0.2500 0.4096
2 0.3750 0.1536
3 0.2500 0.0256
4 0.0625 0.0016
X probs symmetric about x = 2, Y more likely 0 or 1.
Bernoulli and binomial count successes in fixed number of trials.
Also look at waiting time problem: fix successes, count failures
observed to get them.
76
Geometric distribution
Same situation as for binomial: number of trials, independent, equal
(head) prob. θ. Let X now be number of tails before 1st head.
X = k means we observe k tails, and then a head, so
pX(k) = P (X = k) = (1 − θ)kθ, k = 0, 1, 2, . . .
X can be as large as you like, since you might wait a long time for
the first head. (Compare binomial: can’t have more than n
successes in n trials).
X has geometric distribution, prob. θ, written X ∼ Geometric(θ).
Applications: number of working light bulbs tested until first one that
fails; number of outs (non-hits) for baseball player until first hit.
77
Examples: suppose X1 ∼ Geometric(0.8) and
X2 ∼ Geometric(0.5).
k P (X1 = k) P (X2 = k)
0 0.80000 0.50000
1 0.16000 0.25000
2 0.03200 0.12500
3 0.00640 0.06250
4 0.00128 0.03125
When θ larger, 1st success probably sooner.
Also: probabilities form geometric series, hence the name.
78
Negative binomial distribution
To take geometric one stage further: Let r be a fixed number, let Y
be the number of tails before the r-th head.
Y = k only if observe r − 1 heads and k tails, in any order,
followed by a head (must finish with a head). Are r + k − 1 flips
before the final head. Prob. of this is
pY (k) = P (Y = k) =
(
r + k − 1
r − 1
)
θr−1(1 − θ)kθ
=
(
r + k − 1
k
)
θr(1 − θ)k
Write this Y ∼ Negative-Binomial(r, θ).
79
Applications: can re-use geometric distribution examples. Thus:
number of working lightbulbs tested until 5th non-working one
encountered; number of outs (non-hits) until baseball player
achieves 10th hit.
Numerical examples: let Y1 ∼ Negative-Binomial(4, 0.8) and
Y2 ∼ Negative-Binomial(3, 0.5).
80
k P (Y1 = k) P (Y2 = k)
0 0.40960 0.12500
1 0.32768 0.18750
2 0.16384 0.18750
3 0.06553 0.15625
4 0.02293 0.11718
5 0.00734 0.08203
6 0.00220 0.05468
With Y1, “heads” are likely so probably won’t see many tails before
4th H. With Y2, heads not so likely but only need to see 3 before
stopping.
81
General note
For geometric and negative binomial, some books count total
number of trials until first (or r-th) head. Gives random variables
1 + X and r + Y as defined above.
82
Poisson distribution
Suppose X ∼ Binomial(n, λ/n). We’ll think of λ as being fixed
and see what happens as n → ∞. That is, what if the number of
trials gets very large but the prob. of success gets very small?
Then
P (X = x) =
(
n
x
)(
λ
n
)x(
1 − λ
n
)n−x
=n!
x!(n − x)!nxλx
(
1 − λ
n
)n(
1 − λ
n
)−x
83
Thinking of x as fixed (for now) and letting n → ∞: the behaviour
of the factorials is determined by the highest power of n. Thus n!
behaves like nn, (n − x)! behaves like nn−x and hence
n!
(n − x)!nx→ 1.
Also,(
1 − λ
n
)−x
→ 1
because 1 − λ/n → 1 and raising it to a fixed power changes
nothing.
84
Finally,
limn→∞
(
1 − λ
n
)n
is a famous limit from calculus; it is e−λ. Thus
limn→∞
P (X = x) =e−λλx
x!.
A random variable Y with P (Y = y) = e−λλy/y! is said to have
a Poisson(λ) distribution, written Y ∼ Poisson(λ).
The Poisson distribution is a good model for rare events: that is,
events which have a large number of “chances” to happen, but have
a very small probability of happening at each “chance”. λ represents
“rate” at which events happen; doesn’t have to be integer.
85
Applications of Poisson distribution are things like: number of house
fires in a city on a given day, number of phone calls arriving at a
switchboard in an hour, number of radioactive events recorded by a
Geiger counter.
Let X ∼ Poisson(2), Y ∼ Poisson(0.8):
86
x P (X = x), λ = 2 P (Y = x), λ = 0.8
0 0.1353 0.4493
1 0.2707 0.3595
2 0.2707 0.1438
3 0.1804 0.0383
4 0.0902 0.0077
5 0.0361 0.0012
. . . . . .
• When λ is integer, highest prob at that integer and next lower
• Else next lower integer (λ < 1 ⇒ P (X = 0) highest).
87
Hypergeometric distribution
Introduction
Imagine a pot containing 10 balls, 7 red and 3 green. Prob. of
drawing a red ball is 0.7 (7/10). If we put the ball drawn back in the
pot, prob. of drawing a red ball the next time is still 0.7.
Thus, drawing with replacement, number of red balls in 4 draws
R ∼ Binomial(4, 0.7). Therefore
P (R = 4) =
(
4
4
)
(0.7)4(0.3)0 = 0.2401.
88
Now suppose we draw without replacement: that is, don’t put balls
back in pot after drawing. If we draw a red ball 1st time, there are
only 6 red balls out of 9 balls left.
Should be harder to draw 4 red balls in 4 draws because there are
fewer left after we draw each one: now
P (R = 4) =7
10· 6
9· 5
8· 4
7= 0.1667.
This is not so bad, but suppose we now want P (R = 3), say?
Need general principle for drawing without replacement.
89
The hypergeometric formula
Introduce symbols: suppose draw n balls without replacement out
of a pot containing N total. Suppose M of the balls in the pot are
red. Let X be number of red balls drawn. What is P (X = x)?
Need to count ways:
• Number of ways to draw n balls out of N in pot:(
Nn
)
.
• number of ways to draw x red balls out of M red balls in pot:(
Mx
)
.
• number of ways to draw n− x green balls out of N −M green
balls in pot:(
N−Mn−x
)
.
90
P (X = x) is number of ways to draw the red and green balls
divided by number of ways to draw n balls out of N :
P (X = x) =
(
M
x
)(
N − M
n − x
)
/
(
N
n
)
.
X said to have hypergeometric distribution:
X ∼ Hypergeometric(N,M, n). Checks:
M + (N − M) = N and x + (n − x) = n. Restrictions on x?
• Number of red balls: x ≤ n and x ≤ M so x ≤ min(n,M).
• Number of green balls: n − x ≤ n and n − x ≤ N − M , so
x ≥ 0 and x ≥ n + M − N , so x ≥ max(0, n + m − N).
91
Example 1: let X ∼ Hypergeometric(10, 7, 4):
x P (X = x)
0 0.0000
1 0.0333
2 0.3000
3 0.5000
4 0.1667
10 balls in pot, 7 red, 4 drawn. Cannot draw 0 red, because that
would mean drawing 4 green, and only 3 in pot. (Also cannot draw
more than 4 red because only drawing 4).
92
Example 2: let Y ∼ Hypergeometric(5, 3, 4):
y P (Y = y)
0 0.0
1 0.0
2 0.6
3 0.4
4 0.0
5 0.0
5 balls in pot, 3 red and 2 green, draw 4. Cannot draw more than 3
red. But also cannot draw only 0 or 1 red, because that would mean
drawing 4 or 3 green, and aren’t that many in the pot.
93
Applications
Anything that involves drawing without replacement from a finite set
of elements. Includes sampling, eg. selecting people to include in
opinion poll. (Don’t want to select same person twice). People
sampled from might agree (red ball) or disagree (green ball) with
question asked.
94
Large N
If N large, might imagine that it doesn’t matter much whether you
replace balls in pot or not. In other words, for large N , binomial
would be decent approximation. Turns out to be true:
If X ∼ Hypergeometric(N,M, n) and N large, then X has
approx. same distribution as Y ∼ Binomial(n,M/N).
95
As an example of this, suppose
Y1 ∼ Hypergeometric(20, 14, 10),
Y2 ∼ Hypergeometric(100, 70, 10),
Y3 ∼ Hypergeometric(1000, 700, 10). Number of balls in pot
increasing, fraction of red balls always 0.7, so heading to
Y ∼ Binomial(10, 0.7).
Results: Y1 probs not near Y at all; Y2 better, Y3 better still.
96
y P (Y1 = y) P (Y2 = y) P (Y3 = y) P (Y = y)
0 0.000000 0.000002 0.000005 0.000006
1 0.000000 0.000058 0.000128 0.000138
2 0.000000 0.000817 0.001376 0.001447
3 0.000000 0.006438 0.008739 0.009002
4 0.005418 0.031451 0.036255 0.036757
5 0.065015 0.099637 0.102644 0.102919
6 0.243808 0.207578 0.200839 0.200121
7 0.371517 0.281163 0.268171 0.266828
8 0.243808 0.237232 0.233862 0.233474
9 0.065015 0.112708 0.120277 0.121061
10 0.005418 0.022917 0.027704 0.028248
97
Getting one probability from the previous
one
When calculating a number of probs from a distribution, often
easiest to:
• calculate the first prob (often P (X = 0))
• calculate the next prob from the previous one
Often P (X = 0) easy case of general formula, and getting next
prob from previous has easy formula, as we’ll see.
98
Geometric distribution
Here, P (X = k) = θ(1 − θ)k, so:
• P (X = 0) = θ(1 − θ)0 = θ
• P (X=k+1)P (X=k)
= θ(1−θ)k+1
θ(1−θ)k = 1 − θ, so
P (X = k + 1) = (1 − θ)P (X = k).
Eg. if θ = 0.8, 1 − θ = 0.2 and
P (X = 0) = 0.8
P (X = 1) = (0.2)(0.8) = 0.16
P (X = 2) = (0.2)(0.16) = 0.032
P (X = 3) = (0.2)(0.032) = 0.0064 etc.
99
Poisson distribution
Here,
P (X = k) =e−λλk
k!,
so
P (X = 0) =e−λλ0
0!= e−λ
andP (X = k + 1)
P (X = k)=
e−λλk+1
(k + 1)!· k!
e−λλk=
λ
k + 1.
100
Eg. if λ = 1.7, to the accuracy shown:
P (X = 0) = e−1.7 = 0.1827
P (X = 1) =(0.1827)(1.7)
1= 0.3106
P (X = 2) =(0.3106)(1.7)
2= 0.2640
P (X = 3) =(0.2640)(1.7)
3= 0.1496
and so on.
101
Binomial distribution
This time,
P (X = k) =
(
n
k
)
θk(1 − θ)n−k,
so
P (X = 0) =
(
n
0
)
θ0(1 − θ)n = (1 − θ)n
102
and
P (X = k + 1)
P (X = k)=
(
nk+1
)
θk+1(1 − θ)n−(k+1)
(
nk
)
θk(1 − θ)n−k
=n!
(k + 1)!(n − (k + 1))!· k!(n − k)!
n!· θ
1 − θ
=(n − k)θ
(k + 1)(1 − θ),
so
P (X = k + 1) = P (X = k)(n − k)θ
(k + 1)(1 − θ).
103
Example: n = 3, θ = 0.2; have to keep track of what k and k + 1
are:
P (X = 0) = (1 − 0.2)3 = 0.512
P (X = 1) = (0.512)(3 − 0)(0.2)
(1)(1 − 0.2)= 0.384
P (X = 2) = (0.384)(3 − 1)(0.2)
(2)(1 − 0.2)= 0.096
P (X = 3) = (0.096)(3 − 2)(0.2)
(3)(1 − 0.2)= 0.008
P (X = 4) will be 0 (correct since n = 3).
104
Negative binomial
Now, P (X = k) =(
r−1+kk
)
θr(1 − θ)k, so
P (X = 0) =
(
r − 1
0
)
θr(1 − θ)0 = θr,
P (X = k + 1)
P (X = k)=
(
r+kk+1
)
θr(1 − θ)k+1
(
r−1+kk
)
θr(1 − θ)k
= (1 − θ)(r + k)!
(k + 1)!(r − 1)!· k!(r − 1)!
(r − 1 + k)!
= (1 − θ)(r + k)
(k + 1).
105
As an example, r = 3, θ = 0.9:
P (X = 0) = 0.93 = 0.729
P (X = 1) = (0.729)(0.1)3
1= 0.2187
P (X = 2) = (0.2187)(0.1)4
2= 0.04374
P (X = 3) = (0.04374)(0.1)5
3= 0.00729
and so on.
106
Hypergeometric distribution
which has a lot of factorials to deal with.
P (X = k) =
(
M
k
)(
N − M
n − k
)
/
(
N
n
)
with k ≥ max(0, n + M − N).
If n + M − N ≤ 0, start with k = 0 and
P (X = 0) =
(
M
0
)(
N − M
n
)
/
(
N
n
)
=(N − m)!(N − n)!
N !(N − m − n)!.
107
Otherwise, start with k = n + M − N and
P (X = n + M − N) =
(
M
n + M − N
)(
N − M
N − M
)
/
(
N
n
)
=M !n!
(n + M − N)!N !
after some algebra. And
P (X = k + 1)
P (X = k)=
(M − k)(n − k)
(k + 1)(N − M − n + k + 1)
after a lot of algebra.
108
Example: N = 6,M = 4, n = 3. Since 3 + 4 − 6 = 1 > 0,
start with k = 1:
P (X = 1) =4!3!
1!6!= 0.2
P (X = 2) = (0.2)(4 − 1)(3 − 1)
(2)(6 − 4 − 3 + 2)= (0.2)
(3)(2)
(2)(1)= 0.6
P (X = 3) = (0.6)(4 − 2)(3 − 2)
(3)(6 − 4 − 3 + 3)= (0.6)
(2)(1)
(3)(2)= 0.2
and remaining probs are 0.
109
Using Minitab for probability distributions
Calculating prob. distributions by hand can be annoying. Easier to
use software.
We will use statistical software Minitab for this.
Minitab does all kinds of statistical calculations. Available:
• in computer labs on campus
• bundled with textbook
• available via e-academy.com.
See Minitab manual (Evans/Rosenthal) for more.
110
Discrete distributions: binomial
See manual, chapter 4.
Suppose X ∼ Binomial(20, 0.4).
P (X = 7): select Calc, Probability Distributions, Binomial. Brings
up dialog box: enter 20 as Number of Trials, 0.4 as prob. of success.
Make sure Probability radio button checked. Click on Input
Constant, enter 7. Click OK.
111
Probability Density Function
Binomial with n = 20 and p = 0.400000
x P( X = x)
7.00 0.1659
P (X = 7) = 0.1659.
112
P (X ≤ 7) similar, but now click Cumulative Prob. Get this:
Cumulative Distribution Function
Binomial with n = 20 and p = 0.400000
x P( X <= x)
7.00 0.4159
so that P (X ≤ 7) = 0.4159. (Easier than calculating all probs
and adding up.)
113
Poisson
Suppose X ∼ Poisson(3). What is P (X ≤ 5)?
Minitab: Calc, Probability Distributions, Poisson. Select Cumulative
Probability, enter λ value (3) in Mean box, click Input Constant and
enter 5. Results:
Cumulative Distribution Function
Poisson with mu = 3.00000
x P( X <= x)
5.00 0.9161
P (X ≤ 5) = 0.9161.
114
Listing probabilities
Suppose X ∼ Binomial(4, 0.6). To list all probs, enter 0,1,2,3,4
in column C1 (lower half of screen), select Binomial as before.
Select Probability, 4 for Number of Trials, 0.6 for prob. Click Input
Column, type C1. Click OK:
Binomial with n = 4 and p = 0.600000
x P( X = x)
0.00 0.0256
1.00 0.1536
2.00 0.3456
3.00 0.3456
4.00 0.1296
115
Summary
• Distribution of random variable is collection of probabilities
P (X ∈ B) for all subsets B of R.
• If can describe whole distribution by giving P (X = s) (so that∑
s P (X = s) = 1, distribution called discrete.
• Also use notation pX(s) for P (X = s).
• Degenerate distribution has pC(c) = 1 for some c, and
PX(s) = 0 for s 6= c (certain to be c).
116
• Bernoulli distribution describes number of “successes” in 1 trial
(must be 0 or 1).
• Binomial distribution describes number of “successes” in n
independent trials, each having equal success prob.
• Geometric distribution describes waiting time (number of
failures) before first success (under same conditions as
binomial).
• Negative-binomial distribution describes waiting time (number of
failures) until r-th success, under same conditions as binomial.
117
• Poisson distribution describes number of occurrences of “rare”
event over fixed time or space. (Limit as n → ∞ and θ → 0
such that nθ → λ).
• Hypergeometric distribution describes number of successes in
fixed number of trials when sampling without replacement.
• Can devise (simpler) formulas for obtaining one probability in a
discrete distribution from previous ones.
• Can use Minitab to calculate probabilities from discrete
distributions.
118
Continuous distributions
Suppose, for random variable U ,
P (a ≤ U ≤ b) = b − a
for 0 ≤ a ≤ b ≤ 1.
Is legitimate probability since 0 ≤ b − a ≤ 1. But
P (U = a) = a − a = 0 for any a, so not discrete distribution.
Probability attached to intervals (or more generally subsets of R).
119
Here, probability of landing in interval [a, b] depends only on length
of interval b − a. Any intervals of same length have same
probability: no part of [0, 1] “more likely” than any other. Hence
name uniform distribution for U .
Try to get at idea of “probability of being near x”. Start with
P (x ≤ U ≤ x + δ) and let δ → 0. This will head to 0, not helpful!
TryP (x ≤ U ≤ x + δ)
δ=
δ
δ→ 1
as δ → 0. This gets at idea of “uniformity” of distribution —
probability of being “near” x same for any x ∈ [0, 1].
120
Cumulative distribution function
Define the cumulative distribution function (cdf) of any random
variable X to be
FX(x) = P (X ≤ x).
This is defined for any random variable, continuous or discrete
(though in discrete case, individual probabilities easier to work with).
Result:
P (u ≤ X ≤ v) = P (X ≤ v)−P (X ≤ u) = FX(v)−FX(u).
In words: prob. of being between u and v is prob. of being less than
v, minus prob. of being less than u as well.
121
Properties of FX(x):
• 0 ≤ FX(x) ≤ 1 (FX is a probability)
• FX(x) ≤ FX(y) whenever x ≤ y (nondecreasing): FX(x)
“collects” more probability as x increases
• limx→+∞ FX(x) = 1 (“certain to be less than +∞”)
• limx→−∞ FX(x) = 0 (“cannot be less than −∞”).
122
Density function
Try to generalize what we did for uniform distribution above:
P (x ≤ X ≤ x + δ)
δ=
FX(x + δ) − FX(x)
δ.
As δ → 0, this tends to F ′X(x), the derivative of FX(x).
Suggests that F ′X(x) will be useful: call it fX(x), the density
function of X . In sense defined above, says how likely you are to
observe value “near” x.
(Careful: density function not probability, so “how likely”
interpretation only informal.)
123
Getting probabilities from density function
Two facts: fX(x) derivative of FX(x), and FX(x) gives
probabilities. Thus probs must be integral of fX(x):
P (u ≤ X ≤ v) =
∫ v
u
fX(x) dx.
Example: uniform distribution. Know that eg.
P (0.2 ≤ U ≤ 0.6) = 0.6 − 0.2 − 0.4. Compare:
FU(x) = P (U ≤ x) = P (0 ≤ U ≤ x) = x − 0 = x, so
density function is fU(x) = F ′U (x) = 1 (as before). Hence
P (0.2 ≤ U ≤ 0.6) =
∫ 0.6
0.2
1 dx = [x]0.60.2 = 0.6 − 0.2 = 0.4.
124
Summary
• Cumulative distribution function FX(x) = P (X ≤ x)
defined for any random variable.
• Density function fX(x) = F ′X(x), if FX differentiable (it
usually is). Must have fX(x) ≥ 0 and∫∞−∞ fX(x) dx = 1.
• Probabilities from density function by
P (u ≤ X ≤ v) =
∫ v
u
fX(x) dx.
Technical detail: if FX differentiable, so fX(x) exists, X called
absolutely continuous. Possible (though not in this course) for
continuous r. v. not to be absolutely continuous.
125
Getting cumulative distribution function from
density
This requires a little care. Formula is
FX(x) =
∫ x
−∞fX(t) dt.
Note:
• change of variable of integration to t (since x a limit)
• lower limit of integral is lower limit of distribution; might be eg. 0.
126
Example: suppose fX(x) = (3 − x)/2 for 1 ≤ x ≤ 3, 0
otherwise. Then
FX(x) =1
2
∫ x
1
(3 − t) dt =1
4(−5 + 6x − x2).
Check that FX(1) = 0 and FX(3) = 1, and on interval [1, 3],
FX(x) nondecreasing (what it does elsewhere irrelevant). Implies
that 0 ≤ FX(x) ≤ 1 for 1 ≤ x ≤ 3.
127
A trickier example:
Suppose fX(x) = 2x/3, 0 ≤ x ≤ 1, fX(x) = 2/3, 1 ≤ x ≤ 2,
and 0 otherwise.
Have to handle each interval separately:
For 0 ≤ x ≤ 1,
FX(x) =
∫ x
0
2t/3 dt = x2/3.
For 1 ≤ x ≤ 2,
FX(x) =
∫ 1
0
2t/3 dt+
∫ x
1
2/3 dt =1
3+
2x
3− 2
3=
1
3(2x−1).
In 2nd case, had to split integral defining FX into two parts because
fX defined in two parts. (“Integral of density, whatever it is.”)
128
Important continuous distributions
The Uniform[L,R] distribution
Uniform[0, 1] has constant density 1 for 0 ≤ x ≤ 1. If now
fU(x) = 1/(R − L) for L ≤ x ≤ R, U ∼ Uniform[L,R].
Density still constant, no longer 1 because of need to integrate to 1
over [L,R].
Can show by integrating that
P (a ≤ U ≤ b) = (b − a)/(R − L)
for L ≤ a ≤ b ≤ R.
129
The exponential distribution
Suppose now random variable X has density fX(x) = e−x for
x ≥ 0 and 0 otherwise.
Legal density, X ∼ Exponential(1), because
∫ ∞
0
e−x dx =[
−e−x]∞0
= 0 − (−1) = 1.
Change variable in integral: x = λy, so dx = λ dy, and the limits
don’t change. Thus
1 =
∫ ∞
0
λe−λy dy
130
and fY (y) = λe−λy is a density fn for y ≥ 0:
Y ∼ Exponential(λ).
(Careful when using other books/software: sometimes our λ is
written 1/λ).
Applications: lifetimes (eg. of electrical components), inter-arrival
time between customers waiting for service (at bank, fast-food
restaurant).
Fact about Exponential(λ):
P (X ≥ x) =
∫ ∞
x
λe−λt dt = e−λx.
131
Connection between exponential and Poisson
Suppose number of customers arriving at a bank (say) in a time
period ∼ Poisson(λ). Let T1 be time until next arrival:
P (T1 ≥ t) = P (no arrivals in time[0, t]) =(λt)0
0!e−λt = e−λt.
That is, if number arriving ∼ Poisson(λ), time to next arrival
∼ Exponential(λ).
132
The Gamma(α, λ) distribution
Define the gamma function like this:
Γ(α) =
∫ ∞
0
tα−1e−t dt
Follows that
1 =
∫ ∞
0
tα−1
Γ(α)e−t dt.
Now change variable in integral: let t = λx so dt = λ dx:
1 =
∫ ∞
0
(λx)α−1
Γ(α)e−λxλ dx =
∫ ∞
0
λα xα−1
Γ(α)e−λx dx.
In other words, fX(x) = λαxα−1e−λx/Γ(α) is a density function;
X has gamma distribution , X ∼ Gamma(α, λ).
133
If α = 1, density function is λ1x0e−λx = λe−λx, the exponential
density.
Thus Gamma(1, λ) = Exponential(λ).
That is, the exponential dist is a special case of the gamma
distribution.
The gamma distribution can be used to model lifetimes (like the
exponential), but has greater flexibility. See picture next page. α
controls the shape.
134
0 1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
1.0
x
dens
ity
Exponential(1)
Gamma(2,1)
Gamma(3,1)
135
The normal distribution
Consider function φ(z) = Ce−z2/2. Can we make this into a
density, for all z?
Must have∫∞−∞ Ce−z2/2 dz = 1, so C = 1/
√2π (text, section
2.11).
Random variable Z with this density said to have standard normal
distribution , written Z ∼ N(0, 1).
Gives “bell curve”:
136
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
dens
ity
The N(0,1) density function
137
Since
1 =
∫ ∞
−∞
1√2π
e−z2/2 dz
make change of variable z = (x − µ)/σ, so dz = (1/σ) dx.
Gives
1 =
∫ ∞
−∞
1
σ√
2πe−(x−µ)2/2σ2
dx
138
so that
fX(x) =1
σ√
2πe−(x−µ)2/2σ2
is a density function for random variable X ∼ N(µ, σ2). X said to
have a normal distribution .
Density function of X also bell-shaped, now with peak at x = µ. σ
controls spread: larger σ means larger left-right spread.
Applications: often normal distribution is good approximation,
especially when a measurement is a large number of “small things”
added together. Examples: human body measurements such as
height, weight.
Theoretical reason for this called “central limit theorem”, later.
139
Normal distribution probabilities
Probability is integral, as before:
P (a ≤ X ≤ b) =
∫ b
a
1
σ√
2πe−(x−µ)2/2σ2
dx.
Problem: can’t do this integral!
Solution: evaluate numerically, to desired accuracy. Results
available in tables, eg. back of text table D2 p. 660.
140
Tables actually give
Φ(z) = P (Z ≤ z) =
∫ z
−∞
1√2π
e−t2/2 dt,
the cumulative distribution function of N(0, 1).
So have to write problem in terms of Φ(z).
141
Facts:
• since limz→∞ Φ(z) = 1 and φ(z) symmetric about 0,
Φ(−z) = 1 − Φ(z).
• P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
• Hence P (Z ≥ a) = P (a ≤ Z < ∞) = 1 − Φ(a).
• Note that Table D2 only gives Φ(z) for negative z.
142
z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
-0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
Table 1: Standard normal table part 2
143
Example: for Z ∼ N(0, 1), find P (−1.48 ≤ Z ≤ 0.56).
P (−1.48 ≤ Z ≤ 0.56) = Φ(0.56) − Φ(−1.48)
= (1 − Φ(−0.56)) − Φ(−1.48)
= (1 − 0.2877) − 0.0694 = 0.6429.
Example 2: for Z ∼ N(0, 1), find P (Z ≥ 0.90):
P (Z ≥ 0.90) = 1 − P (Z ≤ 0.90)
= 1 − Φ(0.90)
= 1 − (1 − Φ(−0.90)) = 1 − (1 − 0.1841)
= 0.1841.
144
Probs for non-standard normal
So if X ∼ N(µ, σ2), how to calculate P (a ≤ X ≤ b)?
P (z1 ≤ Z ≤ z2) = Φ(z2) − Φ(z1) =
∫ z2
z1
1√2π
e−z2/2 dz
Change variables again: let z = (x − µ)/σ, so dz = (1/σ) dx:
=
∫ b
a
1
σ√
2πe−(x−µ)2/2σ2
dx = P (a ≤ X ≤ b)
exactly what we want, except need relationship between z1, z2 and
a, b:
z1 =a − µ
σ; z2 =
b − µ
σ.
145
In other words, get z1 and z2 from a and b, and then answer is
Φ(z2) − Φ(z1).
Example: suppose X ∼ N(0.5, 16), find P (0 ≤ X ≤ 2).
µ = 0.5, σ2 = 16 so σ = 4. Thus
z1 =0 − 0.5
4= −0.125, z2 =
2 − 0.5
4= 0.375.
Because 0.375 halfway between 0.37 and 0.38, Φ(0.375) about
halfway between Φ(0.37) = 1 − 0.3557 and
Φ(0.38) = 1 − 0.3520. Thus Φ(0.375) = 0.64615. Likewise
Φ(−0.125) = (0.4522 + 0.4483)/2 = 0.45025. Hence
P (0 ≤ X ≤ 2) = 0.64615 − 0.45025 = 0.1959.
146
Getting probabilities from continuous
distributions with Minitab
As with discrete distributions, can use Minitab to calculate
probabilities from continuous distributions. Saves integration/use of
tables.
147
Normal distribution
Redo previous example. If X ∼ N(0.5, 16), find P (0 ≤ X ≤ 2).
First, type values 0 and 2 into column C1.
Select Calc, Probability distributions, Normal. Click Cumulative
Prob. “Mean” is µ = 0.5, “standard deviation” is σ = 4. Select
Input Column, type in C1. Click OK:
Cumulative Distribution Function
Normal with mean = 0.500000 and standard deviation = 4.00000
x P( X <= x)
0.0000 0.4503
2.0000 0.6462
Thus P (0 ≤ X ≤ 2) = 0.6462 − 0.4503 = 0.1959.
148
Uniform[L,R] distribution
Suppose X ∼ Uniform[3, 7]. P (X ≥ 6)? (Can tell it’s 14
almost
by looking at it.)
Select Calc, Prob. Distributions, Uniform. Click Cumulative
Distribution. Enter endpoints 3 and 7. Click Input Constant, enter 6.
Click OK:
Cumulative Distribution Function
Continuous uniform on 3.00000 to 7.00000
x P( X <= x)
6.0000 0.7500
So P (X ≥ 6) = 1 − P (X ≤ 6) = 1 − 0.75 = 0.25.
149
Exponential distribution
Suppose X ∼ Exponential(2): P (0.4 ≤ X ≤ 1.2)?
Enter values 0.4 and 1.2 into column C1. (Overtype if needed.)
Select Calc, Probability Distributions, Exponential. Click Cumulative
Probability. Then enter “mean” as 1/λ = 1/2 = 0.5. Select
Input Column and enter C1. Click OK:
Cumulative Distribution Function
Exponential with mean = 0.500000
x P( X <= x)
0.4000 0.5507
1.2000 0.9093
so P (0.4 ≤ X ≤ 1.2) = 0.9093 − 0.5507 = 0.3586.
150
Gamma distribution
Exponential can be done by integration; gamma cannot. (Previous:∫ 1.2
0.42e−2x dx = 0.3586.)
Suppose X ∼ Gamma(3, 2); P (X ≤ 1.5)?
Select Calc, Prob. Distributions, Gamma. Click Cumulative Prob.
First Shape Parameter is α = 3; second is 1/λ = 0.5. Click Input
Constant, enter 1.5. Click OK.
Cumulative Distribution Function
Gamma with a = 3.00000 and b = 0.500000
x P( X <= x)
1.5000 0.5768
Prob. is 0.5768.
151
Summary
• Get CDF from density by integrating, but: careful with variable of
integration, lower limit.
• Uniform[L,R] distribution has constant density on interval
[L,R].
• Exponential(λ) density is λe−λx for x ≥ 0.
• Exponential often used for time between events; number of
events per unit time then ∼ Poisson(λ).
• Gamma(α, λ) has density λαxα−1e−λx/Γ(α) for x ≥ 0. α
controls shape.
152
• Standard normal distribution has density e−z2/2/√
2π; has
“bell curve” shape.
• if X ∼ N(µ, σ2), X has (non-standard) normal distribution
with peak at µ and spread controlled by σ.
• Get probabilities for Z and X using tables. Tables give CDF of
z, Φ(z); write everything else in terms of that.
• Using Minitab, can get probabilities without integration/tables.
153
Random variables neither discrete nor
continuous
Distributions can be neither discrete nor continuous.
Example from earlier: suppose X = −1 with prob 12, and for
0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b − a)/2. Can talk
about P (X = −1) = 12, but P (X = x) = 0 for any other x. So
not a discrete distribution.
But since P (X = −1) = 12
> 0, not a continuous distribution
either.
154
One-dimensional change of variable
Suppose X is some random variable with known distribution, and
Y = h(X), h(X) a known function. Then what is distribution of
Y ?
If X discrete, have to work out all possible values of Y (may be
finite, at worst countable).
If X continuous, may be uncountably many different values of Y .
155
X discrete
Example: flip a fair coin twice, let X be number of heads. Then
X ∼ Binomial(2, 0.5), so
P (X = 0) = 0.25, P (X = 1) = 0.5, P (X = 2) = 0.25.
Let Y = (X − 1)2. Then possible Y are:
X Y
0 (0 − 1)2 = 1
1 (1 − 1)2 = 0
2 (2 − 1)2 = 1
Thus P (Y = 0) = P (X = 1) = 0.5 and
P (Y = 1) = P (X = 0) + P (X = 2) = 0.25 + 0.25 = 0.5.
156
Idea similar whenever X discrete: work out all possible values of Y .
If more than one X value can lead to the same value of Y , then get
prob for Y by adding up X probs: specifically
P (Y = y) =∑
x:h(x)=y
P (X = x).
If h(X) 1-1, then each value of X leads to different value of Y .
(For instance, if h increasing.)
157
X continuous
If X continuous, then Y might not be continuous at all. It depends
on h(X).
Example: Let X ∼ Uniform[0, 1]. Let h(x) = 7 if x ≤ 34, and let
h(x) = 5 if x > 34.
Then
P (Y = 7) = P
(
X ≤ 3
4
)
=3
4
P (Y = 5) = P
(
X >3
4
)
=1
4.
These add up to 1, so (even though X continuous) Y has discrete
distribution.
158
Reason for Y being discrete: h(X) only took countably many
different values.
Usually, though, h(X) takes on uncountably many values, and
have to be more careful.
Second example: X ∼ Uniform[0, 1], Y = 3X . What is
distribution of Y ?
159
Start with cumulative distribution function:
P (Y ≤ y) = P (3X ≤ y) = P (X ≤ y/3) = y/3
since FX(x) = x if X ∼ Uniform[0, 1].
Since 0 ≤ X ≤ 1, must have 0 ≤ Y/3 ≤ 1 so 0 ≤ Y ≤ 3. And
can get density of Y by differentiating: fY (y) = (y/3)′ = 13. Thus
Y ∼ Uniform[0, 3].
In general: if h(X) increasing, work with cumulative, and substitute.
160
Change of variable using density functions
May not have cumulative distribution function available, so need a
formula that works with density functions.
Mimic what we did above, and adapt. Again assume h(X)
increasing:
FY (y) = P (Y ≤ y) = P (h(X) ≤ y)
= P (X ≤ h−1(y)) = FX(h−1(y)).
Differentiate both sides wrt y:
fY (y) = fX(h−1(y))d
dy(h−1(y)) = fX(h−1(y))
1
h′(h−1(y)).
This is desired result.
161
Redo above example: fX(x) = 1 for 0 ≤ x ≤ 1,
Y = h(X) = 3X so h−1(y) = y/3. Then h′(x) = 3 and
fY (y) = fX(y/3)1
h′(y/3)= (1)
1
3=
1
3
as before.
162
If h(X) not increasing
If h(X) decreasing, then use the same result above (with a slightly
different proof).
If h(X) not 1-1, then can have difficulties.
Example: suppose X ∼ Uniform[0, 2], and
Y = h(X) = (X − 1)2. h(X) neither increasing nor decreasing;
indeed in general two values of X giving same value of Y .
Thus care required. Eg.
P
(
Y ≤ 1
2
)
= P
(
(X − 1)2 ≤ 1
2
)
= P
(
|X − 1| ≤ 1√2
)
= P
(
1 − 1√2≤ X ≤ 1 +
1√2
)
=1√2.
163
Summary
• Random variables usually discrete or continuous, but could be
neither.
• If r. v. Y is function h of X , can find distribution of Y from
distribution of X .
• if X discrete, work out all possible values of Y .
• if X continuous, Y might be discrete (depends on h(X)).
• if h(X) increasing (decreasing) function of X over domain of
X , then Y continuous. Can go via CDF of X to CDF of Y , or
work with density functions directly.
• if h(X) not 1-1, care needed.
164
Joint Distributions
Know how to describe random variables one at a time: probability
function (discrete), density function (continuous), cumulative
distribution function (either).
But two random variables X , Y might be related. Don’t have a way
to describe this.
Example: X ∼ Bernoulli(2/3). Let Y = 1 − X .
Y ∼ Bernoulli(1/3) (count failures not successes). X,Y
related, but doesn’t show in individual probability functions.
165
Joint probability functions
Can simply find probability of all possible combinations of values for
X,Y . Uses individual probability functions and relationship.
In example: if X = 0, then Y = 1; if X = 1, then Y = 0.
Possible values for Y depend on value of X . Also,
P (X = 1) = 2/3.
Notation: pX,Y (x, y) = P (X = x, Y = y) (comma is “and”),
called joint probability function . In example:
pX,Y (1, 0) = 2/3; pX,Y (0, 1) = 1/3.
Are only possible combinations of X and Y values.
166
Often convenient to depict as table. Above example:
x \ y 0 1
0 0 13
1 23
0
Another:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Note that all probabilities in each case sum to 1, because joint
probability function covers all possibilities.
167
Joint density functions
If random variables continuous, joint probability function makes no
sense; instead, define joint density function f(x, y) that
expresses chance of being “near” (X = x, Y = y).
Joint density function also covers all possible values of X,Y , so
integrates to 1 when integrated over both x and y.
168
Example: f(x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1 (page 83).
Integrate over both x and y. Since both variables lie between 0 and
1, those are the limits of integration:
∫ 1
0
∫ 1
0
4x2y + 2y5 dx dy =
∫ 1
0
[
4
3x3y + 2xy5
]x=1
x=0
dy
=
∫ 1
0
4
3y + 2y5 dy
=
[
4
6y2 +
2
6y6
]y=1
y=0
= 1,
showing that f(x, y) is a legal joint density function.
169
Sometimes possible values of Y depend on value of X . Account for
in integration.
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. (Thus
if X = 0.6, Y cannot exceed 0.4.) Region forms triangle: Figure
2.7.3 of text (p. 85). Verify density by letting y limits of integration
depend on x (y = 1 − x), and integrating wrt y first.
170
171
∫ 1
0
∫ 1−x
0
120x3y dy dx =
∫ 1
0
[
60x3y2]y=1−x
y=0dx
=
∫ 1
0
60x3(1 − x)2 dx
=
∫ 1
0
60x3 − 120x4 + 60x5 dx
= [15x4 − 24x5 + 10x6]10
= 15 − 24 + 10 = 1.
172
Bivariate normal distribution
Suppose X , Y both have standard normal distributions, and
suppose −1 < ρ < 1. Then the bivariate standard normal
distribution with correlation ρ has joint density function
f(x, y) =1
2π√
1 − ρ2exp
{
− 1
2(1 − ρ2)(x2 + y2 − 2ρxy)
}
.
Plotting in 3D (Figure 2.7.4) gives a 3D bell shape.
ρ measures relationship between X and Y :
• ρ = 0: no relationship
• ρ > 0: when X > 0, Y likely > 0
• ρ < 0: when X > 0, Y likely < 0.
173
Bivariate standard normal has peak at (0, 0). Replacing x by
(x − µ1)/σ1 and y by (y − µ2)/σ2 shifts peak to (µ1, µ2) and
changes decrease of density away from peak (larger σ values mean
slower decrease).
174
Calculating probabilities
For a continuous random variable X , calculate probabilities by
integrating, eg. P (a < X ≤ b) =∫ b
af(x) dx.
Same idea for continuous joint distribution, integrating over x and y.
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. Find
P (0.5 ≤ X ≤ 0.7, Y > 0.2).
Draw picture:
175 176
Area is trapezoid. Integrate over y first, then x. Call prob P :
P =
∫ 0.7
0.5
∫ 1−x
0.2
120x3y dy dx
=
∫ 0.7
0.5
[60x3y2]y=1−xy=0.2 dx
=
∫ 0.7
0.5
60x3(1 − x)2 − 2.4x3 dx
=
∫ 0.7
0.5
57.6x3 − 120x4 + 60x5 dx
= [14.4x4 − 24x5 + 10x6]x=0.7x=0.5
= 14.4(0.74 − 0.54) − 24(0.75 − 0.55) + 10(0.76 − 0.56)
= 0.294.
177
Marginal distributions
Started from individual distributions for X,Y plus relationship. But:
start from joint, get individual?
One way: get distribution of X by “averaging” over distribution of Y .
Discrete: simply row and column totals. Example:
u \ v 0 1 2 Sum
0 13
16
16
23
1 16
112
112
13
Sum 12
14
14
1
178
Without knowledge of V , U twice as likely 0 as 1; without
knowledge of U , V twice as likely 0 as 1 or 2.
Row totals here give marginal distribution of U ; column totals
here marginal distribution of V . Each marginal distribution is
proper probability distribution (probs sum to 1).
179
Continuous: integrate over other variable. Get marginal density
function .
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1.
Marginal density for X : integrate over y. Limits 0, 1 − x; get
fX(x) =
∫ 1−x
0
120x3y dy = 60x3(1 − x)2.
For Y : integrate over x, limits 0, 1 − y:
fY (y) =
∫ 1−y
0
120x3y dx = 30y(1 − y)4.
“Integrating out” unwanted variable.
Alternative approach via cumulative; text page 79.
180
Example 2: bivariate standard normal. Recall standard normal
density; integrates to 1, so∫ ∞
−∞
1√2π
exp
[
−1
2u2
]
du = 1.
Marginal distribution of x in bivariate standard normal: integrate out
y:
fX(x) =
∫ ∞
−∞
1
2π√
1 − ρ2exp
[
−x2 + y2 − 2ρxy
2(1 − ρ2)
]
dy.
Substitution: let u = (y − ρx)/√
1 − ρ2, so du = dy/√
1 − ρ2.
Then
u2 =y2 − 2ρxy + ρ2x2
1 − ρ2
181
which is nearly what appears inside “exp”. Precisely:
fX(x) =
∫ ∞
−∞
1
2πexp
[
−u2 + x2
2
]
du
=1√2π
exp(−x2/2)
∫ ∞
−∞
1√2π
exp(−u2/2) du.
Integral is 1 (of a standard normal density), so
fX(x) =1√2π
exp(−x2/2) :
that is, marginal distribution of X is standard normal.
182
Conditioning and Independence
Marginal distribution: of one variable, ignorant about other.
But what if we knew X ; what then about distribution of Y ?
Example 1:
x \ y 0 1
0 0 13
1 23
0
Suppose X = 1. Then ignore 1st row.
183
But 2nd row not probability distribution (sum 23
not 1). Idea: divide
by sum. Then if X = 1, P (Y = 0) = 1 and P (Y = 1) = 0: that
is, if X = 1, Y certain to be 0. Called conditional distribution of
Y given X = 1.
If X = 0, Y certain to be 1. Conditional distribution of Y different
for different X : Y depends on X .
Notation: as for conditional probability. Eg. above:
P (Y = 1|X = 0) = 1.
184
Example 2:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Conditional distribution of V given U = 0? Use U = 0 row. This
sums to 23, so divide by this to get P (V = 0|U = 0) = 1
2, P (V =
1|U = 0) = 14, P (V = 2|U = 0) = 1
4.
U = 1 line sums to 13; conditional distribution of V given U = 1 is
same as given U = 0.
In example 2, does not matter what U is – conditional distribution of
V same. Say that V and U are independent .
185
Two examples give extreme cases. In Example 1, knowing X gave
Y with certainty; in example 2, knowing U said nothing about V .
Most cases in between: knowing one variable has some effect on
distribution of other.
Symbols:
P (Y = b|X = a) =P (X = a, Y = b)
∑
y P (X = a, Y = y)=
P (X = a, Y = b)
P (X = a).
Denominator is marginal probability that X = a.
186
Conditioning on continuous random variables
Continuous case: no probabilities, so replace with density functions;
replace sum by integral. This gives conditional density function :
fY |X(y|x) =fX,Y (x, y)
∫∞−∞ fX,Y (x, y) dy
=fX,Y (x, y)
fX(x),
replacing infinities by actual limits for y. Denominator depends on x
only; is marginal density function for X .
Then use conditional density to evaluate conditional probabilities.
187
Example: fX,Y (x, y) = 4x2y + 2y5 for x, y between 0 and 1, 0
otherwise. Find P (0.2 ≤ Y ≤ 0.3|X = 0.8).
Steps: find marginal density of X , use to find conditional density of
Y given X , integrate conditional density to find probability.
Marginal density of X is
fX(x) =
∫ 1
0
4x2y + 2y5 dy =
[
2x2y2 +2
6y6
]1
0
= 2x2 +1
3.
So conditional density of Y given X is
fY |X(y|x) =4x2y + 2y5
2x2 + 13
.
188
Note how denominator doesn’t depend on y, so
P = P (0.2 ≤ Y ≤ 0.3|X = 0.8)
=1
2x2 + 13
∫ 0.3
0.2
4x2y + 2y5 dy
=1
2x2 + 13
[
2x2y2 +1
3y6
]y=0.3
y=0.2
=1
2x2 + 13
·(
2x2(0.32 − 0.22) +1
3(0.36 − 0.26)
)
=0.1x2 + 0.000665
3
2x2 + 13
= 0.0398,
putting in x = 0.8.
189
Followup: what happens to P (0.2 ≤ Y ≤ 0.3) if X changes?
One answer: P (0.2 ≤ Y ≤ 0.3|X = 0.4) = 0.0242, compared
to P (0.2 ≤ Y ≤ 0.3|X = 0.8) = 0.0398. So probability does
change as X changes; Y does depend on X .
Here, conditioning event X = 0.8 has zero probability, so have to
use densities. Otherwise, use standard probability rules, eg.
P (0.2 ≤ Y ≤ 0.3|0.4 ≤ X ≤ 0.8)
=P (0.4 ≤ X ≤ 0.8, 0.2 ≤ Y ≤ 0.3)
P (0.4 ≤ X ≤ 0.8)
worked out the usual way with integrals.
190
Law of total probability
Because
fY |X(y|x) =fX,Y (x, y)
∫∞−∞ fX,Y (x, y) dy
=fX,Y (x, y)
fX(x),
also true that
fX,Y (x, y) = fX(x)fY |X(y|x).
191
So
P (a ≤ X ≤ b, c ≤ Y ≤ d)
=
∫ d
c
∫ b
a
fX,Y (x, y) dx dy
=
∫ d
c
∫ b
a
fX(x)fY |X(y|x) dx dy.
In words: can find probabilities either using joint density or using a
marginal and a conditional density. Can use whichever easier.
192
Independence of random variables
Recall this joint distribution:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Sum 12
14
14
Conditional distribution of V same given U = 0 and given U = 1.
Also same as marginal distribution of V . Knowing U says nothing
about V .
(Also, conditional dist. of U same for all V and same as marginal for
U .)
193
Suggests definition: random variables independent if conditional
distribution always same, and always same as marginal.
Mathematics: X,Y independent if
pY (y) = pY |X(y|x) =pX,Y (x, y)
pX(x)
so that
pX,Y (x, y) = pX(x)pY (y).
This is usually easiest check:
• if pX,Y (x, y) = pX(x)pY (y) for all x, y, then X,Y
independent.
• if pX,Y (x, y) 6= pX(x)pY (y) for any one (x, y) pair, then
X,Y not independent.
194
For example above: P (U = 0) = 23, P (U = 1) = 1
3;
P (V = 0) = 12, P (V = 1) = P (V = 2) = 1
4. Also,
P (U = 0)P (V = 0) =2
3· 1
2=
1
3= P (U = 0, V = 0).
Repeat for all u and v: proves independence.
195
Compare this joint distribution:
x \ y 0 1
0 0 13
1 23
0
Now,
P (X = 0)P (Y = 0) =1
3· 2
3=
2
9
and P (X = 0, Y = 0) = 0 6= 29. One calculation shows X,Y
not independent.
196
Independence of continuous random variables
As usual, turn probability into density. If
fX,Y (x, y) = fX(x)fY (y)
for all x, y, then continuous random variables X,Y independent. If
it fails for any (x, y) pair, not independent.
Example: suppose fX(x) = 2x2 + 13, fY (y) = 4
3y + 2y5,
fX,Y (x, y) = 4x2y + 2y5 for 0 ≤ x, y ≤ 1. Then
fX(x)fY (y) =
(
2x2 +1
3
)(
4
3y + 2y5
)
which cannot be simplified to fX,Y (x, y). So X,Y not
independent.
197
Order statistics
Suppose that X1, X2, . . . , Xn all, independently, have same
distribution (a sample from distribution). Suppose common cdf
FX(x).
For example: take 20 people, give each IQ test. Without knowing
about individuals, use same distribution for each. What might
highest score in sample be?
Idea: more people sampled, higher the highest score could be (get
more chances to see a very high score).
198
Let M = max(X1, X2, . . . , Xn). Then
P (M ≤ m) = P (X1 ≤ m,X2 ≤ m, . . . ,Xn ≤ m)
= P (X1 ≤ m)P (X2 ≤ m) · · ·P (Xn ≤ m)
= [FX(m)]n .
If X continuous, differentiate to get density.
Example: each Xi ∼ Uniform[0, 1]. Then FX(x) = x, so
P (M ≤ m) = xn.
If n = 5, P (M ≤ 0.9) = 0.95 = 0.59; if n = 20,
P (M ≤ 0.9) = 0.920 = 0.1216, much smaller. That is, with
more observations, the maximum is likely to be higher (less likely to
be low).
199
Similar idea for minimum: let K = min(X1, X2, . . . , Xn). Then
P (K ≤ k) = 1 − P (K > k)
= 1 − P (X1 > k,X2 > k, . . . ,Xn > k)
= 1 − P (X1 > k)P (X2 > k) · · ·P (Xn > k)
= 1 − (1 − FX(k))n.
Example: if n = 10, Xi ∼ Uniform[0, 1], then
P (K ≤ 0.2) = 1 − (1 − 0.2)10 = 0.8926.
200
Summary
• Two r. v. X and Y might be related; express using joint
distribution.
• if X,Y discrete, use joint probability function (prob. of every
combination of values for X and Y ).
• if X,Y continuous, use joint density function.
• Joint probability/density function sum/integrate to 1 over both x
and y.
• Bivariate normal distribution: X,Y individually normal but
possibly related, with correlation ρ.
• Prob. from joint density: integrate over both x and y.
201
• Marginal distribution of Y is that of Y “averaged” over X
(sum/integrate); dist. of Y in ignorance of X .
• Conditional distribution of Y given X is that of Y if X known.
• If marginal and conditional distributions of Y are same, then
knowing X has no effect on Y : X and Y independent.
• Prove independence by showing that joint prob. fn. / density at
(x, y) equal to product of marginal prob. fn. / density for all x
and y. Failure anywhere means not independent.
• Order statistics: can find CDF of max or min of sample from a
distribution.
202
Simulating probability distributions
So far, considered mathematical properties of distributions:
probabilities, densities, cdf’s etc. But some distributions difficult to
understand or use.
Generate random values from distribution.
approximation of difficult-to-calculate quantities
simulation of complex systems
generating potential solutions for difficult problems
random choices for quizzes, computer games
understanding behaviour of samples (chapter 4)
203
Pseudo-random numbers
In practice, don’t get actual random numbers, but pseudo-random
numbers. These follow recipe, but look random. (Paradox?)
Not so bad, because crucial feature: unpredictable – cannot easily
say what comes next.
Typical method: multiplicative congruential generator . Start with
initial “seed” value R0, then, for n = 0, 1, . . .:
Rn+1 = 106Rn + 1283 (mod 6075)
(“take remainder on division by 6075”).
204
Eg. start with r0 = 1001:
R1 = 106(1001) + 1283 = 107389 (mod 6075) = 4114
R2 = 106(4114) + 1283 = 437367 (mod 6075) = 6042
R3 = 106(6042) + 1283 = 641735 (mod 6075) = 3860
and so on, with 0 ≤ Ri < 6075.
Gives up to 6075 different random integers before repeating itself.
Suitable choice of constants gives long “period” and unpredictable
sequence. (Number theory.)
In practice, use much larger constants – get many more possible
random numbers.
205
Continuous uniform on [0, 1]
To get (pseudo-) random values from Uniform[0, 1], take
pseudo-random integers and divide by maximum. Result has
approx. uniform distribution.
With generator above, max value is 6075, so random uniform values
are 4114/6075 = 0.677, 6042/6075 = 0.995,
3860/6075 = 0.635. (Only 6075 possible values, so only 3 or so
digits trustworthy.)
“Random numbers” in calculators, Excel etc. of this kind.
Random Uniform[0, 1] values are used as building block for
random values from other distributions. Eg. random
Y ∼ Uniform[0, b]: multiply a random Uniform[0, 1] by b.
206
Bernoulli distribution
Suppose we want to simulate X ∼ Bernoulli(0.4): single trial,
prob. 0.4 of success.
Take single random uniform U . If U ≤ 0.4, take X = 1 (success),
otherwise take X = 0 (failure).
Works because U ≤ 0.4 about 0.4 of the time, so will get
successes about 0.4 of the time (long run).
In general, for X ∼ Bernoulli(θ), take X = 1 if U ≤ θ, 0
otherwise.
207
Binomial and geometric distributions
If Y ∼ Binomial(n, θ), Y = X1 + X2 + · · · + Xn where
Xi ∼ Bernoulli(θ). So just generate n random Bernoullis and
add them up.
Similarly, if Z ∼ Geometric(θ), Z is number of failures (in
Bernoulli trials) before 1st success. So get random value of Z like
this:
1. set Z = 0
2. generate U from Uniform[0, 1]
3. if U ≤ θ, stop with current Z
4. otherwise, add 1 to Z and return to step 2.
208
Inverse-CDF method
Cdf F (x) = P (X ≤ x) defined for all x.
Also, in set of possible X-values (where f(x) > 0), F (x)
invertible: for any p, exactly one x where F (x) = p.
Example: X ∼ Exponential(λ). Then F (x) = 1 − e−λx. For
x > 0, write p = F (x), and solve for x to get
x = −1
λln(1 − p).
Then generate a random p from Uniform[0, 1], and put it in the
formula to get a random X .
209
For instance, if λ = 2, might have p = 0.7 and hence random X is
−12ln(1 − 0.7) = 0.602.
Why does this work in general?
Let Y be any random variable; let F (y) = P (Y ≤ y) be cdf of Y .
Define random variable W = F (Y ). Then
P (W ≤ w) = P (F (Y ) ≤ w)
= P (Y ≤ F−1(w)) = F{F−1(w)} = w.
That is, W ∼ Uniform[0, 1] whatever the distribution of Y .
210
So: to simulate Y , simulate W , then use relationship
Y = F−1(W ) to simulate Y (by using simulated uniform in place
of W ).
This was done above for exponential. Called inverse-CDF method .
211
Also works for discrete. Example: Poisson(0.7) has this cdf:
x 0 1 2 3 4
P (X ≤ x) 0.497 0.844 0.966 0.994 0.999
Procedure: get random U ∼ Uniform[0, 1]. If U ≤ 0.497, take
random X = 0; else if U ≤ 0.844, take X = 1, . . . , else if
U > 0.999, take X = 5.
(Higher values possible, but very unlikely; for more accuracy use
more digits.)
212
Normal distribution
Difficult to simulate from (cannot invert cdf).
But consider X , Y with bivariate standard normal distribution,
correlation 0. Joint density is
fX,Y (x, y) =1
2πexp
{
−1
2(x2 + y2)
}
.
Thinking of (x, y) as point in R2, note that density depends only on
distance from origin (r2 = x2 + y2), not on angle.
So generate random (x, y) pair by generating random angle
θ ∼ Uniform[0, 2π], random distance, separately.
(details: 2-variable transformation using Jacobian determinant.)
213
Density function for distance R is
fR(r) = re−r2/2
and cdf is
FR(r) =
∫ r
0
te−t2/2 dt = 1 − e−r2/2
(eg. use substitution u = t2/2, du = t dt).
FR(r) invertible; let p = FR(r), solve for r to get
r =√
−2 ln(1 − p).
Get random R by taking U ∼ Uniform[0, 1], using for p above.
214
Finally, convert random R, θ to (X,Y ) using polar coordinate
formulas
X = R cos θ; Y = R sin θ.
Example: suppose random θ = 1.8 (radians), U = 0.3. Then
R =√
−2 ln(1 − 0.3) = 0.8446. So
X = 0.8446 cos 1.8 = −0.19; Y = 0.8446 sin 1.8 = 0.82.
215
Rejection methods
Inverse-CDF method doesn’t always work – cdf can be too
complicated to invert. Example: X ∼ Gamma(3, 1), with density
function
f(x) =x2
2e−x.
This has maximum 2e−2 = 0.2707 at x = 2. Density “small”
beyond x = 10.
216
Idea: sample random point (X,Y ) in rectangle enclosing f(x),
with 0 ≤ X ≤ 10, 0 ≤ Y ≤ 2e−2 (using uniform distribution):
• if point below density function (Y ≤ f(X)), take X as random
value from distribution
• otherwise, reject (X,Y ) pair and try again.
Chance of X-value being accepted proportional to density f(X):
when value more likely in distribution, more likely to be accepted.
217
Example:
X 7.3 1.0 2.7 1.7 9.4 5.5
Y 0.206 0.130 0.023 0.256 0.197 0.203
f(X) 0.018 0.184 0.245 0.264 0.004 0.062
reject y n n n y y
Values 7.3, 9.4, 5.5 rejected; 1.0, 2.7, 1.7 random values from
Gamma(3, 1).
Needed 12 random uniforms to generate 3 random gammas.
218
Can be made more sophisticated. Let g(x) be density function that
is easy to sample from, such that f(x) ≤ cg(x) for all x (choose
c). Above, g(x) = 1, c = 2e−2.
Generate random value X from distribution with density g(x).
Generate random Y ∼ Uniform[0, cg(X)]. If Y ≤ f(X),
accept X ; otherwise, reject and try again.
Efficiency of rejection method greatest when cg(x) only slightly
greater than f(x); then, very little rejection.
219
Simulation in Minitab
Minitab can generate random values from many distributions (using
methods above or variations).
Basic procedure:
• Select Calc, Random Data
• Select desired distribution
• Fill in number of random values to generate
• Fill in (empty) column to store values
• Fill in parameters of distribution (if any)
• Click OK.
220
Examples: Uniform[0, 1], Bernoulli(0.4), Binomial(5, 0.4),
Exponential(2), Poisson(0.7), Normal(0, 1).
To generate random values from another distribution, generate
column of values from Uniform[0, 1], then use Calculator to create
desired values (p. 47–48 of manual).
Recall random values actually “pseudo-random”: starting at same
seed value gives same sequence of random values. Can set seed
value in Minitab (Calc, Set Base) to get reproducible random values.
221
Summary
• Simulation: generate “random” values from distribution.
• “Random” because actually obtained from formula, but gives
unpredictable results.
• Generate random Uniform[0, 1] by taking random integers,
divide by max.
• Use as building blocks for other distributions.
• Bernoulli, binomial, geometric: generate random trials using
random Uniform[0, 1] to generate success/failure.
222
• Inverse CDF: use random Uniform[0, 1] to get random F (x)
value; find x corresponding.
• Special method for normal distribution.
• Rejection method: pick random x, y; based on value of y,
decide whether to keep or reject x.
• Minitab.
223
Expectation
224
Introduction
Game: toss fair coin, win $2 for a head, lose $1 for a tail.
Amount you win is random variable W with
P (W = 2) = P (W = −1) = 12.
Could win or lose on any one play, but (a) winning and losing equally
likely, (b) amount won greater than amount lost.
Would probably play this game given chance, because expect to win
in long run, on average over many plays, even though anything
possible.
225
Expected value of random variable is its long-run average. For W
above, expect equal number of 2’s and −1’s, so expected value
would be
E(W ) =2 + (−1)
2=
1
2.
Another: suppose Y = 7 always (ie. P (Y = 7) = 1,
P (Y = k) = 0 for k 6= 7). Then E(Y ) should be 7.
Another: roll 2 dice. Win $30 for double 6, lose $1 otherwise. Looks
good because potential win greater than potential loss, but win very
unlikely. How to balance? For winnings random variable V , what is
E(V )?
226
Expectation for discrete random variables
Define expected value (expectation ) of random variable X :
E(X) =∑
x
xP (X = x),
“sum of value times probability”. Sum over all possible x.
Check for above examples:
E(W ) = 2 · 1
2+ (−1) · 1
2=
1
2E(Y ) = 7 · 1 = 7
E(V ) = 30 · 1
36+ (−1)
35
36= − 5
36
227
First 2 as expected.
For V , prob. of double 6 is 136
, so chance of losing is 1 − 136
. Even
though prize large (win $30 for double 6), E(V ) < 0, so would lose
in long run, because win prob even smaller than prize large.
Formula much easier than reasoning out – less thought!
Now suppose X ∼ Bernoulli(θ). What is E(X)?
X = 1 with prob θ, 0 with prob 1 − θ, so:
E(X) = 1 · θ + 0 · (1 − θ) = θ.
In long run, average X equal to success probability.
Makes sense (think of θ = 0 and θ = 1 as extreme cases).
228
Expectation for geometric and Poisson distributions
To find more complicated expectations, cleverness can be needed
to figure out sum.
Suppose Z ∼ Geometric(θ), so P (Z = k) = θ(1 − θ)k. Then
E(Z) =∞∑
k=0
kθ(1 − θ)k =1 − θ
θ.
Method: write (1 − θ)E(Z) to look like E(Z) but with k − 1 in
place of k, subtract.
Mean is odds against success: if failure 4 times more likely than
success, on average get 4 failures before 1st success.
229
If X ∼ Poisson(λ), then
E(X) =
∞∑
k=0
k · e−λλk
k!.
Note that the k = 0 term is 0, so start sum at k = 1, then let
l = k − 1 to get
E(X) = λ∞∑
l=0
e−λλl
l!.
The sum is of all the probabilities from a Poisson distribution, so is
1. (Or,∑∞
l=0(λl/l!) is the Maclaurin series for eλ.)
So for X ∼ Poisson(λ), E(X) = λ. Thus parameter λ in fact
mean.
230
St Petersburg Paradox
Game: toss fair coin, let Z be #tails before 1st head. Win 2Z
dollars. Thus for TTTH, win 23 = $8. Expected winnings (fair price
to pay to play)?
∞∑
k=0
2k · 1
2k· 1
2=
∞∑
k=0
1
2= ∞.
How can this be? Only ever win finite amount.
Play game 10 times:
Z 0 1 0 0 3 0 3 0 6 1
Winnings 1 2 1 1 8 1 8 1 64 2
Mean winnings $8.90, larger than actual winnings 90% of time!
231
Problem is that any one big payoff completely dominates average,
and by playing game enough times, can make it very likely that a
very big payoff will occur.
If there is a maximum payoff, say $230, expectation finite ($15.50).
When random variable can be arbitrarily large, expectation may not
be finite. But can be finite – compare Poisson, where probabilities
decrease faster than values increase. Similarly, lotteries with very
big prizes still have expected winnings less than ticket price
(because chance of winning big prize small enough).
232
Summary
• Expectation is long-run average value of random variable.
• For discrete, is sum of value times probability.
• If Z ∼ Geometric(θ), E(Z) = (1 − θ)/θ.
• If X ∼ Poisson(λ), E(X) = λ.
• When r. v. can be arbitrarily large, expectation may not be finite
(St Petersburg), but can be (Poisson).
233
Utility and Kelly betting
In St Petersburg paradox, expectation didn’t tell story, because “fair
price” ought to be finite. Changing game by a little changed
expected winnings a lot.
Most bets look like this: win known $w if you win, lose $1 if you lose.
Suppose probability of winning is θ. Then expectation is
E = wθ + (−1)(1 − θ) = θ(w + 1) − 1
which is positive if θ > 1/(w + 1).
For instance, if w = 2, E > 0 if θ > 1/3. That is, if you believe
your chance of winning is better than 13, you should bet because in
long run you win more than you lose.
234
If bet more than $1, wins and losses increase in proportion: on bet
of $b, win $wb or lose $b.
Positive expectation seems to say “bet everything you have”: far too
risky for most! Always possibility of losing.
Idea: consider utility of money, not same as money itself. If you
only have $10, $1 is a lot of money (has great utility), but if you have
$1 million, $1 almost meaningless.
Utility of money varies between people, but could be proportional to
current fortune. Then, utility of money depends on log of $ amount.
235
Suppose we currently have $c, and want to choose b for bet above,
assuming all else known. Then fortune after the bet is F = c + bw
if we win (prob θ), F = c − b if we lose (prob 1 − θ). Utility idea:
choose b to maximize E(lnF ):
E(lnF ) = θ ln(c + bw) + (1 − θ) ln(c − b).
Take derivative (for b), set to 0:
dE(lnF )
db= w
θ
c + bw+(−1)
1 − θ
c − b=
θw(c − b) − (1 − θ)(c + bw)
(c + bw)(c − b).
Zero when numerator zero; solve for b to get
b =c{θ(w + 1) − 1}
w=
cE
w.
This is called the Kelly bet . (If negative, don’t bet anything!)
236
Examples, with c = 100:
• w = 9, θ = 18. E = θ(w + 1) − 1 = 0.25, so Kelly bet
b = 100(0.25)/9 = $2.78.
• w = 1.5, θ = 12. E = 0.25 again; Kelly bet
b = 100(0.25)/1.5 = $16.67.
Note: expected winnings same in both cases, but bet less when
w = 9: more risk because less likely to win.
In general, bet fraction of current fortune that is bigger when
expected winnings bigger and chance of winning bigger.
237
Expectation of functions of random variables
In St Petersburg problem above, random variable was number of
tails Z , but winnings 2Z . In effect, found that E(2Z) was infinite.
Method: sum values of 2Z times probability.
Formally: let g(X) be some function of random variable X . Then
E(g(X)) =∑
x
g(x)P (X = x).
238
Linearity of expected values
Suppose we have two random variables X,Y . What is
E(X + Y )?
Go back to definition, bearing in mind that X,Y might be related,
so have to use joint probability function:
E(X + Y ) =∑
x
∑
y
(x + y)P (X = x, Y = y)
=∑
x
xP (X = x) +∑
y
yP (Y = y)
= E(X) + E(Y ).
Details: expand out (x + y) in first sum, recognize (eg.) that∑
y P (X = x, Y = y) = P (X = x) (marginal distribution).
239
Same logic shows that E(aX + bY ) = aE(X) + bE(Y ).
Likewise,
E(X1 + X2 + · · · + Xn) = E(X1) + E(X2) + · · · + E(Xn).
Also, if Y = 1 always, we get E(aX + b) = aE(X) + b.
240
Expectation for binomial distribution
If Y ∼ Binomial(n, θ), then Y actually sum of Bernoullis:
Y = X1 + X2 + · · · + Xn, where Xi ∼ Bernoulli(θ).
Know that E(Xi) = θ, so (by result on previous page)
E(Y ) = θ + θ + · · · + θ = nθ.
Makes sense: eg. if you succeed on one-third of trials on average
(θ = 13), and you have n = 30 trials, you’d expect 10 successes,
and nθ = 10.
241
Independence and E(XY )
Since E(X + Y ) = E(X) + E(Y ) for all X and Y , tempting to
claim that E(XY ) = E(X)E(Y ). But is this true?
Consider this joint distribution:
Y = 1 Y = 2 Total
X = 0 13
16
12
X = 1 14
14
12
Total 712
512
1
Using marginal distributions, E(X) = 12
and E(Y ) = 1712
. What is
E(XY )?
242
When X = 0, XY = 0 for all Y . So P (XY = 0) = 13
+ 16
= 12.
XY = 1 when X = 1, Y = 1, so P (XY = 1) = 14. Likewise,
XY = 2 when X = 1, Y = 2, so P (XY = 2) = 14. Hence
E(XY ) = 0 · 1
2+ 1 · 1
4+ 2 · 1
4=
3
4.
But
E(X)E(Y ) =1
2· 17
12=
17
246= 3
4.
So E(XY ) 6= E(X)E(Y ) in general.
243
But what if X,Y independent? Then
E(XY ) =∑
x
∑
y
xyP (X = x)P (Y = y) = E(X)E(Y ),
rearranging, because joint prob is product of marginals.
So, if X,Y independent, then E(XY ) = E(X)E(Y ), but not
necessarily otherwise.
See later (in “covariance”) that difference E(XY ) − E(X)E(Y )
measures extent of non-independence of X and Y .
244
Monotonicity of expectation
Suppose X,Y discrete random variables such that X ≤ Y . (That
is, for any event giving X = x and Y = y, x ≤ y always.
Example: roll 2 dice, let X be score on 1st die, Y be total score on 2
dice.)
How do E(X), E(Y ) compare?
Idea: let Z = Y − X . Then Z ≥ 0, discrete, and
E(Z) =∑
z≥0 zP (Z = z). All terms in sum positive or 0, so
E(Z) ≥ 0. But E(Z) = E(Y − X) = E(Y ) − E(X). Hence
E(Y ) − E(X) ≥ 0.
Conclusion: if X ≤ Y , then E(X) ≤ E(Y ).
245
Expectation for continuous random
variables
Can’t use formula
E(X) =∑
x
xP (X = x)
because probability of particular value not meaningful for continuous
X .
Standard procedure: replace probability by density function, replace
sum by integral.
246
That is, if X continuous random variable, define
E(X) =
∫ ∞
−∞x f(x) dx.
In integral, replace infinite limits by actual upper and lower limits.
247
Examples
Suppose X ∼ Uniform[0, 1], so f(x) = 1, 0 ≤ x ≤ 1. Then
E(X) =
∫ 1
0
x · 1 dx =
[
1
2x2
]1
0
=1
2.
As you would have guessed.
Suppose W ∼ Exponential(λ). Then
E(W ) =
∫ ∞
0
wλe−λw dw.
Integrate by parts with u = w, v′ = λe−λw: E(W ) = 1/λ.
If W represents time between events, E(W ) in units of time, so λ
in units of 1 / time: a rate, number of events per unit time.
248
Suppose Z ∼ N(0, 1), so f(z) = (1/√
2π)e−z2/2. Then
E(Z) =
∫ ∞
−∞
1√2π
ze−z2/2 dz.
Replacing z by −z gives negative of function in integral, ie. f(z) is
odd function. Hence integral is 0, so E(Z) = 0. (Alternative:
substitute u = z2/2.)
249
As for discrete, expectation may not be finite.
f(x) = 1/x2, x ≥ 1 is a proper density, but for random variable X
with this distribution:
E(X) =
∫ ∞
1
x · 1
x2dx =
∫ ∞
1
1
xdx = [ln x]∞1 = ∞.
Problem: though density decreases as x increases, does not do so
fast enough to make E(X) integral converge.
250
Properties of expectation for continuous random
variables
These are same as for discrete variables. Proofs use integrals and
densities not sums, but otherwise very similar. Suppose X has
density fX(x) and X,Y have joint density fX,Y (x, y):
• E(g(X)) =∫∞−∞ g(x)fX(x) dx
• E(h(X,Y )) =∫∞−∞∫∞−∞ h(x, y)fX,Y (x, y) dx dy.
• E(aX + bY ) = aE(X) + bE(Y )
• If X,Y independent, then E(XY ) = E(X)E(Y )
• If X ≤ Y , then E(X) ≤ E(Y ).
251
Expectations for general uniform and normal
distributions
Suppose X ∼ Uniform[a, b]. Then
U = (X − a)/(b − a) ∼ Uniform[0, 1], so E(U) = 12.
Write in terms of X : X = a + (b − a)U , so
E(X) = a + (b − a)E(U) = (a + b)/2. Again as expected.
Now suppose X ∼ Normal(µ, σ2). Then
Z = (X − µ)/σ ∼ N(0, 1). Write X = µ + σZ ; then
E(X) = µ + σE(Z) = µ + σ(0) = µ.
That is, parameter µ in normal distribution is the mean.
252
Summary
• Utility of money may be proportional to current fortune: depends
on log of fortune.
• With current fortune c, bet with winnings w, expected value E,
utility maximized by betting amount cE/w if E positive (Kelly
bet).
• Function g(X) of random variable X :
E(g(x)) =∑
x g(x)P (X = x).
• Linearity: E(X + Y ) = E(X) + E(Y ) always; also
E(aX + b) = aE(X) + b.
• If Y ∼ Binomial(n, θ), E(Y ) = nθ.
253
• If X,Y independent, E(XY ) = E(X)E(Y ), but not
necessarily otherwise.
• If X ≤ Y , then E(X) ≤ E(Y ).
• For continuous X , E(X) =∫∞−∞ xfX(x) dx.
• If X ∼ Uniform[0, 1], E(X) = 12.
• If W ∼ Exponential(λ), E(W ) = 1/λ.
• Expectation for discrete and continuous distributions has same
properties.
• If X ∼ Uniform[a, b], E(X) = (a + b)/2.
• If X ∼ N(µ, σ2), E(X) = µ.
254
Variance, covariance and correlation
Compare random variables:
Z = 10 with prob 1, Y = 5, 15 each prob 12.
E(Z) = E(Y ) = 10, but Y further from mean than Z .
Expectation only gives long-run average of random variable, not how
much higher/lower than average it could be. For this, use variance :
Var(X) = E[(X − µX)2], µX = E(X).
255
For discrete X , Var(X) =∑
x(x − µX)2P (X = x). So:
Var(Z) = (10 − 10)2 · 1 = 0;
Var(Y ) = (5 − 10)2 · 1
2+ (15 − 10)2 · 1
2= 25.
Here, Var(Y ) > Var(Z) because Y tends to be further from its
mean than Z does.
(Here, Y always further from mean than Z . But in general,
Var(Y ) > Var(Z) means Y likely to be further from mean than
Z .)
256
More about variance
Because (X − µX)2 ≥ 0, Var(X) ≥ 0 for all random variables
X .
Var(X) = 0 only if X does not vary (compare Z). No upper limit
on variance; larger variance means more unpredictable (can get
further from mean).
Why square? Cannot just omit: E(X − µX) = E(X) − µX = 0
always. Absolute value E(|X − µX |) possible, but hard to work
with (not differentiable).
257
Standard deviation
If random variable X in metres, Var(X) in metres-squared. For
interpretation, suggests using square root of variance:
SD(X) =√
Var(X)
which would be in metres. Called standard deviation of X .
SD easier for interpretation, variance easier for algebra.
258
Variance of Bernoulli
If X ∼ Bernoulli(θ), E(X) = θ, and
Var(X) =∑
x
(x − θ)2P (X = x)
= (1 − θ)2θ + (0 − θ)2(1 − θ)
= θ(1 − θ)(1 − θ + θ) = θ(1 − θ).
This is 0 if θ = 0, 1 (when results completely predictable) and
maximum, 14, when θ = 1
2.
259
Useful properties of variance
Var(aX + b) = a2 Var(X).
Because variance in squared units, changing X eg. from metres to
feet multiplies variance not by 3.3 but by that squared.
Also, adding b changes mean of X , but doesn’t change how spread
out distribution is (shifts left/right).
Var(X) = E(X2) − µ2X .
Useful result for finding variances in practice, since E(X2) not
usually too hard.
260
Proofs: use definition of variance as expectation, then rules of
expectation.
Bernoulli revisited: E(X2) = 12θ + 02(1 − θ) = θ, so
Var(X) = θ − θ2 = θ(1 − θ) as before.
261
Variance of exponential distribution
For continuous distributions, find E(X2) or variance using integral.
W ∼ Exponential(λ): already know E(W ) = 1/λ. Find
Var(W ) by first finding E(W 2), using integration by parts:
E(W 2) =
∫ ∞
0
w2λe−λw dw =[
−w2e−λw]∞0
+2
λ
∫ ∞
0
wλe−λw dw.
Square brackets 0; integral is E(W ) = 1/λ. Hence
E(W 2) = (2/λ)(1/λ) = 2/λ2, and
Var(W ) =2
λ2−(
1
λ
)2
=1
λ2.
For exponential distribution, variance is square of mean.
262
Variance of normal random variable
Suppose Z ∼ N(0, 1). Know that E(Z) = 0, so
Var(Z) = E(Z2) − 02 = E(Z2). Thus
Var(Z) =
∫ ∞
−∞z2 1√
2πe−z2/2 dz.
To tackle by parts: let u = z/√
2π, v′ = ze−z2/2. v′ has
antiderivative v = −e−z2/2. Gives
263
Var(Z) =
[
− z√2π
e−z2/2
]∞
−∞+
∫ ∞
−∞
1√2π
e−z2/2 dz.
Square bracket 0 (e−z2/2 → 0 very fast); integral that of density of
Z , so 1. Hence Var(Z) = 1.
Suppose now X ∼ N(µ, σ2). Then Z = (x − µ)/σ, so
X = µ + σZ . So Var(X) = σ2 Var(Z) = σ2. That is,
parameter σ2 in normal distribution is variance.
264
Summary
• Variance says how far r. v. is from its expectation:
Var(X) = E[(X − µX)2].
• 0 ≤ Var(X), but no upper limit.
• Standard deviation SD(X) =√
Var(X) in same units as X .
• If X ∼ Bernoulli(θ), Var(X) = θ(1 − θ).
• Var(aX + b) = a2 Var(X); Var(X) = E(X2) − µ2X .
• If W ∼ Exponential(λ), Var(W ) = 1/λ2.
• If X ∼ N(µ, σ2), Var(X) = σ2.
265
Covariance
Consider discrete joint distribution:
Y = 1 Y = 2 sum
X = 0 0.4 0.2 0.6
X = 1 0.1 0.3 0.4
sum 0.5 0.5
If X = 0, Y more likely to be small; if X = 1, Y more likely to be
large. X,Y vary together.
Idea: covariance Cov(X,Y ) = E[(X − µX)(Y − µY )].
266
Here, µX = E(X) = 0.4, µY = E(Y ) = 1.5, so take all
combinations of (X − µX , Y − µY ) values and their probs:
Cov(X,Y )
= (0 − 0.4)(1 − 1.5)(0.4) + (0 − 0.4)(2 − 1.5)(0.2)
+ (1 − 0.4)(1 − 1.5)(0.1) + (1 − 0.4)(2 − 1.5)(0.3)
= 0.08 − 0.04 − 0.03 + 0.04 = 0.10.
Result positive. (X,Y ) combinations where (X − µX)(Y − µY )
positive outweigh those where negative. That is, when X large, Y
more likely to be large as well (and small with small).
Covariance can be negative: then large X goes with small Y and
vice versa. Covariance 0: no trend.
267
Calculating covariances
Useful formula:
Cov(X,Y ) = E(XY ) − E(X)E(Y ).
Proof: definition of covariance, properties of expectation.
Previous example revisited:
E(XY ) = (0)(1)(0.4)+(0)(2)(0.2)+(1)(1)(0.1)+(1)(2)(0.3) = 0.7;
Cov(X,Y ) = 0.7 − (0.4)(1.5) = 0.1.
As with corresponding variance formula, useful for calculations.
268
Covariance and independence
If X,Y independent, then E(XY ) = E(X)E(Y ), so
Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 0.
But covariance could be 0 without independence. Example:
(X,Y ) = (−1, 1), (0, 0), (1, 1), each prob 13. E(X) = 0,
E(Y ) = 23, E(XY ) = (−1)(1
3) + (0)(1
3) + (1)(1
3) = 0, so
Cov(X,Y ) = 0 − (0)(23) = 0. But X,Y not independent: given
X , know Y exactly.
Relationship between X,Y not a trend: as X increases, Y
decreases then increases. No general statement about Y
large/small as X increases.
Fact: if X,Y bivariate normal, covariance 0 implies independence.
269
Variance of sum
Previously found that E(X + Y ) = E(X) + E(Y ) for all X,Y .
Corresponding formula for variances?
Derive formula for Var(X + Y ) by writing as expectation,
expanding out square, recognizing terms:
Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).
Logic: if Cov(X,Y ) > 0, X,Y big/small together, sum could be
very big/small, variance large. If Cov(X,Y ) < 0, large X
compensates small Y and vice versa, sum of moderate size,
variance small.
If X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).
270
Variance of binomial distribution
Suppose X ∼ Binomial(n, θ). Then can write
X = Y1 + Y2 + · · · + Yn,
where Yi ∼ Bernoulli(θ) independently. So
Var(X) = Var(Y1) + Var(Y2) + · · · + Var(Yn)
= θ(1 − θ) + θ(1 − θ) + · · · + θ(1 − θ)
= nθ(1 − θ).
Variance increases as n increases (fixed θ) because range of
possible #successes becomes wider.
271
Correlation
Covariance hard to interpret. Eg. size of positive correlation says
little about X,Y relationship.
Suppose X height (metres), Y weight (kg). Units of covariance m
× kg. Measure height in inches, weight in lbs: covariance in
different units.
Try for scale-free quantity. Covariance measures how X,Y vary
together: suggests use of variances. Var(X) m2, Var(Y ) kg2, so
right scaling is by sq root of each. Define correlation :
Corr(X,Y ) =Cov(X,Y )
√
Var(X) Var(Y ).
272
Example: (X,Y ) = (0, 1), (1, 3), each prob 12.
E(X) = 0.5, E(Y ) = 2; XY = 0, 3 each prob 12
so
Cov(X,Y ) = 32− (0.5)(2) = 1
2.
Also, Var(X) = 14, Var(Y ) = 1, so
Corr(X,Y ) =12
√
(14)(1)
= 1.
When X larger (1 vs. 0), Y also larger (3 vs. 1) for certain: a perfect
trend. So this should be largest possible correlation.
(Proof later: Cauchy-Schwartz inequality.)
273
More about correlation
Smallest possible correlation is −1, when larger X always goes
with smaller Y (eg. (X,Y ) = (0, 1), (1,−3) with prob 12).
If X,Y independent, covariance 0, so correlation 0 also.
In-between values represent in-between trends. Eg.
Corr(X,Y ) = 0.5: larger X with larger Y most of the time, but
not always.
Correlation actually measures extent of linear relationship between
random variables. X,Y in example related by Y = 2X + 1.
Perfect nonlinear relationship won’t give correlation ±1.
274
Viewing correlation by simulation
Useful to have sense of what correlation “looks like”.
Generate random normals with required correlation, plot.
Suppose X,Y ∼ N(0, 1) independently. Then use X and
Z = αX + Y for suitable choice of α: correlated if α 6= 0
because X in both. Can show Cov(X,αX + Y ) = α and
Corr(X,αX + Y ) = α/√
1 + α2.
Choose α to get desired correlation ρ: α = ±ρ/√
1 − ρ2.
275
Correlation 0.95:
−3 −2 −1 0 1 2
−10
−5
05
x
z
276
Correlation -0.8:
−3 −2 −1 0 1 2
−4
−2
02
4
x
z
277
Correlation 0.5:
−2 −1 0 1 2 3
−2
01
23
x
z
278
Correlation -0.2:
−3 −2 −1 0 1 2 3
−2
−1
01
2
x
z
279
Summary
• Covariance Cov(X,Y ) = E[(X − µX)(Y − µY )] =
E(XY ) − E(X)E(Y ).
• Can be + or −; + means larger X tends to go with larger Y .
• If X,Y independent, then Cov(X,Y ) = 0.
• If X,Y bivariate normal and Cov(X,Y ) = 0, then X,Y
independent.
• Can be other X,Y with covariance 0 but not independent.
280
• Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ); if
X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).
• If X ∼ Binomial(n, θ), then Var(X) = nθ(1 − θ).
• Corr(X,Y ) = Cov(X,Y )/√
Var(X) Var(Y ); between
−1 and 1.
• Can use simulation to get picture of different-sized correlations.
281
Moment-generating functions
Means and variances (and eg. E(X3)) can be messy: each one
needs an integral (sum) to be solved. Would be nice to have function
that gives E(Xk) more easily than by integration (summing).
Consider mX(s) = E(esX). Function of s.
Maclaurin series for exp function:
mX(s) = E(1) + sE(X) +s2
2!E(X2) +
s3
3!E(X3) + · · · .
282
Differentiate both sides (as function of s):
m′X(s) = E(X) + sE(X2) +
s2
2!E(X3) + · · ·
Putting s = 0 gives m′(0) = E(X). Differentiate again:
m′′X(s) = E(X2) + sE(X3) + · · ·
so that m′′X(0) = E(X2).
By same process, find E(Xk) by differentiating mX(s) k times,
and setting s = 0. Differentiating easier than integrating!
E(Xk) called k-th moment of distribution of X ; function mX(s),
used to get moments, called moment generating function for X .
283
If X discrete,
mX(s) = E(esX) =∑
x
esxP (X − x)
and if X continuous,
mX(s) = E(esX) =
∫ ∞
−∞esxfX(x) dx.
284
Examples of moment generating functions
Bernoulli is easiest of all:
mX(s) = es·0P (X = 0) + es·1P (X = 1) = 1 − θ + θes.
So:
m′X(s) = θes ⇒ E(X) = θ
m′′X(s) = θes ⇒ E(X2) = θ
and indeed E(Xk) = θ for all k. Also,
Var(X) = E(X2) − [E(X)]2 = θ − θ2 = θ(1 − θ).
285
Now try X ∼ Exponential(λ), continuous:
mX(s) = E(esX) =
∫ ∞
0
esxλe−λx dx = λ(λ − s)−1
after some algebra. (Requires s < λ.)
m′X(s) = λ(λ − s)−2, so E(X) = m′
X(0) = 1/λ.
m′′X(s) = 2λ(λ − s)−3, so E(X2) = m′′
X(0) = 2/λ2. Hence
Var(X) =2
λ2−(
1
λ
)2
=1
λ2.
286
More about moment-generating functions
If X ∼ Poisson(λ), then
mX(s) = eλ(es−1).
If X ∼ N(0, 1), then
mX(s) = es2/2.
Facts:
• mX+Y (s) = mX(s)mY (s). (Mgf of sum is product of
moment-generating functions.)
• maX+b(s) = ebsmX(as). (Mgf of linear function related to
mgf of original random variable.)
287
Proofs from definition.
First result very useful: distribution of sum very difficult to find, but
can get moments for sum much more easily.
If X ∼ Binomial(n, θ), then X = Y1 + Y2 + · · · + Yn where
each Yi ∼ Bernoulli(θ). Hence
mX(s) = [mYi(s)]n = (1 − θ + θet)n.
If X ∼ N(µ, σ2), X = µ + σZ where Z ∼ N(0, 1). Thus
mX(s) = mσZ+µ(s) = eµsmZ(σs) = eµs+σ2s2/2.
288
Using mgfs to recognize distributions
Important result, called uniqueness theorem . Suppose X has mgf
finite for −s0 < s < s0; suppose mX(s) = mY (s) for
−s0 < s < s0. Then X , Y have same distribution.
In other words: if mgf of X is that of known distribution, then X
must have that distribution.
Example: X,Y ∼ Poisson(λ). X + Y has mgf
mX+Y (s) = {eλ(es−1)}2 = e2λ(es−1).
This is mgf of Poisson(2λ), so X + Y ∼ Poisson(2λ).
289
Summary
• Moment generating function mX(s) = E(esX) is function of s
(sum for discrete, integral for continuous).
• Get E(Xk) (k-th moment) by differentiating mX(s) k times
(wrt s), put s = 0.
• If X ∼ Bernoulli(θ), mX(s) = 1 − θ + θes.
• If X ∼ Exponential(λ), mX(s) = λ/(λ − s).
• If X ∼ Poisson(λ), mX(s) = eλes−1.
• If Z ∼ N(0, 1), mZ(s) = es2/2.
290
• mX+Y (s) = mX(s)mY (s).
• maX+b(s) = ebsmX(as).
• Last two results lead to mgf’s for binomial and normal.
• Uniqueness theorem: if X and Y have same mgfs, have same
distribution.
291
Conditional Expectation
Consider this joint distribution (Ex. 3.5.2):
X = 5 X = 8 sum
Y = 0 17
37
47
Y = 3 17
0 17
Y = 4 17
17
27
sum 37
47
X,Y related: if Y = 0, then X more likely to be 8.
292
Suppose Y = 3. Then P (X = 5|Y = 3) = (17)/(1
7) = 1,
P (X = 8|Y = 3) = 0/(17) = 0. If Y = 3, then X certain to be
5, so E(X|Y = 3) = 5.
Now suppose Y = 4:
P (X = 5|Y = 4) =17
17
+ 17
=1
2= P (X = 8|Y = 4).
If Y = 4, then average X is E(X|Y = 4) = 5 · 12
+ 8 · 12
= 6.5.
Likewise, E(X|Y = 0) = 7.25.
293
These expectations from conditional distribution called conditional
expectations . E(X|Y = y) varies from 5 to 7.25 depending on
value of Y ; “on average, X depends on Y ”.
In general, if X,Y related, then mean of X depends on Y .
Calculate conditional distribution of X|Y , find X-expectation. This
is conditional expectation.
294
Conditional expectation: continuous case
Same principle: find expectation of conditional distribution. Now use
joint and marginal densities to find conditional density; then
integrate to get expectation.
Example: fX,Y (x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1.
Conditional density fX|Y (x, y) = fX,Y (x, y)/fY (y). So first find
marginal density fY (y) by integrating out x from joint density:
fY (y) = 43y + 2y5. Has no x. Hence
fX|Y (x, y) =4x2y + 2y5
43y + 2y5
.
295
Note: only x in numerator, so not so hard. Thus
E(X|Y = y) =
∫ 1
0
x · 4x2y + 2y5
43y + 2y5
dx =1 + y4
43
+ 2y4.
Depends slightly on Y : E(X|Y = 0) = 0.75,
E(X|Y = 0.5) = 0.729, E(X|Y = 1) = 0.6. As Y increases,
X decreases, on average.
296
Conditional expectations as random variables
Without particular Y -value in mind, can define E(X|Y ) by taking
E(X|Y = y) and replacing y by Y . Above example:
E(X|Y ) =1 + Y 4
43
+ 2Y 4.
This kind of conditional expectation is random variable (function of
random variable Y ).
297
As random variable, E(X|Y ) must have expectation,
E[E(X|Y )]. What is it? Directly, as function of y:
E[E(X|Y )] =
∫ 1
0
E(X|Y = y)fY (y) dy =2
3
(much cancellation). Now: marginal density of x is
fX(x) = 2x2 + 13
(integrate out y from joint density), so
E(X) =
∫ 1
0
x
(
2x2 +1
3
)
dx =2
3= E[E(X|Y )].
Not a coincidence. Illustrates theorem of total expectation :
E[E(X|Y )] = E(X). In words: effect of varying Y is to change
E(X|Y ), but E[E(X|Y )] averages out these effects, leaving only
overall average of X .
298
Conditional variance
Conditional variance is variance of conditional distribution.
Return to previous discrete example:
X = 5 X = 8 sum
Y = 0 17
37
47
Y = 3 17
0 17
Y = 4 17
17
27
sum 37
47
If Y = 3, X certain to be 5, so Var(X|Y = 3) = 0.
But if Y = 4, X equally likely 5 or 8; Var(X|Y = 4) = 2.25.
299
(Calculation: E(X|Y = 4) = 6.5, E(X2|Y = 4) = 44.5,
Var(X|Y = 4) = 44.5 − (6.5)2 = 2.25.)
Another expression of how Y affects X . If know Y = 3, know X
exactly, but if Y = 4, more uncertain about possible X .
300
Summary
• Conditional expectation E(X|Y = y) gives “average” X for
given Y .
• Calculate from conditional distribution of X|Y = y.
• Same way: define conditional expectation E(X|Y ) as random
variable (depends on Y ).
• Total expectation: E(E(X|Y ) = E(X).
• Conditional variance Var(X|Y = y) is variance of conditional
distribution of X given Y = y. Expresses how variable X is for
different Y .
301
Inequalities relating probability, mean and
variance
Mean and variance closely related to probabilies. Are general
relationships true for wide range of random variables and
distributions.
Markov inequality: If X cannot be negative, then
P (X ≥ a) ≤ E(X)
a.
In words: if mean small, X unlikely to be very large.
302
Chebychev inequality:
P (|Y − µY | ≥ a) ≤ Var(Y )
a2.
In words: if variance small, Y unlikely to be far from mean.
(Variations in spelling: best English transliteration from Russian
probably “Chebyshov”.)
303
Example: suppose X = 0, 1, 2 each with probability 13. Then
E(X) = 1, E(X2) = 53, so Var(X) = 2
3.
Markov with a = 1.5 says P (X ≥ 1.5) ≤ 11.5
= 23. Actual
P (X ≥ 1.5) = P (X = 2) = 13, which is indeed ≤ 2
3.
Chebychev with a = 0.9:
P (|X − 1| ≥ 0.9) ≤ (2/3)/(0.9)2 = 0.823. Actual
P (|X − 1| ≥ 0.9) = P (X ≤ 0.1) + P (X ≥ 1.9) = P (X =
0) + P (X = 2) = 23.
Bounds from Markov and Chebychev inequalities often not very
close to truth, but guaranteed, so can use inequalities to prove
results.
304
Proof of Markov inequality
Uses idea that if Z ≤ X , then E(Z) ≤ E(X).
Define random variable Z = a if X ≥ a, 0 otherwise. Because
X ≥ 0, value of Z always ≤ that of X : Z ≤ X .
E(Z) = aP (X ≥ a) + 0P (X < a) = aP (X ≥ a).
But Z ≤ X so E(Z) ≤ E(X) and therefore
aP (X ≥ a) ≤ E(X). Divide both sides by a. Done.
305
Proof of Chebychev inequality
This uses Markov’s inequality with clever choice of random variable.
Let X = (Y − µY )2; X ≥ 0. Then Markov’s inequality (with a2
replacing a) says
P (X ≥ a2) ≤ E(X)
a2⇒ P [(Y −µY )2 ≥ a2] ≤ E[(Y − µY )2]
a2.
In last inequality, E[.] is Var(Y ). On left, both terms in probability
≥ 0, so can square-root both sides. Gives
P (|Y − µY | ≥ a) ≤ Var(Y )
a2
which is Chebychev’s inequality. Done.
306
Cauchy-Schwartz and Jensen inequalities
Cauchy-Schwartz:
|Cov(X,Y )| ≤√
Var(X) Var(Y ) ⇒ |Corr(X,Y )| ≤ 1.
Proof: page 188 of text. Idea, for X,Y having mean 0: write
E(X − λY )2 in terms of variances and covariances; result must
be ≥ 0.
Jensen’s inequality relates E(g(X)) and g(E(X)). Specifically,
if g(x) is concave up (that is, g′′(x) ≥ 0), then
g(E(X)) ≤ E(g(X)).
307
Proof: Tangent line to concave-up function always ≤ function
(picture). Consider tangent line to g(x) at x = E(X); suppose
equation is a + bx. Then g(E(X)) = a + bE(X). Also, line
≤ g(x) everywhere else, so
a + bX ≤ g(X) ⇒ E(a + bX) ≤ E(g(X))
⇒ a + bE(X) ≤ E(g(X))
⇒ g(E(X)) ≤ E(g(X)).
Done.
(Note: text uses “convex” for “concave up”.)
308
Consequences of Jensen’s inequality
Take g(x) = x2. Then (E(X))2 ≤ E(X2). But
Var(X) = E(X2) − (E(X))2 ≥ 0, so knew that anyway.
Another: suppose X = 1, 2, 3, each prob 13. Then E(X) = 2.
But get another kind of average by multiplying 3 possible values and
taking 3rd root. This is called geometric mean. Here is
(1.2.3)1/3 = 1.817. Ordinary mean greater than geometric mean.
Look at log of geometric mean:
ln{(1.2.3)1/3} =1
3ln(1.2.3) =
1
3(ln 1+ln 2+ln 3) = E(ln X).
Thus geometric mean is eE(ln X).
309
Jensen: − ln x is concave up for x > 0, so
− ln(E(X)) ≤ E(− lnX) ⇒ ln(E(X)) ≥ E(lnX).
Exponentiate both sides (eln y = y):
E(X) ≥ eE(ln X).
This says that for any positive random variable X , the ordinary
mean will always be ≥ the geometric mean.
310
Summary
• Markov’s inequality: P (X ≥ a) ≤ E(X)/a.
• Chebychev’s inequality: P (|Y − µY | ≥ a) ≤ Var(Y )/a2.
• These inequalities guaranteed, so can be used for proofs.
• Cauchy-Schwartz: |Cov(X,Y )| ≤√
Var(X) Var(Y ), so
|Corr(X,Y )| ≤ 1.
• Jensen: if g(x) concave up (g′′(x) ≥ 0), then
g(E(X)) ≤ E(g(X)).
• Consequence: geometric mean always ≤ ordinary mean.
311
Sampling Distributions and
Limits
312
Introduction: roulette
See http://tinyurl.com/238p5 for intro to game.
Basic idea: bet on number or number combination. Roulette wheel
spun, one number is winner. Your bet wins if it contains winning
number.
Wheel also contains numbers 0, 00. Winning bets paid as if 0, 00
absent (advantage to casino).
Bet 1: “high number”: win with 19–36, lose otherwise. Bet $1, win
$1 if win. Let W be winnings on one play; P (W = 1) = 18/38,
P (W = −1) = 20/38. Then
E(W ) = 1 · 18
38+ (−1) · 20
38= − 2
38≃ −$0.05.
313
Bet 2: “lucky number”: win if 24 comes up, lose otherwise. Win $35
for $1 bet. Now P (W = 35) = 1/38, P (W = −1) = 37/38, so
E(W ) = 35 · 1
38+ (−1) · 37
38= − 2
38≃ −$0.05.
In both bets, lose 5 cents per $ bet in long run.
Play game not once but many times. Interested in total winnings, or
mean winnings per play. Let Wi be winnings on play i; then mean
winnings per play Mn over n plays is
Mn =1
n
n∑
i=1
Wi.
Investigate behaviour of Mn by simulation.
314
High-number, 30 plays:
0 5 10 15 20 25 30
−0.
40.
00.
20.
4
n
M_n
315
High-number, 1000 plays:
0 200 400 600 800 1000
−0.
40.
00.
20.
4
n
M_n
316
Lucky-number, 1000 plays:
0 200 400 600 800 1000
−0.
40.
00.
20.
4
n
M_n
317
Notes about roulette simulation
1st graph: in high-number bet, fortune goes up/down by $1 per play;
winnings/play pattern similar. On this sequence, in profit after 30
plays, but losing after 15.
2nd graph: same bet, 1000 plays. Less fluctuation after more trials;
winnings per play apparently tending to dotted line, E(W ). (Other
simulations have different shape but similar end behaviour.)
3rd graph: lucky-number bet, 1000 plays. Large jump upwards on
each win. Picture more erratic than for high-number bet; long-term
behaviour not clear yet. (Need more plays.)
318
Summary
• Roulette: bet on number or number combination. Each spin of
wheel gives winning number; win if that number part of your
combination, lose otherwise.
• Amount you win determined as if 0, 00 absent from wheel.
• Expected winnings from most bets −$0.05 (per $ bet).
• Investigate bets by simulation; do 1000 simulated plays:
– high number bet: simulated very close to expectation.
– lucky-number bet: simulated not close to expectation
because results more variable.
319
Understanding Mn mathematically: mean, variance
Mn =1
n
n∑
i=1
Wi
is sum. Wi in sum independent, each same distribution (one spin of
wheel has no effect on other spins). So can calculate E(Mn) and
Var(Mn).
Already found E(Wi) = − 238
for both our bets.
Find variances for bets: for high-number bet, Var(Wi) = 0.9972;
for lucky-number bet Var(Wi) = 33.21.
320
For mean:
E(Mn) =1
n
n∑
i=1
E(Wi) =1
n
n∑
i=1
(
− 2
38
)
= − 2
38,
since there are n terms in the sum, all the same.
That is, regardless of how long you play, you will lose 5 cents per $
bet on average.
321
Var(Mn) =1
n2
n∑
i=1
Var(Wi) =Var(Wi)
n.
Sum has n terms all equal to variance of one play’s winnings. So for
high-number bet, Var(Mn) = 0.9972/n, for lucky-number bet,
Var(Mn) = 33.21/n.
For any particular n, variance for high-number bet lower. Supports
simulation: high-number bet results more predictable.
In both cases, as n → ∞, Var(Mn) → 0. Longer you play, more
predictable Mn is.
322
Distribution of Mn
Mean and variance not whole story – want to know things like
P (Mn > 0) (chance of profit). For this, need distribution of Mn.
Start with M2 (2 plays). Do lucky-number bet (P (W = 35) = 138
,
P (W = −1) = 3738
).
4 possibilities:
• win both times. M2 = (35 + 35)/2 = 35;
P (M2 = 35) = ( 138
)2 = 11444
≃ 0.0007.
• win on 1st, lose on 2nd. M2 = (35 + (−1))/2 = 17; prob is138
· 3738
= 371444
.
323
• lose on 1st, win on 2nd. Again M2 = 17 and prob is same as
above. Thus overall P (M2 = 17) = 741444
≃ 0.0512.
• lose on both. M2 = ((−1) + (−1))/2 = −1;
P (M2 = −1) = (3738
)2 = 13691444
≃ 0.9480.
Calculation complicated, even for n = 2, because have to consider
all possible combinations.
In general: this kind of distribution very difficult to find exactly. So
look for approximations to it.
324
Summary
• Mn =∑
i Wi/n is sum, so can find E(Mn) = −2/38 for
both bets, Var(Mn) = 0.9972/n for high-number,
Var(Mn) = 33.21/n for lucky-number.
• For fixed n, average winnings more predictable for high-number
bet.
• As n → ∞, Var(Mn) → 0 in both cases.
• Actual distribution of Mn difficult to find. Seek approximation.
325
Sampling distributions
Suppose X1, X2, . . . , Xn are random variables, each independent
and with same distribution. For example:
• Xi is winnings from i-th play of a roulette bet.
• Xi is height of i-th randomly chosen Canadian.
• Xi = 1 if randomly chosen voter supports Liberal party,
Xi = 0 otherwise.
• Xi is randomly generated value from a distribution with density
fX(x).
In each case: underlying phenomenon of interest, collect data at
random to help understand phenomenon.
326
Summarize Xi values using random variable
Yn = h(X1, X2, . . . , Xn) for some function h (eg. mean, like
Mn).
Some jargon:
• total collection of individuals (all possible spins of roulette
wheel, all Canadians, all possible values) called population .
• particular individuals selected, or Xi values obtained from
them, called sample .
• Yn defined above called sample statistic.
Usually don’t know about population, so draw conclusion about it
based on sample.
327
First: opposite problem: if we know population, find out what
samples from it look like.
“At random” important, and specific. Each individual value in
population must have correct chance of being in sample (same
chance, for human populations), and each must be in sample or not
independently of others.
Aim: learn about distribution of Yn, called sampling distribution .
General statements difficult. Approach: find what happens as
n → ∞, then use result as approximation for finite n.
328
Convergence in probability; weak law of
large numbers
In mathematics, accustomed to convergence ideas. Eg. if
an = 1 − 1/n, so that a1 = 0, a2 = 12, a3 = 2
3, etc., an → 1
(converges to 1) as n → ∞ because, by taking n large enough, all
values after an as close to 1 as desired.
For sequence X1, X2, . . . of random variables, what is meaning of
Xn → Y , where Y is random variable?
329
Different possibilities. One idea: “prob of Xn being far from Y goes
to 0 as n gets large”. Leads to definition:
Sequence {Xn} converges in probability to Y if, for all ǫ > 0,
limn→∞ P (|Xn − Y | ≥ ǫ) = 0. Notation: XnP→ Y .
Example: suppose U ∼ Uniform[0, 1]. Let Xn = 3 when
U ≤ 23(1 − 1
n) and 8 otherwise.
Thus when n = 1, X1 must be 8. If U > 23, Xn remains 8 forever,
but if U ≤ 23, U ≤ 2
3(1 − 1
n) eventually, so Xn becomes 3 for
some n, then remains 3 forever.
(Cannot know which will happen since U random variable.)
330
Now define Y = 3 if U ≤ 23
and Y = 8 otherwise. Same as
“eventual” Xn, so should have XnP→ Y . Correct?
P (|Xn − Y | ≥ ǫ) = P (Xn 6= Y )
= P
(
2
3
(
1 − 1
n
)
< U <2
3
)
=2
3n.
This tends to 0 as n → ∞, so XnP→ Y .
331
Convergence to a constant
What if Y not random variable, but number?
Example: suppose Zn ∼ Exponential(n). Then E(Zn) = 1/n,
suggesting that Zn typically gets smaller and smaller. Does
ZnP→ 0?
P (|Zn − 0| ≥ ǫ) = P (Zn ≥ ǫ)
=
∫ ∞
ǫ
ne−nx dx = e−nǫ.
For any fixed ǫ, P (|Zn − 0| ≥ ǫ) → 0, so ZnP→ 0.
Important special case (usually easier to handle).
332
Convergence to mean
Suppose sequence {Yn} has E(Yn) = µ for all n. Then YnP→ µ
if P (|Yn − µ| ≥ ǫ) → 0.
But recall Chebychev’s inequality,
P (|Y − µY | ≥ a) ≤ Var(Y )/a2. Here:
P (|Yn − µ| ≥ ǫ) ≤ Var(Yn)
ǫ2.
For fixed ǫ, right side (and hence left side) tends to 0 if
Var(Yn) → 0, in which case YnP→ µ.
(Logically: if Var(Yn) getting smaller, Yn becoming closer to their
mean µ.)
333
Weak Law of Large Numbers
Return to X1, X2, . . . , Xn being a random sample from some
population with mean E(Xi) = µ and variance Var(Xi) = v.
Consider sample mean
Mn =1
n
n∑
i=1
Xi.
Intuitively, expect Mn to be “close” to population mean µ, and to get
closer as n increases (more information in larger sample).
Does MnP→ µ? Re-do roulette calculations to show that
E(Mn) = µ and Var(Mn) = Var(Xi)/n = v/n.
334
Now, {Mn} is sequence of random variables with same mean µ.
Result of section “convergence to mean” says that MnP→ µ if
Var(Mn) → 0. But here, Var(Mn) = v/n → 0. This proves
that MnP→ µ.
This justifies use of sample mean as estimate of the population
mean. Can estimate average height of all Canadians by measuring
average height of sample of Canadians; the larger the sample,
closer estimate will likely be.
Important result, called weak law of large numbers .
335
To generalize: suppose now that Xn do not all have same variance,
but Var(Xi) = vi. Then
Var(Mn) =1
n2
n∑
i=1
vi.
This might not → 0. But suppose that vi ≤ v for all i. Then
Var(Mn) =1
n2
n∑
i=1
vi ≤1
n2
n∑
i=1
v =v
n→ 0.
In other words, MnP→ µ even if the variances are not all equal,
provided that they are bounded.
336
Convergence with probability 1
Previous example: suppose U ∼ Uniform[0, 1]. Let Xn = 3
when U ≤ 23(1 − 1
n) and 8 otherwise. Let Y = 3 if U ≤ 2
3and
Y = 8 otherwise. Concluded that XnP→ Y .
Take another approach. Suppose we knew U , eg. suppose
U = 0.4. Then
0.4 ≤ 2
3
(
1 − 1
n
)
⇒ n ≥ 5
2.
Thus X1 = X2 = 8, X3 = X4 = · · · = 3. This is ordinary
sequence of numbers, converges to 3. Also, if U = 0.4, Y = 3.
337
In general: if U < 23, Xn = 8 for n ≤ 2/(2 − 3U) and Xn = 3
after that. If U > 23, Xn = 8 for all n.
In both cases, Xn → Y as ordinary sequence for any particular
value of U . Potentially different idea of convergence of random
variables.
Definition: Xn converges to Y with probability 1 if
P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;
notation Xna.s.→ Y .
In words: consider all ways to get (number) sequences {Xn}; for
each, consider corresponding Y . If Xn → Y always, then
Xna.s.→ Y .
338
Is it same as convergence in probability?
Example: let U ∼ Uniform[0, 1], and define {Xn} like this:
• X1 = 1 if 0 ≤ U < 12, 0 otherwise
• X2 = 1 if 12≤ U < 1, 0 otherwise
• X3 = 1 if 0 ≤ U < 14, 0 otherwise
• X4 = 1 if 14≤ U < 1
2, 0 otherwise
• X5 = 1 if 12≤ U < 3
4, 0 otherwise
• X6 = 1 if 34≤ U < 1, 0 otherwise
• X7 = 1 if 0 ≤ U < 18, 0 otherwise
• X8 = 1 if 18≤ U < 1
4, 0 otherwise, etc.
339
(Divided [0, 1] into 2, then 4, then 8,. . . intervals.)
Intervals getting shorter, so P (Xn = 1) decreasing. Indeed, for
ǫ < 1, P (|Xn − 0| ≥ ǫ) = P (Xn = 1) → 0, so XnP→ 0.
Suppose U = 0.2. Then Xn = 0 except for
X1 = X3 = X8 = · · · = 1. Beyond any n, always another
Xn = 1 (always another interval containing 0.2). So for U = 0.2,
number sequence {Xn} has no limit. Hence not true that Xna.s.→ 0.
Example shows that two convergence ideas different – convergence
with probability 1 harder to achieve.
340
Strong law of large numbers
Random sample X1, X2, . . . , Xn with E(Xi) = µ,
Var(Xi) ≤ v; let Mn = (∑n
i=1 Xi)/n be sample mean.
Already showed that MnP→ µ (“weak law of large numbers”).
Also strong law of large numbers : Mna.s.→ µ. Proof difficult.
In words: out of (infinitely) many different sequences {Mn}obtainable, every one of them converges to µ.
341
Summary
• Population: “all possible” values of a random variable.
• Sample: observe X1, X2, . . . , Xn.
• Calculate sample statistic h(X1, . . . , Xn) (eg. mean, median).
Want to know sampling distribution of h over repeated samples,
assuming (for now) that population known.
• General results difficult; find out what happens as n → ∞, and
use as approximation for finite n.
• Xn converges in probability to Y (XnP→ Y ) if
P (|Xn − Y | ≥ ǫ) → 0 as n → ∞ for all ǫ.
342
• Convergence to mean: if Var(Yn) → 0), then YnP→ µ
(Chebychev).
• Sample mean MnP→ µ (pop. mean). Called weak law of large
numbers. “Larger sample is more informative”.
• Xn converges to Y with probability 1 if
P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;
notation Xna.s.→ Y . All possible (number) sequences {Xn}
converge to corresponding Y .
• a.s.→ more demanding thanP→.
• Strong law of large numbers: Mna.s.→ µ.
343
Convergence in distribution
Consider independent sequence of random variables {Xn} with
P (Xn = 1) = 12
+ 1n
and P (Xn = 0) = 12− 1
n. Also, let
P (Y = 0) = P (Y = 1) = 12
independently of the Xn.
Now, take ǫ < 1. Then P (|Xn − Y | ≥ ǫ) = P (Xn 6= Y ). Could
have Xn = 0, Y = 1 or Xn = 1, Y = 0; use independence:
P (Xn 6= Y ) =
(
1
2− 1
n
)
1
2+
(
1
2+
1
n
)
1
2=
1
2.
Not → 0, so not true that XnP→ Y .
344
But Xn does converge to Y in sense that
P (Xn = 1) → 12
= P (Y = 1) and
P (Xn = 0) → 12
= P (Y = 0). Called convergence in
distribution .
To make definition: note that P (Xn = x) meaningless for
continuous Xn, so work with P (Xn ≤ x) instead.
Then: {Xn} converges in distribution to Y if
P (Xn ≤ x) → P (Y ≤ x) for all x. Notation: XnD→ Y .
345
Example: Poisson approximation to binomial
Suppose Xn ∼ Binomial(n, λ/n) (that is, trials increasing but
success prob decreasing so that E(X) = n(λ/n) = λ constant.
Then
P (Xn = j) =
(
n
j
)(
λ
n
)j (
1 − λ
n
)n−j
→ e−λλj
j!,
which is P (Y = j) when Y ∼ Poisson(λ). That is,
XnD→ Poisson(λ).
(Proof based on limn→∞(1 − (x/n))n = e−x.)
Suggests that if n large and θ small, Poisson is good approx to
binomial.
346
Try this: take λ = 1.5 for n = 2, 5, 10, 20, 100:
x n=2 n=5 n=10 n=20 n=100 Poisson
0 0.0625 0.1680 0.1968 0.2102 0.2206 0.2231
1 0.3750 0.3601 0.3474 0.3410 0.3359 0.3346
2 0.5625 0.3087 0.2758 0.2626 0.2532 0.2510
3 0.0000 0.1323 0.1298 0.1277 0.1259 0.1255
4 0.0000 0.0283 0.0400 0.0440 0.0465 0.0470
5 0.0000 0.0024 0.0084 0.0114 0.0136 0.0141
6 0.0000 0.0000 0.0012 0.0023 0.0032 0.0035
Approx for n = 20 not bad; for n = 100 is very good.
347
Convergence in distribution and moment generating
functions
Moment-generating function mY (s) for random variable Y is
function of s.
Uniqueness theorem: if mX(s) = mY (s) for all s where both
finite, then X,Y have same distribution.
Suggests following (true) result: if {Xn} is sequence of random
variables with mXn(s) → mY (s) (for all s where both sides finite),
then XnD→ Y .
348
Summary
• XnD→ Y if P (Xn ≤ x) → P (YN ≤ x) for all x.
• If Xn ∼ Binomial(n, λ/n) and Y ∼ Poisson(λ), then
XnD→ Y . (If n large, θ small, Poisson good approx. to
binomial.)
• If mXn(s) → mY (s) for all valid s, then Xn
D→ Y .
349
Central Limit Theorem
Return to “random sample” X1, X2, . . . , Xn; suppose E(Xi) = 0
and Var(Xi) = 1.
Define Mn = (∑n
i−1 Xi)/n. Does Mn converge in distribution to
anything interesting?
Well, E(Mn) = 0 but Var(Mn) = 1/n → 0. So look instead at
Zn =√
nMn: E(Zn) = 0 and Var(Zn) = 1. Then
Zn = (∑n
i=1 Xi)/√
n.
350
Moment-generating function for Xi is
mXi(s) = 1 + sE(Xi) +
s2
2!E(X2
i ) +s3
3!E(X3
i ) + · · · ;
here E(Xi) = 0, Var(Xi) = 1 so E(X2i ) = 1, giving
mXi(s) = 1 +
s2
2+
s3
3!E(X3
i ) + · · · .
Now, by rules for mgf’s,
mZn(s) = mX1
(s/√
n) · mX2(s/
√n) · · · · · mXn
(s/√
n)
= {mXi(s/
√n)}n
=
(
1 +s2
2n+
s3
3!n3/2E(X3
i ) + · · ·)n
.
351
Recall that as n → ∞, (1 + y/n)n → ey. Above, the terms in s3
and higher contribute less and less as n increases, so only the 1
and s2/n terms in bracket have effect. Thus
limn→∞
mZn(s) = lim
n→∞
(
1 +s2
2n
)n
= es2/2
which is mgf of standard normal distribution.
Thus, remarkable fact: regardless of distribution of Xi,
ZnD→ N(0, 1).
Also works for Xi with any mean and variance: standardized
MnD→ N(0, 1). Called central limit theorem .
352
Exact distribution of Mn very difficult to find. But if n “large”,
distribution can be approximated very well by normal distribution,
easier to work with.
This is reason for studying normal distribution.
Note that theorem uses convergence in distribution, so that it is the
cdf that converges, not the density function. Important if Xi discrete.
Also, for approximation, don’t need to be so careful about
standardization. Any sum/mean for large n works.
353
CLT by simulation
Let U1, U2, . . . ∼ Uniform[0, 1]; investigate distribution of
Yn = (U1 + U2 + · · · + Un)/n for various n. Uniform[0, 1]
distribution completely unlike normal. Do by simulation:
1. choose “large” number of Yn’s to simulate (eg. nsim = 10, 000)
2. in each of n columns, generate nsim random values from
Uniform[0, 1]
3. calculate simulated Yn values as row means. Eg. for n = 5,
let c10=rmean(c1-c5).
4. Draw histogram of results, compare normal distribution shape.
Normal good if curve through top middle of histogram bars.
354
Histogram of y
y
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
n = 2: normal too high at top, too low elsewhere.
355
Histogram of y
y
Den
sity
0.2 0.4 0.6 0.8
0.0
1.5
3.0
n = 5: much closer approx.
356
Histogram of y
y
Den
sity
0.3 0.4 0.5 0.6 0.7
02
4
n = 20: almost perfect.
357
Normal approx to binomial
Binomial is sum of Bernoullis, so CLT should apply if #trials n large.
Suppose Y ∼ Binomial(4, 0.5). Then E(Y ) = 2, Var(Y ) = 1.
Exact P (Y ≤ 1):
P (Y ≤ 1) =
(
4
0
)
(0.5)0(1−0.5)4+
(
4
1
)
(0.5)1(0.5)3 = 0.3125.
Take X ∼ N(2, 1) (same mean, variance as Y ). P (X ≤ 1)?
P (X ≤ 1) = P
(
Z ≤ 1 − 2√1
)
= P (Z ≤ −1) = 0.1587.
Not very close!
358
Problem: X continuous, but Y discrete. Y ≤ 1 really “Y ≤anything rounding to 1”. Suggests approximating P (Y ≤ 1) by
P (X ≤ 1.5):
P (X ≤ 1.5) = P
(
Z ≤ 1.5 − 2√1
)
= P (Z ≤ −0.5) = 0.3085.
For such small n, really very close to P (Y ≤ 1) = 0.3125.
In general, add 0.5 for ≤ and subtract 0.5 for <. Called continuity
correction ; do whenever discrete distribution approximated by
continuous.
(Alternatively: for binomial, P (Y ≤ 1) 6= P (Y < 1), but for
normal, P (X ≤ 1) = P (X < 1).)
359
Compare Y ∼ Binomial(20, 0.5); E(Y ) = 10, Var(Y ) = 5.
Then exact P (Y ≤ 8) = 0.2517; approx by X ∼ N(10, 5) as
P (Y ≤ 8) ≃ P (X ≤ 8.5)
= P
(
Z ≤ 8.5 − 10√5
)
= P (Z ≤ −0.67) = 0.2514.
Now, approx very good.
360
If p 6= 0.5, binomial skewed; skewness decreases as n increases.
So need larger n for p far from 0.5.
Example: n = 20, p = 0.1. Simulate and plot using Minitab:
MTB > random 1000 c3;
SUBC> binomial 20 0.1.
MTB > hist c3
Shape clearly skewed, not normal. n = 20 not large enough here.
Rule of thumb: normal approx OK if np ≥ 5 and n(1 − p) ≥ 5.
Examples: n = 4, p = 0.5: np = 2 < 5, no good.
n = 20, p = 0.5: np = n(1 − p) = 10 ≥ 5, good;
n = 20, p = 0.1: np = 2 < 5, no good.
361
Summary
• Central Limit Theorem: if E(Xi) = 0 and Var(Xi) = 1, and
Zn =∑n
i=1 Xi/√
n, then ZnD→ N(0, 1) even though Xi
could have any distribution (with finite variance).
• Proof also works with E(Xi) = µ,Var(Xi) = σ2; define Zn
using standardized Xi.
• Can assess Central Limit Theorem by simulation.
• Can approximate binomial with large n by normal (continuity
correction).
362
Monte Carlo integration
Integral I =∫ 1
0sin(x4) dx: impossible algebraically (no
antiderivative). Get approximate answer numerically eg. by
Simpson’s rule. But can also recognize that
I = E{sin(U4)}
where U ∼ Uniform[0, 1]. I is “average” of sin(U4), suggesting
procedure:
1. Generate U randomly from Uniform[0, 1].
2. Calculate T = sin(U4)
3. Repeat steps 1 and 2 many times, find mean value m of T .
363
Minitab commands to do this (U in c1, T in c2):
MTB > random 1000 c1;
SUBC> uniform 0 1.
MTB > let c2=sin(c1**4)
MTB > mean c2
I got m = 0.19704. How accurate?
m observed value of random variable M . M mean of 1000 values,
so central limit theorem applies: approx normal distribution.
Mean, variance unknown but estimate using sample mean 0.19704,
sample SD 0.25221: E(M) ≃ 0.19704,
Var(M) = σ2/n ≃ 0.252212/1000 = 6.36 × 10−5.
364
Now, 99.7% of normal distribution within mean ± 3 × SD, so I
almost certainly in
0.19704 ± 3√
6.36 × 10−5 = (0.189, 0.205).
To get more accurate answer, get more simulated values.
365
Recognizing as expectation
Consider now I =∫∞0
5x cos(x2)e−5x dx.
Again impossible algebraically; because of limits, can’t use previous
trick.
Idea: use distribution with right limits and density in integral. Here,
Exponential(5) has density 5e−5x on correct interval, so
I = E{X cos(X2)} where X ∼ Exponential(5).
Minitab annoyance: its exponential dist has parameter 1/λ, so we
have to feed in 1/5 = 0.2.
366
Commands:
MTB > random 1000 c1;
SUBC> exponential 0.2.
MTB > let c2=c1*cos(c1**2)
MTB > describe c2
I got mean 0.1884, SD 0.1731, so this area almost certainly in
0.1884 ± 30.1731√
1000= (0.1720, 0.2048).
367
Summary
• Recognize integral as expectation of a distribution (integral has
correct limits and density function inside).
• Generate random values from distribution, compute for each the
function that integral is expectation of.
• Estimated integral is mean of computed values (use sample SD,
and say integral almost certainly ±3SD from mean).
368
Approximating sampling distributions
Central Limit Theorem only applies to means (sums), so is no help
for other quantities (median, variance etc).
Can approximate sampling distributions for these by simulation.
Idea:
1. simulate random sample from population
2. calculate sample quantity
3. repeat steps 1 and 2 many times, summarize results.
369
Sampling distribution of sample median in normal
population
Suppose X1, X2, . . . , Xn is random sample from normal
population mean 10, SD 2; take n = 3.
MTB > Random 500 c1-c3;
SUBC> Normal 10 2.
MTB > RMedian c1-c3 c4.
Samples in rows; use “row statistics” to get sample medians.
370
Shape is very like normal, even for such small sample.
371
Sampling distribution of sample variance in normal
population
Again suppose X1, X2, . . . , Xn ∼ N(10, 22). Now take n = 5:
MTB > Random 500 c1-c5;
SUBC> Normal 10 2.
MTB > RStDev c1-c5 c6.
MTB > let c7=c6*c6
MTB > histogram c7
(samples in rows again; variance as square of SD.)
372
Shape definitely skewed right: not normal-shaped.
373
Summary
• Approximate sampling distribution of quantity by:
– simulate random sample from population
– calculate sample quantity
– repeat many times, summarize results (histogram)
• Sampling distribution of sample median close to normal.
• Sampling distribution of sample variance skewed to right.
374
Normal distribution theory
Normal distribution arises often from CLT, so worth knowing
properties and related distributions. These used frequently in
Chapter 5 and beyond (STAB57).
First: suppose U, V are independent. Then Cov(U, V ) =
E(UV ) − E(U)E(V ) = E(U)E(V ) − E(U)E(V ) = 0 as
expected.
But: now suppose that Cov(U, V ) = 0. If U, V normal, then (fact)
U, V independent.
That is, for normal U, V , Cov(U, V ) = 0 if and only if U, V
independent. Not true for other distributions.
375
The chi-squared distribution
Suppose Z ∼ N(0, 1). What is distribution of W = Z2? Can’t
use usual transformation because Z2 neither increasing nor
decreasing.
FW (w) = P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √
w).
This as integral is
FW (w) =
∫
√w
−√w
e−z2/2
√2π
dz =
∫
√w
−∞
e−z2/2
√2π
dz−∫ −√
w
−∞
e−z2/2
√2π
dz.
376
Differentiate both sides and simplify to get
fW (w) =1√2πw
e−w/2.
This is called chi-squared distribution with 1 degree of freedom
(df). Written W ∼ χ21.
Now suppose Z1, Z2, . . . , Zn ∼ N(0, 1) independently.
Distribution of W = Z21 + Z2
2 + · · ·+ Z2n called chi-squared with
n degrees of freedom . Written W ∼ χ2n.
What is E(W )?
E(W ) = E
(
n∑
i=1
Z2i
)
=n∑
i=1
E(Z2i ) = n(1) = n
since E(Z2i ) = Var(Zi) = 1.
377
To get density function of χ21, compare gamma density with χ2
1:
λαwα−1
Γ(α)e−λw =
1√2πw
e−w/2
if α = 12
and λ = 12. That is, χ2
1 = Gamma(12, 1
2).
If Z2i ∼ χ2
1, use mgf formula for gamma dist to write
mZ2i(s) =
(
1
2
)1/2(1
2− s
)−1/2
.
378
If W =∑n
i=1 Z2i ∼ χ2
n, mgf of W is n copies of mZ2i(s)
multiplied together, ie.
MW (s) =
(
1
2
)n/2(1
2− s
)−n/2
which is mgf of Gamma(n/2, 12). Using formula for gamma
density, then, for W ∼ χ2n,
fW (w) =1
2n/2Γ(n/2)wn/2−1e−w/2.
Has skew-to-right shape (picture page 225).
379
Distribution of sample variance
Suppose X1, X2, . . . , Xn ∼ N(µ, σ2). Define X̄ =∑n
i=1 Xi/n
to be sample mean, S2 =∑n
i=1(Xi − X̄)2/(n− 1) to be sample
variance.
Know that X̄ ∼ N(µ, σ2/n). Distribution of S2?
Actually look at (n − 1)S2/σ2 =∑n
i=1(Xi − X̄)2/σ2. Can write
(p. 235) as sum of n − 1 squared N(0, 1)’s, so
(n − 1)S2
σ2∼ χ2
n−1.
Fact: E(S2) = σ2 (explains division by n − 1).
380
The t distribution
Standardize X̄ :X̄ − µ√
σ2/n∼ N(0, 1).
But what if σ2 unknown? Idea: replace σ2 by sample variance S2.
Distribution of result no longer normal (even though Xi are).
X̄ − µ√
S2/n=
X̄ − µ√
σ2/n· 1√
(n − 1)S2/σ2/(n − 1)=
Z√
Y/(n − 1)
where Z ∼ N(0, 1) and Y ∼ χ2n−1.
This called t distribution with n − 1 degrees of freedom , written
tn−1.
381
What happens as n increases? Write
Y/(n − 1) =∑n−1
i=1 Z2i /(n − 1) where Zi ∼ N(0, 1). Then
E(Y/(n − 1)) = 1. Let k = Var(Z2i ); then
Var(Y/(n − 1)) = (n − 1)k/(n − 1)2 = k/(n − 1) → 0.
That is, Y/(n − 1)P→ 1 and therefore
Z√
Y/(n − 1)
D→ N(0, 1);
that is, for large n, the t distribution with n − 1 df well approximated
by N(0, 1).
t distribution hard to work with; use tables/software for probabilities.
382
The F distribution
Suppose S21 and S2
2 sample variances from independent samples
sizes m, n, both from normal populations with variance σ2. Then
might compare variances by looking at ratio R = S21/S
22 :
R =S2
1
S22
=(m − 1)S2
1/σ2
(n − 1)S22/σ
2· 1/(m − 1)
1/(n − 1)=
X/(m − 1)
Y/(n − 1)
where X ∼ χ2m−1 and Y ∼ χ2
n−1.
This defined to have F distribution with m − 1 and n − 1
degrees of freedom , written F (m − 1, n − 1).
383
Properties of F distribution
Ratio could have been S22/S
21 = 1/R with similar result: therefore,
if R ∼ F (m − 1, n − 1), then 1/R ∼ F (n − 1,m − 1).
Suppose T = X/√
Y/(n − 1) ∼ tn−1. Then
T 2 =X2/1
Y/(n − 1)
is a χ21/1 over χ2
n−1/(n − 1); that is, T 2 ∼ F (1, n − 1).
384
In
R =X/(m − 1)
Y/(n − 1):
if n → ∞, know that Y/(n − 1)P→ 1, and numerator of
R ∼ χ2m−1/(m − 1).
Hence, as n → ∞,
(m − 1)RD→ χ2
m−1.
Thus χ2m−1 is useful approx to F (m − 1, n − 1) if n large.
385
Summary
• Normal distribution often from CLT, so worth knowing about
related distributions.
• If Z ∼ N(0, 1), W = Z2 ∼ χ21.
• ∑ni=1 Z2
i ∼ χ2n. Skewed to right.
• If W ∼ χ2n, E(W ) = n.
• For normal Xi, If S2 =∑n
i=1(Xi − X̄)2/(n − 1) is sample
variance, (n − 1)S2/σ2 ∼ χ2n−1.
• (X̄ − µ)/√
S2/n ∼ tn−1.
• As n increases, tnD→ N(0, 1).
386
• If S21 , S
22 variances from 2 independent samples, sizes m and
n, ratio R = S21/S
22 ∼ F (m − 1, n − 1).
• t2n−1 and F (1, n − 1) have same distribution.
• As n increases, (m − 1)RD→ χ2
m−1.
387
Stochastic Processes
388
Random walks
Consider gambling game: win $1 with prob p, lose $1 with prob q
(p + q = 1). Each play independent. Start with fortune a; let Xn
denote fortune after n plays.
Thus X0 = a; X1 = a + 1 if win (prob p), X1 = a − 1 if lose
(prob q).
Sequence {Xn} of random variables called random walk .
389
Properties of random walk
At each step, two possible outcomes (win/lose), same prob p of
winning, independent. So number of wins Wn ∼ Binomial(n, p).
With Wn wins, must be n − Wn losses, so fortune after Wn wins is
Xn = a + (1)Wn + (−1)(n − Wn) = a + 2Wn − n.
Since E(Wn) = np, have
E(Xn) = a + 2np − n = a + 2n
(
p − 1
2
)
.
Also
Var(Xn) = 22 Var(Wn) = 4np(1 − p).
390
Since Wn ∼ Binomial(n, p), have
P (Wn = j) =
(
n
j
)
pjqn−j;
write in terms of Xn to get
P (Xn = a + k) = P (a + k = a + 2Wn − n)
= P (Wn = (n + k)/2)
=
(
n
(n + k)/2
)
p(n+k)/2q(n−k)/2.
Only certain values of Xn possible; formula fails for impossible
values.
391
Examples
Suppose a = 5, p = 14. Then
E(Xn) = 5 + 2n(14− 1
2) = 5− n/2. Expect fortune to decrease
on average.
What is P (X3 = 6)? Write 6 = 5 + 1 so k = 1, n = 3;
(n + k)/2 = 2 and (n − k)/2 = 1:
P (X3 = 6) =
(
3
2
)(
1
4
)2(3
4
)1
=9
64.
How about P (X9 = 7)? This is P (X9 = 5 + 2), so n = 9 and
k = 2. But (n + k)/2 = (5 + 2)/2 not integer, so formula fails.
X9 cannot be 7 (in fact X9 must be even).
392
Now suppose a = 20, p = 23. Then
E(Xn) = 20 + 2n
(
2
3− 1
2
)
= 20 + n/3,
increasing with n.
Find P (X5) = 21 = 20 + 1: n = 5, k = 1 so (n + k)/2 = 3,
(n − k)/2 = 2 and
P (X5 = 21) =
(
5
3
)(
2
3
)3(1
3
)2
≃ 0.329,
fairly likely.
393
Gambler’s ruin
Suppose we gamble with aim to reach fortune c > 0. How likely do
we succeed before fortune reaches 0 (run out of money)?
Hard to see answer: no idea how long it takes to reach c or 0.
Idea: let S(a) be prob of reaching c first starting from fortune a.
Then for all c > 0, S(0) = 0, S(c) = 1. Also, if current fortune a,
fortune at next step either a + 1 or a − 1, leading to
S(a) = pS(a + 1) + qS(a − 1).
394
Solve above recurrence relation to get formula: if p = 12,
S(a) = a/c; otherwise,
S(a) =1 − (q/p)a
1 − (q/p)c.
Example: start with $20, want to win $50. If p = 12, chance of
success is 20/50 = 0.4. If p = 0.51, chance of success is
S(20) =1 − (0.49/0.51)20
1 − (0.49/0.51)50≃ 0.637.
Even a very small edge makes success much more likely. (Even
small disadvantage makes eventual failure much more likely.)
395
Markov Chains
Simple model of weather:
• if sunny today, prob 0.7 of sunny tomorrow, prob 0.3 of rainy.
• if rainy today, prob 0.4 of sunny tomorrow, prob 0.6 of rainy.
Weather has two states (sunny, rainy). From one day to next,
weather may change state.
Probs above called transition probabilities . This kind of probability
model called Markov chain .
396
Can write as matrix:
P =
0.7 0.3
0.4 0.6
where element pij is P (go to state j|currently state i).
Note assumption: only need to know weather today to predict
weather tomorrow. (If weather today known, past weather
irrelevant). Called Markov property .
Suppose sunny today. Chance of sun in two days?
One idea: list possibilities. Two: SSS, SRS. Use transition probs to
get (0.7)(0.7) + (0.3)(0.4) = 0.61.
397
Another: calculate matrix P 2:
P 2 =
0.7 0.3
0.4 0.6
0.7 0.3
0.4 0.6
=
0.61 0.39
0.52 0.48
.
Note that top-left calculation same as 1st idea above.
Matrix P 2 gives two-step transition probs. That is, if sunny today,
prob of sunny in 2 days’ time 0.61; if rainy today, almost even
chance of being rainy in 2 days.
In general, P n gives n-step transition probs (weather in n days’
time given weather today).
398
Another example
“Ehrenfest’s Urn”: Two urns, containing total of 4 balls. Choose one
ball at random, take out of current urn, place in other urn. Keep
track of number of balls in urn 1.
Transition matrix (states 0, 1, 2, 3, 4 balls in urn 1):
P =
0 1 0 0 0
14
0 34
0 0
0 24
0 24
0
0 0 34
0 14
0 0 0 1 0
Apparent tendency for number of balls in 2 urns to even out.
399
Find likely number of balls in urn 1 after 9 steps by finding P 9. (Use
Minitab: see section E.1 of manual, p. 162.) Answer (rounded):
P 9 =
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
Start with even number of balls in urn 1: end with either odd
number, equally likely. Start with odd number: end with even
number, most likely 2.
400
Stationary distributions
Instead of starting from particular state, pick starting state from
prob. distribution θ = (θ1, θ2, . . .).
In weather example: suppose 80% chance today sunny, so
θ = (0.80, 0.20).
To get prob of each state n steps later, multiply θ as row vector by
P n. Weather example, for n = 2 days later:
(
0.8 0.2)
P 2 =(
0.8 0.2)
0.61 0.39
0.52 0.48
=(
0.592 0.408)
.
401
Suppose we could find θ such that θP = θ. Then starting
distribution θ would be stationary : (marginal) prob of sunny day
same for all days.
Can try directly for weather example:(
θ1 θ2
)
P =(
0.7θ1 + 0.4θ2 0.3θ1 + 0.6θ2
)
=(
θ1 θ2
)
.
2 equations in 2 unknowns, collapse into one equation
0.3θ1 − 0.4θ2 = 0, but θi are probs so that θ1 + θ2 = 1 also.
Solve: θ1 = 47, θ2 = 3
7.
More generally: solve θP = θ by transposing both sides to get
P T θT = θT . Like solution to Av = λv with λ = 1: stationary
prob θ is eigenvector of P T with eigenvalue 1.
402
Can use Minitab to get eigenvalues/vectors (manual p. 167). Usually
need to scale eigenvector to get probs summing to 1.
Ehrenfest urn example: 5 eigenvectors; one with eigenvalue 1 is
(0.120, 0.478, 0.717, 0.478, 0.120), scaling to 116
, 416
, 616
, 416
, 116
.
(Actually binomial probs: see text p. 595).
403
Limiting distributions
If initial state chosen from stationary distribution, then prob of each
state remains same for all time.
Also: if watch Markov chain for many steps, should not matter much
which state we began in.
Weather example: 8-step transition matrix is
P 8 =
0.57146 0.42854
0.57139 0.42861
≃
47
37
47
37
Starting either from sunny or rainy day, chance of sunny day in 8
days’ time is about 47. Called the limiting distribution, here same as
stationary distribution.
404
Compare Ehrenfest urn example:
P 8 ≃
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
not getting stationary distribution in each row.
Problem here: number of balls in urn 1 always goes from odd to
even or vice versa. So eg. P (1 ball in urn 1 after n steps)
alternates between 0 and positive; cannot have limit. Chain called
periodic .
405
To test whether a chain is periodic, think about how long can it take
to get back to the state I’m in now?. In the Ehrenfest urn example,
can get back to current state in 2, 4, 6, . . . steps, always a multiple
of 2. Thus the period of any state is 2.
A chain where every state has period 1 is called aperiodic.
406
Consider a third example:
P =
0.5 0.5 0
0.75 0.25 0
0 0 1
.
Search for stationary distribution: are two eigenvectors for
eigenvalue 1: (0.6, 0.4, 0) and (0, 0, 1).
Both stationary distributions in a way:
start in state 1 or 2, can never reach state 3. Start in state 3, can
never reach states 1 or 2.
Such chain called reducible : can split up into two chains, {1, 2}and {3} and treat each separately.
407
Markov chain limit theorem
Previous work suggests following theorem:
Suppose a Markov chain has a stationary distribution, is not
reducible, and is not periodic. Then its stationary distribution also
gives the probability, as n → ∞, of being in any particular state
after n steps.
In effect, the stationary distribution gives approx to long-term
behaviour of chain.
408
Summary
• Random walk: sequence of r. v.’s with X0 = a,
Xn+1 = Xn + 1 with prob. p, Xn+1 = Xn − 1 with prob.
q = 1 − p.
• E(Xn) = a + 2n(p − 12); Var(Xn) = 4np(1 − p). E(Xn)
increasing function of n if p > 12.
• P (Xn = a + k) =(
n(n+k)/2
)
p(n+k)/2q(n−k)/2.
• Gambler’s ruin: does random walk reach 0 or c first? Prob. of
reaching c can be found; much greater even if p only slightly
bigger than 0.5.
409
• Markov Chain: set of states S, probs P (Sj |Si) arranged in
matrix P = {pij}. (Depend only on one previous step.)
• Matrix P k gives k-step transition probs.
• If starting state chosen at random with probs. (θ1, θ2, . . .) and
θP = θ, θ called stationary distribution for chain. Find by
solving eigenvalue problem.
410
• Limiting distribution is limn→∞ P n, if the limit exists. Gives
result of observing chain for many steps.
• State has period k if can only get back to state in a multiple of k
steps.
• Chain is irreducible if it is possible to move (in some number of
steps) from any state to any other state.
• For a Markov chain that is irreducible and aperiodic, if the
stationary distribution exists, is same as limiting distribution.
411
... that’s all, folks!
412