128
18.05 Spring 2005 Lecture Notes 18.05 Lecture 1 February 2, 2005 Required Textbook - DeGroot & Schervish, “Probability and Statistics,” Third Edition Recommended Introduction to Probability Text - Feller, Vol. 1 §1.2-1.4. Probability, Set Operations. What is probability? Classical Interpretation: all outcomes have equal probability (coin, dice) Subjective Interpretation (nature of problem): uses a model, randomness involved (such as weather) ex. drop of paint falls into a glass of water, model can describe P(hit bottom before sides) or, P(survival after surgery)- “subjective,” estimated by the doctor. Frequency Interpretation: probability based on history P(make a free shot) is based on history of shots made. Experiment has a random outcome. 1. Sample Space - set of all possible outcomes. coin: S={H, T}, die: S={1, 2, 3, 4, 5, 6} two dice: S={(i, j), i, j=1, 2, ..., 6} 2. Events - any subset of sample space ex. A S, A - collection of all events. 3. Probability Distribution - P: A↔ [0, 1] Event A S, P(A) or Pr(A) - probability of A Properties of Probability: 1. 0 P(A) 1 2. P(S) = 1 3. For disjoint (mutually exclusive) events A, B (definition A B= ) P(A or B) = P(A) + P(B) - this can be written for any number of events. For a sequence of events A 1 , ..., A n , ... all disjoint (A i A j = , i = j): P( A i ) = P(A i ) i=1 i=1 which is called “countably additive.” If continuous, can’t talk about P(outcome), need to consider P(set) Example: S = [0, 1], 0 <a<b< 1. P([a, b]) = b a, P(a) = P(b) = 0. 1

Lecture Notes(Introduction to Probability and Statistics)

Embed Size (px)

Citation preview

Page 1: Lecture Notes(Introduction to Probability and Statistics)

18.05 Spring 2005 Lecture Notes

18.05 Lecture 1 February 2, 2005

Required Textbook - DeGroot & Schervish, “Probability and Statistics,” Third Edition Recommended Introduction to Probability Text - Feller, Vol. 1

§1.2-1.4. Probability, Set Operations. What is probability?

• Classical Interpretation: all outcomes have equal probability (coin, dice)

• Subjective Interpretation (nature of problem): uses a model, randomness involved (such as weather)

– ex. drop of paint falls into a glass of water, model can describe P(hit bottom before sides)

– or, P(survival after surgery)- “subjective,” estimated by the doctor.

• Frequency Interpretation: probability based on history

– P(make a free shot) is based on history of shots made.

Experiment ↔ has a random outcome. 1. Sample Space - set of all possible outcomes. coin: S={H, T}, die: S={1, 2, 3, 4, 5, 6}two dice: S={(i, j), i, j=1, 2, ..., 6}

2. Events - any subset of sample space ex. A √ S, A - collection of all events.

3. Probability Distribution - P: A ↔ [0, 1] Event A √ S, P(A) or Pr(A) - probability of A

Properties of Probability: 1. 0 ← P(A) ← 1 2. P(S) = 1 3. For disjoint (mutually exclusive) events A, B (definition ↔ A ∞ B = ≥) P(A or B) = P(A) + P(B) - this can be written for any number of events. For a sequence of events A1, ..., An, ... all disjoint (Ai ∞ Aj = ≥, i = j):∈

∗� ∗�P( Ai) = P(Ai)

i=1 i=1

which is called “countably additive.”If continuous, can’t talk about P(outcome), need to consider P(set)Example: S = [0, 1], 0 < a < b < 1.P([a, b]) = b− a,P(a) = P(b) = 0.

1

Page 2: Lecture Notes(Introduction to Probability and Statistics)

Need to group outcomes, not sum up individual points since they all have P = 0.

§1.3 Events, Set Operations

Union of Sets: A ⇒ B = {s ⊂ S : s ⊂ A or s ⊂ B}

Intersection: A ∞ B = AB = {s ⊂ S : s ⊂ A and s ⊂ B}

cComplement: A = {s ⊂ S : s /⊂ A}

Set Difference: A \ B = A− B = {s ⊂ S : s ⊂ A and s /⊂ B} = A ∞ B

2

c

Page 3: Lecture Notes(Introduction to Probability and Statistics)

cSymmetric Difference: (A ∞ Bc) ⇒ (B ∞ A )

Summary of Set Operations: 1. Union of Sets: A ⇒ B = {s ⊂ S : s ⊂ A or s ⊂ B}2. Intersection: A ∞ B = AB = {s ⊂ S : s ⊂ A and s ⊂ B3. Complement: Ac = {s ⊂ S : s /

}⊂ A}

c4. Set Difference: A \ B = A− B = {s ⊂ S : s ⊂ A and s /⊂ B} = A ∞ B5. Symmetric Difference:A⇔B = {s ⊂ S : (s ⊂ A and s / ) or (s ⊂ B and s /⊂ B ⊂ A)} =

c)(A ∞ Bc) ⇒ (B ∞ A

Properties of Set Operations: 1. A B = B A⇒ ⇒2. (A ⇒ B) ⇒ C = A ⇒ (B C)⇒Note that 1. and 2. are also valid for intersections. 3. For mixed operations, associativity matters:(A ⇒ B) ∞ C = (A ∞ C) ⇒ (B ∞ C)think of union as addition and intersection as multiplication: (A+B)C = AC + BC

c4. (A ⇒ B)c = A ∞ Bc - Can be proven by diagram below:

Both diagrams give the same shaded area of intersection.

c5. (A ∞ B)c = A Bc - Prove by looking at a particular point: ⇒ s ⊂ (A ∞ B)c = s /⊂ (A ∞ B)

c c⊂ A or s /s / ⊂ B = s ⊂ A or s ⊂ Bs ⊂ (Ac Bc)⇒QED

** End of Lecture 1

3

Page 4: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 2 February 4, 2005

§1.5 Properties of Probability. 1. P(A) ⊂ [0, 1] 2. P(S) = 1 3. P(⇒Ai) =

� P (Ai) if disjoint ↔ Ai ∞ Aj = ≥, i = j∈

The probability of a union of disjoint events is the sum of their probabilities.

4. P(≥), P(S) = P(S ⇒ ≥) = P(S) + P(≥) = 1where S and ≥ are disjoint by definition, P(S) = 1 by #2., therefore, P(≥) = 0.

5. P(Ac) = 1 − P(A)because A, Ac are disjoint, P(A Ac) = P(S) = 1 = P(A) + P(Ac)⇒the sum of the probabilities of an event and its complement is 1.

6. If A √ B, P(A) ← P(B)by definition, B = A ⇒ (B \A), two disjoint sets.P(B) = P(A) + P(B \A) ∼ P(A)

7. P(A ⇒ B) = P(A) + P(B) − P(AB)must subtract out intersection because it would be counted twice, as shown:

write in terms of disjoint pieces to prove it:P(A) = P(A \B) + P(AB)P(B) = P(B \A) + P(AB)P(A ⇒ B) = P(A \B) + P(B \A) + P(AB)

Example: A doctor knows that P(bacterial infection) = 0.7 and P(viral infection) = 0.4What is P(both) if P(bacterial ⇒ viral) = 1?P(both) = P(B ∞ V)1 = 0.7 + 0.4 - P(BV)P(BV) = 0.1

Finite Sample Spaces There are a finite # of outcomes S = s1, ..., sn}{Define pi = P(si) as the probability function.

4

Page 5: Lecture Notes(Introduction to Probability and Statistics)

n

pi ∼ 0, �

pi = 1 i=1

P(A) = �

P(s) s∞A

Classical, simple sample spaces - all outcomes have equal probabilities. P(A) = #(A)

#(S) , by counting methods. Multiplication rule: #(s1) = m, #(s2) = n, #(s1 × s2) = mn

Sampling without replacement: one at a time, order is importants1...sn outcomesk ← n (k chosen from n)#(outcome vectors) = (a1, a2, ..., ak ) = n(n − 1) × ... × (n − k + 1) = Pn,k

Example: order the numbers 1, 2, and 3 in groups of 2. (1, 2) and (2, 1) are different.P3,2 = 3 × 2 = 6Pn,n = n(n − 1) × ... × 1 = n!

n! Pn,k =

(n − k)!

Example: Order 6 books on a shelf = 6! permutations.

Sampling with replacement, k out of nnumber of possibilities = n × n × n... = nk

Example: Birthday Problem- In a group of k people,what is the probability that 2 people will have the same birthday?Assume n = 365 and that birthdays are equally distributed throughout the year, no twins, etc.# of different combinations of birthdays= #(S = all possibilities) = 365k

# where at least 2 are the same = #(S) − #(all are different) = 365k − P365,k

P365,kP(at least 2 have the same birthday) = 1 −

365k

Sampling without replacement, k at once s1...sn

sample a subset of size k, b1...bk, if we aren’t concerned with order. �n�

n! number of subsets = Cn,k = =

k k!(n − k)!

each set can be ordered k! ways, so divide that out of Pn,k

Cn,k - binomial coefficients

Binomial Theorem:

n

(x + y)n = � �

n�

x k y n−k

k k=0

5

Page 6: Lecture Notes(Introduction to Probability and Statistics)

There are �n�

times that each term will show up in the expansion. k

Example: a - red balls, b - black balls.number of distinguishable ways to order in a row =

�a + b

� �a + b

= a b

Example: r1 + ... + rk = n; ri = number of balls in each box; n, k givenHow many ways to split n objects into k sets?Visualize the balls in boxes, in a line - as shown:

Fix the outer walls, rearrange the balls and the separators.

If you fix the outer walls of the first and last boxes,you can rearrange the separators and the balls using the binomial theorem.There are n balls and k-1 separators (k boxes).Number of different ways to arrange the balls and separators =

�n + k − 1

� �n + k − 1

= n k − 1

Example: f (x1, x2, ..., xk ), take n partial derivatives:

�nf �2x1�x2�5x3...�xk

k “boxes” � k “coordinates”n “balls” � n “partial derivatives”number of different partial derivatives =

�n+k−1

� =

�n+k−1

�n k−1

Example: In a deck of 52 cards, 5 cards are chosen.What is the probability that all 5 cards have different face values?�

52total number of outcomes = �

5

total number of face value combinations = �13

� 5

total number of suit possibilities, with replacement = 45

�13

�45

P(all 5 different face values) = �552 5

** End of Lecture 2.

6

Page 7: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 3 February 7, 2005

n!Pn,k = (n−k)! - choose k out of n, order counts, without replacement.

nk - choose k out of n, order counts, with replacement. n!Cn,k = k!(n−k)! - choose k out of n, order doesn’t count, without replacement.

§1.9 Multinomial Coefficients These values are used to split objects into groups of various sizes.s1, s2, ..., sn - n elements such that n1 in group 1, n2 in group 2, ..., nk in group k.n1 + ... + nk = n

� n

��n − n1

��n − n1 − n2

× ...

�n − n1 − ... − nk−2

��nk

n1 n2 n3 ×

nk−1 nk

(n − n1 − n2)!n! (n − n1)! n3!(n − n1 − n2 − n3)!

× ... (n − n1 − ... − nk−2)!

= n1!(n − n1)!

× n2!(n − n1 − n2)!

× × nk−1!(n − n1 − ... − nk−1)!

× 1

n! �

n �

= = n1!n2!...nk−1!nk ! n1, n2, ..., nk

These combinations are called multinomial coefficients.

Further explanation: You have n “spots” in which you have n! ways to place your elements.However, you can permute the elements within a particular group and the splitting is still the same.You must therefore divide out these internal permutations.This is a “distinguishable permutations” situation.

Example #1 - 20 members of a club need to be split into 3 committees (A, B, C) of 8, 8, and 4 people,respectively. How many ways are there to split the club into these committees?

� 20

� 20!

ways to split = = 8, 8, 4 8!8!4!

Example #2 - When rolling 12 dice, what is the probability that 6 pairs are thrown?This can be thought of as “each number appears twice”There are 612 possibilities for the dice throws, as each of the 12 dice has 6 possible values.In pairs, the only freedom is where the dice show up.

� 12

� 12! 12!

= � P = = 0.0034 2, 2, 2, 2, 2, 2 (2!)6 (2!)6612

7

Page 8: Lecture Notes(Introduction to Probability and Statistics)

Example #3 - Playing BridgePlayers A, B, C, and D each get 13 cards.P(A − 6�s, B − 4�s, C − 2�s, D − 1�) =?

13 �� 39

� 6,4,2,1 7,9,11,12

� (choose �s)(choose other cards)

P = � 52 � =

(ways to arrange all cards) = 0.00196

13,13,13,13

Note - If it didn’t matter who got the cards, multiply by 4! to arrange people around the hands. Alternate way to solve - just track the locations of the � s

�13

��13

��13

��13

P = 6 4 2 1�52 13

Probabilities of Unions of Events:

P(A ⇒ B) = P(A) + P(B) − P(AB)

P(A B ⇒ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC)⇒

§1.10 - Calculating a Union of Events - P(union of events)P(A B) = P(A) + P(B) − P(AB) (Figure 1)P(A ⇒ B ⇒ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC) (Figure 2)⇒

Theorem:

8

Page 9: Lecture Notes(Introduction to Probability and Statistics)

n

P( �

Ai) = �

P(Ai) − �

P(AiAj ) + �

P(AiAj Ak ) − ... + (−1)n+1P(Ai...An)

i=1 i�n i<j i<j<k

Express each disjoint piece, then add them up according to what sets each piecebelongs or doesn’t belong to.A1 ... An can be split into a disjoint partition of sets:⇒ ⇒

Ai1 ∞ Ai2 ∞ ... ∞ Aik ∞ Aci(k+1) ∞ ... ∞ Ac

in

where k = last set the piece is a part of.

n

P( �

Ai) = �

P(disjoint partition) i=1

To check if the theorem is correct, see how many times each partition is counted.P(A1), P(A2), ..., P(Ak ) - k times�

P(AiAj ) − �k�

timesi<j 2 (needs to contain Ai and Aj in k different intersections.)

A ∞ B ∞ CcExample: Consider the piece , as shown:

This piece is counted: P(A B C) = once.⇒ ⇒P(A) + P(B) + P(C) = counted twice.−P(AB) − P(AC) − P(BC) = subtracted once.+P(ABC) = counted zero times.The sum: 2 - 1 + 0 = 1, piece only counted once.

Example: Consider the piece A1 ∞ A2 ∞ A3 ∞ Ac 4

k = 3, n = 4.P(A1) + P(A2) + P(A3) + P(A4) = counted k times (3 times).−P(A1A2) − P(A1A3) − P(A1A4) − P(A2A3) − P(A2A4) − P(A3A4) = counted

�k�

times (3 times).as follows:

�= counted

�k�

times (1 time). 2

i<j<k

total in general: k − �k�

+ �k� −

3�k�

+ ... + (−1)k+1�k�

= sum of times counted. 2 3 4 k

To simplify, this is a binomial situation.

9

Page 10: Lecture Notes(Introduction to Probability and Statistics)

k �k� �

k� �

k� �

k�

0 = (1 − 1)k = � �

k�

(−1)i(1)(k−i) = + ... i 0

− 1 2

− 3

i=0

0 = 1 − sum of times counted

therefore, all disjoint pieces are counted once.

** End of Lecture 3

10

Page 11: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 4 February 11, 2005

Union of Events

P(A1 ... An) = �

P(Ai) −�

P(AiAj ) + �

P(AiAj Ak ) + ...⇒ ⇒i i<j i<j<k

It is often easier to calculate P(intersections) than P(unions)

Matching Problem: You have n letters and n envelopes, randomly stuff the letters into the envelopes.What is the probability that at least one letter will match its intended envelope?P(A1 ⇒ ... An), Ai = {ith position will match}

(n−1)!P(Ai) = 1

⇒= n!n

(permute everyone else if just Ai is in the right place.) P(AiAj ) = (n−2)! (Ai and Aj are in the right place) n!

P(Ai1Ai2...Aik ) = (n−k)! n!

1 �n�

(n − 2)! �n�

(n − 3)! − ... + (−1)n+1

�n�

(n − n)!P(A1 ... An) = n +⇒ ⇒ ×

n −

2 n! 3 n! n n!

general term: �n�

(n − k)! n!(n − k)! 1 = =

k n! k!(n − k)!n! k!

1 1 SUM = 1 − +

3! − ... + (−1)n+1 1

2! n! 2 3

Recall: Taylor series for ex = 1 + x + x + x + ...2! 3! 1for x= -1, e−1 = 1 − 1 + 1 + ...3!2 −

therefore, SUM = 1 - limit of Taylor series as n ↔ → When n is large, the probability converges to 1 − e−1 = 0.63

§2.1 - Conditional Probability Given that B “happened,” what is the probability that A also happened? The sample space is narrowed down to the space where B has occurred:

The sample size now only includes the determination that event B happened.

Definition: Conditional probability of Event A given Event B:

P(AB)P(A B) = |

P(B)

Visually, conditional probability is the area shown below:

11

Page 12: Lecture Notes(Introduction to Probability and Statistics)

It is sometimes easier to calculate intersection given conditional probability: P(AB) = P(A B)P(B)|

Example: Roll 2 dice, sum (T) is odd. Find P(T < 8). B = {T is odd}, A = {T < 8}

P(AB) 18 1 P(A B) = , P(B) = =|

P(B) 62 2

All possible odd T = 3, 5, 7, 9, 11.Ways to get T = 2, 4, 6, 4, 2 - respectively.

1 1/3 2P(AB) = 12 = 3 ; P(A|B) = 1/2 = 36 3

Example: Roll 2 dice until sum of 7 or 8 results (T = 7 or 8)P(A = {T = 7}), B = {T = 7 or 8}This is the same case as if you roll once.P(A B) = P(AB) = P(A) 6/36 = 6=

P(B) P(B) (6+5)/36 11|

Example:

Treatments for a disease, results after 2 years: Result A B C Placebo Relapse 18 13 22 24 No Relapse 22 25 16 10

24Example, considering Placebo: B = Placebo, A = Relapse. P(A B) = 24+10 = 0.7|13Example, considering treatment B: P(A B) = 13+25 = 0.34|

As stated earlier, conditional probability can be used to calculate intersections:Example: You have r red balls and b black balls in a bin.Draw 2 without replacement, What is P(1 = red, 2 = black)?

rWhat is P(2 = black) given that 1 = red ? P(1 = red) = r+b Now, there are only r - 1 red balls and still b black balls.

b rP(2 = black 1 = red) = b+r

b −1 � P(AB) = r+b| b+r−1 ×

P(A1A2...An) = P(A1) × P(A2 A1) × P(A3 A2 A1) × ... × P(An An−1...A2 A1) = | | | | |

P(A2A1) P(A3A2A1) P(AnAn−1...A1) == P(A1) ×

P(A1) ×

P(A2A1) × ...

P(An−1...A1)

= P(AnAn−1...A1)

Example, continued: Now, find P(r, b, b, r)

12

Page 13: Lecture Notes(Introduction to Probability and Statistics)

r b b − 1 r − 1 = r + b

× r − 1 + b

× r + b − 2

× r + b − 3

Example, Casino game - Craps. What’s the probability of actually winning??On first roll: 7, 11 - win; 2, 3, 12 - lose; any other number (x1), you continue playing.If you eventually roll 7 - lose; x1, you win!

P(win) = P(x1 = 7 or 11) + P(x1 = 4)P(get 4 before 7 x1 = 4)+ |

+P(x1 = 5)P(get 5 before 7 x1 = 5) + ... = 0.493|The game is almost fair!

** End of Lecture 4

13

Page 14: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 5 February 14, 2005

§2.2 Independence of events. P(A B) = P(AB)

P(B) ;|Definition - A and B are independent if P(A B) = P(A)|

P(AB)P(A B) = = P(A) � P(AB) = P(A)P(B)|

P(B)

Experiments can be physically independent (roll 1 die, then roll another die),or seem physically related and still be independent.Example: A P(A 1) =

{ }.P(B 2 {1, 3}

= P(AB 13

odd , B ) =

) = , therefore independent.

= = {1, 2, 3, 4}. Related events, but independent..AB = 2 3

2P(AB) = 1

32 ×

Independence does not imply that the sets do not intersect.

Disjoint = Independent. ∈

If A, B are independent, find P(ABc)P(AB) = P(A)P(B)ABc = A \ AB, as shown:

so, P(ABc) = P(A) − P(AB)= P(A) − P(A)P(B)= P(A)(1 − P(B))= P(A)P(Bc)therefore, A and Bc are independent as well.similarly, Ac and Bc are independent. See Pset 3 for proof.

Independence allows you to find P(intersection) through simple multiplication.

14

Page 15: Lecture Notes(Introduction to Probability and Statistics)

2

Example: Toss an unfair coin twice, these are independent events. P(H) = p, 0 ← p ← 1, find P(“T H ∅∅) = tails first, heads second P(“T H ∅∅) = P(T )P(H) = (1 − p)p

1Since this is an unfair coin, the probability is not just 4 TH 1=If fair, HH+HT+TH+TT 4

If you have several events: A1, A2, ...An that you need to prove independent:It is necessary to show that any subset is independent.Total subsets: Ai1, Ai2, ..., Aik , 2 k n← ←Prove: P(Ai1Ai2...Aik) = P(Ai1)P(Ai2)...P(Aik)You could prove that any 2 events are independent, which is called “pairwise” independence,but this is not sufficient to prove that all events are independent.

Example of pairwise independence:Consider a tetrahedral die, equally weighted.Three of the faces are each colored red, blue, and green,but the last face is multicolored, containing red, blue and green.P(red) = 2/4 = 1/2 = P(blue) = P(green)P(red and blue) = 1/4 = 1/2 × 1/2 = P(red)P(blue)Therefore, the pair {red, blue} is independent.The same can be proven for {red, green} and {blue, green}.but, what about all three together?P(red, blue, and green) = 1/4 = P(red)P(blue)P(green) = 1/8, not fully independent.∈

Example: P(H) = p, P(T ) = 1 − p for unfair coinToss the coin 5 times � P(“HTHTT”)= P(H)P(T )P(H)P(T )P(T )= p(1 − p)p(1 − p)(1 − p) = p2(1 − p)3

Example: Find P(get 2H and 3T, in any order)= sum of probabilities for ordering= P(HHT T T ) + P(HT HT T ) = ...= p2(1 − p)3 + p2(1 − p)3 + ...=

�5�p2(1 − p)3

General Example: Throw a coin n times, P(k heads out of n throws)�n�

k= p (1 − p)n−k

k

Example: Toss a coin until the result is “heads;” there are n tosses before H results.P(number of tosses = n) =?needs to result as “TTT....TH,” number of T’s = (n - 1)

P(tosses = n) = P(T T...H) = (1 − p)n−1 p

Example: In a criminal case, witnesses give a specific description of the couple seen fleeing the scene.P(random couple meets description) = 8.3 × 10−8 = pWe know at the beginning that 1 couple exists. Perhaps a better question to be asked is:Given a couple exists, what is the probability that another couple fits the same description?P(2 couples exists)A = P(at least 1 couple), B = P(at least 2 couples), find P(B A)P(B A) = P(BA) = P(B)

|P(A) P(A)|

15

Page 16: Lecture Notes(Introduction to Probability and Statistics)

Out of n couples, P(A) = P(at least 1 couple) = 1 − P(no couples) = 1 − �n

i=1(1 − p)*Each* couple doesn’t satisfy the description, if no couples exist.Use independence property, and multiply.P(A) = 1 − (1 − p)n

P(B) = P(at least two) = 1 − P(0 couples) − P(exactly 1 couple)= 1 − (1 − p)n − n × p(1 − p)n−1, keep in mind that P(exactly 1) falls into P(k out of n)

1 − (1 − p)n − np(1 − p)n−1

P(B A) = |1 − (1 − p)n

If n = 8 million people, P(B A) = 0.2966, which is within reasonable doubt! |P(2 couples) < P(1 couple), but given that 1 couple exists, the probability that 2 exist is not insignificant.

In the large sample space, the probability that B occurs when we know that A occured is significant!

2.3 Bayes’s Theorem §

It is sometimes useful to separate a sample space S into a set of disjoint partitions:

B1 k -, ..., B a partition of sample space S.

Bi ∞ Bj = ≥, for i =∈ j, S = �k

Bi (disjoint) i=1

Total probability: P(A) = �k

P(ABi) = �k

P(A|Bi)P(Bi)i=1 i=1

(all ABi are disjoint, �k

ABi = A)i=1

** End of Lecture 5

16

Page 17: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 6 February 16, 2005

Solutions to Problem Set #1 1-1 pg. 12 #9 Bn =

�∗ Ai, Cn =

�∗ Aii=n i=n

a) Bn ∅ Bn+1... Bn = An ⇒ (

�∗ i=n+1 Ai) = An Bn+1⇒

s ⊂ Bn+1 ≤ s ⊂ Bn+1 An = Bn⇒Cn ⊃ Cn+1... Cn = An ∞ Cn+1

s ⊂ Cn = An ∞ Cn+1 ≤ s ⊂ Cn+1

b) s ⊂ �∗

for all n n=1 Bn ≤ s ⊂ Bn

s ⊂ �∗

Ai for all n ≤ s ⊂ some Ai for i ∼ n, for all n i=1 ≤ s ⊂ infinitely many events Ai ≤ Ai happen infinitely often.c) s ⊂

�∗ Cn ≤ s ⊂ some Cn =

�∗ Ai ≤ for some n, s ⊂ all Ai for i ∼ nn=1 i=n

≤ s ⊂ all events starting at n.

1-2 pg. 18 #4 P (at least 1 fails) = 1 − P (neither fail) = 1 − 0.4 = 0.6

1-3 pg. 18 #12 A1, A2, ...

1A2, ..., Bn = Acn−1AnB1 = A1, B2 = Ac

1...Ac

P (�n

Ai) = �n

P (Bi) splits the union into disjoint events, and covers the entire space. i=1 i=1 follows from:

�ni=1 Ai =

�n Bii=1 take point (s) in

�n Ai, ≤ s ⊂ at least one ≤ s ⊂ A1 = B1,i=1

1, if s ⊂ A2, then s ⊂ Acif not, s ⊂ Ac 1A2 = B2, if not... etc.

at some point, the point belongs to a set. The sequence stops when s ⊂ Ac

2 ∞ ... ∞ Ac

≤ s ⊂ �

in =1 Bi.P (

�n Ai) = P (�n Bi)

k−1 ∞ Ak = Bk1 ∞ Ac

i=1 i=1 =

�n P (Bi) if Bi’s are disjoint. i=1

Should also prove that the point in Bi belongs in Ai. Need to prove Bi’s disjoint - by construction: Bi, Bj ≤ Bi = Ai

c ∞ ... ∞ Ac i−1 ∞ Ai

1 ∞ ... ∞ Aci ∞ ... ∞ AcBj = Ac

j−1 ∞ Aj

s ⊂ Bi ≤ s ⊂ Ai, s∅ ⊂ Bj ≤ s∅ ⊂ Ai./≤ implies that s = s∅∈

1-4 pg. 27 #5 #(S) = 6 × 6 × 6 × 6 = 64

#(all different) = 6 × 5 × 3 = P6,4× 4 5P (all different) = P6,4 = 1864

1-5 pg. 27 #712 balls in 20 boxes.P(no box receives > 1 ball, each box will have 0 or 1 balls)also means that all balls fall into different boxes.#(S) = 2012

#(all different) = 20 × 19... × 9 = P20,12

17

Page 18: Lecture Notes(Introduction to Probability and Statistics)

2012 P (...) = P20,12

1-6 pg. 27 #10100 balls, r red balls.Ai = {draw red at step i}think of arranging the balls in 100 spots in a row.

ra) P (A1) = 100 b) P (A50)sample space = sequences of length 50.#(S) = 100 × 99 × ... × 50 = P100,50

#(A50) = r × P99,49 red on 50. There are 99 balls left, r choices to put red on 50.rP (A50) = 100 , same as part a.

c) As shown in part b, the particular draw doesn’t matter, probability is the same.rP (A100) = 100

1-7 pg. 34 #6Seat n people in n spots.#(S) = n!#(AB sit together) =?visualize n seats, you have n-1 choices for the pair.2(n-1) ways to seat the pair, because you can switch the two people.but, need to account for the (n-2) people remaining!#(AB) = 2(n − 1)(n − 2)!

2therefore, P = 2(n−1)! = nn! or, think of the pair as 1 entity. There are (n-1) entities, permute them, multiply by 2 to swap the pair.

1-8 pg. 34 #11 Out of 100, choose 12. #(S) =

�100

� 12

#(AB are on committee) = �98

�, choose 10 from the 98 remaining. 10

(98 10)

P =(100

12 )

1-9 pg. 34 #16 50 states × 2 senators each. a) Select 8 , #(S) =

�100

#(state 1 or state 2) = 8�

98��

2�

+ �2��

98�

6 2 1 7 (98

8 )or, calculate: 1 − P(neither chosen) = 1 − (100

8 ) b) #(one senator from each state) = 250

select group of 50 = �100

� 50

1-10 pg. 34 #17In the sample space, only consider the positions of the aces in the hands.#(S) =

�52

�, #(all go to 1 player) = 4 ×

�13

�4 4

(13 4 )

P = 4 (52×

4 )

r balls, n boxes, no box is empty.first of all, put 1 ball in each box from the beginning.r-n balls remain to be distributed in n boxes.

18

1-11

Page 19: Lecture Notes(Introduction to Probability and Statistics)

1-12

�n + (r − n) − 1

� �r − 1

= r − n r − n

30 people, 12 months.P(6 months with 3 birthdays, 6 months with 2 birthdays)#(S) = 1230

Need to choose the 6 months with 3 or 2 birthdays, then the multinomial coefficient:�

12��

30 �

#(possibilities) = 6 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2

** End of Lecture 6

19

Page 20: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 7 February 18, 2005

Bayes’ Formula.

Partition B1, ..., Bk�k Bi = S, Bi ∞ Bj = ≥ for i = ji=1

P(A) = �k

P(ABi) = �k

P

∈(A|Bi)P(Bi) - total probability. i=1 i=1

Example: In box 1, there are 60 short bolts and 40 long bolts. In box 2,there are 10 short bolts and 20 long bolts. Take a box at random, and pick a bolt.What is the probability that you chose a short bolt?B1 = choose Box 1.B2 = choose Box 2.P(short) = P(short|B1)P(B1) + P(short B2)P(B2) = 60

2 ) + 10 2 )| 100 (

1 30 (

1

Example:Partitions: B1, B2, ...Bk and you know the distribution.Events: A, A, ..., A and you know the P(A) for each Bi

If you know that A happened, what is the probability that it came from a particular Bi?

P(BiA) P(A Bi)P(Bi)P(Bi|A) = =

|: Bayes’s Formula

P(A) P(A B1)P(B1) + ... + P(A Bk )P(Bk )| |Example: Medical detection test, 90% accurate.Partition - you have the disease (B1), you don’t have the disease (B2)The accuracy means, in terms of probability: P(positive B1) = 0.9, P(positive B2) = 0.1| |In the general public, the chance of getting the disease is 1 in 10,000.In terms of probability: P(B1) = 0.0001, P(B2) = 0.9999If the result comes up positive, what is the probability that you actually have the disease? P(B1 positive)?|

P(positive B1)P(B1)P(B1|positive) =

|P(positive B1)P(B1) + P(positive B2)P(B2)| |

(0.9)(0.0001) = = 0.0009

(0.9)(0.0001) + (0.1)(0.9999)

The probability is still very small that you actually have the disease.

20

Page 21: Lecture Notes(Introduction to Probability and Statistics)

Example: Identify the source of a defective item.There are 3 machines: M1, M2, M3. P(defective): 0.01, 0.02, 0.03, respectively.The percent of items made that come from each machine is: 20%, 30%, and 50%, respectively.Probability that the item comes from a machine: P (M1) = 0.2, P (M2) = 0.3, P (M3) = 0.5Probability that a machine’s item is defective: P (D M1) = 0.01, P (D M2) = 0.02, P (D M3) = 0.03| | |Probability that it came from Machine 1:

P (D M1)P (M1)P (M1|D) =

|P (D M1)P (M1) + P (D M2)P (M2) + P (D M3)P (M 3)| | | |

(0.01)(0.2) = = 0.087

(0.01)(0.2) + (0.02)(0.3) + (0.03)(0.5)

Example: A gene has 2 alleles: A, a. The gene exhibits itself through a trait with two versions.The possible phenotypes are “dominant,” with genotypes AA or Aa, and “recessive,” with genotype aa.Alleles travel independently, derived from a parent’s genotype.In a population, the probability of having a particular allele: P(A) = 0.5, P(a) = 0.5Therefore, the probabilities of the genotypes are: P(AA) = 0.25, P(Aa) = 0.5, P(aa) = 0.25Partitions: genotypes of parents: (AA, AA), (AA, Aa), (AA, aa), (Aa, Aa), (Aa, aa), (aa, aa).Assume pairs match regardless of genotype.

Parent genotypes(AA, AA)(AA, Aa)(AA, aa)(Aa, Aa)(Aa, aa)(aa, aa)

Probabilities 1 16 2 × ( 1

44 )( 1 2

1) = 2 × ( 1

8 ( 1 2 )(

1 2

1) = 4 )(

1 4

1) =

4 2 × ( 1

42 )( 1 4

1) = 1 16

Probability that child has dominant phenotype 1 1 1 3 4 1 2 0

If you see that a person has dark hair, predict the genotypes of the parents:

1 1

P ((AA, AA) A) = 16

1(1) + 1 4

1(1) + 8 1(1) +

× 16

4 ( 34

1) + 4 ( 12 ) +

=1

1 16 (0) 12

You can do the same computation to find the probabilities of each type of couple. Bayes’s formula gives a prediction inside the parents that you aren’t able to directly see.

|

Example: You have 1 machine.In good condition: defective items only produced 1% of the time. P(in good condition) = 90%In broken condition: defective items produced 40% of the time. P(broken) = 10%Sample 6 items, and find that 2 are defective. Is the machine broken?This is very similar to the medical example worked earlier in lecture:P(good 2 out of 6 are defective) =|

P (2of 6 good)P (good) =

|P (2of 6 good)P (good) + P (2of 6 broken)P (broken)| |

�6�(0.01)2(0.99)4(0.9)2 = = 0.04�

6�(0.01)2(0.99)4(0.9) +

�6�(0.4)2(0.6)4(0.1)2 2

** End of Lecture 7

21

Page 22: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 8 February 22, 2005

§3.1 - Random Variables and Distributions Transforms the outcome of an experiment into a number.Definitions:Probability Space: (S, A, P)S - sample space, A - events, P - probabilityRandom variable is a function on S with values in real numbers, X:S R↔

Examples:Toss a coin 10 times, Sample Space = {HTH...HT, ....}, all configurations of H & T.Random Variable X = number of heads, X: S R↔X: S ↔ {0, 1, ..., 10} for this example.There are fewer outcomes than in S, you need to give the distribution of therandom variable in order to get the entire picture. Probabilities are therefore given.

Definition: The distribution of a random variable X:S ↔ R, is defined by: A √ R, P(A) = P(X ⊂ A) = P(s ⊂ S : X(s) ⊂ A)

The random variable maps outcomes and probabilities to real numbers.This simplifies the problem, as you only need to define the mapped R, P, not the original S, P.The mapped variables describe X, so you don’t need to consider the originalcomplicated probability space.

2 )k ( 1

�From the example, P(X = #(heads in 10 tosses) = k) =

�10

�( 1 2 )

10−k = �10 1

210k k Note: need to distribute the heads among the tosses,account for probability of both heads and tails tossed.This is a specific example of the more general binomial problem:A random variable X ⊂ {1, ..., n}

�n�

kP(X = k) = p (1 − p)n−k

k

This distribution is called the binomial distribution: B(n, p), which is an example of a discrete distribution.

Discrete Distribution A random variable X is called discrete if it takes a finite or countable number (sequence) of values:

s1, s2, s3, ...}X ⊂ {It is completely described by telling the probability of each outcome. Distribution defined by: P(X = sk ) = f (sk), the probability function (p.f.) p.f. cannot be negative and should sum to 1 over all outcomes. P(X ⊂ A) =

� sk ∞A f (sk )

Example: Uniform distribution of a finite number of values {1, 2, 3, ..., n} each outcome

22

Page 23: Lecture Notes(Introduction to Probability and Statistics)

has equal probability ↔ f(sk ) = 1 : uniform probability function. n random variable X ⊂ R,P(A) = P(X ⊂ A), A √ Rcan redefine probability space on random variable distribution:(R,A,P) - sample space, X: R R, X(x) = x (identity map)↔P(A) = P(X : X(x) ⊂ A) = P(x ⊂ A) = P(x ⊂ A) = P(A)all you need is the outcomes mapped to real numbers and relative probabilitiesof the mapped outcomes.

Example: Poisson Distribution, {0, 1, 2, 3, ...} �(�), � = intensityprobability function:

�k

f(k) = P(X = k) = e−� , where � parameter > 0. k!

�k �k�∗ e−� = e−� �∗

= e−�e� = e0 = 1 k 0 k! k 0 k!→ →Very common distribution, will be used later in statistics.Represents a variety of situations - ex. distribution of “typos” in a book on a particular page,number of stars in a random spot in the sky, etc.Good approximation for real world problems, as P > 10 is small.

Continuous Distribution Need to consider intervals not points.Probability distribution function (p.d.f.): f(x) ∼ 0.Summation replaced by integral:

� ∗ f(x)dx = 1−∗

then, P(A) = �

A f(x)dx, as shown:

If you were to choose a random point on an interval, the probability of choosinga particular point is equal to zero.You can’t assign positive probability to any point, as it would add up infinitely on a continuous interval.It is necessary to take P(point is in a particular sub-interval).Definition implies that P({a}) =

� a f(x)dx = 0a

Example: In a uniform distribution [a, b], denoted U[a, b]: 1p.d.f.: f(x) = b−a , for x ⊂ [a, b]; 0, for x /⊂ [a, b]

Example: On an interval [a, b], such that a < c < d < b, 1

P([c, d]) = �

c

d b−a dx = d−c (probability on a subinterval) b−a

Example: Exponential Distribution

E(∂), ∂ > 0 parameter

p.d.f.: f(x) = ∂e−∂x , if x ∼ 0; 0, if x < 0 Check that it integrates to 1:

23

Page 24: Lecture Notes(Introduction to Probability and Statistics)

� ∗ 0∂e−∂xdx = ∂(−

Real world: Exponential distribution describes the life span of quality products (electronics).

1 ∂ e

−∂x ∗0 = 1 |

** End of Lecture 8

24

Page 25: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 9 February 23, 2005

Discrete Random Variable: - defined by probability function (p.f.) s1, s2, ...}, f(si) = P(X = si){

Continuous: probability distribution function (p.d.f.) - also called density function. f(x) ∼ 0,

� ∗ f(x)dx,P(X ⊂ A) =

� f(x)dx −∗ A

Cumulative distribution function (c.d.f):F (x) = P(X ← x), x ⊂ RProperties:1. x1 x2}← x2, {X ← x1} ⊃ {X ←↔ P(X ← x1) ← P(X x2) non-decreasing function. ←2. limx�−∗ F (x) = P(X ← −→) = 0, limx�∗ F (x) = P(X ← →) = 1. A random variable only takes real numbers, as x ↔ −→, set becomes empty.

Example: P(X = 0) = 122 ,P(X = 1) = 1

P(X x < 0) = 0 ←2 ,P(X x) = P(X = 0) = 1P(X ← 0) = P(X = 0) = 1 ← 2 , x ⊂ [0, 1)

P(X ← x) = P(X = 0 or 1) = 1, x ⊂ [1,→)

3. “right continuous”: limy�x+ F (y) = F (x)

F (y) = P(X ← y), event {X ← y}

∗� {X ← yn} = {X ← x}, F (yn) ↔ P(X x) = F (x)←

n=1

Probability of random variable occuring within interval: P(x1 < X < x2) = P({X x1})← x2}\{X ←= P(X x1)← x2) − P(X ←= F (x2) − F (x1)

25

Page 26: Lecture Notes(Introduction to Probability and Statistics)

{X x1}← x2} ∪ {X ←

Probability of a point x, P(X = x) += F (x) − F (x−) where F (x−) = limx�x− F (x), F (x ) = limx�x+ F (x)

If continuous, probability at a point is equal to 0, unless there is a jump,where the probability is the value of the jump.P(x1 ← X ← x2) = F (x2) − F (x−

1 ) P(A) = P(X ⊂ A)X - random variable with distribution P

When observing a c.d.f:

Discrete: sum of probabilities at all the jumps = 1. Graph is horizontal in between the jumps, meaning that probability = 0 in those intervals.

Continuous: F (x) = P(X x) = � x

f (x)dx← −∗eventually, the graph approaches 1.

26

Page 27: Lecture Notes(Introduction to Probability and Statistics)

If f continuous, f (x) = F ∅(x)

Quantile: p ⊂ [0, 1], p-quantile = inf {x : F (x) = P(X ← x) ∼ p}find the smallest point such that the probability up to the point is at least p.The area underneath F(x) up to this point x is equal to p.If the 0.25 quantile is at x = 0, P(X ← 0) ∼ 0.25

= 0,Note that if disjoint, the 0.25 quantile is at x but so is the 0.3, 0.4...all the way up to 0.5.

What if you have 2 random variables? multiple?ex. take a person, measure weight and height. Separate behavior tells you nothingabout the pairing, need to describe the joint distribution.Consider a pair of random variables (X, Y)Joint distribution of (X, Y): P((X, Y ) ⊂ A)Event, set A ⊂ R2

27

Page 28: Lecture Notes(Introduction to Probability and Statistics)

11

21

12

22Discrete distribution: {(s

Joint p.f.: f (s 2) = ((X, Y ) = (P si 1, si

), (s ), ...} ⊆ (X, Y ) ))2

11, si

, s , s

1, Y i = s2)i = P(X = sOften visualized as a table, assign probability for each point:

0 -1 -2.5 5

1 0.1 0 0.2 0 1.5 0 0 0 0.1 3 0.2 0 0.4 0

Continuous: � � ∗ � ∗

f (x, y) ∼ 0, f (x, y)dxdy = f (x, y)dxdy = 1 R2 −∗ −∗

f (x, y)dxdyA

Joint c.d.f. F (x, y) = P(X f ( P(( ) ⊂ A

← ← y) Joint p.d.f. x, y) : X, Y ) =

x, Y

If you want the c.d.f. only for x, F (x) = P(X ← x) = P(X ← x, Y ← +→)= F (x, →) = limy�∗ F (x, y)Same for y.

To find the probability within a rectangle on the (x, y) plane:

F (� x −∗

� y f ( ) Also, �2 F = f ( )Continuous: x, y) = x, y dxdy. x, y−∗ �x�y

** End of Lecture 9

28

Page 29: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 10 February 25, 2005

Review of Distribution Types Discrete distribution for (X, Y): joint p.f. f (x, y) = P(X = x, Y = y) Continuous: joint p.d.f. f (x, y) ∼ 0,

� f (x, y)dxdy = 1

R2

Joint c.d.f.: F (x, y) = P(X ← x, Y ← y) F (x) = P(X ← x) = limy�∗ F (x, y)

In the continuous case: F (x, y) = P(X ← x, T ← y) = � x � y

f (x, y)dxdy. −∗ −∗Marginal Distributions Given the joint distribution of (X, Y), the individual distributions of X, Yare marginal distributions.

Discrete (X, Y): marginal probability functionf1(x) = P(X = x) =

� y P(X = x, Y = y) =

� y f (x, y)

In the table for the previous lecture, of probabilities for each point (x, y):Add up all values for y in the row x = 1 to determine P(X = 1)

f1(x) = �F f (x, y)dy�x

f1(x� ∗ −∗ f ( )dy

F (x P(X ← x P(X ←� x −∗

� ∗ −∗ f ( )

= � ∗

Continuous (X, Y): joint p.d.f. f(x, y); p.d.f. of X: ) = x, y

) = ) = x, Y ←→) = x, y dydx

−∗Why not integrate over line?P({X = x}) =

� ∗ (� x

f (x, y)dx)dy = 0x−∗

P(of continuous random variable at a specific point) = 0.

Example: Joint p.d.f. 2 2f (x, y) = 21 x y, x ← y ← 1, 0 ← x ← 1; 0 otherwise 4

29

Page 30: Lecture Notes(Introduction to Probability and Statistics)

What is the distribution of x? 21 2 1 2 1p.d.f. f1(x) =

� x

1 21 x2ydy = x × 2 y 2 = 21 x2(1 − x4), −1 x 12 4 4 |x 8 ← ←

Discrete values for X, Y in tabular form: 1 2

1 0.5 0 0.5 2 0 0.5 0.5

0.5 0.5

Note: If all entries had 0.25 values, the two variables would have the same marginal dist.

Independent X and Y:

Definition: X, Y independent if P(X ⊂ A, Y ⊂ B) = P(X ⊂ A)P(Y ⊂ B)Joint c.d.f. F (x, y) = P(X ← x, Y ← y) = P(X ← x)P(Y ← y) = F1(x)F2 (y) (intersection of events)The joint c.d.f can be factored for independent random variables.

Implication: continuous (X, Y): joint p.d.f. f(x, y), marginal f1(x), f2(y)� yF (x, y) =

� x � y f (x, y)dydx = F1(x)F2 (y) =

� x f1(x)dx × −∗ f2(y)dy

�2 −∗ −∗ −∗

Take of both sides: f (x, y) = f1(x)f2(y)�x�y Independent if joint density is a product.

Much simpler in the discrete case:Discrete (X, Y): f (x, y) = P(X = x, Y = y) = P(X = x)P(Y = y) = f1(x)f2(y) by definition.

Example: Joint p.d.f.2f (x, y) = kx2y , x2 + y2 ← 1; 0 otherwise

X and Y are not independent variables.f (x, y) = f1(x)f2(y) because of the circle condition.∈

30

Page 31: Lecture Notes(Introduction to Probability and Statistics)

P(square) = 0 = P(X ⊂ side) × P(Y ⊂ side)∈

Example: f(x, y) = kx2y2 , 0 x ← 1, 0 ← y ← 1; 0 otherwise ←Can be written as a product, as they are independent:f(x, y) = kx2y2I(0 ← x ← 1, 0 ← y ← 1) = k1x

2I(0 ← x ← 1) × k2y2I(0 ← y ← 1)

Conditions on x and y can be separated.

Note: Indicator NotationI(x ⊂ A) = 1, x ⊂ A; 0, x /⊂ A

For the discrete case, given a table of values, you can tell independence:

b1 b2 ... bm

a1 p11 p12 ... p1m p1+

a2 ... ... ... ... p2+

... ... ... ... ... ... an pn1 ... ... pnm pn+

p+1 p+2 ... p+n

pij = P(X = ai, Y = bj ) = P(X pi+ = P(X = ai) =

�m j=1 pij

p+j = P(Y = bj ) = �n

i=1 pij

= ai)P(Y = bj )

pij = pi+ × p+j , for every i, j - all points in table.

** End of Lecture 10

31

Page 32: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 11 February 28, 2005

A pair (X, Y) of random variables:f(x, y) joint p.f. (discrete), joint p.d.f. (continuous)Marginal Distributions: f (x) =

� y f (x, y) - p.f. of X (discrete)

f (x) = � f (x, y)dy - p.d.f. of X (continuous)

Conditional Distributions

Discrete Case:

P(X = x, Y = y)P(X = x|Y = y) =

P(Y = y)

P = f (x,y) = f (x y) conditional p.f. of X given Y = y. Note: defined when f(y) is positive. f (y) |f (y x) = f (x,y) conditional p.f. of Y given X = x. Note: defined when f(x) is positive. f (x)|If the marginal probabilities are zero, conditional probability is undefined.

Continuous Case:Formulas are the same, but can’t treat like exact possibilities at fixed points.Consider instead in terms of probability density:Conditional c.d.f. of X given Y=y;

← | P(X ← x, Y ⊂ [y − φ, y + φ])P(X x Y ⊂ [y − φ, y + φ]) =

P(Y ⊂ [y − φ, y + φ])

Joint p.d.f. f (x, y), P(A) = �

A f (x, y)dxdy

1 � y+δ � x

f (x, y)dxdy2δ = � y+δ

y−δ −∗

1� ∗

f (x, y)dxdy ×y−δ −∗ 2δ

As φ 0:↔ � x

f (x, y)dx � x

f (x, y)dx −∗ = −∗� ∗ f (x, y)dx f (y)−∗

Conditional c.d.f:

32

Page 33: Lecture Notes(Introduction to Probability and Statistics)

� x f (x, y)dx

P(X x|Y = y) = −∗← f (y)

Conditional p.d.f:

� f (x, y)f (x|y) =

�x P(X ← x|Y = y) =

f (y)

Same result as discrete.Also, f (x y) only defined when f(y) > 0.|

Multiplication Rule

f (x, y) = f (x y)f (y)|Bayes’s Theorem:

f (x, y) f (x y)f (y) f (x y)f (y)f (y x) = |

f (x)= �

f (|x, y)dy

= � f (x||y)f (y)dy

Bayes’s formula for Random Variables. For each y, you know the distribution of x. Note: When considering the discrete case,

� �↔In statistics, after observing data, figure out the parameter using Bayes’s Formula.

Example: Draw X uniformly on [0, 1], Draw Y uniformly on [X, 1] p.d.f.:

1 f (x) = 1 × I(0 ← x ← 1), f (y x) =

1 − x × I(x ← y ← 1)|

Joint p.d.f: 1

f (x, y) = f (y x)f (x) = 1 − x

× I(0 ← x ← y ← 1)|

Marginal: � � y 1 f (y) = f (x, y)dx =

0 1 − xdx = − ln(1 − x) y = − ln(1 − y)0|

Keep in mind, this condition is everywhere: given, y ⊂ [0, 1] and f(y) = 0 if y /⊂ [0, 1] Conditional (of X given Y):

f (x, y)f (x y) = =

−1 |f (y) (1 − x) ln(1 − y)

I(0 ← x ← y ← 1)

Multivariate Distributions

Consider n random variables: X1, X2, ..., Xn

Joint p.f.: f (x1, x2, ..., xn) = P(X1 = x1, ..., Xn = xn) ∼ 0, � f = 1

Joint p.d.f.: f (x1, x2, ..., xn) ∼ 0, � f dx1dx2...dxn = 1

Marginal, Conditional in the same way: Define notation as vectors to simplify: ↔ ↔ = (x1, ..., xn)X = (X1, ..., Xn), −−

x

X = (−

Z ) subsets of coordinates: −

y = (y1...yk)↔ ↔ ↔ ↔

= (X1, ..., Xk), −−Y , −

Y ↔↔ ↔ = (z1...zn−k )Z = (Xk+1, ..., Xn), −−

z

X , f (− y , −Joint p.d.f. or joint p.f. of −

x ) = f (− z )↔ ↔ ↔ ↔

33

Page 34: Lecture Notes(Introduction to Probability and Statistics)

� �

Marginal:

y ) = f(− z )d− z ) = f(− z )d−f(− y ,− z , f(− y ,− y↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔

Conditional: ↔|− f(− z ) y z )f(−y ,− z ) y z ) =

↔ ↔, f(− y ) = �

f(−↔|z

−)f(−f(− ↔

f(−↔|− f(− ↔ ↔

z ) y z↔ z ↔ ↔|− z )d−↔ ↔ ↔

Functions of Random Variables

Consider random variable X and a function r: R R,↔Y = r(X), and you want to calculate the distribution of Y.

Discrete Case: Discrete p.f.:

f(y) = P(Y = y) = P(r(X) = y) = P(x : r(x) = y) = �

P(X = x) = �

f(x) x:r(x)=y x:r(x)=y

(very similar to “change of variable”)

Continuous Case:Find the c.d.f. of Y = r(X) first.

P(Y ← y) = P(r(X) ← y) = P(x : r(x) ← y) = P(A(y)) = f(x)dx A(y)

p.d.f. f(y) = �y

� f(x)dx

A(y)

** End of Lecture 11

34

Page 35: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 12March 2, 2005

Functions of Random Variables

X - random variable, continuous with p.d.f. f(x)Y = r(X)Y doesn’t have to be continuous, if it is, find the p.d.f.To find the p.d.f., first find the c.d.f.

P(Y ← y) = P(r(X) ← y) = P(x : r(X) ← y) = f(x)dx. x:r(x)�y

Then, differentiate the c.d.f to find the p.d.f.:

f(y) = �(P(Y ← y))

�y

Example:Take random variable X, uniform on [-1, 1]. Y = X2, find distribution of Y.p.d.f. f(x) = { 1 for −1 ← x ← 1; 0 otherwise}2

Y = X2 , P(Y ← y) = P(X2 Y ) = P(−∩y ← X ∩y) =

� ≥y

f(x)dx← ← −≥

y

. Take derivative before integrating.

� 1 1 P(Y ← y) = f(

∩y) ×

1+ f(−y) ×

2∩y

= ∩y (f(∩y) + f(−∩y))

�y 2∩y

1 f(y) = { , 0 ← y ← 1; 0 otherwise.}∩

y

Suppose r is monotonic (strictly one-to-one function).X = r(y), can always find inverse: y = r−1(x) = s(y) - inverse of r.P(Y ← y) = P(r(x) ← y) == P(X s(y)) if r is increasing (1)← = P(X ∼ s(y)) if r is decreasing (2)

(1) = F (s(y)) where F () - c.d.f. of X ,

�P(Y ← y) �F (s(y)) = = f(s(y))s∅(y)

�y �y

(2) = 1 − P(X < s(y)) = 1 − F (s(y)),

− �P(Y

�y ← y)

= −f(s(y))s∅(y)

If r is increasing ↔ s = r−1 is increasing. ↔ s∅(y) ∼ 0 s∅(y) = s∅(y)|If r is decreasing ↔ s = r−1 is decreasing. ↔ s∅(y) ←

↔ ∅(y) = s∅(

|y)0 ↔ −s | |

Answer: p.d.f. of Y : f(y) = f(s(y)) s∅(y)| |

35

Page 36: Lecture Notes(Introduction to Probability and Statistics)

1

Example: f(x) = {3(1 − x)2 , 0 x ← 1; 0 otherwise.} Y = 10e5x ←

1 Y Y = 10e 5x X = ln( ); X ‘ = ↔

5 10 5Y

1 Y 1 f(y) = 3(1 −

5 ln(

10)|

5Y |, 10 ← y ← 10e 5; 0, otherwise.

X has c.d.f. F (X) = P(X x), continuous. ←Y = F (X), 0 Y←

c.d.f. P(Y p.d.f. f(y) = {1, 0

Y - uniform on interval [0, 1]

← y P(F (X) ← y P(X ← F −1(y F (F −1(y 0 ← y ← 1 ← y ← , otherwise.}

1, what is the distribution of Y?

) = ) = )) = )) = y, 1; 0

X - uniform on interval [0, 1]; F - c.d.f. of Y .

36

Page 37: Lecture Notes(Introduction to Probability and Statistics)

Y = F −1(X); P(Y ← y) = P(F −1(x) ← y) = P(X F (y)) = F (y).←Random Variable Y = F −1(X) has c.d.f. F (y).↔

Suppose that (X, Y) has joint p.d.f. f(x, y). Z = X + Y.

z−x� � ∗ � P(Z z) = P(X + Y z) = f (x, y)dxdy = f (x, y)dydx, ← ←

x+y�z −∗ −∗

p.d.f.:

�P(Z z) � ∗

f (z) = ←

= f (x, z − x)dx �z −∗

If X, Y independent, f1(x) = p.d.f. of X. f2(y) = p.d.f. of Y Joint p.d.f.:

� ∗

f (x, y) = f1(x)f2(y); f (z) = f1(x)f2(z − x)dx −∗

Example: X, Y independent, have p.d.f.:

f (x) = {∂e−∂x , x ∼ 0; 0, otherwise}. Z = X + Y :

z

f (z) = ∂e−∂x∂e−∂(z−x)dx 0

Limits determined by: (0 ← x, z − x ∼ 0 0 x z)↔ ← ←

z z

f (z) = ∂2 �

e−∂z dx = ∂2 e−∂z �

dx = ∂2 ze−∂z

0 0

This distribution describes the lifespan of a high quality product.It should work “like new” after a point, given it doesn’t break early on.Distribution of X itself:

37

Page 38: Lecture Notes(Introduction to Probability and Statistics)

∂ | = e−∂xX, P(X ∼ x) =

� ∗

∂e−∂xdx = ∂e−∂x(− 1

) ∗x

x

Conditional Probability:

e−∂(X+t)

P(X ∼ x + t X ∼ x) = P(X ∼ x + t, x ∼ x)

= P(X ∼ x)

= e−∂x

= e−∂t = P(X ∼ t)|P(X ∼ x)

P(X ∼ x + t)

** End of Lecture 12

38

Page 39: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 13 March 4, 2005 Functions of random variables. If (X, Y) with joint p.d.f. f (x, y), consider Z = X + Y. p.d.f. of Z: f (z) =

� ∗ f (x, z − x)dx −∗

If X and Y independent: f (z) = � ∗

f1(x)f2(z − x)dx −∗

Example:X, Y independent, uniform on [0, 1], X, Y ≈ U [0, 1], Z = X + Yp.d.f. of X, Y:f1(x) = {1, 0 ← x ← 1; 0 otherwise} = I(0 ← x ← 1),f2(y) = I(0 ← y ← 1) = I(0 ← z − x ← 1)f (z) =

� ∗ I(0 ← x ← 1) × I(0 ← z − x ← 1)dx

Limits: 0 −∗

x ← 1; z − 1 x z← ← ←Both must be true, consider all the cases for values of z:

Case 1: (z ← 0) ↔ �

= 0 ≤Case 2: (0 ← z ← 1) ↔

� z 1dx = z

0

Case 3: (1 ← z ← 2) ↔ �

z

1 −1 1dx = 2 − z

Case 4: (z ∼ 2) ↔ �

= 0 ≤Random variables likely to add up near 1, peak of f(z) graph.

X ∼ 0, Y ∼ 0.Z = XY First, look at the c.d.f.:

Example: Multiplication of Random Variables (Z is positive)

� z/x � � ∗

P(Z z) = P(XY z) = f (x, y)dxdy = f (x, y)dydx ← ← XY �z 0 0

p.d.f. of Z:

39

Page 40: Lecture Notes(Introduction to Probability and Statistics)

�P(Z z) � ∗ z 1

f (z) = ←

= f (x, ) dx �z 0 x x

Example: Ratio of Random Variables Z = X/Y (all positive), P(Z z) = P(X zY ) =

� x�zy f (x, y)dxdy =

� ∗ � zy f (x, y)dxdy0 0

p.d.f. f (z) = � 0 ∗ f (zy, y)ydy

← ←

In general, look at c.d.f. and express in terms of x and y.

Example: X1, X2, ..., Xn - independent with same distribution (same c.d.f.)f (x) = F ∅(x) - p.d.f. of Xi

P(Xi x) = F (x)←Y = maximum among X1, X2...Xn

P(Y ← y) = P(max(X1, ..., Xn) ← y) = P(X1 ← y, X2 ← y...Xn ← y)Now, use definition of independence to factor:

= P(X1 ← y)P(X2 ← y)...P(Xn ← y) = F (y)n

p.d.f. of Y:

f̂(y) = �F (y)n = nF (y)n−1F ∅(y) = nF (y)n−1f (y)

�y

Y = min(X1, . . . , Xn), P(Y ← y) = P(min(X1, ..., Xn) ← y)Instead of intersection, use union. But, ask if greater than y:= 1 − P(min(X1, ..., Xn) > y)= 1 − P(X1 > y, ..., Xn > y)= 1 − P(X1 > y)P(X2 > y)...P(Xn > y)= 1 − P(X1 > y)n

= 1 − (1 − P(X1 ← y))n

= 1 − (1 − F (y))n

X = (X1, X2, .., Xn), −

X )−

Y ↔↔ ↔

= (Y1, Y2, ..., Yn) = r(−

Y1 = r1(X1, ..., Xn)Y2 = r2(X1, ..., Xn)...Yn = rn(X1, ..., Xn)

X ↔

Suppose that a map r has inverse. −

Y ) ↔ ⊂ A) = �

A g(− y ↔ g(− ↔

↔= r−1(

P(−

y )d− y ) is the joint p.d.f. of −YY ↔ ↔ ↔

↔ ⊂ A) = P(r(−

X ⊂ s(A)) = �

f (− y ))|J |d−P(−

X ) ⊂ A) = P(−

x )dx = �

A f (s(− y , Y ↔ ↔

s(A) ↔ ↔ ↔

Note: change of variable −x y )↔ = s(−↔Note: J = Jacobian:

�s1 �s1 �y1

... �yn

J = det ... ... ... �sn �sn �y1

... �yn

Y ↔The p.d.f. of −

y ))|J |↔: f (s(−

Example:(X1, X2) with joint p.d.f. f (x1, x2) = {4x1x2, for 0 ← x1 ← 1, 0 ← x2 ← 1; 0, otherwise}

40

Page 41: Lecture Notes(Introduction to Probability and Statistics)

X1Y1 = ; Y2 = X1X2

X2

inverse X1 = Y1Y2 = s1(Y1, Y2), X2 = = s2(Y1, Y2)Y1 = r1(X1, X2), Y2 = r2(X1, X2)

−−−−−↔ � Y2

Y1

But, keep in mind the intervals for non-zero values: ≥

y2 ≥

y1

2≥

y1 2≥

y2 1 1 1 J = det ≥

y2 = + = 1 2y 3/2 2

≥y2

≥y1

4y1 4y1 2y1− 1

Joint p.d.f. of (Y1, Y2):

2y2 � y2

g(y1, y2) = { 4∩ y1y2

� y2

J = , if 0 1, and 0 ← y1 ← 1; 0 otherwise }

y1 | | | y1|

← ∩ y1y2 ←

Condition implies that they are positive, absolute value is unneccessary.

Joint p.d.f. of (Y1, Y2)

** Last Lecture of Coverage on Exam 1 ** ** End of Lecture 13

41

Page 42: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 14 March 7, 2005

Linear transformations of random vectors: −

X )Y ↔↔

= r(−

y1

.

.

. yn

A - n by n matrix, −

YX ↔

if det A = 0 A−1↔= A−1− ∈ ↔

x1 = b1y1 + ... + b1nyn

b11... b1n

= A

x1

.

.

. xn

= B

J = Jacobian = det ... ... where bi∅s are partial derivatives of si with respect to yi

bn1... bnn 1det B = det A−1 =

detA p.d.f. of Y:

1 g(y) = f (A−1−x )↔|detA|

Example: −X = (x1, x2) with p.d.f.:

f (x1, x2) = {cx1x2, 0 ← x1 ← 1, 0 ← x2 ← 1; 0 otherwise} To make integral equal 1, c = 4.

1 2

Y1 = X1 + 2X2, Y2 = 2X1 + X2; A =1 ↔ det(A) = −3

2

Calculate the inverse functions:

1 1 X1 = (Y1 − 2Y2), X2 = (Y2 − 2Y1)−

3 −

3

New joint function:

1 1 1 g(y1, y2) = {

3 × 4(− (y1 − 2y2))(− (y2 − 2y1))

3 3

1 1 for 0 ← − (y1 − 2y2) ← 1 and 0 ← − (y2 − 2y1) ← 1;

3 3

0, otherwise} Simplified:

4 f (y1, y2) = { (y1 − 2y2)(y2 − 2y1) for − 3 ← y1 − 2y2 ← 0, −3 ← y2 − 2y1 ← 0;

27

0, otherwise}

42

Page 43: Lecture Notes(Introduction to Probability and Statistics)

Linear transformation distorts the graph from a square to a parallelogram.

Note: From Lecture 13, when min() and max() functions were introduced, such functionsdescribe engines in series (min) and parallel (max).When in series, the length of time a device will function is equal to the minimum lifein all the engines (weakest link).When in parallel, this is avoided as a device can function as long as one engine functions.

Review of Problems from PSet 4 for the upcoming exam: (see solutions for more details)

Problem 1 - f (x) = {ce−2x for x ∼ 0; 0 otherwise}Find c by integrating over the range and setting equal to 1:

1 = � ∗

ce−2xdx =1 ce−2x ∗ =

2 c ×−1 = 1 c = 2 0

0 −

2 | − ↔

P(1 ← X ← 2) = � 2

2e−2xdx = e−2 − e−4 1

Problem 3 - X ≈ U [0, 5], Y = 0 if X ← 1; Y = X if 1 ← X ← 3; Y = 5 if 3 < X 5←Draw the c.d.f. of Y, showing P(Y ← y)

Graph of Y vs. X, not the c.d.f.

Write in terms of X ↔ P(X−?)

Cumulative Distribution Function

43

Page 44: Lecture Notes(Introduction to Probability and Statistics)

0

Cases:y < 0 ↔

1 P(Y ← y) = P(≥) = 0↔ P(Y ← y) = P(0 ← X ← 1) = 1/5← y ←

1 < y ← 3 ↔ P(Y ← y) = P(X ← y) = y/5 3 < y ← 5 ↔ P(Y ← y) = P(X ← 3) = 3/5 y > 5 ↔ P(Y ← 5) = P(X ∼ 5) = 1 These values over X from 0 to → give its c.d.f.

Problem 8 - 0 ← x 1 ← 3, 0 ← y ← 4

c.d.f. F (x, y) = 2 + y)156 xy(xP(1 ← x ← 2, 1 ← y ← 2) = F (2, 2) − F (2, 1) − F (1, 2) + F (1, 1)

Rectangle probability algorithm.

or, you can find the p.d.f. and integrate (more complicated): c.d.f. of Y: P(Y ← y) = P(X ← →, Y ← y) = P(X ← 3, Y ← y)(based on the domain of the joint c.d.f.)P(Y ← y) = 1 3y(9 + y) for 0 ← y ← 4156 Must also mention: y ← 0, P(Y ← y) = 0; y ∼ 4, P(Y ← y) = 1 Find the joint p.d.f. of x and y:

1 f(x, y) =

�2F (x, y)= (3x 2 + 2y), 0 x ← 3, 0 ← y ← 4; 0 otherwise}

�x�y {156

x

P(Y X) = f(x, y)dxdy = � 3 �

1 (3x 2 + 2y)dydx =

93 156 208

←y�x 0 0

** End of Lecture 14

44

Page 45: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 15 March 9, 2005

Review for Exam 1

Practice Test 1:

1. In the set of all green envelopes, only 1 card can be green.Similarly, in the set of red envelopes, only 1 card can be red.Sample Space = 10! ways to put cards into envelopes, treating each separately.You can’t have two of the same color matching, as that would be 4 total.Degrees of Freedom = which envelope to choose (5 × 5) and which card to select (5 × 5)Then, arrange the red in green envelopes (4!), and the green in red envelopes (4!)

54(4!)2

P = 10!

2. Bayes formula:

P(HHH fair)P(fair) 0.53 × 0.5 P(fair|HHH) =

P(HHH fair)P(fair) + |P(HHH unfair)P(unfair)

=0.53 × 0.5 + 1 × 0.5| |

3. f1(x) = 2xI(0 < x < 1), f2(x) = 3x2I(0 < x < 1) Y = 1, 2 P(Y = 1) = 0.5,P(Y = 2) = 0.5↔f(x, y) = 0.5 × I(y = 1) × 2xI(0 < x < 1) + 0.5 × I(y = 2) × 3x2I(0 < x < 1) f(x) = 0.5 × 2xI(0 < x < 1) + 0.5 × 3x2I(0 < x < 1) = (x + 1.5x2)I(0 < x < 1)

11 f1( 1 2 × 1/4 × 1/2 P(Y = 1|X =

4) =

f1( 14 ) × 2

4 ) × 2

= 1 + f2( 1 1 2 × 1/4 × 1/2 + 3 × 1/16 × 1/24 ) × 2

4. f(z) = 2e−2zI(Z > 0), T = 1/Z we know t > 0 P(T ← t) = P(1/Z ← t) = P(Z ∼ 1/t) =

1 2 =

� ∗

2e−2zdz, p.d.f. f(t) = �F (T ← t)

= −2e−2/t ×− t2

= e−2/t, t > 0 (0 otherwise) �t t2

1/t

1T = r(Z), Z = s(T ) = T ↔ g(t) = s∅(t) f(1/t) by change of variable. | |

5. f(x) = e−xI(x > 0)Joint p.d.f. f(x, y) = e−xI(x > 0)e−yI(y > 0) = e−(x+y)I(x > 0, y > 0)

X U = ; V = X + Y

X + Y

Step 1 - Check values for random variables: (0 < V < →), (0 < U < 1) Step 2 - Account for change of variables: X = UV ; Y = V − UV = V (1 − U) Jacobian:

�X �X V U�U �V J = det �Y = = V (1 − U) + UV = V�Y -V 1 - U�U �V

45

Page 46: Lecture Notes(Introduction to Probability and Statistics)

g(u, v) = f(uv, v(1 − u)) × v I(uv > 0, v(1 − u) > 0) = e−v vI(v > 0, 0 < u < 1)| |

Problem Set #5 (practice pset, see solutions for details):

p. 175 #4f(x1, x2) = x1 + x2I(0 < x1 < 1, 0 < x2 < 1)Y = X1X2(0 < Y < 1)First look at the c.d.f.: P(Y ← y) = P(X1X2 ← y) =

� f(x1, x2)dx1dx2{x1 x2 �y}={x2� 1 }y/x

Due to the complexity of the limits, you can integrate the area in pieces, or you can find the complement, which is easier with only 1 set of limits.

� � 1 � 1

f(x1, x2) = 1 − = 1 − (x1 + x2)dx2 dx1 = 1 − (1 − y)2 = 2y − y {x1 x2 >y} y y/x1

f(x1, x2) = 0 for y < 0; 2y − y 2 for 0 < y < 1; 1 for y > 1.

p.d.f.:

g(y) = { �P(Y ← y) �y

= 2(1 − y), y ⊂ (0, 1); 0, otherwise.}

f x 2 , 0

p. 164 #3 (x) = { ← x ← 2; 0 otherwise}Y = X(2 −X), find the p.d.f. of Y.First, find the limits of Y, notice that it is not a one-to-one function.

Y varies from 0 to 1 as X varies from 0 to 2. Look at the c.d.f.:

P(Y ← y) = P(X(2 −X) ← y) = P(X2 − 2X + 1 ∼ 1 − y) = P((1 −X)2 ∼ 1 − y) =

= P( 1 −X ∼

1 − y) = P(1 −X ∼

1 − y or 1 −X ← −

1 − y) = | |

46

2

Page 47: Lecture Notes(Introduction to Probability and Statistics)

= P(X 1 −

1 − y or X ∼ 1 +

1 − y) = ←

= P(0 ← X 1 −

1 − y) + P(1 +

1 − y ← X ← 2) = ←

Take derivative to get the p.d.f.:

� 1−≥1−y � 2x x

= { 0 2

dx+ 1+

≥1−y

dx = 1 −

1 − y, 0 ← y ← 1; 0, y < 0; 1, y > 1}2

1g(y) = , 0 ← y ← 1; 0, otherwise.

4∩

1 − y

** End of Lecture 15

47

Page 48: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 16 March 14, 2005

Expectation of a random variable. X - random variableroll a die - average value = 3.5flip a coin - average value = 0.5 if heads = 0 and tails = 1

Definition: If X is discrete, p.f. f (x) = p.f. of X ,Then, expectation of X is EX =

� xf (x)

For a die:

1 2 3 4 5 6 f(x) 1/6 1/6 1/6 1/6 1/6 1/6

E = 1 × 1 6 + ... + 6 × 1

6 = 3.5

Another way to think about it:

Consider each pi as a weight on a horizontal bar. Expectation = center of gravity on the bar.

If X - continuous, f (x) = p.d.f. then E(X) = � xf (x)dx

Example: X - uniform on [0, 1], E(X) = � 0

1(x × 1)dx = 1/2

Consider Y = r(x), then EY = �

r(x)f (x) or � r(x)f (x)dx x

p.f. g(y) = �

{x:y=r(x)} f (x)E(Y ) =

� y yg(y) =

� y y

� {x:y=r(x)} f (x) =

� � {x:r(x)=y} yf (x) =

� � {x:r(x)=y} r(x)f (x)y y

then, can drop y since no reference to y: E(Y ) =

� r(x)f (x)x

Example: X - uniform on [0, 1] EX2 =

� 1 X2 × 1dx = 1/3

0

X1, ..., Xn - random variables with joint p.f. or p.d.f. f (x1...xn) E(r(X1, ..., Xn)) =

� r(x1, ..., xn)f (x1, ..., xn)dx1 ...dxn

Example: Cauchy distribution p.d.f.:

1 f (x) =

ψ(1 + x2)

Check validity of integration: � ∗ 1 1

dx = tan−1(x) ∗ = 1 ψ(1 + x2) ψ

|−∗−∗

But, the expectation is undefined:

48

Page 49: Lecture Notes(Introduction to Probability and Statistics)

� ∗ 1 E|X | =

−∗ |x| ψ(1 + x2)

dx = 2 � ∗ x

= 1

ln(1 + x 2)|∗ = 0 →ψ(1 + x2) 2ψ

Note: Expectation of X is defined if E X <

0

| | →

Properties of Expectation:

1) E(aX + b) = aE(X) + b Proof: E(aX + b) =

�(aX + b)f (x)dx = a

�xf (x)dx + b

�f (x)dx = aE(X) + b

2) E(X1 + X2 + ... + Xn) = EX1 + EX2 + ... + EXn

Proof: E(X1 + X2) = � �

(x1 + x2)f (x1, x2)dx1 dx2 = =

� �x1f (x1, x2)dx1 dx2 +

� �x2f (x1, x2)dx1dx2 =

= �x1

�f (x1, x2)dx2dx1 +

�x2

�f (x1, x2)dx1dx2 =

= �x1f1(x1)dx1 +

�x2f2(x2)dx2 = EX1 + EX2

Example: Toss a coin n times, “T” on i: Xi = 1; “H” on i: Xi = 0. Number of tails = X1 + X2 + ... + Xn

E(number of tails) = E(X1 + X2 + ... + Xn) = EX1 + EX2 + ... + EXn

EXi = 1 × P(Xi = 1) + 0 × P(Xi = 0) = p, probability of tails Expectation = p + p + ... + p = np This is natural, because you expect np of n for p probability.

nY = Number of tails, P(Y = k) = �

k

�pk(1 − p)n−k

n nE(Y ) =

�k� �

pk (1 − p)n−k = npk=0 k More difficult to see though definition, better to use sum of expectations method.

Two functions, h and g, such that h(x) ← g(x), for all x ⊂ R Then, E(h(X)) ← E(g(X)) ↔ E(g(X) − h(X)) ∼ 0�(g(x) − h(x)) × f (x)dx ∼ 0

You know that f (x) ∼ 0, therefore g(x) − h(x) must also be ∼ 0

If a X ← b ↔ a ← E(X) ← E(b) ← b←E(I(X ⊂ A)) = 1 × P(X ⊂ A) + 0 × P(X /⊂ A), for A being a set on R Y = I(X ⊂ A) = {1, with probability P(X ⊂ A); 0, with probability P(X /⊂ A) = 1 − P(X ⊂ A) E(I(X ⊂ A)) = P(X ⊂ A)}In this case, think of the expectation as an indicator as to whether the event happens.

Chebyshev’s Inequality Suppose that X ∼ 0, consider t > 0, then:

1 t E(X)P(X ∼ t) ←

Proof: E(X) = E(X)I(X < t) + E(X)I(X ∼ t) ∼ E(X)I(X ∼ t) ∼ E(t)I(X ∼ t) = tP(X ∼ t)

** End of Lecture 16

49

Page 50: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 17 March 16, 2005

Properties of Expectation. Law of Large Numbers. E(X1 + ... + Xn) = EX1 + ... + EXn

Matching Problem (n envelopes, n letters)Expected number of letters in correct envelopes?Y - number of matchesXi = {1, letter i matches; 0, otherwise}, Y = X1 + ... + Xn

E(Y ) = EX1 + ... + EXn, but EXi = 1 × P(Xi = 1) + 0 × P(Xi = 0) = P(Xi = 1) = 1n

Therefore, expected match = 1:

1 E(Y ) = n ×

n = 1

If X1, ..., Xn are independent, then E(X1 × ... × Xn) = EX1 × ... × EXn

As with the sum property, we will prove for two variables: EX1X2 = EX1 × EX2

joint p.f. or p.d.f.: f (x1, x2) = f1(x1)f2(x2)EX1X2 =

� �x1x2f (x1, x2)dx1dx2 =

� �x1x2f1(x1)f2(x2)dx1dx2 =

= �f1(x1)x1dx1

�f2(x2)x2dx2 = EX1 × EX2

X1, X2, X3 - independent, uniform on [0, 1]. Find EX12(X2 − X3)

2 .= EX1

2E(X2 − X3)

2 by independence.= EX1

2E(X2

2 − 2X2X3 + X32) = EX1

2(EX22 + EX3

2 − 2EX2X3)By independence of X2, X3; = EX1

2(EX22 + EX3

2 − 2EX2EX3)2

EX1 = � 1 x × 1dx = 1/2, EX2 =

� 1 x × 1dx = 1/3 (same for X2 and X3)0 1 0

1 1 2 )(

1EX1

2(X2 − X3)2 = 3 (

1 + 3 − 2( 1 2 )) = 1 3 18

For discrete random variables, X takes values 0, 1, 2, 3, ... E(X) =

�n=0 nP(x = n)∗

for n = 0, contribution = 0; for n = 1, P(1); for n = 2, 2P(2); for n = 3, 3P(3); ... E(X) =

�∗P(X ∼ n)n=1

Example: X - number of trials until success.P(success) = pP(f ailure) = 1 − p = q

∗� ∗� 2(1 − p)n−1 = 1 + q + q + ... =

1=E(X) = P(X ∼ n) = p1 − q

n=1 n=1

∗∗

Formula based upon reasoning that the first n - 1 times resulted in failure.Much easier than the original formula:�

n=0 nP(X = n) = �

n=1 n(1 − p)n−1p

Variance:

Definition: Var(X) = E(X − E(X))2 = θ2(X)Measure of the deviation from the expectation (mean).Var(X) =

�(X − E(X))2f (x)dx - moment of inertia.

50

1

Page 51: Lecture Notes(Introduction to Probability and Statistics)

≈ �

(X − center of gravity )2 × mx

Standard Deviation:

θ(X) =

Var(X) Var(aX + b) = a2Var(X) θ(aX + b) = a θ(X)| |

Proof by definition:E((aX + b) − E(aX + b))2 = E(aX + b − aE(X) − b)2 = a2

E(X − E(X))2 = a2Var(X)

Property: Var(X) = EX 2 − (E(X))2

Proof:Var(X) = E(X − E(X))2 = E(X2 − 2XE(X) + (E(X))2) =EX2 − 2E(X) × E(X) + (E(X))2 = E(X)2 − (E(X))2

Example: X ≈ U [0, 1]

� 1 � 1

EX = x × 1dx =1 , EX2 = x 2 × 1dx =

1 30 2 0

1 1)2 1

Var(X) = 3 − ( =

2 12

If X1, ..., Xn are independent, then Var(X1 + ... + Xn) = Var(X1) + ... + Var(Xn) Proof:

Var(X1 + X2) = E(X1 + X2 − E(X1 + X2))2 = E((X1 − EX1) + (X2 − EX2))

2 =

= E(X1 − EX1)2 + E(X2 − EX2)

2 + 2E(X1 − EX1)(X2 − EX2) =

= Var(X1) + Var(X2) + 2E(X1 − EX1) × E(X2 − EX2)

By independence of X1 and X2:

= Var(X1) + Var(X2)

2 2Property: Var(a1X1 + ... + anXn + b) = a1Var(X1) + ... + anVar(Xn)

Example: Binomial distribution - B(n, p), P(X = k) = �n�pk(1 − p)n−k

k X = X1 + ... + Xn, Xi = { 1, Trial i is success ; 0, Trial i is failure.}Var(X) =

�ni=1 Var(Xi)

Var(Xi) = EX2 − (EXi)2 , EXi = 1(p) + 0(1 − p) = p; EX 2 = 12(p) + 02(1 − p) = p.i i

Var(Xi) = p − p2 = p(1 − p) Var(X) = np(1 − p) = npq

Law of Large Numbers: X1, X2, ..., Xn - independent, identically distributed.

X1 + ... + XnSn = n−−−−↔

EX1 n

↔ →

Take φ > 0 - but small, P( Sn − EX1 > φ) ↔ 0 as n| | ↔ → By Chebyshev’s Inequality:

51

Page 52: Lecture Notes(Introduction to Probability and Statistics)

P((Sn − EX1)2 > φ2) = P(Y > M ) ←

1 EY =

M

1 X1 + ... + Xn 1 E(Sn − EX1)

2 =1

E( − EX1)2 =

φ2 Var(

1(X1 + ... + Xn)) =

φ2 φ2 n n

1 nVar(X1) Var(X1) = φ2n2

(Var(X1) + ... + Var(Xn)) = = 0 φ2n2 nφ2

for large n.

** End of Lecture 17

52

Page 53: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 18March 18, 2005

Law of Large Numbers.

X1, ..., Xn - i.i.d. (independent, identically distributed)

X1 + ... + Xn x = as n , EX1

n ↔ ↔→

Can be used for functions of random variables as well:Consider Yi = r(X1) - i.i.d.

r(X1) + ... + r(Xn)Y = as n , EY1 = Er(X1)

n ↔ ↔→

Relevance for Statistics: Data points xi, as n↔→, The average converges to the unknown expected value of the distribution which often contains a lot (or

all) of information about the distribution.

Example: Conduct a poll for 2 candidates:p ⊂ [0, 1] is what we’re looking forPoll: choose n people randomly: X1, ..., Xn

P(Xi = 1) = pP(Xi = 0) = 1 − p

X1 + ... + Xn EX1 = 1(p) + 0(1 − p) = p ♥ as n

n ↔→

Other characteristics of distribution:Moments of the distribution: for each integer, k ∼ 1, kth moment EX k

kth moment is defined only if E|X k <| →Moment generating function: consider a parameter y ⊂ R.and define δ(t) = EetX where X is a random variable.δ(t) - m.g.f. of X

∗δk (0) kTaylor series of δ(t) =

� t

k! k=0

∗(tX)k ∗

tk

Taylor series of Ee tX = E �

= �

EXk

k! k! k=0 k=0

EXk = δk (0)

Example: Exponential distribution E(∂) with p.d.f. f (x) = {∂e−∂x, x ∼ 0; 0, x < 0}Compute the moments:EXk =

� ∗ xk ∂e−∂xdx is a difficult integral.0

Use the m.g.f.: � ∗ � ∗

δ(t) = Ee tX = e tx∂e−∂xdx = ∂e(t−∂)xdx 0 0

(defined if t < → to keep the integral finite)

53

Page 54: Lecture Notes(Introduction to Probability and Statistics)

∗�∂e(t−∂)x

= t− ∂

|� tk∂ ∂ 1 t

)k EXk∗

0 = 1 − (= = = = t− ∂ ∂− t 1 − t/∂ ∂ k!

k=0

Recall the formula for geometric series:

∗� 1k x = when k < 1 1 − x

k=0

1 Exk k! = Ex k =

∂k k! ↔

∂k

The moment generating function completely describes the distribution.Exk =

� xk f(x)dx

If f(x) unknown, get a system of equations for f ↔ unique distribution for a set of moments.M.g.f. uniquely determines the distribution.

X1, X2 from E(∂), Y = X1 + X2.To find distribution of sum, we could use the convolution formula,but, it is easier to find the m.g.f. of sum Y :

tX1 tX2 = Ee tX1 Ee tX2Ee tY = Ee t(X1 +X2 ) = Ee e

Moment generating function of each:

∂ ∂− t

For the sum:

∂ ( )2

∂− t

Consider the exponential distribution:

1 E(∂) ≈ X1,EX = , f(x) = {∂e−∂x , x ∼ 0; 0, x < 0}

This distribution describes the life span of quality products. ∂ =

E

1 X , if ∂ small, life span is large.

Median: m ⊂ R such that:

1 1 P(X ∼ m) ∼ ,P(X

2 ← m) ∼

2

(There are times in discrete distributions when the probability cannot ever equal exactly 0.5) When you exclude the point itself: P(X > m) ← 1

2 P(X m) + P(X > m) = 1 ←The median is not always uniquely defined. Can be an interval where no point masses occur.

54

Page 55: Lecture Notes(Introduction to Probability and Statistics)

1For a continuous distribution, you can define P > or < m as equal to 2 . But, there are still cases in which the median is not unique!

For a continuous distribution:

1 P(X ← m) = P(X ∼ m) =

2

The average measures center of gravity, and is skewed easily by outliers.

The average will be pulled towards the tail of a p.d.f. relative to the median.

Mean: find a ⊂ R such that E(X − a)2 is minimized over a.

E(X − a)2 = −E2(X − a) = 0, EX − a = 0 a = EX �a

expectation - squared deviation is minimized.

Median: find a ⊂ R such that E X − a is minimized. | |E X − a , where m - median∼ E X − m| | | |E( X − a X − m ) ∼ 0� (|x − a

| − | |x − m )f(x)dx| | − | |

55

Page 56: Lecture Notes(Introduction to Probability and Statistics)

Need to look at each part: 1)a − x − (m − x) = a − m, x m←2)x − a − (x − m) = m − a, x ∼ m 3)a − x − (x + m) = a + m − 2x, m x a← ←

The integral can now be simplified:

m� � � ∗

( x − a x − m )f (x)dx ∼ (a − m)f (x)dx + (m − a)f (x)dx =| | − | |−∗ m

m� � ∗

= (a − m)( f (x)dx − f (x)dx) = (a − m)(P(X ← m) − P(X > m)) ∼ 0 m−∗

As both (a − m) and the difference in probabilities are positive. The absolute deviation is minimized by the median.

** End of Lecture 18

56

Page 57: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 19 March 28, 2005

Covariance and Correlation Consider 2 random variables X, Y

2= Var(X), θy 2θx = Var(Y )

Definition 1:Covariance of X and Y is defined as:

Cov(X,Y ) = E(X − EX)(Y − EY )

Positive when both high or low in deviation.Definition 2:Correlation of X and Y is defined as:

Cov(X,Y ) Cov(X,Y )π(X,Y ) = =

θxθy

Var(X)Var(Y )

The scaling is thus removed from the covariance.

Cov(X,Y ) = E(XY − XEY − Y EX + EXEY ) = = E(XY ) − EXEY − EY EX + EXEY = E(XY ) − EXEY

Cov(X,Y ) = E(XY ) − EXEY

Property 1:If the variables are independent, Cov(X,Y ) = 0 (not correlated)Cov(X,Y ) = E(XY ) − EXEY = EXEY − EXEY = 0

1 1Example: X takes values {−1, 0, 1} with equal probabilities { 3 , 3 , 1 3 }

Y = X2

X and Y are dependent, but they are uncorrelated. Cov(X,Y ) = EX3 − EXEX2

but, EX = 0, and EX3 = EX = 0 Covariance is 0, but they are still dependent. Also - Correlation is always between -1 and 1.

Cauchy-Schwartz Inequality: (EXY )2

EX2EY 2 ←

Also known as the dot-product inequality: v ,− | v ||− |u ) ↔ ↔|(−↔ ↔ | ←

− u To prove for expectations:

2δ(t) = E(tX + Y )2 = t EX2 + 2tEXY + EY 2 ∼ 0

Quadratic f(t), parabola always non-negative if no roots:D = (EXY )2 − EX2

EY 2 ← 0) (discriminant)Equality is possible if δ(t) = 0 for some point t.δ(t) = E(tX + Y )2 = 0, if tX + Y = 0, Y = -tX, linear dependence.(Cov(X,Y ))2 = (E(X − EX)(Y − EY ))2 ← E(X − EX)2E(Y − EY )2

Cov(X,Y ) θxθy ,| | ←

57

2θy 2θ= x

Page 58: Lecture Notes(Introduction to Probability and Statistics)

Cov(X, Y )|π(X, Y ) = |

1| θxθy

| ←

So, the correlation is between -1 and 1.

Property 2:

← π(X, Y ) ← 1−1

When is the correlation equal to 1, -1? π(X, Y ) = 1 only when Y − EY = c(X − EX),| |or Y = aX + b for some constants a, b.(Occurs when your data points are in a straight line.)If Y = aX + b :

E(aX2 + bX) − EXE(aX + b) aVar(X) a π(X, Y ) = = = = sign(a)

Var(X) × a2Var(X) a Var(X) a| | | | If a is positive, then the correlation = 1, X and Y are completely positively correlated. If a is negative, then correlation = -1, X and Y are completely negatively correlated.

Looking at the distribution of points on Y = X2, there is NO linear dependence, correlation = 0. However, if Y = X2 + cX , then there is some linear dependence introduced in the skewed graph.

Property 3:

Var(X + Y ) = E(X + Y − EX − EY )2 = E((X − EX) + (Y − EY ))2 =

E(X − EX)2 − 2E(X − EX)(E(Y − EY ) + E(Y − EY )2 = Var(X) + Var(Y ) − 2Cov(X, Y )

Conditional Expectation:(X, Y) - random pair.What is the average value of Y given that you know X?f(x, y) - joint p.d.f. or p.f. then f(y x) - conditional p.d.f. or p.f.|Conditional expectation:

E(Y X = x) = � yf(y|x)dy or

� yf(y x)| |

E(Y |X) = h(X) = � yf(y X)dy - function of X, still a random variable. |

Property 4:

E(E(Y X)) = EY|

58

Page 59: Lecture Notes(Introduction to Probability and Statistics)

Proof:E(E(Y X)) = E(h(X)) =

�f (x)f (x)dx =

= �

(� |yf (y|x)dy)f (x)dx =

� �yf (y x)f (x)dydx =

� �yf (x, y)dydx =

= �y(

�f (x, y)dx)dy =

�yf (y)dy =

|EY

Property 5:

E(a(X)Y X) = a(X)E(Y X)| |See text for proof.

Summary of Common Distributions:

1) Bernoulli Distribution: B(p), p ⊂ [0, 1] - parameter Possible values of the random variable: X = {0, 1}; f (x) = px(1 − p)1−x

P(1) = p, P(0) = 1 − p E(X) = p, Var(X) = p(1 − p)

2) Binomial Distribution: B(n, p), n repetitions of Bernoulli n X − {0, 1, ..., n}; f (x) =

� �px(1 − p)1−x

x E(X) = np, Var(X) = np(1 − p)

3) Exponential Distribution: E(∂), parameter ∂ > 0 X = [0, →), p.d.f. f (x) = {∂e−∂x, x ∼ 0; 0, otherwise }

1 EX = , EXk =

k! ∂ ∂k

2 1 1 Var(X) = =

∂2 − ∂2 ∂2

** End of Lecture 19

59

Page 60: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 20 March 30, 2005

§5.4 Poisson Distribution�(�), parameter � > 0, random variable takes values: {0, 1, 2, ...}p.f.:

�x �x

f (x) = P(X = x) = e−�; e−� �

= e−� × e � = 1 x! x!

x 0→

Moment generating function:

tXΠ(t) = Ee−tX = �

e × �

x

x

! e−� =

� (et�)x

e−� = e−� � (et�)x

= e−� e e t � = e �(e t −1)

x! x! x 0 x 0 x 0→ → →

EXk = Πk (0) EX = Π∅(0) = e�(e t −1) × �et

t=0 = � t t

EX2 = Π∅∅(0) = (�e�(e −1)+t)

|∅|t=0 = �e�(e −1)+t(�et + 1) t=0 = �(� + 1)

Var(X) = EX2 − (EX)2 = �(� + 1) − �2 = � |

If X1 ≈ �(�1 ), X2 ≈ �(�2 ), ...Xn ≈ �(�n), all independent:Y = X1 + ... + Xn, find moment generating function of Y,

tXnΠ(t) = Ee tY = Ee t(X1 +...+Xn) = Ee tX1 × ... × e By independence:

tXn �1 (e t −1)e �2 (e t �n (e t −1)Ee tX1 Ee tX2 × ... × Ee = e −1)...e

Moment generating function of �(�1 + ... + �n):

Π(t) = e(�1 +�2 +...+�n)(e t −1)

If dependent, for example:X1, X1 − 2X1 ⊂ {0, 2, 4, ...} - skips odd numbers, so not Poisson.

Approximation of Binomial: X1, ..., Xn ≈ B(p), P(Xi = 1) = p, P(X1 = 0) = 1 − p Y = X1 + .. + Xn ≈ B(n, p), P(Y = k) =

�n�pk (1 − p)n−k

kIf p is very small, n is large; np = � p = 1/100, n = 100; np = 1

� � �n�

1 )n

k k n n n n! k nk

�n�

p k(1 − p)n−k =

�n�

( �

)k (1 − �

)n−k = �k (1 − )−k (1 −

Many factors can be eliminated when n is large ↔

x xlimn�∗(1 + )n = e n

�n�

1 n! 1 (n − k + 1)(n − k + 2)...n 1 = =

k nk k!(n − k)! nk n × n × ... × n k!

Simplify the left fraction:

1 (1 −

k − 1)(1 −

k − 2)...(1 − ) ↔ 1

n n n

60

Page 61: Lecture Notes(Introduction to Probability and Statistics)

1 ↔ k!

So, in the end:

�n�

p k (1 − p)n−k = �k

e−�

k k!

Poisson distribution with parameter � results.

Example:B(100, 1/100) � �(1); P(2) � 1 e−1 = e−1

very close to actual.2 2

Counting Processes: Wrong connections to a phone number, number of typos in a book ona page, number of bacteria on a part of a plate.

Properties:1) Count(S) - a count of random objects in a region S √ TE(count(S)) = � S , where S - size of S× | | | |(property of proportionality)2) Counts on disjoint regions are independent.3) P(count(S) ∼ 2) is very small if the size of the region is small.1, 2, and 3 lead to count(S) ≈ �(� S ), � - intensity parameter.| |

A region from [0, T] is split into n sections, each section has size T /n| |The count on each region is X1, ..., Xn

By 2), X1, ..., Xn are independent. P(Xi ∼ 2) is small if n is large. TBy 1), EXi = � |n| = 0(P(X1 = 0)) + 1(P(X1 = 1) + 2(P(X1 = 2)) + ...

But, over 1 the value is very small.P(Xi = 1) � �|

nT |

P(X1 = 0) � 1 − �|T |n

(� T |)k T

P(count(T ) = k) = P(X1 + ... + Xn = k) � B(n, �|T |

) � �(� T ) � e−�| | n

| | |k!

§5.6 - Normal Distribution

61

Page 62: Lecture Notes(Introduction to Probability and Statistics)

� 2

2( e −x

dx)2

Change variables to facilitate integration:

−x 2 �

−y −(x 2 2+y )�

2 � �

= e 2 dx × e 2 dy = e 2 dxdy

Convert to polar:

� 2λ 2� ∗ 1 r 2

� ∗ 1 r 2

� ∗ 1 r 2 r

� ∗

= e− 2 2 2rdrdχ = 2ψ e− rdr = 2ψ e− rd( ) = 2ψ e−tdt = 2ψ 0 0 0 0 2 0

So, original integral area = ∩

2� ∗ 1 −x

e 2 dx = 1 ∩2ψ−∗

p.d.f.:

21 −x

f (x) = e 2∩2ψ

Standard normal distribution, N(0, 1)

** End of Lecture 20

62

Page 63: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 21 April 1, 2005

Normal Distribution

Standard Normal Distribution, N(0, 1) p.d.f.:

2∩1

2ψe−x /2f (x) =

m.g.f.:

t /2δ(t) = E(e tX ) = e 2

Proof - Simplify integral by completing the square:

2 2

� 1

δ(t) = e tx ∩2ψe−x /2dx =

1 � e tx−x /2dx =∩

11 �

t2 /2−t2 /2+tx−x 2/2dx =1

e t2 /2

� e− 2 (t−x)2

e dx∩2ψ

∩2ψ

Then, perform the change of variables y = x - t:

2 1 2 2 1 2 2 2

= ∩1

2ψe t /2

� ∗

e− 2 y dy = e t /2 1 � ∗

e− 2 y dy = e t /2 f (x)dx = e t /2∩2ψ−∗ −∗

Use the m.g.f. to find expectation of X and X 2 and therefore Var(X):

2 t /2 2 2

E(X) = δ∅(0) = tet /2|t=0 = 0; E(X2) = δ∅∅(0) = e 2

t + e t /2t=0 = 1; Var(X) = 1 |

Consider X ≈ N (0, 1), Y = θX + µ, find the distribution of Y:

y−µ � 1 2

P(Y ← y) = P(θX + µ ← y) = P(Xy − µ

) = �

∩2ψe−x /2dx←

θ −∗

p.d.f. of Y:

f (y) = �P(Y ← y) 1 (y−µ)2 1 1 (y−µ)2

= ∩2ψe−

2�2 = 2�2 N (µ, θ)�y θ θ

∩2ψe− ↔

EY = E(θX + µ) = θ(0) + µ(1) = µE(Y − µ)2 = E(θX + µ − µ)2 = θ2

E(X2) = θ2 - variance of N (µ, θ)θ =

Var(X) - standard deviation

63

Page 64: Lecture Notes(Introduction to Probability and Statistics)

To describe an altered standard normal distribution N(0, 1) to a normal distribution N (µ, θ), The peak is located at the new mean µ, and the point of inflection occurs θ away from µ

N ( ); Y = + µ Moment Generating Function of µ, θ

θX

= Ee t(πX+µ) = Ee(tπ)X etµ = etµEe(tπ)X = etµe(tπ)2 /2 = etµ+t (π)2 /2δ(t) = Ee tY 2

Note: X1 ≈ N (µ1, θ1), ..., Xn ≈ N (µn, θn) - independent.Y = X1 + ... + Xn, distribution of Y:Use moment generating function:

Ee tY = Ee t(X1 +...+Xn) = Ee tX1 ...e tXn = Ee tX1 ...Ee tXn = eµ1 t+π2 t2 /2 × ... × eµnt+π2 t2 /21 n

2P µi t+

P π2

i t /2 ≈ N (�

µi, ��

θ2= e i )

The sum of different normal distributions is still normal!This is not always true for other distributions (such as exponential)

Example:X ≈ N (µ, θ), Y = cX , find that the distribution is still normal:Y = c(θN (0, 1) + µ) = (cθ)N (0, 1) + (µc)Y ≈ cN (µ, θ) = N (cµ, cθ)

Example:Y ≈ N (µ, θ)P(a Y ← b) = P(a ← θx + µ ← b) = P( a−µ X b−µ )π ← π← ←This indicates the new limits for the standard normal.

Example:Suppose that the heights of women: X ≈ N (65, 1) and men: Y ≈ N (68, 2)P(randomly chosen woman taller than randomly chosen man)P(X > Y ) = P(X − Y > 0)Z = X − Y ≈ N

Z−(−3)≥5

(65 −

> −(−3)≥5

68, ∩

12 + 22) = N (−3,

(5))

P(Z > 0) = P( ) = P(standard normal > ≥5

3 = 1.342) = 0.09 Probability values tabulated in the back of the textbook.

Central Limit Theorem Flip 100 coins, expect 50 tails, somewhere 45-50 is considered typical.

64

Page 65: Lecture Notes(Introduction to Probability and Statistics)

Flip 10,000 coins, expect 5,000 tails, and the deviation can be larger, perhaps 4,950-5,050 is typical.

Xi = { 1(tail); 0(head)}

number of tails X1 + ... + Xn 1 1 1 1 = E(X1) = by LLN Var(X1) = ) =

n n ↔

2 2(1 −

2 4 But, how do you describe the deviations? X1, X2, ..., Xn are independent with some distribution P

n

µ = EX1, θ2 = Var(X1); x =1 �

Xi EX1 = µ n

↔ i=1

x − µ on the order of ∩ n

≥n(x−µ) behaves like standard normal. π↔

∩ n(x − µ)

is approximately standard normal N (0, 1) for large n θ

nP(

∩ n(x − µ)

x) −−−−↔ P(standard normal ← x) = N (0, 1)(−→ , x)

θ ← ↔ →

This is useful in terms of statistics to describe outcomes as likely or unlikely in an experiment.

P(number of tails ← 4900) = P(X1 + ... + X10,000 4, 900) = P(x 0.49) = ← ←

1∩ 10, 000(x − 2 ) ←

∩ 10, 000(0.49 − 0.5)

) � N (0, 1)(−→ , − 100(0.01)

= − 2) = 0.0267= P( 1 1 1 2 2 2

Tabulated values always give for positive X, area to the left.

In the table, look up -2 by finding the value for 2 and taking the complement.

** End of Lecture 21

65

Page 66: Lecture Notes(Introduction to Probability and Statistics)

1

18.05 Lecture 22 April 4, 2005

Central Limit Theorem X1, ..., Xn - independent, identically distributed (i.i.d.) x = (X1 + ... + Xn)n µ = EX, θ2 = Var(X)

∩ n(x − µ)

n N (0, 1)−−−−↔ θ

↔ →

You can use the knowledge of the standard normal distribution to describe your data:

θY ∩ n(x − µ)

= Y, x − µ = θ

∩ n

This expands the law of large numbers:It tells you exactly how much the average value and expected vales should differ.

1 1∩ n(x − µ)

= ∩ n (

x1 − µ + ... +

xn − µ ) = ∩

n (Z1 + ... + Zn)

θ n θ θ

where: Zi = Xi

π−µ ; E(Zi) = 0, Var(Zi) = 1

Consider the m.g.f., see that it is very similar to the standard normal distribution:

Ee t �1 n

(Z1 +...+Zn) = Ee tZ1 /

≥n × ... × e tZn /

≥n = (Ee tZ1 /

≥n)n

Ee tZ1 = 1 + tEZ1 +1 t2EZ1

2 +1 t3EZ1

3 + ... 2 6

1 2 1 = 1 + t + t3EZ1

3 + ... 2 6

t2

Ee(t/≥

n)Z1 = 1 + t2

+ t3

EZ13 + ... � 1 +

2n 6n3/2 2n Therefore:

)n

2n (Ee tZ1 /

≥n)n � (1 +

t2

t2

(1 + )n n e t /2 - m.g.f. of standard normal distribution! −−−−↔ 2

2n ↔ →

Gamma Distribution: Gamma function; for ∂ > 0, λ > 0

66

Page 67: Lecture Notes(Introduction to Probability and Statistics)

� ∗

�(∂) = x ∂−1 e−xdx 0

p.d.f of Gamma distribution, f(x):

1 = � ∗ 1

x ∂−1 e−xdx, f (x) = { 1 x ∂−1 e−x , x ∼ 0; 0, x < 0}

0 �(∂) �(∂)

Change of variable x = λy, to stretch the function: � ∗ 1

� ∗ λ∂

1 = λ∂−1 y ∂−1 e−ξyλdy = y ∂−1 e−ξydy 0 �(∂) 0 �(∂)

p.d.f. of Gamma distribution, f (x ∂, λ):|

λ∂

f (x ∂, λ) = { x ∂−1 e−ξx , x ∼ 0; 0, x < 0} −Gamma(∂, λ)|�(∂)

Properties of the Gamma Function: � ∗ � ∗

�(∂) = x ∂−1 e−xdx = x ∂−1d(−e−x) = 0 0

Integrate by parts:

� ∗ � ∗

= x ∂−1 e−x|∗ (−e−x)(∂ − 1)x ∂−2dx = 0 + (∂ − 1) x ∂−2 e−xdx = (∂ − 1)�(∂ − 1)0 − 0 0

In summary, Property 1: �(∂) = (∂ − 1)�(∂ − 1)

You can expand Property 1 as follows:

�(n) = (n − 1)�(n − 1) = (n − 1)(n − 2)�(n − 2) = (n − 1)(n − 2)(n − 3)�(n − 3) =

� ∗

= (n − 1)...(1)�(1) = (n − 1)!�(1), �(1) = e−xdx = 1 ↔ �(n) = (n − 1)! 0

In summary, Property 2: �(n) = (n − 1)!

Moments of the Gamma Distribution: X ≈ (∂, λ)

� ∗ k λ

∂ λ∂ � ∗

x(∂+k)−1 e−ξxdxEXk = x x ∂−1 e−ξxdx = 0 �(∂) �(∂) 0

Make this integral into a density to simplify:

λ∂ �(∂ + k) � ∗ λ∂+k

x(∂+k)−1 e−ξxdx= �(∂) λ∂+k

0 �(∂ + k)

The integral is just the Gamma distribution with parameters (∂ + k, λ)!

�(∂ + k) (∂ + k − 1)(∂ + k − 2) × ... × ∂�(∂) (∂ + k − 1) × ... × ∂ = = =

�(∂)λk �(∂)λk λk

For k = 1:

67

Page 68: Lecture Notes(Introduction to Probability and Statistics)

∂ E(X) =

λ

For k = 2:

E(X2) = (∂ + 1)∂

λ2

(∂ + 1)∂ ∂2 ∂ Var(x) = =

λ2 − λ2 λ2

Example:

If the mean = 50 and variance = 1 are given for a Gamma distribution, Solve for ∂ = 2500 and λ = 50 to characterize the distribution.

Beta Distribution:

� 1

x ∂−1(1 − x)ξ−1dx = �(∂)�(λ)

, 1 = � 1 �(∂ + λ)

x ∂−1(1 − x)ξ−1dx 0 �(∂ + λ) 0 �(∂)�(λ)

Beta distribution p.d.f. - f (x ∂, λ)|

Proof: � ∗ � ∗ � ∗ � ∗

�(∂)�(λ) = x ∂−1 e−xdx y ξ−1 e−y dy = x ∂−1 y ξ−1 e−(x+y)dxdy 0 0 0 0

Set up for change of variables:

x ∂−1 y ξ−1 e−(x+y) = x ∂−1((x + y) − x)ξ−1 e−(x+y) = x ∂−1(x + y)ξ−1(1 − x

)ξ−1 e−(x+y)

x + y

Change of Variables:

x s = x + y, t = , x = st, y = s(1 − t) ↔ J acobian = s(1 − t) − (−st) = s

x + y

Substitute:

� 1 � ∗ � 1 � ∗

= t∂−1 s ∂+ξ−2(1 − t)ξ−1 e−ssdsdt = t∂−1(1 − t)ξ−1dt s ∂+ξ−1 e−sds = 0 0 0 0

� 1

= t∂−1(1 − t)ξ−1 × �(∂ + λ) = �(∂)�(λ) 0

Moments of Beta Distribution:

68

Page 69: Lecture Notes(Introduction to Probability and Statistics)

�(∂ + λ) � 1

EXk = � 1

x k �(∂ + λ) x ∂−1(1 − x)ξ−1dx = x(∂+k)−1 (1 − x)ξ−1dx

0 �(∂)�(λ) �(∂)�(λ) 0

Once again, the integral is the density function for a beta distribution.

�(∂ + λ) �(∂ + k)�(λ) �(∂ + λ) �(∂ + k) (∂ + k − 1) × ... × ∂ = = =

�(∂)�(λ) ×

�(∂ + λ + k) �(∂ + λ + k) �(∂) (∂ + λ + k − 1) × ... × (∂ + λ)

For k = 1:

∂ EX =

∂ + λ

For k = 2:

EX2 =(∂ + 1)∂

(∂ + λ + 1)(∂ + λ)

(∂ + 1)∂ ∂2 ∂λ =Var(X) =

(∂ + λ + 1)(∂ + λ) −

(∂ + λ)2 (∂ + λ)2(∂ + λ + 1)

Shape of beta distribution.

** End of Lecture 22

69

Page 70: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 23 April 6, 2005

Estimation Theory: If only 2 outcomes: Bernoulli distribution describes your experiment.If calculating wrong numbers: Poisson distribution describes experiment.May know the type of distribution, but not the parameters involved.

A sample (i.i.d.) X1, ..., Xn has distribution P from the family of distributions: {Pβ : χ ⊂ Γ}P = Pβ0 , χ0 is unknownEstimation Theory - take data and estimate the parameter.It is often obvious based on the relation to the problem itself.

Example: B(p), sample: 0 0 1 1 0 1 0 1 1 1p = E(X) ♥ x = 6/10 = 0.6

Example: E(∂), ∂e−∂x, x ∼ 0, E(X) = 1/∂.Once again, parameter is connected to the expected value.1/∂ = E(X) ♥ x, ∂ � 1/x - estimate of alpha.

Bayes Estimators: - used when intuitive model can be used in describing the data.

X1, ..., Xn ≈ Pβ0 , χ0 ⊂ Γ Prior Distribution - describes the distribution of the set of parameters (NOT the data)f (χ) - p.f. or p.d.f. ↔ corresponds to intuition.P0 has p.f. or p.d.f.; f (x χ)|Given x1, ..., xn joint p.f. or p.d.f.: f (x1, ..., xn χ) = f (x1 χ) × ... × f (xn χ)| | |To find the Posterior Distribution - distribution of the parameter given your collected data. Use Bayes formula:

f (x1, .., xn χ)f (χ)f (χ x1, ..., xn) = | �

f (x1, ..., xn

||χ)f (χ)dχ

The posterior distribution adjusts your assumption (prior distribution) based upon your sample data.

Example: B(p), f (x p) = px(1 − p)1−x;|

70

Page 71: Lecture Notes(Introduction to Probability and Statistics)

f (x1, ..., xn p) = �p xi (1 − p)1−xi = pP

xi (1 − p)n−P xi|

Your only possibilities are p = 0.4, p = 0.6, and you make a prior distribution based on the probability that the parameter p is equal to each of those values. Prior assumption: f(0.4) = 0.7, f(0.6) = 0.3 You test the data, and find that there are are 9 successes out of 10, p̂ = 0.9 Based on the data that give p̂ = 0.9, find the probability that the actual p is equal to 0.4 or 0.6. You would expect it to shift to be more likely to be the larger value. Joint p.f. for each value:

f (x1, ..., x10|0.4) = 0.49(0.6)1

f (x1, ..., x10|0.6) = 0.69(0.4)1

Then, find the posterior distributions:

(0.49(0.6)1)(0.7)f (0.4|x1, ..., xn) =

(0.49(0.6)1)(0.7) + (0.69(0.4)1)(0.3) = 0.08

(0.69(0.4)1)(0.3)f (0.6|x1, ..., xn) =

(0.49(0.6)1)(0.7) + (0.69(0.4)1)(0.3) = 0.92

Note that it becomes much more likely that p = 0.6 than p = 0.4

Example: B(p), prior distribution on [0, 1]Choose any prior to fit intuition, but simplify by choosing the conjugate prior.

p�xi (1 − p)n−�xi f (p)f (p x1, ..., xn) = �

(...)dp|

Choose f(p) to simplify the integral. Beta distribution works for Bernoulli distributions. Prior is therefore:

f (p) = �(∂ + λ)

p ∂−1(1 − p)ξ−1 , 0 ← p ← 1 �(∂)�(λ)

Then, choose ∂ and λ to fit intuition: makes E(X) and Var(X) fit intuition.

�(∂ + � xi + λ + n −

� xi)

(1 − p)(ξ+n−P xi )−1f (p x1...xn) = |

�(∂ + � xi)�(λ + n −

� xi) × p(∂+

P x1 )−1

Posterior Distribution = Beta(∂ + � xi, λ + n −

� xi )

The conjugate prior gives the same distribution as the data.

Example:

71

Page 72: Lecture Notes(Introduction to Probability and Statistics)

B(∂, λ) such that EX = 0.4, Var(X) = 0.1 Use knowledge of parameter relations to expectation and variance to solve:

∂ ∂λ EX = 0.4 = , Var(X) = 0.1 =

∂ + λ (∂ + λ)2(∂ + λ + 1)

The posterior distribution is therefore:

Beta(∂ + 9, λ + 1)

And the new expected value is shifted:

∂ + 9 EX =

∂ + λ + 10

Once this posterior is calculated, choose the parameters by finding the expected value.

Definition of Bayes Estimator: Bayes estimator of unknown parameter χ0 is χ(X1, ..., Xn) = expectation of the posterior distribution.

Example: B(p), prior Beta(∂, λ), X1, ..., Xn ↔ posterior Beta(∂ + � xi, λ + n −

� xi )

∂ + � xi ∂ +

� xi

Bayes Estimator: = ∂ +

� xi + λ + n −

� xi ∂ + λ + n

To see the relation to the prior, divide by n:

∂/n + x = ∂/n + λ/n + 1

Note that it erases the intuition for large n.The Bayes Estimator becomes the average for large n.

** End of Lecture 23

72

Page 73: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 24 April 8, 2005

Bayes Estimator. Prior Distribution f (χ) ↔ compute posterior f (χ X1, ..., Xn)|Bayes’s Estimator = expectation of the posterior. E(X − a)2 minimize a a = EX↔ ↔

Example: B(p), f (p) = Beta(∂, λ) ↔ f (p x1, ..., xn) = Beta(∂ + � xi, λ + n −

� xi)|

∂ + � xi

χ(x1, ..., xn) = ∂ + λ + n

Example: Poisson Distribution

�x

�(�), f (x �) = | x!e−�

Joint p.f.:

n

e−n�f (x1, ..., xn| �) = � �xi

e−� = �

P xi

xi! �xi! i=1

If f (�) is the prior distribution, posterior:

�P

xi e−n�f (�)

f (� x1, ..., xn) = �xi !

�P

xi e−n�f (�)d�|

g(x1...xn) = �

�xi !

Note that g does not depend on �:

f (� x1, ..., xn) ≈ �P

xi e−n�f (�)| Need to choose the appropriate prior distribution, Gamma distribution works for Poisson.

Take f (�) - p.d.f. of �(∂, λ),

λ∂

e−ξ� f (�) = �∂−1

�(∂)

f (� x1, ..., xn) ≈ �P

xi +∂−1e−(n+ξ)� ↔ �(∂ + � xi, λ + n)|

Bayes Estimator:

∂ + � xi

�(x1, ..., xn) = EX = n + λ

Once again, balances both prior intuition and data, by law of large numbers:

∂/n + � xi/n

n�(x1 , ..., xn) = −−−−↔ x ↔ E(X1) ↔ � 1 + λ/n

↔ →

The estimator approaches what you’re looking for, with large n.

Exponential E(∂), f (x ∂) = ∂e−∂x, x ∼ 0|f (x1, ..., xn ∂) = �n

i=1| ∂e−∂xi = ∂ne−(P

xi )∂

If f (∂) - prior, the posterior:

73

Page 74: Lecture Notes(Introduction to Probability and Statistics)

f (∂ x1, ..., xn) ≈ ∂n e−(P

xi)∂f (∂)| Once again, a Gamma distribution is implied. Choose f (∂) − �(u, v)

uve−v∂ f (∂) = ∂u−1

�(u)

New posterior:

f (∂ x1, ..., xn) ≈ ∂n+u−1 e−(P

xi +v)u ↔ �(u + n, v + �

xi)| Bayes Estimator:

u + n u/n + 1 1 ∂(x1 , ..., xn) = = n = ∂

v + � xi v/n +

� xi/n

−−−−↔ x ↔

E

1 X

↔ →

Normal Distribution:

1

N (µ, θ), f (x µ, θ) = θ∩ 1

2ψe−

2�2 (x−µ)2 |

1

f (x1, ..., xn µ, θ) = 1

i=1 (xi −µ)2 | (θ∩

2ψ)n e−

2�2

Pn

It is difficult to find simple prior when both µ, θ are unknown. Say that θ is given, and µ is the only parameter:

1

Prior: f (µ) = b∩ 1

2ψe−

2b2 (µ−a)2

= N (a, b)

Posterior:

1 1

f (µ X1, ..., Xn) ≈ e− 2�2

P

(xi −µ)2 − 2b2 (µ−a)2 |

Simplify the exponent:

1 2 2 n 1 a2 1 2 = �

(xi − 2µxi + µ ) + 2b2

(µ − 2aµ + a 2) = µ (2θ2

+2b2

) − 2µ(

� xi

+2b2

) + ... 2θ2 2θ2

B B2 )2 = µ 2A − 2µB + ... = A(µ − 2µB

+ ( )2) − B2

+ ... = A(µ − + ... A A A A

f (µ X1, ..., Xn) ≈ e−A(µ− B = e

− 1

| A )2

2(1/�

2A)2 (µ − B

)2 = N ( B,

1 ) = N (

θ2A + nb2x,

θ2b2

A A ∩

2A θ2 + nb2 θ2 + nb2 )

Normal Bayes Estimator:

θ2a + nb2x θ2a/n + b2x µ(X1, ..., Xn) = =

θ2/n + b2 n x E(X1) = µ−−−−↔

θ2 + nb2 ↔ → ↔

** End of Lecture 24

74

Page 75: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 25 April 11, 2005

Maximum Likelihood Estimators X1, ..., Xn have distribution Pβ0 Pβ : χ ⊂ Γ}⊂ {Joint p.f. or p.d.f.: f (x1, ..., xn) = f (x1 χ) × ... × f (xn χ) = ξ(χ) - likelihood function. | |If Pβ - discrete, then f (x χ) = Pβ(X = x),|and ξ(χ) - the probability to observe X1, ..., Xn

Definition: A Maximum likelihood estimator (M.L.E.):χ̂ = χ̂(X1, ..., Xn) such that ξ(χ̂) = maxβ ξ(χ)Suppose that there are two possible values of the parameter, χ = 1, χ = 2p.f./p.d.f. - f (x 1), f (x 2)| |Then observe points x1, ..., xn

view probability with first parameter and second parameter:ξ(1) = f (x1, ..., xn 1) = 0.1, ξ(2) = f (x1, ..., xn 2) = 0.001,| |The parameter is much more likely to be 1 than 2.

Example: Bernoulli Distribution B(p), p ⊂ [0.1], ξ(p) = f (x1, ..., xn p) = p

P xi (1 − p)n−P

xi|ξ(χ) ↔ max ∝ log ξ(χ) ↔ max (log-likelihood)log ξ(p) =

� xi log p + (n −�

xi ) log(1 − p), maximize over [0, 1]Find the critical point:

log ξ(p) = 0 �p

� xi n −�

xi

p −

1 − p = 0

� xi(1 − p) − p(n −

� xi) =

� xi − p

� xi − np + p

� xi = 0

� xi

p̂ = = x E(X) = p n

For Bernoulli distribution, the MLE converges to the actual parameter of the distribution, p.

Example: Normal Distribution: N (µ, θ2),

1

f (x|µ, θ2) = 1 ∩2ψθ

e− 2�2 (x−µ)2

Pn1

ξ(µ, θ2) = ( ∩2

1

ψθ )n e−

2�2 i=1 (xi −µ)2

n1

log ξ(µ, θ2) = n log(∩

2ψθ) − �

(xi − µ)2 max : µ, θ2

2θ2 ↔

i=1

Note that the two parameters are decoupled.

First, for a fixed θ, we minimize �n

i=1(xi − µ)2 over µ

75

Page 76: Lecture Notes(Introduction to Probability and Statistics)

� �µ

n�

i=1

n�(xi − µ)2 = − 2(xi − µ) = 0,

i=1

n� 1 n�

xi − nµ = 0, µ̂ = xi = x ↔ E(X) = µ0 n

i=1 i=1

To summarize, the estimator of µ for a Normal distribution is the sample mean.

To find the estimator of the variance:

1 −n log(∩

2ψθ) − n�

(xi − x)2 maximize over θ↔2θ2

i=1

� n 1 = +

�θ − θ θ3

n�(xi − x)2 = 0

i=1

θ2 10 ; ˆˆ =

�(xi − x)2 - MLE of θ2 θ2 − a sample variance

n

θ2

θ2 1 2 1 2 1 1 2

Find ˆ

ˆ = �

(x − 2xix+ (x)2) = �

xi − 2x �

xi + (x)2 = �

xi − 2(x)2 + (x)2 = i n n n n

1 2 − (x)2 = x2 − (x)2 2 = �

xi ↔ E(x1) − E(x1)2 = θ2

0 n

To summarize, the estimator of θ2 for a Normal distribution is the sample variance.0

Example: U(0, χ), χ > 0 - parameter.

f(x χ) = { , 0 ← x ← χ; 0, otherwise }|χ

1

Here, when finding the maximum we need to take into account that the distribution is supported on a finite interval [0, χ].

ξ(χ) = n � 1 1

I(0 ← xi ← χ) = χI(0 ← x1, x2, ..., xn χ)

n ←

χ i=1

The likelihood function will be 0 if any points fall outside of the interval.If χ will be the correct parameter with P = 0,you chose the wrong χ for your distribution.

ξ(χ) ↔ 0maximize over χ >

76

Page 77: Lecture Notes(Introduction to Probability and Statistics)

� |

If you graph the p.d.f., notice that it drops off when χ drops below the maximum data point.

χ̂ = max(X1, ..., Xn)

The estimator converges to the actual parameter χ0:As you keep choosing points, the maximum gets closer and closer to χ0

Sketch of the consisteny of MLE.

1 ξ(χ) ↔ max ⊇ log ξ(χ) ↔ max

n

n1 1

Ln(χ) = log ξ(χ) = log �

f (xi|χ) = 1 �

log f (xi χ) ↔ L(χ) = Eβ0 log f (x1 χ). n n n

| |i=1

Ln(χ) is maximized at χ̂, by definition of MLE. Let us show that L(χ) is maximized at χ0. Then, evidently, χ̂ ↔ χ0. L(χ) ← L(χ0) : Expand the inequality:

� f (x χ)

L(χ) − L(χ0) = log |

f (x|χ0)dx ← � �

f

f (

(

x

x

||χ

χ

0

)

) − 1 f (x|χ0)dx

f (x χ0)

= (f (x χ) − f (x χ0)) dx = 1 − 1 = 0.| |

Here, we used that the graph of the logarithm will be less than the line y = x - 1 except at the tangent point.

** End of Lecture 25

77

Page 78: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 26 April 13, 2005

Confidence intervals for parameters of Normal distribution.

Confidence intervals for µ0, θ2 in N (µ0, θ2 0 0 )

µ = x, ˆˆ θ2 = x2 − (x)2

µ ↔ µ0, ˆˆ θ2 θ2 with large n, but how close exactly? ↔ 0

You can guarantee that the mean or variance are in a particular interval with some probability:Definition: Take ∂ ⊂ [0, 1], ∂− confidence levelIf P(S1(X1, ..., Xn) ← µ0 S2(X1, ..., Xn)) = ∂,←then interval [S1, S2] is the confidence interval for µ0 with confidence level ∂.

Consider Z0, ..., Zn - i.i.d., N(0, 1)Definition: The distribution of Z1

2 + Z22 + ... + Z2 is called a chi-square (α2) distribution,n

with n degrees of freedom.

2 , 2 )As shown in 7.2, the chi-square distribution is a Gamma distribution ↔ �( n 1§

Definition:

Z0The distribution of �

1 (Z12 + ... + Z2

is called a t-distribution with n d.o.f. n)n

The t-distribution is also called Student’s distribution, see 7.4 for detail. §

To find the confidence interval for N (µ0, θ2 0 ), need the following:

Fact: Z1, ..., Zn ≈ i.i.d.N (0, 1)

1 2 1 z = (Z1 + ... + Zn), z2 − (z)2 =

1 � zi − (

� zi)

2

n n n

Then, A = ∩nz ≈ N (0, 1), B = n(z2 − (z)2) ≈ α2

n−1, and A and B are independent.

Take X1, ..., Xn ≈ N (µ0, θ2 0 ), µ0, θ2 unknown. 0

Z1 = x1 − µ0

, ..., Zn = xn − µ0 ≈ N (0, 1)

θ0 θ0

x − µ0A =

∩nz =

∩n( )

θ0

B = n(z2 − (z)2) = ∩n(

1 � (xi − µ0)2

− ( x − µ0

)2) = θ2 (

∩n 1 �

(xi − µ0)2 − (x − µ0)

2) = n θ2 θ0 0 n0

∩n 2 2

∩n

= (x2 − 2µ0x + µ0 − x 2 + 2µ0x − µ0) = (x2 − (x)2)θ2 θ2

0 0

To summarize:

A = ∩n( x − µ0

) ≈ N (0, 1); B =

∩n

(x2 − (x)2) ≈ α2

θ2 n−1θ0 0

78

Page 79: Lecture Notes(Introduction to Probability and Statistics)

and, A and B are independent.

You can’t compute B, because you don’t know θ0, but you know the distribution:

∩n(x2 − (x)2) ≈ α2B =

θ2 n−1 0

c1 and c2.Choose the most likely values for B, between

Choose the c values from the chi-square tabled values, such that area = ∂ confidence.

With probability = confidence (∂), c1 B← ← c2

∩n(x2 − (x)2)

c1 θ2←

0 ← c2

Solve for θ0:

∩n(x2 − (x)2)

∩n(x2 − (x)2)

θ2 0 ←c2

← c1

Choose c1 and c2 such that the right tail has probability 1−∂ , same as left tail. 2 This results in throwing away the possibilities outside c1 and c2

1Or, you could choose to make the interval as small as possible, minimize: c1 1 − c2

given ∂

Why wouldn’t you throw away a small interval in between c1 and c2, with area 1 − ∂? Though it’s the same area, you are throwing away very likely values for the parameter!

** End of Lecture 26

79

Page 80: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 27 April 15, 2005

Take sample X1, ..., Xn ≈ N (0, 1)

n(x2 − (x)2

≈ α2A =

∩n(x − µ) ≈ N (0, 1), B = θ θ2 n−1

A, B - independent.To determine the confidence interval for µ, must eliminate θ from A:

A Z0

= ≈ tn−1� 1 B

� 1 2 2

n−1 n−1 (z1 + ... + zn−1)

Where Z0 1 n−1 ≈ N (0, 1) 1

n−1 (Z2 1 + ... + Z2

n−1) ↔ EZ2 1

, Z , .., ZThe standard normal is a symmetric distribution, and = 1

So tn-distribution still looks like a normal distribution (especially for large n), and it is symmetric about zero.

Given ∂ ⊂ (0, 1) find c, tn−1(−c, c) = ∂

A c−c ← �

1 B ←

n−1

with probability = confidence (∂)

∩n(x − µ)

� 1 n(x2 − (x)2)

/ c θ2

−c ← θ n − 1

c−c ← � 1

x − µ

n−1 (x2 − (x)2)

� 1

� 1

x − cn − 1

(x2 − (x)2) ← µ ← x + cn − 1

(x2 − (x)2)

By the law of large numbers, x EX = µ↔The center of the interval is a typical estimator (for example, MLE). �

π2 error � estimate of variance � for large n. n

θ̂2 = x2 − (x)2 is a sample variance and it converges to the true variance,

80

Page 81: Lecture Notes(Introduction to Probability and Statistics)

θ2 θ2by LLN ˆ ↔

1 θ2 1 2 2

Eˆ = E (x1 + ... + xn) − E( (x1 + ... + xn))2 = n n

1 = EX1

2 − n

1 2

� EXiXj = EX1

2 − n2

(nEX12 + n(n + 1)(EX1)

2) i,j

Note that for i = j, EXiXj = EXiEXj = (EX1)2 = µ2 , n(n - 1) terms with different indices. ∈

θ2Eˆ =

n − 1 EX1

2 − n − 1

(EX1)2 =

n n

= n − 1

(EX12 − (EX1)

2) = n − 1

Var(X1) = n − 1

θ2

n n n

Therefore:

θ2 n − 1 n

Eˆ = θ2 < θ2

Good estimator, but more often than not, less than actual. So, to compensate for the lower error:

n θ2 = θ2

n − 1 n ˆ

E ˆ

Consider (θ∅)2 = n−1 θ2, unbiased sample variance.

� 1

� 1

� 1

ˆ ∅±cn − 1

(x2 − (x)2) = ±cn − 1

θ2 = ±c (θ )2 n

� (θ∅)2

� (θ∅)2

x + cx − c n ← µ ←

n

7.5 pg. 140 Example: Lactic Acid in Cheese §0.86, 1.53, 1.57, ..., 1.58, n = 10 ≈ N (µ, θ2), x = 1.379, ̂θ2 = x2 − (x)2 = 0.0966 Predict parameters with confidence ∂ = 95% Use a t-distribution with n - 1 = 9 degrees of freedom.

81

Page 82: Lecture Notes(Introduction to Probability and Statistics)

See table: (−→, c) = 0.975 gives c = 2.262 �

1 �

1 θ2x − 2.262 θ̂2 ← µ ← x + 2.262 ˆ

9 9

0.6377 ← µ ← 2.1203

Large interval due to a high guarantee and a small number of samples. If we change ∂ to 90% c = 1.833, interval: 1.189 ← µ ← 1.569Much better sized interval.

Confidence interval for variance:

nθ̂2

c1 ← θ2 ← c2

where the c values come from the α2 distribution

Not symmetric, all positive points given for α2 distribution. c1 = 2.7, c2 = 19.02 0.0508 θ2 0.3579↔ ← ←again, wide interval as result of small n and high confidence.

Sketch of Fisher’s theorem. z1, ..., zn ≈ N (0, 1)

1 (z1 + ... + zn) ≈ N (0, 1)∩nz = ≥

n

1 2 12 n(z2 − (z)2) = n(1 �

zi − ( �

zi)2) =

� z (z1 + ... + zn))2 ≈ α2

i n n − ( ∩

n n−1

1 12 i

2

e−1/2 P

z )n e−1/2r)nf (z1, ..., zn) = ( = ( ∩2ψ

∩2ψ

1 1 )n e−1/2

P y 2 2

i)n e−1/2rf (y1, ..., yn) = ( = ( ∩2ψ

∩2ψ

The graph is symmetric with respect to rotation, so rotating the coordinates gives again i.i.d. standard normal sequence.

� 1 e−1/2y 2

i ↔ y1, ..., yn − i.i.d.N (0, 1)∩2ψ

i

Choose coordinate system such that:

1 1 1 y1 = (z1 + ... + zn), i.e. ρv1 = ( , . . . , ) - new first axis. ∩

n ∩n

∩n

Choose all other vectors however you want to make a new orthogonal basis:

82

Page 83: Lecture Notes(Introduction to Probability and Statistics)

2 1 + ... + y 2

n = z 2 1 + .. + z 2

ny

since the length does not change after rotation!

∩nz = y1 ≈ N (0, 1)

) = �

y− (z)2 2 i − y 2 2

2 + ... + yn(z2 1 = y −1

** End of Lecture 27

83

2α≈ n2 n

Page 84: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 28 April 20, 2005

Review for Exam 2

pg. 280, Problem 5µ = 300, θ = 10; X1, X2, X3 ≈ N (300, 100 = θ2)P(X1 > 290 X2 > 290 ⇒X3 > 290) = 1 − P(X1 ← 290)P(X2 ← 290)P(X3 ← 290)⇒

← −1)P( x2 − 300 ← −1)P(

x3 − 300 = 1 − P(

x1 − 300 ← −1)10 10 10

Table for x = 1 gives 0.8413, x = -1 is therefore 1 - 0.8413 = 0.1587 = 1 − (0.1587)3 = 0.996

pg. 291, Problem 11600 seniors, a third bring both parents, a third bring 1 parent, a third bring no parents.Find P(< 650 parents)Xi − 0, 1, 2 ↔ parents for the ith student.P(Xi = 2) = P(Xi = 1) = P(Xi = 0) = 1

3 P(X1 + ... + X600 < 650) - use central limit theorem. µ = 0(1/3) + 1(1/3) + 2(1/3) = 1 EX2 = 02(1/3) + 12(1/3) + 22(1/3) = 5

3 5Var(X) = EX2 − (EX)2 = 3 − 1 = 2

3

θ =

2/3

( 650∩

600( P

xi − 1)P( 600 < 600 − 1)

∩600

) 2/3

2/3

P(

∩n(x − µ)

< 2.5) � N (0, 1), P(Z 2.5) = δ(2.5) = 0.9938 θ

pg. 354, Problem 10 Time to serve X ≈ E(χ), n = 20, X1, ..., X20, x = 3.8 min Prior distribution of χ is a Gamma dist. with mean 0.2 and std. dev. 1 ∂/λ = 0.2, ∂/λ2 = 1 λ = 0.2, ∂ = 0.04↔Get the posterior distribution: f (x|χ) = χe−βx , f (x1, ..., xn χ) = χne−β

P xi

ξ�

f (χ) = �(∂) χ∂−1e−ξβ , f (χ x

|1, ..., xn) ≈ χ(∂+n)−1e−(ξ+

P xi )β|

Posterior is �(∂ + n, λ + � xi) = �(0.04 + 20, 0.2 + 3.8(20))

Bayes estimator = mean of posterior distribution =

20.04 =

3.8(20) + 0.2

Problem 4 f (x χ) = {eβ−x, x ∼ 0; 0, x < 0}|Find the MLE of χ Likelihood δ(χ) = f (x1 χ) × ... × f (xn χ)|= eβ−x1 ...eβ−xn I(x1 ∼ χ, ..., xn ∼ χ) =

|enβ−P

xi I(min(x1, ..., xn) ∼ χ)

84

Page 85: Lecture Notes(Introduction to Probability and Statistics)

Maximize over χ.

Note that the graph increases in χ, but χ must be less than the min value.If greater, the value drops to zero.Therefore:

χ̂ = min(x1, ..., xn)

Also, by observing the original distribution, the maximum probability is at the smallest Xi.

p. 415, Problem 7:To get the confidence interval, compute the average and sample variances:Confidence interval for µ:

� 1

� 1

x− cn− 1

(x2 − (x)2) ← µ ← x− cn− 1

(x2 − (x)2)

To find c, use the t distribution with n - 1 degrees of freedom:

tn−1 = t19(−→, c) = 0.95, c = 1.729

Confidence interval for θ2:

n(x2 − (x)2) ≈ α2

∩n(x− µ) ≈ N(0, 1),θ θ2 n−1

85

Page 86: Lecture Notes(Introduction to Probability and Statistics)

N (0, 1) ∩n(x − µ

−(x)2 )

)/θ ≈ tn−1 =tn−1 ≈ � 1 α2

� 1 n(x2

n−1 n−1 n−1 π2

Use the table for α2 n−1

n(x2 − (x)2) c1 ←

θ2 ← c2

From the Practice Problems: (see solutions for more detail)

p. 196, Number 9 P(X1 = def ective) = p Find E(X − Y ) Xi = {1, def ective; −1, notdef ective}; X − Y = X1 + ... + Xn

E(X − Y ) = EX1 + ... + EXn = nEX1 = n(1 × p − 1(1 − P )) = n(2p − 1)

p. 396, Number 10 Xc((X1 + X2 + X3)

2 X4 + X5 + X6)2) ≈ α2

(∩c(X1 + X2 + X3))

2 ∩c(X4 + X5 + X6))

2+ (

+ (

1, ..., X6 ≈ N (0, 1)

n ≈ α2

2 But each needs a distribution of N(0, 1) E∩c(X1 + X2 + X3

∩c(EX1 + EX2 + EX3∩

c(X1 + X2 + X3 c x1 X2 X3 c ) = ) = 0

Var( )) = (Var( ) + Var( ) + Var( )) = 3In order to have the standard normal distribution, variance must equal 1. 3c = 1, c = 1/3

** End of Lecture 28

86

Page 87: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 29 April 25, 2005

Score distribution for Test 2: 70-100 A, 40-70 B, 20-40 C, 10-20 D Average = 45

Hypotheses Testing. X1, ..., Xn with unknown distribution P Hypothesis possibilities: H1 : P = P1

H2 : P = P2

... Hk : P = Pk

There are k simple hypotheses. A simple hypothesis states that the distribution is equal to a particular probability distribution.

Consider two normal distributions: N(0, 1), and N(1, 1).

There is only 1 point of data: X1

Depending on where the point is, it is more likely to come from either N(0, 1) or N(1, 1).Hypothesis testing is similar to maximum likelihood testing ↔Within your k choices, pick the most likely distribution given the data.However, hypothesis testing is NOT like estimation theory, as there is a different goal:

Definition: Error of type iP(make a mistake Hi is true) = ∂i|Decision Rule: β : X n ↔ (H1, H2, ..., Hk )Given a sample (X1, ..., Xn), β(X1, ..., Xn) ⊂ {H1, ..., Hk}∂i = P(β = Hi Hi) - error of type i∈ |“The decision rule picks the wrong hypothesis” = error.

Example: Medical test, H1 - positive, H2 - negative.Error of Type 1: ∂1 = P(β = H1 H1) = P(negative positive)∈ | |Error of Type 2: ∂2 = P(β = H2 H2) = P(positive negative)∈ | |These are very different errors, have different severity based on the particular situation.

Example: Missile Detection vs. Airplane Type 1 ↔ P(airplane missile), Type 2 ↔ P(missile airplane)| |Very different consequences based on the error made.

Bayes Decision Rules Choose a prior distribution on the hypothesis.

87

Page 88: Lecture Notes(Introduction to Probability and Statistics)

Assign a weight to each hypothesis, based upon the importance of the different errors.�(1), ..., �(k) ∼ 0,

� �(i) = 1

Bayes error ∂(�) = �(1)∂1 + �(2)∂2 + ... + �(k)∂k

Minimize the Bayes error, choose the appropriate decision rule.Simple solution to finding the decision rule:X = (X1, ..., Xn), let fi(x) be a p.f. or p.d.f. of Pi

fi(x) = fi(x1) × ... × fi(xn) - joint p.f./p.d.f.

Theorem: Bayes Decision Rule:

β = {Hi : �(i)fi(x) = maxi�j�k �(j)fj (x)

Similar to max. likelihood.Find the largest of joint densities, but weighted in this case.

∂(�) = � �(i)Pi(β = Hi) =

� �(i)(1 − Pi(β = Hi)) =

= 1 − � �(i)Pi(β =

∈Hi) = 1 −

� �(i)

� I(β(x) = Hi)fi(x)dx =

= 1 − � (� �(i)I(β(x) = Hi)fi(x))dx - minimize, so maximize the integral:

Function within the integral:

I(β = H1)�(1)f1(x) + ... + I(β = Hk )�(k)fk (x)

The indicators pick the term ↔β = H1 : 1�(1)f1(x) + 0 + 0 + ... + 0So, just choose the largest term to maximize the integral.Let β pick the largest term in the sum.

Most of the time, we will consider 2 simple hypotheses:

f1(x) �(2)β = {H1 : �(1)f1(x) > �(2)f2(x), > ; H2 if <; H1 or H2 if =

f2(x) �(1) }

Example:H1 : N (0, 1), H2 : N (1, 1)�(1)f1(x) + �(2)f2(x) ↔ minimize

1 12

f ( ) = ( i ; x2

P

(xi−1)2

e−12

12

P x )n e−)nf1(x) = ( ∩

2ψ ∩

f1(x) �(2)2+i 12

P

(xi −1)212

P x = e

n 2 −P

xi= e− > f2(x) �(1)

n �(2)β = {H1 :

� xi <

2 − log ; H2 if >; H1 or H2 if =

�(1) }

Considering the earlier example, N(0, 1) and N(1, 1)

88

Page 89: Lecture Notes(Introduction to Probability and Statistics)

X1, n = 1, �(1) = �(2) = 1 2

1 1 β = {H1 : x1 < ; H1 or H2 if =; H2x1 >

2 2 }

However, if 1 distribution were more important, it would be weighted.

If N(0, 1) were more important, you would choose it more of the time, even on 1some occasions when xi > 2

Definition: H1, H2 - two simple hypotheses, then: ∂1(β) = P(β = H1 H2) - level of significance. ∈ |λ(β) = 1 − ∂2(β) = P(β = H2 H2) - power. |For more than 2 hypotheses,∂1(β) is always the level of significance, because H1 is always theMost Important hypothesis.λ(β) becomes a power function, with respect to each extra hypothesis.

Definition: H0 - null hypothesisExample, when a drug company evaluates a new drug,the null hypothesis is that it doesn’t work.H0 is what you want to disprove first and foremost,you don’t want to make that error!

Next time: consider class of decision rules.K∂ = {β : ∂1(β) ← ∂}, ∂ ⊂ [0, 1]Minimize ∂2(β) within the class K∂

** End of Lecture 29

89

Page 90: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 30 April 27, 2005

Bayes Decision Rule �(1)∂1(β) + �(2)∂2(β) ↔ minimize.

f1(x) �(2)β = {H1 : > ; H2 : if <; H1 or H2 : if =

f2(x) �(1) }

Example: see pg. 469, Problem 3 H0 : f1(x) = 1 for 0 ← x 1 H1 : f2(x) = 2x for 0 x

← 1← ←

Sample 1 point x1

Minimize 3∂0(β) + 1∂1(β)

1 1 1 1 β = {H0 : > ; H1 : < ; either if equal}

2x1 3 2x1 3

Simplify the expression:

3 3 β = {H0 : x1 ←

2; H1 : x1 >

2 }

Since x1 is always between 0 and 1, H0 is always chosen. β = H0 always.

Errors:∂0(β) = P0(β = H0) = 0∈∂1(β) = P1(β = H1) = 1 ∈We made the ∂0 very important in the weighting, so it ended up being 0.

Most powerful test for two simple hypotheses. Consider a class K∂ = {β such that ∂1(β) ← ∂ ⊂ [0, 1]}Take the following decision rule:

f1(x) f1(x)β = {H1 : < c}

f2(x) ∼ c; H2 :

f2(x)

Calculate the constant from the confidence level ∂:

f1(x)∂1(β) = P1(β = H1) = P1( < c) = ∂∈

f2(x)

Sometimes it is difficult to find c, if discrete, but consider the simplest continuous case first: �(2)Find �(1), �(2) such that �(1) + �(2) = 1, �(1) = c

Then, β is a Bayes decision rule.�(1)∂1(β) + �(2)∂2(β) ← �(1)∂1(β

∅) + �(2)∂2(β∅)

for any decision rule β∅

If β∅ ⊂ K∂ then ∂1(β∅ ∂.) ←

Note: ∂1(β) = ∂, so: �(1)∂ + �(2)∂2(β) ← �(1)∂ + �(2)∂2(β∅)

Therefore: ∂2(β) ← ∂2(β∅), β is the best (mosst powerful) decision rule in K∂

Example:H1 : N(0, 1), H2 : N(1, 1), ∂1(β) = 0.05

90

Page 91: Lecture Notes(Introduction to Probability and Statistics)

f1(x) = e− 1

2

P x 2

i + 1 2

P

(xi −1)2

= e P

n− xi2 ∼ c f2(x)

Always simplify first:

n n 2 −

� xi ∼ log(c),

� xi + log(c),

� xi c∅←

2 ←

The decision rule becomes:

β = {H1 : �

xi c∅; H2 : �

xi > c∅}←

Now, find c∅

∂1(β) = P1(� xi > c∅)

recall, subscript on P indicates that x1, ..., xn ≈ N(0, 1)Make into standard normal:

� xi c∅

P1( > ) = 0.05∩n

∩n

Check the table for P(z > c∅∅) = 0.05, c∅∅ = 1.64, c∅ = ∩n(1.64)

Note: a very common error with the central limit theorem:

n

� xi − µ

) ↔

� xi − nµ�

xi ↔∩n(

1

θ ∩nθ

These two conversions are the same! Don’t combine techniques from both.

The Bayes decision rule now becomes:

β = {H1 : �

xi 1.64∩n; H2 :

� xi > 1.64

∩n}←

Error of Type 2:∂2(β) = P2(

� xi c = 1.64

∩n)←

Note: subscript indicates that X1, ..., Xn ≈ N(1, 1)

= P2(

� xi − n(1) 1.64

∩n− n

) = P2(z 1.64 −∩n)∩n

← ∩n

Use tables for standard normal to get the probability.If n = 9 P2(z 1.64 −

∩9) = P2(z ← −1.355) = 0.0877↔ ←

Example:H1 : N(0, 2), H2 : N(0, 3), ∂1(β) = 0.05

1 x f1(x) (

2≥1

2λ )ne−

P 2(2)

2 i

13)n/2 e− 12

P x= ( 2 i= ∼ c1

1f2(x) ( 3≥

2λ )ne−

P 2 i 2x

2(3)

2 2β = {H1 : �

xi c∅; H2 : �

xi > c∅← }

This is intuitive, as the sum of squares ≈ sample variance. If small θ = 2 ↔If large ↔ θ = 3

91

Page 92: Lecture Notes(Introduction to Probability and Statistics)

05 1(β) = P1(� x ∅) = P1(

� x2 i 2 > c

2

� ) = P1(α

2 i > c 2

n > c∂ ∅∅) = 0.2If n = 10, P1(α10 > c∅∅) = 0.05; c∅∅ = 18.31, c∅ = 36.62

Can find error of type 2 in the same way as earlier: 2 n > c

3

� ) ↔ 2

P(α10 > 12.1) � 0.7 A difference of 1 in variance is a huge deal! P(α

Large type 2 error results, small n.

** End of Lecture 30

92

Page 93: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 31 April 29, 2005

t-test X1, ..., Xn - a random sample from N(µ, θ2)

2-sided Hypothesis Test: H1 : µ = µ0

H2 : µ = µ0∈2 sided hypothesis - parameter can be greater or less than µ0

Take ∂ ⊂ (0, 1) - level of significance (error of type 1) Construct a confidence interval confidence = 1 - ∂↔If µ0 falls in the interval, choose H1, otherwise choose H2

How to construct the confidence interval in terms of the decision rule:

= t distribution with n - 1 degrees of freedom. T � 1

x− µ0

n−1 (x2 − (x)2)

∂ 2

∂ 2

Under the hypothesis H1, T is has a t-distribution.See if the T value falls in the expected area of the t-distribution:Accept the null hypothesis (H1), if −c T c, Reject if otherwise.← ←Choose c such that area between c and -c is 1 − ∂, each tail area = ∂/2 Error of type 1: ∂1 = P1(T < −c, T > c) = + = ∂

Definition: p-value

93

Page 94: Lecture Notes(Introduction to Probability and Statistics)

p-value = probability of values less likely than TIf p-value ∼ ∂, accept the null hypothesis.If p-value < ∂, reject the null hypothesis.Example: p-value = 0.0001, very unlikely that this T value would occurif the mean were µ0. Reject the null hypothesis!

1-sided Hypothesis Test: H1 : µ ← µ0

H2 : µ > µ0

T = � 1

x−2

µ0

)2)n−1 (x − (x

See how the distribution behaves for three cases: 1) If µ = µ n−1.0, T ≈ t

2) If µ < µ0 :

= � 1

µ− µ0 +T = �

1

x−2

µ0

)2) n−1 (x2 )2)

� 1

x− µ

)2)n−1 (x2

n−1 (x − (x − (x − (x

(µ− µ0)∩n− 1

n−1 +T � tθ

↔ −→

3) If µ > µ0, similarly T ↔ +→

Decision Rule: β = {H1 : T ← �; H∞ : T > �}

94

Page 95: Lecture Notes(Introduction to Probability and Statistics)

∂1 = P1(T > c) = ∂p-value: Still the probability of values less likely than T ,but since it is 1-sided,you don’t need to consider the area to the left of −T as you would in the 2-sided case.

The p-value is the area of everything to the right of T

Example: 8.5.1, 8.5.4 µ0 = 5.2.n = 15, x = 5.4, θ∅ = 0.4226 H1 : µ = 5.2, H2 : µ = 5.2∈

is calculated to be = 1.833, which leads to a p-value of 0.0882 T

If ∂ = 0.05, accept H1, µ = 5.2. because the p-value is over 0.05Decision rule:Such that ∂ = 0.05, the areas of each tail in the 2-sided case = 2.5%

95

Page 96: Lecture Notes(Introduction to Probability and Statistics)

From the table c = 2.145↔β = {H1 : −2.145 T ← 2.145; H2 otherwise}←

Consider 2 samples, want to compare their means:

1 ) and Y1, ..., Ym ≈ N (µ2, θ2X1, ..., Xn ≈ N (µ1, θ22 )

Paired t-test: Example (textbook): Crash test dummies, driver and passenger seats ≈ (X, Y)See if there is a difference in severity of head injuries depending on the seat:(X1, Y1), ..., (Xn, Yn)Observe the paired observations (each car) and calculate the difference:Hypothesis Test:H1 : µ1 = µ2

H2 : µ1 = µ2

Consider∈Z1 = X1 − Y1, ..., Zn = Xn − Yn ≈ N (µ1 − µ2 = µ, θ2)

H1 : µ = 0; H2 : µ = 0∈Just a regular t-test:p-values comes out as < 10−6, so they are likely to be different.

** End of Lecture 31

96

Page 97: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 32 May 2, 2005

Two-sample t-test

X1, ..., Xm ≈ N (µ1, θ2)Y1, ..., Yn ≈ N (µ2, θ2)Samples are independent.Compare the means of the distributions.

Hypothesis Tests:H1 : µ1 = µ2, µ1 ← µ2

H2 : µ1 = µ2, µ1 > µ2∈

By properties of Normal distribution and Fisher’s theorem: ∩m(x − µ1)

,

∩n(y − µ2) ≈ N (0, 1)

θ θ

2θx = x2 − (x)2 2, θy = y2 − (y)2

2nθy 2mθx 2α≈ m

2α≈ n−1, θ2 −1θ2

=T � 1

x − µ ≈ tn−1

n−1 (x2 − (x)2)

Calculate x − y

1x − µ1 ≈ ∩1 mN (0, 1) = N (0, ),

y − µ2 ≈ N (0, 1

)θ m θ n

1 1x − µ1 y − µ2 =

(x − y) − (µ1 − µ2) ≈ N (0, + )θ

− θ θ m n

(x − y) − (µ1 − µ2) ≈ N (0, 1)� 1θ + n

1 m

2nθy 2mθx 2α≈ m+

θ2 θ2

97

+n−2

Page 98: Lecture Notes(Introduction to Probability and Statistics)

� �

Construct the t-statistic:

N (0, 1) � 1

m+n−2 (α2

≈ tm+n−2

m+n−2)

(x − y) − (µ1 − µ2) (x − y) − (µ1 − µ2) = = ≈ tm+n−2T �

1 2nπ+ y

2mπx 1( 1 + ) m+1 n−2 (mθ

2 + nθ2 m n x y )1

m+1 n−2 (θ )+ π2m n

Construct the test:H1 : µ1 = µ2, H2 : µ1 = µ2∈If H1 is true, then:

= x − y ≈ tm+n−2 1

T �( 1 + 1 ) m+n−2 (mθ

2 + nθ2 y )m n x

Decision Rule:

β = {H1 : −c ← T ← c, H2 : otherwise} where the c values come from the t distribution with m + n - 2 degrees of freedom.c = T value where the area is equal to ∂/2, as the failure is both below -c and above +c

If the test were: H1 : µ1 ← µ2, H2 : µ1 > µ2,then the T value would correspond to an area in one tail, as the failure is only above +c.

There are different functions you can construct to approach the problem,based on different combinations of the data.This is why statistics is entirely based on your assumptions and the resulting

98

Page 99: Lecture Notes(Introduction to Probability and Statistics)

distribution function!

Example: Testing soil types in different locations by amount of aluminum oxide present.m = 14, x = 12.56 ≈ N(µ1, θ

2); n = 5, y = 17.32 ≈ N(µ2, θ2)

H1 : µ1 ← µ2; H2 : µ1 > µ2 ↔ T = −6.3 ≈ t14+5−2=17

c-value is 1.74, however this is a one-sided test. T is very negative, but we still accept H1

If the hypotheses were: H1 : µ1 ∼ µ2; H2 : µ1 < µ2,Then the T value of -6.3 is way to the left of the c-value of -1.74. Reject H1

Goodness-of-fit tests. Setup: Consider r different categories for the random variable. The probability that a data point takes value Bi is pi� pi = p1 + ...+ pr = 1

Hypotheses: H1 : pi = p0 for all i = 1, ..., r; H2 : otherwise. i

Example: (9.1.1)3 categories exist, regarding a family’s financial situation.They are either worse, better, or the same this year as last year.Data: Worse = 58, Same = 64, Better = 67 (n = 189)

1Hypothesis: H1 : p1 = p2 = p3 = 3 , H2 : otherwise.

Ni = number of observations in each category.You would expect, under H1, that N1 = np1, N2 = np2, N3 = np3

Measure using the central limit theorem:

N1 − np1 N(0, 1)

np1(1 − p1) ↔

99

Page 100: Lecture Notes(Introduction to Probability and Statistics)

However, keep in mind that the Ni values are not independent!! (they sum to 1) Ignore part of the scaling to account for this (proof beyond scope):

N1 − np1 1 − p1N (0, 1) = N (0, 1 − p1)∩

np1 ↔

Pearson’s Theorem:

(N1 − np1)2 (Nr − npr)

2

= + ... + r−1T np1 npr

↔ α2

If H1 is true, then:

r(Ni − np0

i )2�

α2 = 0 r−1T npi

↔ i=1

If H1 is not true, then:

T ↔ +→

Proof:

0N1 − np0 N1 − np1 n(p1 − p1)0 iif p1 = p1, np

= np0

+ np0

↔ N (0, θ2) + (±→)0

∈i 1 i

However, is squared ↔ +→

Decision Rule:

β = {H1 : T ← c, H2 : T > c}

The example yields a T value of 0.666, from the α2 = α2 r−1=3−1=2 2

c is much larger, therefore accept H1.The difference among the categories is not significant.

** End of Lecture 32

100

Page 101: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 33 May 4, 2005

Simple goodness-of-fit test: 0H1 : pi = pi , i r; H2 : otherwise. ←

r 0

T � (Ni − npi )

2

≈ α2 = np0 r−1

i=1 i

Decision Rule:

β = {H1 : T ← c; H2 : T > c} If the distribution is continuous or has infinitely many discrete points: Hypotheses: H1 : P = P0; H2 : P = P0∈

Discretize the distribution into intervals, and count the points in each interval.You know the probability of each interval by area, then, consider a finite number of intervals.This discretizes the problem.

New Hypotheses: H1∅ : pi = P(X ⊂ Ii) = P0(X ⊂ Ii); H2 otherwise.

If H1 is true H1∅ is also true.↔

Rule of Thumb: 0npi = nP0(X ⊂ Ii) ∼ 5

If too small, too unlikely to find points in the interval,does not approximate the chi-square distribution well.

Example 9.1.2 ↔ Data ≈ N(3.912, 0.25), n = 23H1 : P ≈ N(3.912, 0.25)

0Choose k intervals ↔ pi = 1 k

23n( 1 ↔ ∼ 5, k = 4 k ) ∼ 5 k

101

Page 102: Lecture Notes(Introduction to Probability and Statistics)

X−3.912 ≈ N(0, 1)N(3.912, 0.25) ≈ X ↔ ≥0.25

Dividing points: c1, c2 = 3.912, c3

Find the normalized dividing points by the following relation:

ci − 3.912 = c∅i0.5

The ci∅ values are from the std. normal distribution.

↔ c∅ = −0.68 ↔ c1 = −0.68(0.5) + 3.912 = 3.5751c2∅ = 0 c2 = 0(0.5) + 3.912 = 3.912↔ ↔ c3∅ = 0.68 c3 = 0.68(0.5) + 3.912 = 4.249↔ ↔

Then, count the number of data points in each interval.Data: N1 = 3, N2 = 4, N3 = 8, N4 = 8; n = 23Calculate the T statistic:

(3 − 23(0.25))2 (8 − 23(0.5))2

T = + ...+ = 3.609 23(0.25 23(0.25)

Now, decide if T is too large. ∂ = 0.05 - significance level. α2 α2

r−1 ↔ 3, c = 7.815

102

Page 103: Lecture Notes(Introduction to Probability and Statistics)

Decision Rule:β = {H1 : T ← 7.815; H2 : T > 7.815}T = 3.609 < 7.815, conclusion: accept H1

The distribution is relatively uniform among the intervals.

Composite Hypotheses: H1 : pi = pi(χ), i ← r for χ ⊂ Γ - parameter set.H2 : not true for any choice of χ

Step 1: Find χ that best describes the data.Find the MLE of χLikelihood Function: ξ(χ) = p1(χ)

N1 p2(χ)N −2 × ...× pr(χ)

Nr

Take the log of ξ(χ) ↔ maximize χ↔ �

Step 2: See if the best choice of χ� is good enough. H1 : pi = pi(χ�) for i ← r,H2 : otherwise.

r� (Ni − npi(χ�))2

≈ α2T = npi(χ�)

r−s−1 i=1

where s - dimension of the parameter set, number of free parameters.

Example: N(µ, θ2) ↔ s = 2If there are a lot of free parameters, it makes the distribution set more flexible.Need to subtract out this flexibility by lowering the degrees of freedom.

Decision Rule:β = {H1 : T ← c; H2 : T > c}Choose c from α2

r−s−1 with area = ∂

Example: (pg. 543) Gene has 2 possible alleles A1, A2

Genotypes: A1A1, A1A2, A2A2

Test that P(A1) = χ,P(A2) = 1 − χ,

103

Page 104: Lecture Notes(Introduction to Probability and Statistics)

but you only observe genotype.

H1 : P(A1A2) = 2χ(1 − χ) ♥ N2

P(A1A1) = χ2 ♥ N1

P(A2A2) = (1 − χ)2−♥ N3

r = 3 categories. s = 1 (only 1 parameter, χ)

ξ(χ) = (χ2)N1 (2χ(1 − χ))N2 ((1 − χ)2)N3 = 2N2 χ2N1 +N2 (1 − χ)2N3 +N2

log ξ(χ) = N2 log 2 + (2N1 + N2) log χ + (2N3 + N2) log(1 − χ)

� 2N1 + N2 2N3 + N2 = −

1 − χ = 0 �χ χ

(2N1 + N2)(1 − χ) − (2N3 + N2)χ = 0

2N1 + N2 2N1 + N2χ� = =

2N1 + 2N2 + 2N3 2n

compute χ� based on data. 0 0p0 = χ2, p2 = 2χ�(1 − χ�), p3 = (1 − χ�)2

i

0� (Ni − npi )2

≈ α2 0 r−s−1 1T =

np= α2

i

For an ∂ = 0.05, c = 3.841 from the α2 1 distribution.

Decision Rule: β = {H1 : T ← 3.841; H2 : T > 3.841}

** End of Lecture 33

104

Page 105: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 34 May 6, 2005

Contingency tables, test of independence.

Feature 1 = 1 F1 = 2 F1 = 3

... F1 = a

col. total

X1 ⊂ {1, ..., a}i X2 ⊂ {1, ..., b}i

Random Sample:

Feature 2 = 1 N11

...

...

... Na1

N+a

F2 = 2..................

F2 = 3..................

...

...

...

...

...

...

...

F2 = b N1b

...

...

... Nab

N+b

row total N1+

...

...

... Na+

n

X1 = (X11, X1

2), ..., Xn = (X1 n)n, X2

Question: Are X1, X2 independent? Example: When asked if your finances are better, worse, or the same as last year, see if the answer depends on income range:

Worse Same Better ← 20K

20K - 30K 20 24

15 27

12 32

∼ 30K 14 22 23

Check if the differences and subtle trend are significant or random.

χij = P(i, j) = P(i) × P(j) if independent, for all cells ij

Independence hypothesis can be written as: H1 : χij = piqj where p1 + ... + pa = 1, q1 + ... + qb = 1 H2 : otherwise. r = number of categories = ab s = dimension of parameter set = a + b − 2 The MLE p�i , qj

� needs to be found ↔

T =(Nij − npi

�qj�)2

≈ α2 r−s−1=ab−(a+b−2)−1=(a−1)(b−1)np�i qj

�i,j

Distribution has (a - 1)(b - 1) degrees of freedom.

Likelihood:

Ni+ N+jp , −ξ(− q ) = �

(piqj )Nij =

� p ×

� qj

↔ ↔i

i,j i j

Note: Ni+ = �

j Nij and N+j = �

i Nij

Maximize each factor to maximize the product.

105

Page 106: Lecture Notes(Introduction to Probability and Statistics)

�i Ni+ log pi ↔ max, p1 + ... + pa = 1

Use Lagrange multipliers to solve the constrained maximization: �i Ni+ log pi − �(

�i pi − 1) ↔ maxp min�

� Ni+ Ni+

�pi =

pi − � = 0 ↔ pi =

n Ni+�

pi = = 1 ↔ � = n ↔ p�i = � n

i

Ni+ p� = , qj

� = N+j

i n n

T = � (Nij − Ni+N+j /n)2

≈ α(2 a−1)(b−1)Ni+N+j /ni,j

Decision Rule:β = {H1 : T ← c; H2 : T > c}Choose c from the chi-square distribution, (a - 1)(b - 1) d.o.f., at a level of significance ∂ = area.

From the above example:N1+ = 47, N2+ = 83, N3+ = 59N+1 = 58, N+2 = 64, N+3 = 67n = 189For each cell, the component of the T statistic adds as follows:

(20 − 58(47)/189)2

T = 58(47)/189

+ ... = 5.210

Is T too large? = α2T ≈ α2

(3−1)(3−1) 4

For this distribution, c = 9.488 According to the decision rule, accept H1, because 5.210 9.488←

Test of Homogeniety - very similar to independence test.

106

Page 107: Lecture Notes(Introduction to Probability and Statistics)

Category 1 ... Category b Group 1 N11 ... N1b

... ... ... ... Group a Na1 ... Nab

1. Sample from entire population. 2. Sample from each group separately, independently between the groups.

Question: P(category j group i) = P(category j) |This is the same as independence testing!

P(category j, group i) = P(category j)P(group i)

P(Cj )P(Gi) ↔ P(Cj Gi) = P(Cj Gi)

= = P(Cj )|P(Gi) P(Gi)

Consider a situation where group 1 is 99% of the population, and group 2 is 1%.You would be better off sampling separately and independently.Say you sample 100 of each, just need to renormalize within the population.The test now becomes a test of independence.

Example: pg. 560100 people were asked if service by a fire station was satisfactory or not.Then, after a fire occured, the people were asked again.See if the opinion changed in the same people.

Before Fire 80 20 After Fire 72 28

satisfied unsatisfied

But, you can’t use this if you are asking the same people! Not independent! Better way to arrange:

Originally Satisfied 70 10 Originally Unsatisfied 2 18

After, Satisfied After, Not Satisfied

If taken from the entire population, this is ok. Otherwise you are taking from a dependent population.

** End of Lecture 34

107

Page 108: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 35 May 9, 2005

Kolmogorov-Smirnov (KS) goodness-of-fit test Chi-square test is used with discrete distributions.If continuous - split into intervals, treat as discrete.This makes the hypothesis weaker, however, as the distribution isn’t characterized fully.The KS test uses the entire distribution, and is therefore more consistent.

Hypothesis Test:H1 : P = P0

H2 : P = P0∈P0 - continuous In this test, the c.d.f. is used.

Reminder: c.d.f. F (x) = P(X x), goes from 0 to 1. ←

The c.d.f. describes the entire function. Approximate the c.d.f. from the data ↔

Empirical Distribution Function:

n1

x) = #(points ← x)

Fn(x) = �

I(X n

← n

i=1

Fn(x) ↔ EI(X1 ← x P(X1 ← x F (x)by LLN, ) = ) =

From the data, the composed c.d.f. jumps by 1/n at each point. It converges to the c.d.f. at large n. Find the largest difference (supremum) between the disjoint c.d.f. and the actual.

sup Fn(x) − F (x) n 0 x | | −−−−↔ ↔ →

108

Page 109: Lecture Notes(Introduction to Probability and Statistics)

For a fixed x:

x))∩n(Fn(x) − F (x)) =

�(I(Xi ← x)∩−

n EI(X1 ←

By the central limit theorem:

� N�0,Var(I(Xi ← x)) = p(1 − p) = F (x)(1 − F (x)

You can tell exactly how close the values should be!

Dn = ∩n sup Fn(x) − F (x)

x | |

a) Under H1, Dn has some proper known distribution. b) Under H2, Dn +↔ →

If F (x) implies a certain c.d.f. which is β away from that predicted by H0 ↔

Fn(x) ↔ F (x), Fn (x) − F0(x) > β/2 ∩n Fn(x) − F0(

|x) >

≥nα

|+| | 2 ↔ →

The distribution of Dn does not depend on F(x), this allows to construct the KS test. Dn =

∩n supx Fn(x) − F (x) =

∩n supy Fn(F−1(y)) − y|

y = F (x), x = F−1(y), y ⊂ [0,|1]

| |

n� n� n�1 1 1 n(F−1(y)) = F−1(y)) = F I(Xi I(F (Xi) ← y) = I(Yi ← y)←

n n n i=1 i=1 i=1

Y values generated independently of F .P(Yi ← y) = P(F (Xi) ← y) = P (Xi F−1(y)) = F (F−1(y)) = y←Xi ≈ F (x)F (Xi) ≈ uniform on [0, 1], independent of Y.

Dn is tabulated for different values of n, since not dependent on the distribution.(find table on pg. 570)For large n, converges to another distribution, whose table you can alternatively use.

2tP(Dn t) ↔ H(t) = 1 − 2

�i=1(−1)i−1e−2i2 ←

The function represents Brownian Motion of a particle suspended in liquid.

109

Page 110: Lecture Notes(Introduction to Probability and Statistics)

Distribution - distance the particle travels from the starting point. The maximum distance is the distribution of Dn

H(t) = distribution of the largest deviation of particle in liquid (Brownian Motion)

Decision Rule:β = {H1 : Dn c; H2 : Dn > c← }Choose c such that the area to the right is equal to ∂

Example:Set of data points as follows ↔n = 10,0.58, 0.42, 0.52, 0.33, 0.43, 0.23, 0.58, 0.76, 0.53, 0.64H1 : P uniform on [0, 1]Step 1: Arrange in increasing order.0.23, 0.33, 0.42, 0.43, 0.52, 0.53, 0.58, 0.64, 0.76Step 2: Find the largest difference.Compare the c.d.f. with data.

Note: largest difference will occur before or after the jump, so only consider end points.

x: F(x):

Fn(x) before: Fn(x) after:

Calculate the differences: Fn(x) − F (x)| |

0.23 0.33 0.42 ... 0.23 0.33 0.42 ... 0 0.1 0.2 ...

0.1 0.2 0.3 ...

Fn(x) before and F(x): 0.23 0.23 0.22 ... Fn(x) after and F(x): 0.13 0.13 0.12 ...

The largest difference occurs near the end: Dn =

∩10(0.26) = 0.82

|0.9 − 0.64| = 0.26

Decision Rule: β = {H1 : 0.82 ← c; H2 : 0.82 > c}c for ∂ = 0.05 is 1.35. Conclusion - accept H1.

** End of Lecture 35

110

Page 111: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 36 May 11, 2005

Review of Test 2 (see solutions for more details)

Problem 1:P(X = 2c) = 1

22 , P(X = 1 c) = 1 EX = 2c( 1

2 ) + 1 c( 1 2 ) = 5 c2 4

EXn = ( 52 ↔

4 )nc

Problem 2: X1, ..., Xn

n = 1000 P(Xi = 1) = 1 22 , P (Xi = 0) = 1

1µ = EX = 2 , Var(X1) = p(1 − p) = 1 4

Sn = X1 + ... + Xn

P(440 ← Sn k) = 0.5←

Sn − nEX1 Sn − 1000(1/2) Sn − 500 =

nVar(X1) ↔

1000(1/4) ∩

250

P(440 − 500 k − 500

z ) = 0.5∩250

← ← ∩250

by the Central Limit Theorem:

� Π( k − 500

) − Π(440 − 500 ∩

250 ∩

250 ) = Π(

k − 500) − Π(−3.75) = Π(

k − 500) − 0.0001 = 0.5∩

250 ∩

250

Therefore:

Π( k − 500 ∩

250 ) = 0.5001

k − 500 = 0, k = 500 ↔ ∩

250

Problem 3:

χn nβχeβ ef (x) =

xβ+1 I(x ∼ e); ξ(χ) = max

(� xi)β+1

Easier to maximize the log-likelihood:

log ξ(χ) = n log(χ) + nχ − (χ + 1) log �

xi

n n + n − log

� xi = 0 ↔ χ =

log � xi − nχ

Problem 5:Confidence Intervals, keep in mind the formulas!

� 1 2

� 1

x − cn − 1

(x2 − x ) ← µ ← x + cn − 1

(x2 − x2)

Find c from the T distribution with n - 1 degrees of freedom.

111

Page 112: Lecture Notes(Introduction to Probability and Statistics)

Set up such that the area between -c and c is equal to 1 − ∂ In this example, c = 1.833

2 2n(x2 − x ) )θ2 n(x2 − x

c2 ← ←

c1

Find c from the chi-square distribution with n - 1 degrees of freedom.

Set up such that the area between c1 and c2 is equal to 1 − ∂In this example, c1 = 3.325, c2 = 16.92

Problem 4:Prior Distribution:

λ∂

e−ξβ f (χ) = χ∂−1

�(∂)

χn nβef (x1, ..., xn χ) =

(� xi)β+1

|

Posterior Distribution:

f (χ x1, ..., xn) ≈ f (χ)f (x1, ..., xn χ)| |

χn nβ

≈ χ∂−1 e−ξβ

(� e

xi)β = χ∂+n−1 e−ξβ+nβ e−β log

Q xi = χ(∂+n)−1 e−(ξ−n+log

Q xi )β

Posterior = �(∂ + n, λ − n + log � xi)

Bayes Estimator:

∂ + n χ� =

λ − n + log � xi

112

Page 113: Lecture Notes(Introduction to Probability and Statistics)

Final Exam Format Cumulative, emphasis on after Test 2.9-10 questions.Practice Test posted Friday afternoon.Review Session on Tuesday Night - 5pm, Bring Questions!

Optional PSet:pg. 548, Problem 3:Gene has 3 alleles, so there are 6 possible combinations.

1 , p2 = χ2p1 = χ2 2 , p3 = (1 − χ1 − χ2)2

p4 = 2χ1χ2, p5 = 2χ1(1 − χ1 − χ2), p6 = 2χ2(1 − χ1 − χ2)Number of categories ↔ r = 6, s = 2.2 Free Parameters.

r� (Ni − npi)2

≈ α2T = npi

r−s−1=3 i=1

ξ(χ1, χ2) = χ2N1 χ2N2 (1 − χ1 − χ2)2N3 (2χ1χ2)N4 (2χ1(1 − χ1 − χ2))N5 (2χ2(1 − χ1 − χ2))N6

1 2= 2N4 +N5 +N6 χ2N1 +N4 +N5

2χ2N2 +N4+N6 (1 − χ1 − χ2)2N3 +N5 +N6 1

Maximize the log likelihood over the parameters.log ξ = const. + (2N1 + N4 + N5) log χ1 + (2N2 + N4 + N6) log χ2 + (2N3 + N5 + N6) log(1 − χ1 − χ2)Max over χ1, χ2↔log ξ = a

logχ1 + b logχ2 + c log(1 − χ1 − χ2)

� a c � b c = = 0; =

χ2 −

1 − χ1 − χ2 = 0

�χ1 χ1 −

1 − χ1 − χ2 �χ2

Solve for χ1, χ2

a b = aχ2 = bχ1

χ1 χ2 ↔

a − aχ1 − aχ2 − cχ1 = 0, a − aχ1 − bχ1 − cχ1 = 0 ↔

a b χ1 = , χ2 =

a + b + c a + b + c

Write in terms of the givens:

2N1 + N2 + N5 1 2N2 + N4 + N6 1 χ1 = = , χ2 = =

2n 5 2n 2

where n = � Ni

113

Page 114: Lecture Notes(Introduction to Probability and Statistics)

Decision Rule:β = {H1 : T ∼ c, H2 : T < c}Find c values from chi-square dist. with r ­Area above c = ∂ c = 7.815↔

Problem 5:There are 4 blood types (O, A, B, AB)There are 2 Rhesus factors (+, -)Test for independence:

O + 82 - 13

95

s - 1 d.o.f.

A 89 27 116

B 54 7 61

AB 19 244 9 56 28 300

(82 − 244(95) )2

T = 300 + ... 244(95)

300

Find the T statistic for all 8 cells. = α2

3, and the test is same as before. ≈ α2(a−1)(b−1)

** End of Lecture 36

114

Page 115: Lecture Notes(Introduction to Probability and Statistics)

18.05 Lecture 37 May 17, 2005

Final Exam Review - solutions to practice final.

1. f (x v) = { uv−uxu−1e−( x )u for x ∼ 0; 0 otherwise.}v|Find the MLE of v.

Maximize v in the likelihood function = joint p.d.f. ↔

ξ(v) = u n v−nu(�

xi)u−1 e−

P

(xi /v)u

)ulog ξ(v) = n log u − nu log v + (u − 1) log(� xi) −

�( xi

v Maximize with respect to v.

�(log ξ(v)) = − nu �u u

� xi v−u =

− nu + u

� xi v

−(u+1) = 0 �v v

− �v v

u uv u � xi

� xi n = (

vu+1 ) =

u vu

1u u v = �

xi n

1 1u uv = ( �

xi ) MLE n

2. Xi, ..., Xn U [0, χ]≈f (x χ) = 1 I(0 ← x χ)β ←|Prior:

192 f (χ) =

χ4 I(χ ∼ 4)

Data: X1 = 5, X2 = 3, X3 = 8 Posterior: f (χ x1, ..., xn) ≈ f (x1, ..., xn χ)f (χ)| |f (x1, ..., xn χ) = β

1 n I(0 ← all x’s ← χ) = β

1 n I(max(X1, ..., Xn) ← χ)

1f (χ x1, ..., x| n) ≈ 1 I(χ ∼ 4)I(max(x1, ..., xn) ← χ) ≈ I(χ ∼ 8)βn+4 βn+4|

Find constant so it integrates to 1.

cχ−6

n = 3 1 = � ∗

cχ−7dχ 8 c

1 = � ∗

χn

1 +4 dχ −−−↔

8 ↔ − 6

| ∗ =6 8−6 = 1

8

c = 6 × 86

3. Two observations (X1, X2) from f(x)H1 : f (x) = 1/2, I(0 ← x ← 2)H2 : f (x) = { 1/2, 0 x 1, 2/3, 1 < x← ← ← 2}H3 : f (x) = { 3/4, 0 x 1, 1/4, 1 < x ← ← ← 2}β minimizes ∂1(β) + 2∂2(β) + 2∂3(β) ∂i(β) = P(β = Hi Hi)∈ |Find

� �(i)∂i, Decision rule picks �(i)fi(x1, ..., xn) ↔ max for each region.

115

Page 116: Lecture Notes(Introduction to Probability and Statistics)

�(i)fi(x1)fi(x2) H1 H2 H3

both x1, x2 ⊂ [0, 1] (1)(1/2)(1/2) = 1/4 (2)(1/3)(1/3) = 1/3 (2)(3/4)(3/4) = 9/8 point in [0, 1], [1, 2] (1)(1/2)(1/2) = 1/4 (2)(1/3)(2/3) = 4/9 (2)(3/4)(1/4) = 3/8

both in [1, 2] (1)(1/2)(1/2) = 1/4 (2)(1/3)(2/3) = 8/9 (2)(1/4)(1/4) = 1/8

Decision Rule:β = {H1 : never pick , H2 : both in [1, 2], one in [0, 1], [1, 2] , H3 : both in [0, 1]}If two hypotheses:

f1(x) �(2)>

f2(x) �(1)

Choose H1, H2←

4.

2f(x µ) = { 1 e

−1 (ln x−µ)2

for x ∼ 0, 0 for x < 0}|x∩

If X has this distribution, find distribution of ln X . Y = ln X c.d.f. of Y: P(Y ← y) = (ln x ← y) = P(x ey) =

� 0 ey

f(x)dx←However, you don’t need to integrate. p.d.f. of Y, f(y) = �y P(Y ← y) = f(ey) × ey

1

ey ∩

−1 (ln ey −µ)2 × ey = ∩1

2ψe

−1 (y−µ)2 ≈ N(µ, 1)2= e 2

5. n = 10, H1 : µ = −1, H2 : µ = 1 ∂ = 0.05 ∂1(β) = P1(

f1 < c)f2↔

x )f1(−↔

β = {H1 : x ) ∼ c,H2 if less than}

f2(−↔

f1(− 1 1 ↔ �

xi(∩

2ψ)n e− 2

P

(ln xi +1)2

x ) =

f2(− 1 1 ↔ �

xi(∩

2ψ)n e− 2

P

(ln xi −1)2

x ) =

f1(x)= e−2

P ln xi ∼ c ∝

� ln xi ← c∅

f2(x)

β = {H1 : �

ln xi c = −4.81, H2 : �

ln xi > c = −4.81}←

∼ c−nµ c−nµ0.05 = P1(�

ln xi ∼ c) = P1(� N(−1, 1) ∼ c) = P1(

P xi −nµ ≥

nπ ) = P1(Z ∼ ≥nπ )

≥n

c− nµ = 1.64, c = −4.81∩

Power = 1 - type 2 error = 1 − P2(β = H2) = 1 − P2(�

ln xi < c) = 1 − P2(� N(1, 1) < c)∈

= 1 − P2(

� xi − n(1)

< −4.81 − 10

) � 1∩n

∩10

116

Page 117: Lecture Notes(Introduction to Probability and Statistics)

β β6. H1 : p1 = 2 , p2 = 3 , p3 = 1 − 56 β , χ ⊂ [0, 1]

Step 1) Find MLE χ�

Step 2) p� = β2

� , p� = 3 , p

�β�

3 = 1 − 5β�

1 2 6 Step 3) Calculate T statistic.

r� (Ni − np�i )2

≈ α2T = npi

r−s−1=3−1−1=1 i=1

ξ(χ) = ( χ )N1 (

χ )N2 (1 −

5χ )N3

2 3 6

5χ log ξ(χ) = (N1 + N2) log(χ) + N3 log(1 − ) − N1 log(2)N2 log(3) ↔ max χ

6

� N1 + N2 = + N3

−5/5

6 β�χ χ 1 − 6

5 5 N3χ = 0 (N1 + N2)χ −N1 + N2 −

6 6

6 N1 + N2 23 solve for χ χ� = ( ) = ↔

5 n 25 Compute statistic, T = 0.586 β = {H1 : T ← 3.841, H2 : T > 3.841}Accept H1

7. n = 17, x = 3.2, x = 0.09 From N(µ, θ2) H1 : µ ← 3 H2 : µ > 3 at ∂ = 0.05

≈ tn−1 � 3.

1

2 − 3 = = = 2.67T �

1

x− µ0

n−1 (x2 − (x)2) 16 (0.09)

Choose decision rule from the chi-square table with 17-1 degrees of freedom:

β : {H1 : T < 1.746, H2 : T > 1.746}H1 is rejected.

8. Calculate t statistic:

� (Nij − N+j Ni+ )2

T = n = 12.1N+j Ni+

i,j n

117

Page 118: Lecture Notes(Introduction to Probability and Statistics)

α2 = α2 = α2

β : {H1 : T ← 12.59, H2 : T > 12.59}Accept H1. But note if confidence level changes, ∂ 0.10, c = 10.64, would reject H1

(a−1)(b−1) 3×2 6 at 0.05, c = 12.59

9.f(x) = 1/2, I(0 ← x ← 2)F (x) =

� x f(t)dt = x/2, x 2−∗ ←

x:F(x):

F(x) before:F(x) after:

diff F(x) before: diff F(x) after:

�n = F (x) − Fn(x)| |max �n = 0.295 c for ∂ = 0.05 is 1.35 Dn =

∩10(0.295) = 0.932872

0.02 0.18 0.20 ... 0.01 0.09 0.10 ... 0 0.1 0.2 ...

0.1 0.2 0.3 ... 0.01 0.01 0.1 ... 0.09 0.11 0.2 ...

β = {H1 : 0.932872 ← 1.35, H2 : 0.932872 > 1.35}Accept H1

** End of Lecture 37

*** End of 18.05 Spring 2005 Lecture Notes.

118

Page 119: Lecture Notes(Introduction to Probability and Statistics)

18.05. Practice test 1.

(1) Suppose that 10 cards, of which five are red and five are green, are placed at random in 10 envelopes, of which five are red and five are green. Determine the probability that exactly two envelopes will contain a card with a matching color.

(2) Suppose that a box contains one fair coin and one coin with a head on each side. Suppose that a coin is selected at random and that when it is tossed three times, a head is obtained three times. Determine the probability that the coin is the fair coin.

(3) Suppose that either of two instruments might be used for making a certain measurement. Instrument 1 yields a measurement whose p.d.f. is

2x, 0 < x < 1 f1(x) =

0, otherwise

Instrument 2 yields a measurement whose p.d.f. is

3x2 , 0 < x < 1 f2(x) =

0, otherwise

Suppose that one of the two instruments is chosen at random and a mea­surement X is made with it.

(a) Determine the marginal p.d.f. of X. (b) If X = 1/4 what is the probability that instrument 1 was used? (4) Let Z be the rate at which customers are served in a queue. Assume

that Z has p.d.f. 2e−2z , z > 0,

f(z) = 0, otherwise

1Find the p.d.f. of average waiting time T = Z .

(5) Suppose that X and Y are independent random variables with the following p.d.f.

−xe , x > 0,f(x) =

0, otherwise

Determine the joint p.d.f. of the following random variables:

X U = and V = X + Y.

X + Y

Page 120: Lecture Notes(Introduction to Probability and Statistics)

18.05. Practice test 2.

(1) page 280, No. 5 (2) page 291, No. 11 (3) page 354, No. 10 (4) Suppose that X1, . . . , Xn form a random sample from a distribution

with p.d.f. �−xe , x � �

f(x|�) = 0, x < �.

Find the MLE of the unknown parameter �. (5) page 415, No. 7. (Also compute 90% confidence interval for �2 .)

Extra practice problems: page 196, No. 9; page 346, No. 19; page 396, No. 10; page 409, No. 3. page 415, No. 3.

Go over psets 5, 6, 7 and examples in class.

Page 121: Lecture Notes(Introduction to Probability and Statistics)

18.05. Test 1.

A

(1) Consider events A = {HHH at least once} and B = {TTT at least once}. We want to find the probability P (A � B). The complement of A � B will be

c � Bc , i.e. no TTT or no HHH, and

P (A � B) = 1 − P (Ac � Bc).

To find the last one we can use the probability of a union formula

P (Ac � Bc) = P (Ac) + P (Bc) − P (Ac � Bc).

Probability of Ac , i.e. no HHH, means that on each toss we don’t get HHH. The probability not to get HHH on one toss is 7/8 and therefore,

7 �10 P (Ac) = .

8

The same for P (Bc). Probability of Ac �Bc , i.e. no HHH and no TTT, means that on each toss we don’t get HHH and TTT. The probability not to get HHH and TTT on one toss is 6/8 and, therefore,

6 �10 P (Ac � Bc) = .

8

Finally, we get,

10 7 �10 7 �10� 6 P (A � B) = 1 − + .

8 8 −

8

Page 122: Lecture Notes(Introduction to Probability and Statistics)

(2) We have

P (F ) = P (M) = 0.5, P (CB M) = 0.05 and P (CB F ) = 0.0025.| |

Using Bayes’ formula,

P (CB M)P (M) 0.05 × 0.5 P (M |CB) =

P (CB M)P (M

|) + P (CB F )P (F )

=0.05 × 0.5 + 0.0025 × 0.5| |

Page 123: Lecture Notes(Introduction to Probability and Statistics)

(3) We want to find f(x, y)

f(y x) = |f1(x)

which is defined only when f(x) > 0. To find f1(x) we have to integrate out y, i.e.

f1(x) = f(x, y)dy.

2To find the limits we notice that for a given x, 0 < y2 < 1 − x which is not empty only if x2 < 1, i.e. −1 < x < 1. Then −

≤1 − x2 < y <

≤1 − x2 . So if

−1 < x < 1 we get,

� �1−x2

12 �

f1(x) = −�

1−x2

c(x 2+y )dy = c(x 2 y+ y3

) �� �

1−x2

= 2c(x 2≤

1 − x2+ (1−x 2)3/2). 3 −

�1−x2 3

Finally, for −1 < x < 1,

2c(x2 + y2) x2 + yf(y x) = =

2 + 1 2 + 2|

2c(x2≤

1 − x3 (1 − x2)3/2) 2x2

≤1 − x

3 (1 − x2)3/2

if −≤

1 − x2 < y <≤

1 − x2 , and 0 otherwise.

Page 124: Lecture Notes(Introduction to Probability and Statistics)

(4) Let us find the c.d.f first.

P (Y � y) = P (max(X1, X2) � y) = P (X1 � y, X2 � y) = P (X1 � y)P (X2 � y).

The c.d.f. of X1 and X2 is

y

P (X1 � y) = P (X2 � y) = f(x)dx. −�

If y � 0, this is

y �y

x�P (X1 � y) = e xdx = e � = ey

−� −�

and if y > 0 this is

� 0 �

0 x�P (X1 � y) = e xdx = e � = 1.

−� −�

Finally, the c.d.f. of Y,

2y

P (Y � y) = e , y � 0 1, y > 0.

Taking the derivative, the p.d.f. of Y,

2y

f(y) = 2e , y � 0 0, y > 0.

Page 125: Lecture Notes(Introduction to Probability and Statistics)

��������������������������������������������������������������������������������������������������������������������������

PSfrag

y

z � 1

��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ���������������

��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ���������������

������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� �������������������������������

������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� ������������������������������� �������������������������������

��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ���������������

��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ��������������� ���������������

1.z >

x

Figure 1: Region {x � zy} for z � 1 and z > 1.

(5) Let us find the c.d.f. of Z = X/Y first. Note that for X, Y ∪ (0, 1), Z can take values only > 0, so let z > 0. Then

P (Z � z) = P (X/Y � z) = P (X � zY ) = f(x, y)dxdy. {x�zy}

To find the limits, we have to consider the intersection of this set {x � zy}with the square 0 < x < 1, 0 < y < 1. When z � 1, the limits are

� 1

� 1 �

zy 2 �zy

� 1 2 2z z z

(x + y)dxdy = ( x

+ xy)�� dy = ( + z)y 2dy = + .

2 0 0 2 6 30 0 0

When z ∩ 1, the limits are different �

1 � 1 �

1 y2 �� 1 1 1 (x + y)dydx = ( + xy)� dx = 1 − .

2 x/z 6z2 −

3z0 x/z 0

So the c.d.f. of Z is 2

z + z 0 < z � 1P (Z � z) = 6 3 ,

1 , z > 11 − 6z2 3

1 z

The p.d.f. is z + 1 0 < z � 1

f(z) = 3 3 , 1 1

3z3 + 3z2 , z > 1

and zero otherwise, i.e. for z � 0.

Page 126: Lecture Notes(Introduction to Probability and Statistics)

18.05. Test 2.

(1) Let X be the players fortune after one play. Then

1 c 1P (X = 2c) = and P (X = ) =

2 2 2

and the expected value is

1 c 1 5 EX = 2c = c.×

2+

2 ×

2 4

Repeating this n times we get the expected values after n plays (5/4)nc.

S(2) Let Xi, i = 1, . . . , n = 1000 be the indicators of getting heads. Then

n = X1 + . . . + Xn is the total number of heads. We want to find k such that P (440 � Sn � k) � 0.5. Since µ = EXi = 0.5 and β2 = Var(Xi) = 0.25 by central limit theorem,

Sn − nµ Sn − 500 Z = =≈

nβ ≈

250

is approximately standard normal, i.e.

k − 500 P (440 � Sn � k) = P (

440 − 500 = −3.79 � Z � )≈

250 ≈

250

�( k − 500

) − �(−3.79) = 0.5.� ≈250

From the table we find that �(−3.79) = 0.0001 and therefore

�( k − 500

) = 0.4999.≈250

500Using the table once again we get k− � 0 and k � 500. 250

(3) The likelihood function is

αnen�

�(α) = � Xi)�+1(

and the log-likelihood is

log �(α) = n log α + nα − (α + 1) log Xi.

Page 127: Lecture Notes(Introduction to Probability and Statistics)

�(�)

� �

We want to find the maximum of log-likelihood so taking the derivative we get n

+ n− log Xi = 0 α

and solving for α, the MLE is

α̂ = � n

. log Xi − n

(4) The prior distribution is

�1 −�� f(α) = α�− e

and the joint p.d.f. of X1, . . . , Xn is

αnen�

f(X1, . . . , Xn α) = � Xi)�+1

.|(

Therefore, the posterior is proportional to (as usual, we keep track only of the terms that depend on α)

αnen� α�+n−1 −��+n�1 ef(α|X1, . . . , Xn) � α�−1 e−�� �

Xi)�+1 = � �

� α

( Xi ( Xi)�

�+n−1 −��+n�−� log Q

Xi = α(�+n)−1 −(�−n+log Q

Xi)� e e .

This shows that the posterior is again a gamma distribution with parameters

�(� + n, � − n + log Xi).

Bayes estimate is the expectation of the posterior which in this case is

� + n α̂ = � .

� − n + log Xi

(5) The confidence interval for µ is given by

1 1X̄ − c

n− 1(X2 X2) � µ � ¯− ¯ X + c (X2 ¯

n− 1 − X2)

where c that corresponds to 90% confidence is found from the condition

t10−1(c) − t10−1(−c) = 0.9

Page 128: Lecture Notes(Introduction to Probability and Statistics)

or t9(c) = 0.95 and c = 1.833. The confidence interval for β2 is

n(X2 ¯ n(X2 ¯− X2) � β2 − X2) c2

� c2

where c1, c2 satisfy

10−1(c1) = 0.05 and ϕ2ϕ2

10−1(c2) = 0.95,

and c1 = 3.325, c2 = 16.92.