37
Nakul Gopalan Georgia Tech Probability and Statistics Machine Learning CS 4641 These slides are based on slides from Le Song , Sam Roweis, Mahdi Roozbahani, and Chao Zhang.

Lecture 03 Probability and Statistics - Nakul Gopalan

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 03 Probability and Statistics - Nakul Gopalan

Nakul Gopalan

Georgia Tech

Probability and Statistics

Machine Learning CS 4641

These slides are based on slides from Le Song , Sam Roweis, Mahdi Roozbahani, and Chao Zhang.

Page 2: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

2

Page 3: Lecture 03 Probability and Statistics - Nakul Gopalan

Probability

3

Page 4: Lecture 03 Probability and Statistics - Nakul Gopalan

Three Key Ingredients in Probability Theory

4

Random variables 𝑋 represents outcomes in sample space

A sample space is a collection of all possible outcomes

Probability of a random variable to happen 𝑝 𝑥 = 𝑝(𝑋 = 𝑥)

𝑝 𝑥 ≥ 0

Page 5: Lecture 03 Probability and Statistics - Nakul Gopalan

Continuous variable Continuous probability distribution

Probability density functionDensity or likelihood value

Temperature (real number)Gaussian Distribution

Discrete variable Discrete probability distribution

Probability mass functionProbability value

Coin flip (integer)Bernoulli distribution

𝑥𝜖𝐴

𝑝 𝑥 = 1

න𝑝(𝑥)𝑑𝑥 = 1𝑥

Page 6: Lecture 03 Probability and Statistics - Nakul Gopalan

Continuous Probability Functions

6

Page 7: Lecture 03 Probability and Statistics - Nakul Gopalan

Discrete Probability Functions

In Bernoulli, just a single trial is conducted

k is number of successes

n-k is number of failures

𝒏𝒌

The total number of ways of selection k distinct combinations of n

trials, irrespective of order.

Page 8: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

9

Page 9: Lecture 03 Probability and Statistics - Nakul Gopalan

Example

Y = Flip a coinX = Throw a

dice

𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20

𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15

5 6 6 7 5 6 N=35

X

Y𝑦𝑗=1 = ℎ𝑒𝑎𝑑

𝑦𝑗=2 = 𝑡𝑎𝑖𝑙

𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6

X and Y are random variables

N = total number of trials

𝐶𝑗

𝐶𝑖

𝒏𝒊𝒋 = Number of occurrences

Page 10: Lecture 03 Probability and Statistics - Nakul Gopalan

𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20

𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15

5 6 6 7 5 6 N=35

X

Y𝑦𝑗=1 = ℎ𝑒𝑎𝑑

𝑦𝑗=2 = 𝑡𝑎𝑖𝑙

𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗

𝐶𝑖

Pr(x=4, y=h)

Pr(x=5)

Pr(y=t | x=3)

Joint from conditionalPr(x=xi , y = yi)

Page 11: Lecture 03 Probability and Statistics - Nakul Gopalan

𝑝(𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗) =𝑛𝑖𝑗

𝑁Joint probability:

Probability: 𝑝(𝑋 = 𝑥𝑖) =𝑐𝑖𝑁

Sum rule

𝑝 𝑋 = 𝑥𝑖 =

𝑗=1

𝐿

𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ⇒ 𝑝 𝑋 =

𝑌

𝑃(𝑋, 𝑌)

Product rule

𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 =𝑛𝑖𝑗

𝑁=𝑛𝑖𝑗

𝑐𝑖

𝑐𝑖𝑁= 𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 𝑝(𝑋 = 𝑥𝑖)

𝑝 𝑋, 𝑌 = 𝑝 𝑌 𝑋 𝑝(𝑋)

𝑝(𝑌 = 𝑦𝑗|𝑋 = 𝑥𝑖) =𝑛𝑖𝑗

𝑐𝑖Conditional probability:

Page 12: Lecture 03 Probability and Statistics - Nakul Gopalan

Joint Distribution

13

Page 13: Lecture 03 Probability and Statistics - Nakul Gopalan

Marginal Distribution

14

Page 14: Lecture 03 Probability and Statistics - Nakul Gopalan

Conditional Distribution

15

Page 15: Lecture 03 Probability and Statistics - Nakul Gopalan

Independence & Conditional Independence

16

Page 16: Lecture 03 Probability and Statistics - Nakul Gopalan

Poll

• A virus Kappa is known to cause a flu. And a flu is know to cause a headache sometimes. A flu can be caused by multiple reasons. A patient shows up with a diagnosis of a flu. We know the patient has flu. Does the probability of the patient having a headache depend on the virus Kappa now or not?

Options:

1) Yes, probability of a headache depends on the virus Kappa.

2) No, probability of a headache is independent of the virus Kappa as we know the patient has a flu.

Page 17: Lecture 03 Probability and Statistics - Nakul Gopalan

Conditional Independence

18

Page 18: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

19

Page 19: Lecture 03 Probability and Statistics - Nakul Gopalan

Bayes’ Rule

20

Page 20: Lecture 03 Probability and Statistics - Nakul Gopalan

Bayes’ Rule

21

Page 21: Lecture 03 Probability and Statistics - Nakul Gopalan

Administrative business

• Office hours are live

• Live Q&A parallel to the lectures to TAs can help withanswering questions. Use voting as well so I know what needs to be handled here.

• Chris is holding a python tutorial on Thursday at 6 pm

• Project questions answered on Thursday’s lecture by me, come one, come all.

Page 22: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

23

Page 23: Lecture 03 Probability and Statistics - Nakul Gopalan

Mean and Variance

24

Page 24: Lecture 03 Probability and Statistics - Nakul Gopalan

For Joint Distributions

25

Page 25: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

26

Page 26: Lecture 03 Probability and Statistics - Nakul Gopalan

Gaussian Distribution

𝑓(𝑥|𝜇, 𝜎2) =1

2𝜋𝜎2𝑒−𝑥−𝜇 2

2𝜎2

Page 27: Lecture 03 Probability and Statistics - Nakul Gopalan

Multivariate Gaussian Distribution

28

Page 28: Lecture 03 Probability and Statistics - Nakul Gopalan

Properties of Gaussian Distribution

30

Page 29: Lecture 03 Probability and Statistics - Nakul Gopalan

Central Limit Theorem

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6

X

Probability mass function of a biased diceLet’s say, I am going to get a sample from this pmf having a

size of 𝒏 = 𝟒

𝑆1 = 1,1,1,6 ⇒ 𝐸 𝑆1 = 2.25

𝑆2 = 1,1,3,6 ⇒ 𝐸 𝑆2 = 2.75

𝑆𝑚 = 1,4,6,6 ⇒ 𝐸 𝑆𝑚 = 4.25

3.52.51 4.5 6

According to CLT, it will follow a bell curve distribution

(normal distribution)

Page 30: Lecture 03 Probability and Statistics - Nakul Gopalan

CLT Definition

• Statement: The central limit theorem (due to Laplace) tells us that, subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases (Walker, 1969).

Page 31: Lecture 03 Probability and Statistics - Nakul Gopalan

Outline

• Probability Distributions

• Joint and Conditional Probability Distributions

• Bayes’ Rule

• Mean and Variance

• Properties of Gaussian Distribution

• Maximum Likelihood Estimation

33

Page 32: Lecture 03 Probability and Statistics - Nakul Gopalan

Likelihood, what is it? Biased coin from a stranger

Page 33: Lecture 03 Probability and Statistics - Nakul Gopalan

Likelihood, what is it? or Cat or Dog? Do we know? Let’s find out!

Page 34: Lecture 03 Probability and Statistics - Nakul Gopalan

Maximum Likelihood Estimation

36

Main assumption:Independent and identically distributed random variables

i.i.d

Page 35: Lecture 03 Probability and Statistics - Nakul Gopalan

Maximum Likelihood Estimation

For Bernoulli (i.e. flip a coin):

Objective function: 𝑓 𝑥𝑖; 𝑝 = 𝑝𝑥𝑖 1 − 𝑝 1−𝑥𝑖 𝑥𝑖 ∈ 0,1 𝑜𝑟 {ℎ𝑒𝑎𝑑, 𝑡𝑎𝑖𝑙}

𝐿 𝑝 = 𝑃𝑟(𝑋 = 𝑥1, 𝑋 = 𝑥2, 𝑋 = 𝑥3, … , 𝑋 = 𝑥𝑛)

We want to know what is the most “likely” value for the probability of success p given n observations???

Page 36: Lecture 03 Probability and Statistics - Nakul Gopalan

Maximum Likelihood Estimation

For Bernoulli (i.e. flip a coin):

Objective function: 𝑓 𝑥𝑖; 𝑝 = 𝑝𝑥𝑖 1 − 𝑝 1−𝑥𝑖 𝑥𝑖 ∈ 0,1 𝑜𝑟 {ℎ𝑒𝑎𝑑, 𝑡𝑎𝑖𝑙}

𝐿 𝑝 = 𝑃𝑟(𝑋 = 𝑥1, 𝑋 = 𝑥2, 𝑋 = 𝑥3, … , 𝑋 = 𝑥𝑛)

= 𝑃𝑟 𝑋 = 𝑥1 𝑃𝑟 𝑋 = 𝑥2 …𝑃𝑟 𝑋 = 𝑥𝑛 = 𝑓 𝑝; 𝑥1 𝑓 𝑝; 𝑥2 …𝑓 𝑝; 𝑥𝑛

i.i.d assumption

𝐿 𝑝 =ෑ

𝑖=1

𝑛

𝑓 𝑥𝑖; 𝑝 =ෑ

𝑖=1

𝑛

𝑝𝑥𝑖 1 − 𝑝 1−𝑥𝑖

𝐿 𝑝 = 𝑝𝑥1 1 − 𝑝 1−𝑥1 × 𝑝𝑥2 1 − 𝑝 1−𝑥2 …× 𝑝𝑥𝑛 1 − 𝑝 1−𝑥𝑛 =

= 𝑝σ 𝑥𝑖 1 − 𝑝 σ(1−𝑥𝑖)

Page 37: Lecture 03 Probability and Statistics - Nakul Gopalan

We don’t like multiplication, let’s convert it into summation

What’s the trick? Take the log

𝐿 𝑝 = 𝑝σ 𝑥𝑖 1 − 𝑝 σ(1−𝑥𝑖)

𝑙𝑜𝑔𝐿 𝑝 = 𝑙 𝑝 = log 𝑝

𝑖=1

𝑛

𝑥𝑖 + log 1 − 𝑝

𝑖=1

𝑛

(1 − 𝑥𝑖)

𝜕𝑙(𝑝)

𝜕𝑝= 0

σ𝑖=1𝑛 𝑥𝑖𝑝

−σ𝑖=1𝑛 (1 − 𝑥𝑖)

1 − 𝑝= 0

𝑝 =1

𝑛

𝑖=1

𝑛

𝑥𝑖

How to optimize p?