33
Probability and Sampling Recitation Avishek Dutta CS155 Machine Learning and Data Mining February 2, 2017 Avishek Dutta Probability and Sampling Recitation

Probability and Sampling Recitation

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probability and Sampling Recitation

Probability and Sampling Recitation

Avishek Dutta

CS155 Machine Learning and Data Mining

February 2, 2017

Avishek Dutta Probability and Sampling Recitation

Page 2: Probability and Sampling Recitation

Motivation

Uncertainty is everywhere around us

”what is the chance that it will rain today?””when will the next bus arrive?””will I go to recitation today?”

Machine learning tries to understand uncertainties andinteract with the real world

Probability theory is the mathematical study of uncertainty

Avishek Dutta Probability and Sampling Recitation

Page 3: Probability and Sampling Recitation

Basic Concepts

Sample Space Ω: set of all possible outcomes

An event, A, is a subspace of Ω

P(A) is the probability of event A occurring

0 ≤ P(A) ≤ 1P(∅) = 0P(Ω) = 1P(A) = 1− P(A)

Examples

Suppose you are rolling a six-sided dice. Then we have:

Ω = 1, 2, 3, 4, 5, 6P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1

6

P(1, 2, 3, 4, 5, 6) = 1

Avishek Dutta Probability and Sampling Recitation

Page 4: Probability and Sampling Recitation

Union

Given two events, A and B, the probability of A or B is

P(A ∪ B) = P(A) + P(B)− P(A ∩ B) (1)

= P(A) + P(B) (2)

where (1) and (2) are equal only if A and B are mutually exclusive

Examples

Suppose you are rolling a six-sided dice. Then we have:

P(2, 4, 6) = P(2) + P(4) + P(6) = 12

P(1, 2, 3 ∪ 2, 4, 6) =P(1, 2, 3) + P(2, 4, 6)− P(2) = 1

2 + 12 −

16 = 5

6

Avishek Dutta Probability and Sampling Recitation

Page 5: Probability and Sampling Recitation

Joint and Conditional Probabilities

Given two events, A and B,

P(A,B) is the joint probability of A and B occurring together

P(A | B) is the conditional probability of A occurring giventhat we know B has occurred

Joint and Conditional Probabilities are related in the following way:

P(A | B) =P(A,B)

P(B)

where we assume that P(B) 6= 0.

Avishek Dutta Probability and Sampling Recitation

Page 6: Probability and Sampling Recitation

Joint and Conditional Probabilities

Examples

Consider a standard deck of cards. What is the probability that thefirst two cards drawn are both Kings?

Event A = First card drawn is a King

Event B = Second card drawn is a King

A standard deck of cards has 4 Kings and 52 total cards.

P(A) = 452

P(B | A) = 351

This means that

P(A,B) = P(B | A)P(A) = 351 ·

452 = 1

221 ≈ 0.005

Avishek Dutta Probability and Sampling Recitation

Page 7: Probability and Sampling Recitation

Joint Probabilities - Chain Rule

An extension of P(A,B) = P(B | A)P(A):

P(A1,A2, . . . ,An) = P(An, . . . ,A2,A1)

= P(An | An−1, . . . ,A2,A1)P(An−1, . . . ,A2,A1)

...

=n∏

i=1

P(Ai | A1,A2, . . . ,Ai−1)

Avishek Dutta Probability and Sampling Recitation

Page 8: Probability and Sampling Recitation

Independence

Two events, A and B, are independent if

P(A,B) = P(A)P(B)

or equivalently

P(A | B) = P(A)

In other words, knowledge of whether B occurred does not affectthe probability that A occurs

Avishek Dutta Probability and Sampling Recitation

Page 9: Probability and Sampling Recitation

Independence

Examples

Suppose that you roll a six-sided die twice. What is the probabilitythat you roll 1 both times?

A = rolling a 1 in the first roll

B = rolling a 1 in the second roll

These two events are independent so

P(A,B) = P(A)P(B) =1

6· 1

6=

1

36

Avishek Dutta Probability and Sampling Recitation

Page 10: Probability and Sampling Recitation

Bayes Theorem

We know that the following is true

P(A,B) = P(A | B)P(B) = P(B | A)P(A)

This implies that

P(A | B) =P(B | A)P(A)

P(B)

P(A | B) ∝ P(B | A)P(A)

P(A | B) - Posterior Probability

P(B | A) - Likelihood Function

P(A) - Prior Information

P(B) - Evidence

Avishek Dutta Probability and Sampling Recitation

Page 11: Probability and Sampling Recitation

Bayes Theorem

Examples

If a person has an allergy (A), sneezing (S) is observed withprobability P(S | A) = 0.8. What is the chance a person has anallergy given that they are sneezing: P(A | S)?

Assume that P(A) = 0.001 (few people have allergies)

Assume that P(S) = 0.1 (many people sneeze).

P(A | S) =P(S | A)P(A)

P(S)

=0.8 · 0.001

0.1= 0.008

Avishek Dutta Probability and Sampling Recitation

Page 12: Probability and Sampling Recitation

Random Variables

So far, we have only considered simple examples with simple eventsas the possible outcomes

A = rolling 1 on a six-sided dice

B = first card drawn is a King

This is mathematically imprecise and can be limiting. The notionof random variables resolves these issues

A random variable X is a function X : Ω→ RAbuse of notation: P(X = x) is the probability that therandom variable takes on value x

Avishek Dutta Probability and Sampling Recitation

Page 13: Probability and Sampling Recitation

Random Variables

Examples

Dice Roll: X = i for i ∈ 1, 2, 3, 4, 5, 6 with P(X = i) = 16

Biased Coin: X = 1 with P(X = 1) = p and X = 0 withP(X = 0) = 1− p

All random variables have a Cumulative Distribution Function(CDF):

F (x) = P(X ≤ x)

Some properties of the CDF are:

0 ≤ F (x) ≤ 1

F (X ) is monotonically increasing

limx→−∞

F (x) = 0 limx→∞

F (x) = 1

Avishek Dutta Probability and Sampling Recitation

Page 14: Probability and Sampling Recitation

Discrete Random Variables

A random variable that can take on only finitely many differentvalues

Discrete random variables have a Probability Mass Function(PMF):

p(x) = P(X = x)

The PMF satisfies ∑x

P(X = x) = 1

Examples

Suppose you are flipping two coins. Let X be the random variablethat equals the number of heads

P(X = 0) P(X = 1) P(X = 2)

0.25 0.5 0.25

Avishek Dutta Probability and Sampling Recitation

Page 15: Probability and Sampling Recitation

Continuous Random Variables

A random variable that can take on infinitely many different values

Continuous random variables have a Probability DistributionFunction (PDF):

p(x) = f (x) =d

dxF (x)

The PDF satisfies ∫ ∞−∞

f (x)dx = 1

Examples

Suppose that X is a random variable that is equal to a valuechosen uniformly at random from the interval [0, a]. Then

F (x) =x

af (x) =

1

a

Avishek Dutta Probability and Sampling Recitation

Page 16: Probability and Sampling Recitation

Marginal Distribution

Suppose that for two random variables, X and Y , the jointdistribution p(x , y) is known for all combinations of X and Y .Then the marginal distribution of X is

p(x) =∑y

p(x , y) p(x) =

∫ ∞−∞

p(x , y)dy

and the marginal distribution of Y is

p(y) =∑x

p(x , y) p(y) =

∫ ∞−∞

p(x , y)dx

Avishek Dutta Probability and Sampling Recitation

Page 17: Probability and Sampling Recitation

Marginal Distribution

Examples

Consider two random variables X and Y . Suppose the jointdistribution is given by

x1 x2 x3 x4 p(y)

y1432

232

132

132

832

y2232

432

132

132

832

y3232

232

232

232

832

y4832 0 0 0 8

32

p(x) 1632

832

432

432

3232

p(x1) =∑y

p(x1, y) =16

32

Avishek Dutta Probability and Sampling Recitation

Page 18: Probability and Sampling Recitation

Expected Value

The expected value of a random variable, X , is the mean of thedistribution

E[X ] =∑x

xP(X = x) E[X ] =

∫xxf (x)dx

Expectation is linear

E[aX + b] = aE[X ] + b

E[X + Y ] = E[X ] + E[Y ]

Examples

Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the expected value ofX + Y ?

E[X + Y ] = E[X ] + E[Y ] = P(X = 1) + P(Y = 1) = 2p

Avishek Dutta Probability and Sampling Recitation

Page 19: Probability and Sampling Recitation

Variance

The variance of a random variable is the measure of the ’spread’ ofthe values the variable takes on:

Var(X ) = E[(X − E[X ])2]

The variance can also be written as

Var(X ) = E[X 2]− E[X ]2

Variance is generally not linear

Var(aX + b) = a2Var(X )

Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )

Avishek Dutta Probability and Sampling Recitation

Page 20: Probability and Sampling Recitation

Covariance

The covariance between random variables X and Y measures thedegree to which X and Y are related

Cov(X ,Y ) = E[

(X − E[X ])(Y − E[Y ])

]This can also be written as

Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]

Note that Cov(X ,X ) = Var(X )

Avishek Dutta Probability and Sampling Recitation

Page 21: Probability and Sampling Recitation

Variance and Covariance

Examples

Suppose that X and Y are random variables that are 1 withprobability p and 0 otherwise. What is the covariance of X and Y ?What is the variance of X + Y ?

Cov(X ,Y ) = E[XY ]− E[X ]E[Y ]

= p2(1) + (1− p2)(0)− p2 = 0

Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X ,Y )

= E[X 2]− E[X ]2 + E[Y 2]− E[Y ]2

= p − p2 + p − p2 = 2p(1− p)

Avishek Dutta Probability and Sampling Recitation

Page 22: Probability and Sampling Recitation

Important Discrete Distributions

Bernoulli (p)

P(X = x) = px(1− p)1−x for x ∈ 0, 1E[x ] = p

Binomial (n, p)

P(X = x) =(nx

)px(1− p)n−x

E[x ] = np

Geometric (p)

P(X = x) = p(1− p)x−1

E[x ] = 1p

Poisson (λ)

P(X = x) = λxe−x

x!E[x ] = λ

Avishek Dutta Probability and Sampling Recitation

Page 23: Probability and Sampling Recitation

Important Discrete Distributions

Examples

Bernoulli - flipping a biased coin that lands heads withprobability p

Binomial - flipping n biased coins that land heads withprobability p

Geometric - flipping a biased coin that lands heads withprobability p until it lands on heads

Poisson - letters received in a given day when the averageletters received per day is λ

Avishek Dutta Probability and Sampling Recitation

Page 24: Probability and Sampling Recitation

Important Continuous Distributions

Uniform (a, b) with a ≤ b

f (x) = 1b−a for a ≤ x ≤ b, 0 otherwise

E[x ] = a+b2

Normal (µ, σ2)

f (x) = 1σ√2πe−

12σ2 (x−µ)

2

E(x) = µ

Examples

Uniform - spinning a board game spinner

Normal - height of a person in the general population

Avishek Dutta Probability and Sampling Recitation

Page 25: Probability and Sampling Recitation

Law of Large Numbers

Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ. Then thesample average given by

X n =1

n(X1 + X1 + . . .+ Xn)

converges to the expected value

X n → µ as n→∞

Avishek Dutta Probability and Sampling Recitation

Page 26: Probability and Sampling Recitation

Central Limit Theorem

Let X1,X2, . . . ,Xn be a set of independent and identicallydistributed (iid) random variables with E[Xi ] = µ andVar(Xi ) = σ2, both finite. If we define the random variable Y as

Y =1

n(X1 + X1 + . . .+ Xn)

then Y is approximately normal with mean µ and variance σ2

n //

The approximation improves as n→∞

Avishek Dutta Probability and Sampling Recitation

Page 27: Probability and Sampling Recitation

Sampling

Sampling is the process by which elements are drawn from apopulation or distribution

In this course, we typically assume elements are sampledindependently (i.e. with replacement)

The np.random module contains many functions for samplingrandomly from various distributions

np.random.uniform(0,1) # uniform on [0,1]

Avishek Dutta Probability and Sampling Recitation

Page 28: Probability and Sampling Recitation

Independent and Identically Distributed

The importance of the iid assumption cannot be overstated!

Independence assumption simplifies many things:

P(X ,Y ) = P(X )P(Y )

P(X | Y ) = P(X )

Var(X + Y ) = Var(X ) + Var(Y )

Avishek Dutta Probability and Sampling Recitation

Page 29: Probability and Sampling Recitation

Parameter Estimation

Suppose we have a parametrized distribution P(X ; θ) with θunknown

Suppose we have an iid set of samples x1, . . . , xn

How can we estimate θ? From Bayes’ Theorem, we have that

P(θ | X ) ∝ P(X | θ)P(θ)

MAP: θ = arg maxθ

P(θ | X )

MLE: θ = arg maxθ

P(X | θ)

Avishek Dutta Probability and Sampling Recitation

Page 30: Probability and Sampling Recitation

Parameter Estimate - log trick

The logarithmic function is monotonically increasing - will notchange the location of where a function achieves its maximum

Multiplication turns into summation, resulting in less overflow

Considering the negative of the likelihood function allows us to usegradient descent to minimize the function

arg maxθ

f (X | θ) = arg minθ

− log f (X | θ)

Avishek Dutta Probability and Sampling Recitation

Page 31: Probability and Sampling Recitation

Parameter Estimation - Binomial Distribution

Examples

Suppose you toss a (possibly biased) coin n times and record thenumber of heads, k . What is p, the probability that it lands headson a given coin flip. We can use the binomial distribution andMLE to find p:

p = arg maxp

P(k | p) = arg maxp

(n

k

)pk(1− p)n−k

= arg maxp

pk(1− p)n−k

= arg minp

− log pk(1− p)n−k

= arg minp

− k log p − (n − k) log (1− p)

Taking the derivative wrt to p and zeroing implies that p = kn

Avishek Dutta Probability and Sampling Recitation

Page 32: Probability and Sampling Recitation

MLE with Logistic Regression

Examples

Suppose you have an iid dataset, S = (xi , yi )Ni=1, and want tomodel it using a ”log-linear” model:

P(y | x ,w , b) =1

1 + e−y(wT x+b)

We can use a MLE to find w , b:

w , b = arg maxw ,b

P(y | x ,w , b)

= arg maxw ,b

N∏i=1

P(yi | xi ,w , b) iid assumption

= arg minw ,b

N∑i=1

− lnP(yi | xi ,w , b) log-loss in logistic reg.

Avishek Dutta Probability and Sampling Recitation

Page 33: Probability and Sampling Recitation

References

Lucy Yin’s Recitation on Probability (CS155 Winter 2016)

Kevin Murphy’s Machine Learning: A Probabilistic Perspective

Avishek Dutta Probability and Sampling Recitation