18
1 ©2005-2013 Carlos Guestrin 1 Point Estimation Machine Learning – CSE446 Carlos Guestrin University of Washington April 3, 2013 http://www.cs.washington.edu/education/courses/cse446/13sp/ 2 ©2005-2013 Carlos Guestrin Your first consulting job A billionaire from the suburbs of Seattle asks you a question: He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? You say: Please flip it a few times: You say: The probability is: He says: Why??? You say: Because

Point Estimation - courses.cs.washington.edu

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Point Estimation - courses.cs.washington.edu

1

©2005-2013 Carlos Guestrin 1

Point Estimation

Machine Learning – CSE446 Carlos Guestrin University of Washington

April 3, 2013

http://www.cs.washington.edu/education/courses/cse446/13sp/

2 ©2005-2013 Carlos Guestrin

Your first consulting job

n  A billionaire from the suburbs of Seattle asks you a question: ¨ He says: I have thumbtack, if I flip it, what’s the

probability it will fall with the nail up? ¨ You say: Please flip it a few times:

¨ You say: The probability is:

¨ He says: Why??? ¨ You say: Because…

Page 2: Point Estimation - courses.cs.washington.edu

2

3 ©2005-2013 Carlos Guestrin

Thumbtack – Binomial Distribution

n  P(Heads) = θ, P(Tails) = 1-θ

n  Flips are i.i.d.: ¨  Independent events ¨  Identically distributed according to Binomial

distribution n  Sequence D of αH Heads and αT Tails

4 ©2005-2013 Carlos Guestrin

Maximum Likelihood Estimation

n  Data: Observed set D of αH Heads and αT Tails n  Hypothesis: Binomial distribution n  Learning θ is an optimization problem

¨ What’s the objective function?

n  MLE: Choose θ that maximizes the probability of observed data:

Page 3: Point Estimation - courses.cs.washington.edu

3

5 ©2005-2013 Carlos Guestrin

Your first learning algorithm

n  Set derivative to zero:

6 ©2005-2013 Carlos Guestrin

How many flips do I need?

n  Billionaire says: I flipped 3 heads and 2 tails. n  You say: θ = 3/5, I can prove it! n  He says: What if I flipped 30 heads and 20 tails? n  You say: Same answer, I can prove it!

n He says: What’s better? n  You say: Humm… The more the merrier??? n  He says: Is this why I am paying you the big bucks???

Page 4: Point Estimation - courses.cs.washington.edu

4

7 ©2005-2013 Carlos Guestrin

Simple bound (based on Hoeffding’s inequality)

n  For N = αH+αT, and

n  Let θ* be the true parameter, for any ε>0:

8 ©2005-2013 Carlos Guestrin

PAC Learning

n  PAC: Probably Approximate Correct n  Billionaire says: I want to know the thumbtack

parameter θ, within ε = 0.1, with probability at least 1-δ = 0.95. How many flips?

Page 5: Point Estimation - courses.cs.washington.edu

5

9 ©2005-2013 Carlos Guestrin

What about continuous variables?

n  Billionaire says: If I am measuring a continuous variable, what can you do for me?

n  You say: Let me tell you about Gaussians…

10 ©2005-2013 Carlos Guestrin

Some properties of Gaussians

n  affine transformation (multiplying by scalar and adding a constant) ¨ X ~ N(µ,σ2) ¨ Y = aX + b è Y ~ N(aµ+b,a2σ2)

n  Sum of Gaussians ¨ X ~ N(µX,σ2

X) ¨ Y ~ N(µY,σ2

Y) ¨ Z = X+Y è Z ~ N(µX+µY, σ2

X+σ2Y)

Page 6: Point Estimation - courses.cs.washington.edu

6

11 ©2005-2013 Carlos Guestrin

Learning a Gaussian

n  Collect a bunch of data ¨ Hopefully, i.i.d. samples ¨ e.g., exam scores

n  Learn parameters ¨ Mean ¨ Variance

12 ©2005-2013 Carlos Guestrin

MLE for Gaussian

n  Prob. of i.i.d. samples D={x1,…,xN}:

n  Log-likelihood of data:

Page 7: Point Estimation - courses.cs.washington.edu

7

13 ©2005-2013 Carlos Guestrin

Your second learning algorithm: MLE for mean of a Gaussian

n  What’s MLE for mean?

14 ©2005-2013 Carlos Guestrin

MLE for variance

n  Again, set derivative to zero:

Page 8: Point Estimation - courses.cs.washington.edu

8

15 ©2005-2013 Carlos Guestrin

Learning Gaussian parameters

n  MLE:

n  BTW. MLE for the variance of a Gaussian is biased ¨ Expected result of estimation is not true parameter! ¨ Unbiased variance estimator:

What you need to know…

n  Learning is… ¨  Collect some data

n  E.g., thumbtack flips ¨  Choose a hypothesis class or model

n  E.g., binomial ¨  Choose a loss function

n  E.g., data likelihood ¨  Choose an optimization procedure

n  E.g., set derivative to zero to obtain MLE ¨  Collect the big bucks

n  Like everything in life, there is a lot more to learn… ¨  Many more facets… Many more nuances… ¨  The fun will continue…

16 ©2005-2013 Carlos Guestrin

Page 9: Point Estimation - courses.cs.washington.edu

9

Announcement: R Tutorial

n  R: open-source scripting language for stats & ML n  A lot of resources online

n  Tutorial: ¨  When: Thursday April 4th at 6:00pm ¨  Where: EEB 125

n  Before attending please download and install R: ¨  http://www.r-project.org/

n  We recommend using an R environment such as: ¨  R studio: http://www.rstudio.com/ ¨  Tinn-R: http://www.sciviews.org/Tinn-R/

17 ©2005-2013 Carlos Guestrin

©2005-2013 Carlos Guestrin 18

Linear Regression Bias-Variance Tradeoff

Machine Learning – CSE446 Carlos Guestrin University of Washington

April 3, 2013

Page 10: Point Estimation - courses.cs.washington.edu

10

19

Prediction of continuous variables

n  Billionaire sayz: Wait, that’s not what I meant! n  You sayz: Chill out, dude. n  He sayz: I want to predict a continuous variable

for continuous inputs: I want to predict salaries from GPA.

n  You sayz: I can regress that…

©2005-2013 Carlos Guestrin

20

The regression problem n  Instances: <xj, tj> n  Learn: Mapping from x to t(x) n  Hypothesis space:

¨  Given, basis functions ¨  Find coeffs w={w1,…,wk}

¨  Why is this called linear regression??? n  model is linear in the parameters

n  Precisely, minimize the residual squared error:

©2005-2013 Carlos Guestrin

Page 11: Point Estimation - courses.cs.washington.edu

11

21

The regression problem in matrix notation

N data points

K basis functions

N data points

observations weights

K basis func

©2005-2013 Carlos Guestrin

Minimizing the Residual

22 ©2005-2013 Carlos Guestrin

Page 12: Point Estimation - courses.cs.washington.edu

12

23

Regression solution = simple matrix operations

where

k×k matrix for k basis functions

k×1 vector

©2005-2013 Carlos Guestrin

24

n  Billionaire (again) says: Why sum squared error??? n  You say: Gaussians, Dr. Gateson, Gaussians…

n  Model: prediction is linear function plus Gaussian noise ¨  t(x) = ∑i wi hi(x) + εx

n  Learn w using MLE

But, why?

©2005-2013 Carlos Guestrin

Page 13: Point Estimation - courses.cs.washington.edu

13

25

Maximizing log-likelihood

Maximize:

Least-squares Linear Regression is MLE for Gaussians!!! ©2005-2013 Carlos Guestrin

26

Applications Corner 1

n  Predict stock value over time from ¨ past values ¨ other relevant vars

n  e.g., weather, demands, etc.

©2005-2013 Carlos Guestrin

Page 14: Point Estimation - courses.cs.washington.edu

14

27

Applications Corner 2

n  Measure temperatures at some locations

n  Predict temperatures throughout the environment

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE50

51

52 53

54

46

48

49

47

43

45

44

42 41

3739

38 36

33

3

6

10

11

12

13 14

1516

17

19

2021

22

242526283032

31

2729

23

18

9

5

8

7

4

34

1

2

3540

[Guestrin et al. ’04]

©2005-2013 Carlos Guestrin

28

Applications Corner 3

n  Predict when a sensor will fail ¨ based several variables

n  age, chemical exposure, number of hours used,…

©2005-2013 Carlos Guestrin

Page 15: Point Estimation - courses.cs.washington.edu

15

29

Bias-Variance tradeoff – Intuition

n  Model too “simple” è does not fit the data well ¨ A biased solution

n  Model too complex è small changes to the data, solution changes a lot ¨ A high-variance solution

©2005-2013 Carlos Guestrin

30

(Squared) Bias of learner

n  Given dataset D with N samples, learn function hD(x)

n  If you sample a different dataset D’ with N samples, you will learn different hD’(x)

n  Expected hypothesis: ED[hD(x)]

n  Bias: difference between what you expect to learn and truth ¨  Measures how well you expect to represent true solution ¨  Decreases with more complex model ¨  Bias2 at one point x: ¨  Average Bias2:

©2005-2013 Carlos Guestrin

Page 16: Point Estimation - courses.cs.washington.edu

16

31

Variance of learner

n  Given dataset D with N samples, learn function hD(x)

n  If you sample a different dataset D’ with N samples, you will learn different hD’(x)

n  Variance: difference between what you expect to learn and what you learn from a particular dataset ¨  Measures how sensitive learner is to specific dataset ¨  Decreases with simpler model ¨  Variance at one point x: ¨  Average pariance:

©2005-2013 Carlos Guestrin

Bias-Variance Tradeoff

n  Choice of hypothesis class introduces learning bias ¨ More complex class → less bias ¨ More complex class → more variance

©2005-2013 Carlos Guestrin 32

Page 17: Point Estimation - courses.cs.washington.edu

17

Bias-Variance Decomposition of Error

n  Expected mean squared error:

n  To simplify derivation, drop x:

n  Expanding the square:

33 ©2005-2013 Carlos Guestrin

MSE = E

D

hE

x

h(t(x)� h

D

(x))2ii

h̄N (x) = ED[hD(x)]

34

Moral of the Story: Bias-Variance Tradeoff Key in ML

n  Error can be decomposed:

n  Choice of hypothesis class introduces learning bias ¨ More complex class → less bias ¨ More complex class → more variance

MSE = E

D

hE

x

h(t(x)� h

D

(x))2ii

= E

x

h�t(x)� h̄

N

(x)�2i

+ E

D

hE

x

h�h̄(x)� h

D

(x)�2ii

©2005-2013 Carlos Guestrin

Page 18: Point Estimation - courses.cs.washington.edu

18

35

What you need to know

n  Regression ¨ Basis function = features ¨ Optimizing sum squared error ¨ Relationship between regression and Gaussians

n  Bias-variance trade-off n  Play with Applet

©2005-2013 Carlos Guestrin