23
ESL Chap2 — Overview of Supervised Learning Overview of Supervised Learning Notation X : inputs, feature vector, predictors, independent variables. Generally X will be a vector of p real values. Qualitative features are coded in X using, for example, dummy variables. Sample values of X generally in lower case; x i is ith of N sample values. Bold X represents a matrix of feature values, with an x i in each row. Y : output, response, dependent variable. Typically a scalar, can be a vector, of real values. Again y i is a realized value. G: a qualitative response, taking values in a discrete set G ; e.g. G = {survived, died}. We often code G via a binary indicator response vector Y . 1

Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Overview of Supervised Learning

Notation

• X: inputs, feature vector, predictors, independent variables.Generally X will be a vector of p real values. Qualitative features arecoded in X using, for example, dummy variables. Sample values ofX generally in lower case; xi is ith of N sample values. Bold Xrepresents a matrix of feature values, with an xi in each row.

• Y : output, response, dependent variable. Typically a scalar, can be avector, of real values. Again yi is a realized value.

• G: a qualitative response, taking values in a discrete set G; e.g.G = {survived, died}. We often code G via a binary indicatorresponse vector Y .

1

Page 2: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

X1

X2

-2 0 2 4

-2-1

01

23

Raw Data with a Binary Response

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o ooo

o

o

ooo o

o

o

o

o

o

o

o

oo

o

o

oo

ooo

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

200 points generated in IR2 from an unknown distribution; 100 in each oftwo classes G = {GREEN, RED}. Can we build a rule to predict the colorof future points?

2

Page 3: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Linear regression

• Code Y = 1 if G = RED, else Y = 0.

• We model Y as a linear function of X:

Y = β0 +p∑j=1

Xj βj = XT β

• Obtain β by least squares, by minimizing the quadratic criterion:

RSS(β) =N∑i=1

(yi − xTi β)2

• Given an N × p model matrix X and a response vector y,

β = (XTX)−1XTy

(Drop β0 and make first column of X equal to 1)

3

Page 4: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

• Prediction at a future point x0 is Y (x0) = xT0 β. Also

G(x0) =

RED if Y (x0) > 0.5,

GREEN if Y (x0) ≤ 0.5.

• The decision boundary is {x|xT β = 0.5} is linear (Figure 2.1) (andseems to make many errors on the training data).

4

Page 5: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Possible scenarios

Scenario 1: The data in each class are generated from a Gaussiandistribution with uncorrelated components, same variances, anddifferent means.

Scenario 2: The data in each class are generated from a mixture of 10gaussians in each class.

For Scenario 1, the linear regression rule is almost optimal (Chapter 4).

For Scenario 2, it is far too rigid.

5

Page 6: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

K-Nearest Neighbors

A natural way to classify a new point is to have a look at its neighbors,and take a vote:

Yk(x) =1k

∑xi∈Nk(x)

yi,

where Nk(x) is a neighborhood of x that contains exactly k neighbors(k-nearest neighborhood).

If there is a clear dominance of one of the classes in the neighborhood ofan observation x, then it is likely that the observation itself would belongto that class, too. Thus the classification rule is the majority voting amongthe members of Nk(x). As before,

Gk(x0) =

RED if Yk(x0) > 0.5,

GREEN if Yk(x0) ≤ 0.5.

6

Page 7: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Figure 2.2 shows the result of 15-nearest neighbor classification. Fewertraining data are misclassified, and the decision boundary adapts to thelocal densities of the classes.

Figure 2.3 shows the result of 1-nearest neighbor classification. None ofthe training data are misclassified.

Linear regression uses 3 parameters to describe its fit. Does K-nearestneighbors use 1 (the value of k here)?

More realistically, k-nearest neighbors uses N/k effective number ofparameters

7

Page 8: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Many modern procedures are variants of linear regression and K-nearestneighbors:

• Kernel smoothers

• Local linear regression

• Linear basis expansions

• Projection pursuit and neural networks

8

Page 9: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Linear regression vs k-nearest neighbors?

First we expose the oracle. The density for each class was an equalmixture of 10 Gaussians. For the GREEN class, its 10 means weregenerated from a N((1, 0)T , I) distribution (and considered fixed). Forthe RED class, the 10 means were generated from a N((0, 1)T , I). Thewithin cluster variances were 1/5.

See page 17 for more details, or the book website for the actual data.

Figure 2.4 shows the results of classifying 10,000 test observationsgenerated from this distribution.

The Bayes Error is the best performance possible (Figure 2.5).

9

Page 10: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Statistical decision theory

Case 1: Quantitative output Y

• Let X ∈ Rp denote a real valued random input vector

• We have a Loss function L(Y, f(X)) for penalizing errors inprediction.

• Most common and convenient is squared error loss:L(Y, f(X) = (Y − f(X))2.

• This leads us to a criterion for choosing f ,

EPE(f) = E(Y − f(X))2

the Expected (squared) Prediction Error,

EPE(f) = Ex[Ey|x(Y − E(Y |x) + E(Y |x)− f(x))]2

10

Page 11: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

= Ex(Ey|x(Y − E(Y |x))2 + Ex(E(Y |x)− f(x)))2

= Bayes error +MSE (1)

• Minimizing EPE(f) leads to a solution f(x) = E(Y |X = x), theconditional expectation, also known as the regression function.

11

Page 12: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Case 2: Qualitative output G:

• Suppose our prediction rule is G(X), and G and G(X) take values inG, with card(G) = K.

• We have a different loss function for penalizing prediction errors.L(k, `) is the price paid for classifying an observation belonging toclass Gk as G`.

• Most often we use the 0-1 loss function where all misclassificationsare charged a single unit.

• The expected prediction error is

EPE = E[L(G, G(X))]

• Solution is

G(x) = argming∈G

K∑k=1

L(Gk, g)P (Gk|X = x)

12

Page 13: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

With the 0-1 loss function this simplifies to

G(x) = Gk if P (Gk|x) = maxg∈G

P (g|X)

This is known as the Bayes classifier. It just says that we should pick theclass having maximum probability at the input x.

Question: how did we construct the Bayes classifier for our simulationexample?

13

Page 14: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

• K-nn tries to implement conditional expectations directly, by

– Approximating expectations by sample averages

– Relaxing the notion of conditioning at a point, to conditioning ina region about a point.

• As N, k →∞, such that k/N → 0, the K-nearest neighbor estimatef(x)→ E(Y |X = x) — it is consistent.

• Linear regression assumes a (linear) structural form for f(x) = xTβ,and minimizes sample version of EPE directly.

• As sample size grows, our estimate of linear coefficients β convergesto the optimal βopt = E(XXT )−1E(XY ).

• Model is limited by the linearity assumption

Question: Why not always use k-nearest neighbors?

14

Page 15: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Curse of dimensionality

K-nearest neighbors can fail in high dimensions, because it becomesdifficult to gather K observations close to a target point x0:

• near neighborhoods tend to be spatially large, and estimates arebiased.

• reducing the spatial size of the neighborhood means reducing K, andthe variance of the estimate increases.

See Figure 2.6.

• Most points are at the boundary

• Sampling density is proportional to N1/p; if 100 points are sufficientto estimate a function in IR1, 10010 are needed to achieve similaraccuracy in IR10

15

Page 16: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Example 1

• 1000 training examples xi generated uniformly on [−1, 1]p.

• Y = f(X) = e−8||X||2 (no measurement error).

• use the 1-nearest-neighbor rule to predict y0 at the test-point x0 = 0.

MSE(x0) = ET [f(x0)− y0]2

= ET [y0 − ET (y0)]2 + [ET (y0)− f(x0)]2

= VarT (y0) + Bias2(y0).

Figure 2.7 shows what happens as p increases.

Figure 2.8 is the same, except here f(X) = 12 (X1 + 1)3. Here we have

• Ef(x0) = Ef(X(1)) ≈ f(EX(1)) = f(x0) for all dimensions

• Varf(x0) = Varf(X(1)) ↑ with dimension

16

Page 17: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Example 2

If the linear model is correct, or almost correct, K-nearest neighbors willdo much worse than linear regression.

In cases like this (and of course, assuming we know this is the case),simple linear regression methods are not affected by the dimension.

Figure 2.9 illustrates two simple cases.

17

Page 18: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Statistical Models

Y = f(X) + ε

with E(ε) = 0 and X and ε independent.

• E(Y |X) = f(X)

• Pr(Y |X) depends on X only through f(X).

• Useful approximation to the truth — all unmeasured variablescaptured by ε

• N realizations yi = f(xi) + εi, i = 1, . . . , N

• Assume εi and εj are independent.

More generally can have, for example, Var(Y |X) = σ2(X).

For qualitative outcomes {Pr(G = Gk|X)}K1 = p(X) which we modeldirectly.

18

Page 19: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

19

Page 20: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Function Approximation

RSS(θ) =N∑i=1

(yi − fθ(xi))2

Assumes

• xi, yi are points in, say IRp+1.

• A (parametric or non-parametric) form for f(X)

• A loss function for measuring the quality of the approximation.

Figure 2.10 illustrates the situation.

More generally, Maximum Likelihood Estimation provides a natural basisfor estimation. We will see examples such as logistic regression via thebinomial likelihood.

20

Page 21: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Structured Regression Models

RSS(f) =N∑i=1

(yi − f(xi))2

• Any function passing through (xi, yi) has RSS = 0

• Need to restrict the class

• Usually restrictions impose local behavior — see equivalent kernelsin chapters 5 and 6

• Any method that attempts to approximate locally varying functions is“cursed”

• Alternatively, any method that “overcomes” the curse, assumes animplicit metric that does not allow neighborhoods to besimultaneously small in all directions.

21

Page 22: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Classes of Restricted Estimators

Some of the classes of restricted methods that we cover are

• Roughness Penalty and Bayesian Methods (chap 5)

Penalized RSS(f, λ) = RSS(f) + λJ(f)

• Basis functions and dictionary methods (chap 5)

fθ(x) =M∑m=1

θmhm(x)

• Kernel Methods and Local Regression (chap 6)

RSS(fθ, x0) =N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

22

Page 23: Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Model Selection and the Bias-Variance Tradeoff

Many of the flexible methods have a complexity parameter:

• the multiplier of the penalty term

• the width of the kernel

• the number of basis functions

Cannot use RSS to determine this parameter — why?

Can use Prediction error on unseen test cases to guide us

E.g. Y = f(X) + ε, K-nn (and assume the sample xi are fixed):

E[(Y − fk(x0))2|X = x0] = σ2 + Bias2(fk(x0)) + VarT (fk(x0))

= σ2 + [f(x0)−1k

k∑`=1

f(x(`))]2 +σ2

k

Selecting k is a bias-variance tradeoff — see Figure 2.11.

23