Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible

ESL Chap2 — Overview of Supervised Learning Trevor Hastie

Overview of Supervised Learning

Notation

• X: inputs, feature vector, predictors, independent variables.Generally X will be a vector of p real values. Qualitative features arecoded in X using, for example, dummy variables. Sample values ofX generally in lower case; xi is ith of N sample values. Bold Xrepresents a matrix of feature values, with an xi in each row.

• Y : output, response, dependent variable. Typically a scalar, can be avector, of real values. Again yi is a realized value.

• G: a qualitative response, taking values in a discrete set G; e.g.G = {survived, died}. We often code G via a binary indicatorresponse vector Y .

1


X1

X2

-2 0 2 4

-2-1

01

23

Raw Data with a Binary Response

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o ooo

o

o

ooo o

o

o

o

o

o

o

o

oo

o

o

oo

ooo

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

200 points generated in IR2 from an unknown distribution; 100 in each oftwo classes G = {GREEN, RED}. Can we build a rule to predict the colorof future points?

2


Linear regression

• Code Y = 1 if G = RED, else Y = 0.

• We model Y as a linear function of X:

Y = β0 +p∑j=1

Xj βj = XT β

• Obtain β by least squares, by minimizing the quadratic criterion:

RSS(β) =N∑i=1

(yi − xTi β)2

• Given an N × p model matrix X and a response vector y,

β = (XTX)−1XTy

(Drop β0 and make first column of X equal to 1)

3


• Prediction at a future point x0 is Y (x0) = xT0 β. Also

G(x0) =

RED if Y (x0) > 0.5,

GREEN if Y (x0) ≤ 0.5.

• The decision boundary is {x|xT β = 0.5} is linear (Figure 2.1) (andseems to make many errors on the training data).

4

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/FIGURES/chap2/fig2-1.pdf


Possible scenarios

Scenario 1: The data in each class are generated from a Gaussiandistribution with uncorrelated components, same variances, anddifferent means.

Scenario 2: The data in each class are generated from a mixture of 10gaussians in each class.

For Scenario 1, the linear regression rule is almost optimal (Chapter 4).

For Scenario 2, it is far too rigid.

5


K-Nearest Neighbors

A natural way to classify a new point is to have a look at its neighbors,and take a vote:

Yk(x) =1k

∑xi∈Nk(x)

yi,

where Nk(x) is a neighborhood of x that contains exactly k neighbors(k-nearest neighborhood).

If there is a clear dominance of one of the classes in the neighborhood ofan observation x, then it is likely that the observation itself would belongto that class, too. Thus the classification rule is the majority voting amongthe members of Nk(x). As before,

Gk(x0) =

RED if Yk(x0) > 0.5,

GREEN if Yk(x0) ≤ 0.5.

6


Figure 2.2 shows the result of 15-nearest neighbor classification. Fewertraining data are misclassified, and the decision boundary adapts to thelocal densities of the classes.

Figure 2.3 shows the result of 1-nearest neighbor classification. None ofthe training data are misclassified.

Linear regression uses 3 parameters to describe its fit. Does K-nearestneighbors use 1 (the value of k here)?

More realistically, k-nearest neighbors uses N/k effective number ofparameters

7




Many modern procedures are variants of linear regression and K-nearestneighbors:

• Kernel smoothers

• Local linear regression

• Linear basis expansions

• Projection pursuit and neural networks

8


Linear regression vs k-nearest neighbors?

First we expose the oracle. The density for each class was an equalmixture of 10 Gaussians. For the GREEN class, its 10 means weregenerated from a N((1, 0)T , I) distribution (and considered fixed). Forthe RED class, the 10 means were generated from a N((0, 1)T , I). Thewithin cluster variances were 1/5.

See page 17 for more details, or the book website for the actual data.

Figure 2.4 shows the results of classifying 10,000 test observationsgenerated from this distribution.

The Bayes Error is the best performance possible (Figure 2.5).

9

http://www-stat.stanford.edu/ElemStatLearn

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/FIGURES/chap2/fig2-4



Statistical decision theory

Case 1: Quantitative output Y

• Let X ∈ Rp denote a real valued random input vector

• We have a Loss function L(Y, f(X)) for penalizing errors inprediction.

• Most common and convenient is squared error loss:L(Y, f(X) = (Y − f(X))2.

• This leads us to a criterion for choosing f ,

EPE(f) = E(Y − f(X))2

the Expected (squared) Prediction Error,

•

EPE(f) = Ex[Ey|x(Y − E(Y |x) + E(Y |x)− f(x))]2

10


= Ex(Ey|x(Y − E(Y |x))2 + Ex(E(Y |x)− f(x)))2

= Bayes error +MSE (1)

• Minimizing EPE(f) leads to a solution f(x) = E(Y |X = x), theconditional expectation, also known as the regression function.

11


Case 2: Qualitative output G:

• Suppose our prediction rule is G(X), and G and G(X) take values inG, with card(G) = K.

• We have a different loss function for penalizing prediction errors.L(k, `) is the price paid for classifying an observation belonging toclass Gk as G`.

• Most often we use the 0-1 loss function where all misclassificationsare charged a single unit.

• The expected prediction error is

EPE = E[L(G, G(X))]

• Solution is

G(x) = argming∈G

K∑k=1

L(Gk, g)P (Gk|X = x)

12


With the 0-1 loss function this simplifies to

G(x) = Gk if P (Gk|x) = maxg∈G

P (g|X)

This is known as the Bayes classifier. It just says that we should pick theclass having maximum probability at the input x.

Question: how did we construct the Bayes classifier for our simulationexample?

13


• K-nn tries to implement conditional expectations directly, by

– Approximating expectations by sample averages

– Relaxing the notion of conditioning at a point, to conditioning ina region about a point.

• As N, k →∞, such that k/N → 0, the K-nearest neighbor estimatef(x)→ E(Y |X = x) — it is consistent.

• Linear regression assumes a (linear) structural form for f(x) = xTβ,and minimizes sample version of EPE directly.

• As sample size grows, our estimate of linear coefficients β convergesto the optimal βopt = E(XXT )−1E(XY ).

• Model is limited by the linearity assumption

Question: Why not always use k-nearest neighbors?

14

http://www-stat.stanford.edu/~hastie/Printer/315-LECTURES/FIGURES/chap2/knnvsols.pdf


Curse of dimensionality

K-nearest neighbors can fail in high dimensions, because it becomesdifficult to gather K observations close to a target point x0:

• near neighborhoods tend to be spatially large, and estimates arebiased.

• reducing the spatial size of the neighborhood means reducing K, andthe variance of the estimate increases.

See Figure 2.6.

• Most points are at the boundary

• Sampling density is proportional to N1/p; if 100 points are sufficientto estimate a function in IR1, 10010 are needed to achieve similaraccuracy in IR10

15



Example 1

• 1000 training examples xi generated uniformly on [−1, 1]p.

• Y = f(X) = e−8||X||2 (no measurement error).

• use the 1-nearest-neighbor rule to predict y0 at the test-point x0 = 0.

MSE(x0) = ET [f(x0)− y0]2

= ET [y0 − ET (y0)]2 + [ET (y0)− f(x0)]2

= VarT (y0) + Bias2(y0).

Figure 2.7 shows what happens as p increases.

Figure 2.8 is the same, except here f(X) = 12 (X1 + 1)3. Here we have

• Ef(x0) = Ef(X(1)) ≈ f(EX(1)) = f(x0) for all dimensions

• Varf(x0) = Varf(X(1)) ↑ with dimension

16




Example 2

If the linear model is correct, or almost correct, K-nearest neighbors willdo much worse than linear regression.

In cases like this (and of course, assuming we know this is the case),simple linear regression methods are not affected by the dimension.

Figure 2.9 illustrates two simple cases.

17



Statistical Models

Y = f(X) + ε

with E(ε) = 0 and X and ε independent.

• E(Y |X) = f(X)

• Pr(Y |X) depends on X only through f(X).

• Useful approximation to the truth — all unmeasured variablescaptured by ε

• N realizations yi = f(xi) + εi, i = 1, . . . , N

• Assume εi and εj are independent.

More generally can have, for example, Var(Y |X) = σ2(X).

For qualitative outcomes {Pr(G = Gk|X)}K1 = p(X) which we modeldirectly.

18


19


Function Approximation

RSS(θ) =N∑i=1

(yi − fθ(xi))2

Assumes

• xi, yi are points in, say IRp+1.

• A (parametric or non-parametric) form for f(X)

• A loss function for measuring the quality of the approximation.

Figure 2.10 illustrates the situation.

More generally, Maximum Likelihood Estimation provides a natural basisfor estimation. We will see examples such as logistic regression via thebinomial likelihood.

20



Structured Regression Models

RSS(f) =N∑i=1

(yi − f(xi))2

• Any function passing through (xi, yi) has RSS = 0

• Need to restrict the class

• Usually restrictions impose local behavior — see equivalent kernelsin chapters 5 and 6

• Any method that attempts to approximate locally varying functions is“cursed”

• Alternatively, any method that “overcomes” the curse, assumes animplicit metric that does not allow neighborhoods to besimultaneously small in all directions.

21


Classes of Restricted Estimators

Some of the classes of restricted methods that we cover are

• Roughness Penalty and Bayesian Methods (chap 5)

Penalized RSS(f, λ) = RSS(f) + λJ(f)

• Basis functions and dictionary methods (chap 5)

fθ(x) =M∑m=1

θmhm(x)

• Kernel Methods and Local Regression (chap 6)

RSS(fθ, x0) =N∑i=1

Kλ(x0, xi)(yi − fθ(xi))2

22


Model Selection and the Bias-Variance Tradeoff

Many of the flexible methods have a complexity parameter:

• the multiplier of the penalty term

• the width of the kernel

• the number of basis functions

Cannot use RSS to determine this parameter — why?

Can use Prediction error on unseen test cases to guide us

E.g. Y = f(X) + ε, K-nn (and assume the sample xi are fixed):

E[(Y − fk(x0))2|X = x0] = σ2 + Bias2(fk(x0)) + VarT (fk(x0))

= σ2 + [f(x0)−1k

k∑`=1

f(x(`))]2 +σ2

k

Selecting k is a bias-variance tradeoff — see Figure 2.11.

23


Documents

Overview of Supervised Learning - Stanford Universitystatweb.stanford.edu/~tibs/stat315a/LECTURES/chap2.pdf · ESL Chap2 — Overview of Supervised Learning Trevor Hastie Possible