31
Chap 1. Overview of Statistical Learning (HTF, 2.1 - 2.6, 2.9) Yongdai Kim Seoul National University

Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Chap 1. Overview of Statistical Learning(HTF, 2.1 - 2.6, 2.9)

Yongdai Kim

Seoul National University

Page 2: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

0. Learning vs Statistical learning

• Learning procedure

– Construct a claim by observing data or using logics

– Perform experiments

– Make conclusion

• Statistical learning procedure

– Collect data

– Analyze the data

– Find new rules

“Let the data tell something.”

Seoul National University. 1

Page 3: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Why “Statistical learning” necessary?

– We know most of rules which can be imagined by our brain.

– Life (nature, socio-economic status, human behavior, biology

etc.) is more complex than we have thought.

– Our world is changing too fast for us to keep up with based

only on our logics.

– Due to digitalization, amount of data is increasing very fast.

– Most of information in huge data remains undiscovered

• Sample questions

– What are the risk factors for heart failure?

– Are there genes which characterize differences between various

races?

– How does the stock market behave?

– Which chemical confounds are effective for a specific disease?

Seoul National University. 2

Page 4: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

– Who are valuable customers for our company?

– What are the influential factors for changing the amount of

ozone?

– Are there patterns in the content of spam mails?

• In statistical learning, the common objective is to find causes for

a given phenomenon.

• One of the common features of the problems is that the set of

possible causes we can think of is very large.

• Learning procedure suffers from time limitation unless we are

lucky.

Seoul National University. 3

Page 5: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Machine learning vs Statistical learning (personal view)

• Machine learning is a method to educate a machine (computer).

• Two tasks

– Without errors (eg. rule based learning)

– With errors

• Statistical learning is a subset of machine learning, which deals

with tasks with errors.

Seoul National University. 4

Page 6: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Statistical view of statistical learning

• Analysis of ultra-high dimensional data

• Methods to overcome the “curse of dimensionality”

Seoul National University. 5

Page 7: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Supervised and Unsupervised learnings

• Supervised learning

– Use the inputs to predict the values of the outputs

– Examples: Regression and Classification

• Unsupervised learning

– Only use inputs to describe the data

– Examples: Clustering, PCA

Seoul National University. 6

Page 8: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

1. Basic set-up of Supervised learning

• Input(Covariate) : x ∈ Rp

• Output(Response) : y ∈ Y

• System (Model): y = ϕ(x, ϵ)

• Loss function: l(y, a)

• Assumption : f belongs to a family of functions F .

• Learning set (Data): L = {(yi,xi), i = 1, . . . , n} assumed to be

a random sample of (Y,X) ∼ P

• Objective: Find f0 = arg minf∈FE(Y,X)l(Y, f(X)).

• Predictor(Estimator): f̂(x) = f(x,L).

• Prediction: If new input is x, predict unknown y by f̂(x).

Seoul National University. 7

Page 9: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

y is categorical ⇒ Classification

is continuous ⇒ Regression

Seoul National University. 8

Page 10: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

2. From Least Squares to Nearest Neighbor (forregression)

Least Squares

• Assumption : f(x) ∈ {β0 +∑p

i=1 xiβi}

• Estimate β = (β0, β1, . . . , βp) by β̂ which minimizes the residual

sum of square

RSS(β) =n∑

i=1

(yi − β0 −

p∑k=1

xkiβk

)2

.

• f(x,L) = β̂0 +∑p

i=1 xiβ̂i.

Seoul National University. 9

Page 11: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Nearest Neighbor (NN)

• Nk(x): the neighborhood of x defined by the k closest points xi

in the training sample.

•f(x,L) = 1

k

∑xi∈Nk(x)

yi.

Seoul National University. 10

Page 12: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Simulation 1

• Model: y = x+ ϵ and ϵ ∼ N(0, 1).

• Training sample size is 100. The test error is calculated by test

sample of size 5000.

• Result

Method Training error Test error

Linear 0.8247196 3.395535

1 NN 0.0000000 3.915410

5 NN 0.7080551 3.434624

15 NN 0.8412333 3.400420

Seoul National University. 11

Page 13: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Plot

x

y

-2 -1 0 1 2 3

-20

24

Linear Regression

x

y

-2 -1 0 1 2 3

-20

24

Nearest Neighbor with k= 1

x

y

-2 -1 0 1 2 3

-20

24

Nearest Neighbor with k= 5

x

y

-2 -1 0 1 2 3

-20

24

Nearest Neighbor with k= 15

Seoul National University. 12

Page 14: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Simulation 2

• Model: y = x(1− x) + ϵ and ϵ ∼ N(0, 1).

• Training sample size is 100. The test error is calculated by test

sample of size 5000.

• Result

Method Training error Test error

Linear 3.3307623 3.051589

1 NN 0.0000000 1.892876

5 NN 0.9872481 1.387429

15 NN 2.1303585 2.069501

Seoul National University. 13

Page 15: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Plot

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2

Linear Regression

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2

Nearest Neighbor with k= 1

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2

Nearest Neighbor with k= 5

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2

Nearest Neighbor with k= 15

Seoul National University. 14

Page 16: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Comments

• Linear model is the best when the true model is linear and worst

when the true model is nonlinear.

• NN performs reasonably well regardless of what the true function

is.

• Training error is not a good estimate of the test error.

• Complicated models do not always perform well.

• The number of neighborhood k controls the complexity of the

predictor.

Seoul National University. 15

Page 17: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

LS vs NN

LS NN

Assumption linear nothing

Data size small to medium large

Interpretation easy almost impossible

Predictability good when the true stable regardless

is simple of the ture

Tuning parameter nothing the size of neighbor

Seoul National University. 16

Page 18: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

3. Statistical Decision theory

Regression

• The training sample L is a random sample from the joint

distribution P (y,x).

• Let l(y, f(x)) be a loss function for penalizing errors in

prediction. The most popular loss function is squared error loss:

l(y, f(x)) = (y − f(x))2.

• The expected prediction error of f (EPE(f)) is defined as

EPE(f) = E(Y − f(X))2

where (Y,X) ∼ P (y,x).

• Theorem : f0(x) = E(Y |X = x) minimizes EPE(f).

• E(Y |X = x) is called the regression function.

Seoul National University. 17

Page 19: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• For NN method,

– f is estimated by f̂ :

f̂(x) = Ave(yi|xi ∈ Nk(x)).

– Two approximations are

∗ expectation is approximated by averaging over sample data

∗ conditioning at a point is relaxed to conditioning on some

region “close” to the target point.

– Theorem: Under regularity conditions,

f̂(x) → f0(x)

for all x ∈ Rp when n, k → ∞ and k/n → 0.

– The condition k/n → 0 means that the model complexity

should increase slower than the sample size.

Seoul National University. 18

Page 20: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• For LS,

– f is assumed to be a linear function:

f(x) = β0 +

p∑i=1

xiβi.

– f with β =(E(XXT )

)−1E(XY ) minimizes the EPE.

– The LS estimator replace the expectation by averages over the

training sample.

Seoul National University. 19

Page 21: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Classification

• y ∈ {1, . . . , J}.

• For a given loss function l, the EPE is defined as E(l(Y, f(X))).

• Since

EPE(f) = EX

J∑j=1

L(j, f(X))P (Y = j|X),

f(x) = arg mink=1,...,J

∑Jj=1 L(j, k)P (Y = j|X = x) minimizes

the EPE.

• If l(y, f(x)) = I(y ̸= f(x)), f(x) becomes

f(x) = maxj=1,...,J

P (Y = j|X = x). (1)

• This predictor is called the Bayes rule (Bayes classifier) and its

EPE is called the Bayes rate.

Seoul National University. 20

Page 22: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Estimate the Bayes classifier via function estimation

– First, estimate ϕj(x) = P (Y = j|X = x), and

– Estimate the Bayes classifier by replacing P (Y = j|X = x) by

ϕj(x) in (1).

– The NN estimation of ϕj

ϕ̂j(x) =1

k

∑xi∈Nk(x)

I(yi = j).

– Linear models do not fit well for estimating ϕj since ϕj should

have values between 0 and 1. “Logistic regression” is an

promising alternative.

Seoul National University. 21

Page 23: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

4. Curse of dimensionality

• When p is large, the concept “neighborhood” does not work for

local averaging.

• Phenomenon 1

– X = (X1, . . . , Xp) ∼ Uniform[0, 1]p

– Consider a hypercubical neighborhood about a target point.

– We want to capture a fraction r of the sample.

– Then the expected edge length will be ep(r) = r1/p.

– e10(0.01) = 0.63 and e10(0.1) = 0.80.

– To capture 1% or 10% of the data to form a local average, we

must cover 63% or 80% of the range of each input variable.

– Such neighborhoods are no longer “local”.

Seoul National University. 22

Page 24: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Phenomenon 2:

– X = (X1, . . . , Xp) ∼ Uniform in a p dimensional unit ball

centered at the origin.

– For the sample size n, let Ri =∑p

k=1 X2ki for i = 1, . . . , n.

– Let R(1) = min{Ri}.– Then, the median of R(1) is (1− (1/2)1/n)1/p.

– For n = 5000, p = 10, the median is approximately 0.52, more

than half way to the boundary.

– Most data points are closer to the boundary of the sample space

than to the origin.

– Prediction is much more difficult near the edges since one must

extrapolate rather than interpolate.

Seoul National University. 23

Page 25: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Phenomenon 3:

– Suppose X ∼ Unifrom[−1, 1]p.

– Assume that the true relation is Y = f(X) = exp(−8∥X∥2).– Consider the 1-NN estimate at x = 0.

– The bias of the estimator is 1− exp(−8∥x∥2(1)) where ∥x∥(1)is the smallest norm among the training sample.

– Since ∥X∥2 =∑p

i=1 X2i ≥ X2

(p) and X2(p) → 1 as p → ∞,

the bias tends to increase as p increases.

Seoul National University. 24

Page 26: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

5. Overfitting and Bias-Variance tradeoff

• As we have seen, in the NN method, the size of neighborhood k

controls the complexity of the predictor. The question is how to

choose k?

• If we know P (y,x), we can choose k by minimizing the EPE (test

error):

EPE(f̂k) = E(Y − f̂k(X))2

where f̂k is the k-NN estimate of f.

• Unfortunately, we do not know P (y,x).

• One naive answer is to estimate the EPE of f̂k by the residual

sum of square (training error):

n∑i=1

(yi − f̂k(xi))2.

Seoul National University. 25

Page 27: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• The training error is downward biased estimator of the test error

since the data set is used twice (one for constructing f̂ and the

other for calculating the training error).

• Moreover, the training error keeps decreasing as k is getting

smaller while the test error decreases initially and increases later.

• This means that too complicated models (or models fitting the

training data too closely, or overfitted models) show poor

performance.

• This seemingly mysterious phenomenon can be explained by the

bias-variance decomposition.

• Several ways of choosing the model complexity (i.e. k in the NN

method) will be explained later.

Seoul National University. 26

Page 28: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Bias-Variance tradeoff (for regression)

• Suppose Y = f(X) + ϵ with E(ϵ) = 0 and Var(ϵ) = σ2.

• For a given training sample L, the test error of f(x,L) is given by

TE = ELE(Y,X)((Y − f(X,L))2),

• which is decomposed by

TE = E(Y,X)((Y − f(X))2) + EX((f(X)− EL(f(X,L)))2)

+EX(EL(f(X,L)− EL(f(X,L))2)

= σ2 + EX(BiasL(X)2 +VarianceL(X)).

Seoul National University. 27

Page 29: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• In general, if the model is getting complicated, the bias decreases

and the variance increases.

• Example : k-NN method

– f(x,L) =∑k

l=1(f(x(l)) + ϵ(l))/k where the subscripts (l)

indicates the sequence of nearest neighbors to x.

– Then

BiasL(x) = f(x)− 1

k

k∑l=1

f(x(l))

and

VarianceL(x) =σ2

k.

– For k = 1, the bias is the smallest and variance is the largest

while for k = n, the bias is the largest and variance is the

smallest.

Seoul National University. 28

Page 30: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

• Plot

Test error

Training error

Complexity

Error

High bias

Low variance

Low bias

High variance

Seoul National University. 29

Page 31: Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

6. Four situations in supervised learning

1. p is small and F is parametric.

• Standard regression and classification problems

• MLE, least square, Robust estimator etc.

2. p is large and F is parametric.

• Develop efficient methods for small and moderate samples

• Variable selection, Shrinkage, Bayesian method etc.

3. p is small and F is nonparametric.

• Nonparametric regression

• Kernel, Spline, Wavelet, Mixture model etc.

4. p is large and F is nonparametric.

• Main play ground of Data Mining

• Decision tree, Project pursuit, MARS, Neural network etc.

Seoul National University. 30