Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the

Chap 1. Overview of Statistical Learning(HTF, 2.1 - 2.6, 2.9)

Yongdai Kim

Seoul National University

0. Learning vs Statistical learning

• Learning procedure

– Construct a claim by observing data or using logics

– Perform experiments

– Make conclusion

• Statistical learning procedure

– Collect data

– Analyze the data

– Find new rules

“Let the data tell something.”

Seoul National University. 1

• Why “Statistical learning” necessary?

– We know most of rules which can be imagined by our brain.

– Life (nature, socio-economic status, human behavior, biology

etc.) is more complex than we have thought.

– Our world is changing too fast for us to keep up with based

only on our logics.

– Due to digitalization, amount of data is increasing very fast.

– Most of information in huge data remains undiscovered

• Sample questions

– What are the risk factors for heart failure?

– Are there genes which characterize differences between various

races?

– How does the stock market behave?

– Which chemical confounds are effective for a specific disease?


– Who are valuable customers for our company?

– What are the influential factors for changing the amount of

ozone?

– Are there patterns in the content of spam mails?

• In statistical learning, the common objective is to find causes for

a given phenomenon.

• One of the common features of the problems is that the set of

possible causes we can think of is very large.

• Learning procedure suffers from time limitation unless we are

lucky.


Machine learning vs Statistical learning (personal view)

• Machine learning is a method to educate a machine (computer).

• Two tasks

– Without errors (eg. rule based learning)

– With errors

• Statistical learning is a subset of machine learning, which deals

with tasks with errors.


Statistical view of statistical learning

• Analysis of ultra-high dimensional data

• Methods to overcome the “curse of dimensionality”


Supervised and Unsupervised learnings

• Supervised learning

– Use the inputs to predict the values of the outputs

– Examples: Regression and Classification

• Unsupervised learning

– Only use inputs to describe the data

– Examples: Clustering, PCA


1. Basic set-up of Supervised learning

• Input(Covariate) : x ∈ Rp

• Output(Response) : y ∈ Y

• System (Model): y = ϕ(x, ϵ)

• Loss function: l(y, a)

• Assumption : f belongs to a family of functions F .

• Learning set (Data): L = {(yi,xi), i = 1, . . . , n} assumed to be

a random sample of (Y,X) ∼ P

• Objective: Find f0 = arg minf∈FE(Y,X)l(Y, f(X)).

• Predictor(Estimator): f̂(x) = f(x,L).

• Prediction: If new input is x, predict unknown y by f̂(x).


y is categorical ⇒ Classification

is continuous ⇒ Regression


2. From Least Squares to Nearest Neighbor (forregression)

Least Squares

• Assumption : f(x) ∈ {β0 +∑p

i=1 xiβi}

• Estimate β = (β0, β1, . . . , βp) by β̂ which minimizes the residual

sum of square

RSS(β) =n∑

i=1

(yi − β0 −

p∑k=1

xkiβk

)2

.

• f(x,L) = β̂0 +∑p

i=1 xiβ̂i.


Nearest Neighbor (NN)

• Nk(x): the neighborhood of x defined by the k closest points xi

in the training sample.

•f(x,L) = 1

k

∑xi∈Nk(x)

yi.


Simulation 1

• Model: y = x+ ϵ and ϵ ∼ N(0, 1).

• Training sample size is 100. The test error is calculated by test

sample of size 5000.

• Result

Method Training error Test error

Linear 0.8247196 3.395535

1 NN 0.0000000 3.915410

5 NN 0.7080551 3.434624

15 NN 0.8412333 3.400420


• Plot

x

y

-2 -1 0 1 2 3

-20

24

Linear Regression

x

y

-2 -1 0 1 2 3

-20

24

Nearest Neighbor with k= 1

x

y

-2 -1 0 1 2 3

-20

24


x

y

-2 -1 0 1 2 3

-20

24



Simulation 2

• Model: y = x(1− x) + ϵ and ϵ ∼ N(0, 1).

• Training sample size is 100. The test error is calculated by test

sample of size 5000.

• Result

Method Training error Test error

Linear 3.3307623 3.051589

1 NN 0.0000000 1.892876

5 NN 0.9872481 1.387429

15 NN 2.1303585 2.069501


• Plot

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2

Linear Regression

x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2


x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2


x

y

-3 -2 -1 0 1 2 3

-10

-6-4

-20

2



Comments

• Linear model is the best when the true model is linear and worst

when the true model is nonlinear.

• NN performs reasonably well regardless of what the true function

is.

• Training error is not a good estimate of the test error.

• Complicated models do not always perform well.

• The number of neighborhood k controls the complexity of the

predictor.


LS vs NN

LS NN

Assumption linear nothing

Data size small to medium large

Interpretation easy almost impossible

Predictability good when the true stable regardless

is simple of the ture

Tuning parameter nothing the size of neighbor


3. Statistical Decision theory

Regression

• The training sample L is a random sample from the joint

distribution P (y,x).

• Let l(y, f(x)) be a loss function for penalizing errors in

prediction. The most popular loss function is squared error loss:

l(y, f(x)) = (y − f(x))2.

• The expected prediction error of f (EPE(f)) is defined as

EPE(f) = E(Y − f(X))2

where (Y,X) ∼ P (y,x).

• Theorem : f0(x) = E(Y |X = x) minimizes EPE(f).

• E(Y |X = x) is called the regression function.


• For NN method,

– f is estimated by f̂ :

f̂(x) = Ave(yi|xi ∈ Nk(x)).

– Two approximations are

∗ expectation is approximated by averaging over sample data

∗ conditioning at a point is relaxed to conditioning on some

region “close” to the target point.

– Theorem: Under regularity conditions,

f̂(x) → f0(x)

for all x ∈ Rp when n, k → ∞ and k/n → 0.

– The condition k/n → 0 means that the model complexity

should increase slower than the sample size.


• For LS,

– f is assumed to be a linear function:

f(x) = β0 +

p∑i=1

xiβi.

– f with β =(E(XXT )

)−1E(XY ) minimizes the EPE.

– The LS estimator replace the expectation by averages over the

training sample.


Classification

• y ∈ {1, . . . , J}.

• For a given loss function l, the EPE is defined as E(l(Y, f(X))).

• Since

EPE(f) = EX

J∑j=1

L(j, f(X))P (Y = j|X),

f(x) = arg mink=1,...,J

∑Jj=1 L(j, k)P (Y = j|X = x) minimizes

the EPE.

• If l(y, f(x)) = I(y ̸= f(x)), f(x) becomes

f(x) = maxj=1,...,J

P (Y = j|X = x). (1)

• This predictor is called the Bayes rule (Bayes classifier) and its

EPE is called the Bayes rate.


• Estimate the Bayes classifier via function estimation

– First, estimate ϕj(x) = P (Y = j|X = x), and

– Estimate the Bayes classifier by replacing P (Y = j|X = x) by

ϕj(x) in (1).

– The NN estimation of ϕj

ϕ̂j(x) =1

k

∑xi∈Nk(x)

I(yi = j).

– Linear models do not fit well for estimating ϕj since ϕj should

have values between 0 and 1. “Logistic regression” is an

promising alternative.


4. Curse of dimensionality

• When p is large, the concept “neighborhood” does not work for

local averaging.

• Phenomenon 1

– X = (X1, . . . , Xp) ∼ Uniform[0, 1]p

– Consider a hypercubical neighborhood about a target point.

– We want to capture a fraction r of the sample.

– Then the expected edge length will be ep(r) = r1/p.

– e10(0.01) = 0.63 and e10(0.1) = 0.80.

– To capture 1% or 10% of the data to form a local average, we

must cover 63% or 80% of the range of each input variable.

– Such neighborhoods are no longer “local”.


• Phenomenon 2:

– X = (X1, . . . , Xp) ∼ Uniform in a p dimensional unit ball

centered at the origin.

– For the sample size n, let Ri =∑p

k=1 X2ki for i = 1, . . . , n.

– Let R(1) = min{Ri}.– Then, the median of R(1) is (1− (1/2)1/n)1/p.

– For n = 5000, p = 10, the median is approximately 0.52, more

than half way to the boundary.

– Most data points are closer to the boundary of the sample space

than to the origin.

– Prediction is much more difficult near the edges since one must

extrapolate rather than interpolate.


• Phenomenon 3:

– Suppose X ∼ Unifrom[−1, 1]p.

– Assume that the true relation is Y = f(X) = exp(−8∥X∥2).– Consider the 1-NN estimate at x = 0.

– The bias of the estimator is 1− exp(−8∥x∥2(1)) where ∥x∥(1)is the smallest norm among the training sample.

– Since ∥X∥2 =∑p

i=1 X2i ≥ X2

(p) and X2(p) → 1 as p → ∞,

the bias tends to increase as p increases.


5. Overfitting and Bias-Variance tradeoff

• As we have seen, in the NN method, the size of neighborhood k

controls the complexity of the predictor. The question is how to

choose k?

• If we know P (y,x), we can choose k by minimizing the EPE (test

error):

EPE(f̂k) = E(Y − f̂k(X))2

where f̂k is the k-NN estimate of f.

• Unfortunately, we do not know P (y,x).

• One naive answer is to estimate the EPE of f̂k by the residual

sum of square (training error):

n∑i=1

(yi − f̂k(xi))2.


• The training error is downward biased estimator of the test error

since the data set is used twice (one for constructing f̂ and the

other for calculating the training error).

• Moreover, the training error keeps decreasing as k is getting

smaller while the test error decreases initially and increases later.

• This means that too complicated models (or models fitting the

training data too closely, or overfitted models) show poor

performance.

• This seemingly mysterious phenomenon can be explained by the

bias-variance decomposition.

• Several ways of choosing the model complexity (i.e. k in the NN

method) will be explained later.


Bias-Variance tradeoff (for regression)

• Suppose Y = f(X) + ϵ with E(ϵ) = 0 and Var(ϵ) = σ2.

• For a given training sample L, the test error of f(x,L) is given by

TE = ELE(Y,X)((Y − f(X,L))2),

• which is decomposed by

TE = E(Y,X)((Y − f(X))2) + EX((f(X)− EL(f(X,L)))2)

+EX(EL(f(X,L)− EL(f(X,L))2)

= σ2 + EX(BiasL(X)2 +VarianceL(X)).


• In general, if the model is getting complicated, the bias decreases

and the variance increases.

• Example : k-NN method

– f(x,L) =∑k

l=1(f(x(l)) + ϵ(l))/k where the subscripts (l)

indicates the sequence of nearest neighbors to x.

– Then

BiasL(x) = f(x)− 1

k

k∑l=1

f(x(l))

and

VarianceL(x) =σ2

k.

– For k = 1, the bias is the smallest and variance is the largest

while for k = n, the bias is the largest and variance is the

smallest.


• Plot

Test error

Training error

Complexity

Error

High bias

Low variance

Low bias

High variance


6. Four situations in supervised learning

1. p is small and F is parametric.

• Standard regression and classification problems

• MLE, least square, Robust estimator etc.

2. p is large and F is parametric.

• Develop efficient methods for small and moderate samples

• Variable selection, Shrinkage, Bayesian method etc.

3. p is small and F is nonparametric.

• Nonparametric regression

• Kernel, Spline, Wavelet, Mixture model etc.

4. p is large and F is nonparametric.

• Main play ground of Data Mining

• Decision tree, Project pursuit, MARS, Neural network etc.


Documents

Yongdai Kim Seoul National University · 2017-03-06 · Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the