Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Chap 1. Overview of Statistical Learning(HTF, 2.1 - 2.6, 2.9)
Yongdai Kim
Seoul National University
0. Learning vs Statistical learning
• Learning procedure
– Construct a claim by observing data or using logics
– Perform experiments
– Make conclusion
• Statistical learning procedure
– Collect data
– Analyze the data
– Find new rules
“Let the data tell something.”
Seoul National University. 1
• Why “Statistical learning” necessary?
– We know most of rules which can be imagined by our brain.
– Life (nature, socio-economic status, human behavior, biology
etc.) is more complex than we have thought.
– Our world is changing too fast for us to keep up with based
only on our logics.
– Due to digitalization, amount of data is increasing very fast.
– Most of information in huge data remains undiscovered
• Sample questions
– What are the risk factors for heart failure?
– Are there genes which characterize differences between various
races?
– How does the stock market behave?
– Which chemical confounds are effective for a specific disease?
Seoul National University. 2
– Who are valuable customers for our company?
– What are the influential factors for changing the amount of
ozone?
– Are there patterns in the content of spam mails?
• In statistical learning, the common objective is to find causes for
a given phenomenon.
• One of the common features of the problems is that the set of
possible causes we can think of is very large.
• Learning procedure suffers from time limitation unless we are
lucky.
Seoul National University. 3
Machine learning vs Statistical learning (personal view)
• Machine learning is a method to educate a machine (computer).
• Two tasks
– Without errors (eg. rule based learning)
– With errors
• Statistical learning is a subset of machine learning, which deals
with tasks with errors.
Seoul National University. 4
Statistical view of statistical learning
• Analysis of ultra-high dimensional data
• Methods to overcome the “curse of dimensionality”
Seoul National University. 5
Supervised and Unsupervised learnings
• Supervised learning
– Use the inputs to predict the values of the outputs
– Examples: Regression and Classification
• Unsupervised learning
– Only use inputs to describe the data
– Examples: Clustering, PCA
Seoul National University. 6
1. Basic set-up of Supervised learning
• Input(Covariate) : x ∈ Rp
• Output(Response) : y ∈ Y
• System (Model): y = ϕ(x, ϵ)
• Loss function: l(y, a)
• Assumption : f belongs to a family of functions F .
• Learning set (Data): L = {(yi,xi), i = 1, . . . , n} assumed to be
a random sample of (Y,X) ∼ P
• Objective: Find f0 = arg minf∈FE(Y,X)l(Y, f(X)).
• Predictor(Estimator): f̂(x) = f(x,L).
• Prediction: If new input is x, predict unknown y by f̂(x).
Seoul National University. 7
y is categorical ⇒ Classification
is continuous ⇒ Regression
Seoul National University. 8
2. From Least Squares to Nearest Neighbor (forregression)
Least Squares
• Assumption : f(x) ∈ {β0 +∑p
i=1 xiβi}
• Estimate β = (β0, β1, . . . , βp) by β̂ which minimizes the residual
sum of square
RSS(β) =n∑
i=1
(yi − β0 −
p∑k=1
xkiβk
)2
.
• f(x,L) = β̂0 +∑p
i=1 xiβ̂i.
Seoul National University. 9
Nearest Neighbor (NN)
• Nk(x): the neighborhood of x defined by the k closest points xi
in the training sample.
•f(x,L) = 1
k
∑xi∈Nk(x)
yi.
Seoul National University. 10
Simulation 1
• Model: y = x+ ϵ and ϵ ∼ N(0, 1).
• Training sample size is 100. The test error is calculated by test
sample of size 5000.
• Result
Method Training error Test error
Linear 0.8247196 3.395535
1 NN 0.0000000 3.915410
5 NN 0.7080551 3.434624
15 NN 0.8412333 3.400420
Seoul National University. 11
• Plot
x
y
-2 -1 0 1 2 3
-20
24
Linear Regression
x
y
-2 -1 0 1 2 3
-20
24
Nearest Neighbor with k= 1
x
y
-2 -1 0 1 2 3
-20
24
Nearest Neighbor with k= 5
x
y
-2 -1 0 1 2 3
-20
24
Nearest Neighbor with k= 15
Seoul National University. 12
Simulation 2
• Model: y = x(1− x) + ϵ and ϵ ∼ N(0, 1).
• Training sample size is 100. The test error is calculated by test
sample of size 5000.
• Result
Method Training error Test error
Linear 3.3307623 3.051589
1 NN 0.0000000 1.892876
5 NN 0.9872481 1.387429
15 NN 2.1303585 2.069501
Seoul National University. 13
• Plot
x
y
-3 -2 -1 0 1 2 3
-10
-6-4
-20
2
Linear Regression
x
y
-3 -2 -1 0 1 2 3
-10
-6-4
-20
2
Nearest Neighbor with k= 1
x
y
-3 -2 -1 0 1 2 3
-10
-6-4
-20
2
Nearest Neighbor with k= 5
x
y
-3 -2 -1 0 1 2 3
-10
-6-4
-20
2
Nearest Neighbor with k= 15
Seoul National University. 14
Comments
• Linear model is the best when the true model is linear and worst
when the true model is nonlinear.
• NN performs reasonably well regardless of what the true function
is.
• Training error is not a good estimate of the test error.
• Complicated models do not always perform well.
• The number of neighborhood k controls the complexity of the
predictor.
Seoul National University. 15
LS vs NN
LS NN
Assumption linear nothing
Data size small to medium large
Interpretation easy almost impossible
Predictability good when the true stable regardless
is simple of the ture
Tuning parameter nothing the size of neighbor
Seoul National University. 16
3. Statistical Decision theory
Regression
• The training sample L is a random sample from the joint
distribution P (y,x).
• Let l(y, f(x)) be a loss function for penalizing errors in
prediction. The most popular loss function is squared error loss:
l(y, f(x)) = (y − f(x))2.
• The expected prediction error of f (EPE(f)) is defined as
EPE(f) = E(Y − f(X))2
where (Y,X) ∼ P (y,x).
• Theorem : f0(x) = E(Y |X = x) minimizes EPE(f).
• E(Y |X = x) is called the regression function.
Seoul National University. 17
• For NN method,
– f is estimated by f̂ :
f̂(x) = Ave(yi|xi ∈ Nk(x)).
– Two approximations are
∗ expectation is approximated by averaging over sample data
∗ conditioning at a point is relaxed to conditioning on some
region “close” to the target point.
– Theorem: Under regularity conditions,
f̂(x) → f0(x)
for all x ∈ Rp when n, k → ∞ and k/n → 0.
– The condition k/n → 0 means that the model complexity
should increase slower than the sample size.
Seoul National University. 18
• For LS,
– f is assumed to be a linear function:
f(x) = β0 +
p∑i=1
xiβi.
– f with β =(E(XXT )
)−1E(XY ) minimizes the EPE.
– The LS estimator replace the expectation by averages over the
training sample.
Seoul National University. 19
Classification
• y ∈ {1, . . . , J}.
• For a given loss function l, the EPE is defined as E(l(Y, f(X))).
• Since
EPE(f) = EX
J∑j=1
L(j, f(X))P (Y = j|X),
f(x) = arg mink=1,...,J
∑Jj=1 L(j, k)P (Y = j|X = x) minimizes
the EPE.
• If l(y, f(x)) = I(y ̸= f(x)), f(x) becomes
f(x) = maxj=1,...,J
P (Y = j|X = x). (1)
• This predictor is called the Bayes rule (Bayes classifier) and its
EPE is called the Bayes rate.
Seoul National University. 20
• Estimate the Bayes classifier via function estimation
– First, estimate ϕj(x) = P (Y = j|X = x), and
– Estimate the Bayes classifier by replacing P (Y = j|X = x) by
ϕj(x) in (1).
– The NN estimation of ϕj
ϕ̂j(x) =1
k
∑xi∈Nk(x)
I(yi = j).
– Linear models do not fit well for estimating ϕj since ϕj should
have values between 0 and 1. “Logistic regression” is an
promising alternative.
Seoul National University. 21
4. Curse of dimensionality
• When p is large, the concept “neighborhood” does not work for
local averaging.
• Phenomenon 1
– X = (X1, . . . , Xp) ∼ Uniform[0, 1]p
– Consider a hypercubical neighborhood about a target point.
– We want to capture a fraction r of the sample.
– Then the expected edge length will be ep(r) = r1/p.
– e10(0.01) = 0.63 and e10(0.1) = 0.80.
– To capture 1% or 10% of the data to form a local average, we
must cover 63% or 80% of the range of each input variable.
– Such neighborhoods are no longer “local”.
Seoul National University. 22
• Phenomenon 2:
– X = (X1, . . . , Xp) ∼ Uniform in a p dimensional unit ball
centered at the origin.
– For the sample size n, let Ri =∑p
k=1 X2ki for i = 1, . . . , n.
– Let R(1) = min{Ri}.– Then, the median of R(1) is (1− (1/2)1/n)1/p.
– For n = 5000, p = 10, the median is approximately 0.52, more
than half way to the boundary.
– Most data points are closer to the boundary of the sample space
than to the origin.
– Prediction is much more difficult near the edges since one must
extrapolate rather than interpolate.
Seoul National University. 23
• Phenomenon 3:
– Suppose X ∼ Unifrom[−1, 1]p.
– Assume that the true relation is Y = f(X) = exp(−8∥X∥2).– Consider the 1-NN estimate at x = 0.
– The bias of the estimator is 1− exp(−8∥x∥2(1)) where ∥x∥(1)is the smallest norm among the training sample.
– Since ∥X∥2 =∑p
i=1 X2i ≥ X2
(p) and X2(p) → 1 as p → ∞,
the bias tends to increase as p increases.
Seoul National University. 24
5. Overfitting and Bias-Variance tradeoff
• As we have seen, in the NN method, the size of neighborhood k
controls the complexity of the predictor. The question is how to
choose k?
• If we know P (y,x), we can choose k by minimizing the EPE (test
error):
EPE(f̂k) = E(Y − f̂k(X))2
where f̂k is the k-NN estimate of f.
• Unfortunately, we do not know P (y,x).
• One naive answer is to estimate the EPE of f̂k by the residual
sum of square (training error):
n∑i=1
(yi − f̂k(xi))2.
Seoul National University. 25
• The training error is downward biased estimator of the test error
since the data set is used twice (one for constructing f̂ and the
other for calculating the training error).
• Moreover, the training error keeps decreasing as k is getting
smaller while the test error decreases initially and increases later.
• This means that too complicated models (or models fitting the
training data too closely, or overfitted models) show poor
performance.
• This seemingly mysterious phenomenon can be explained by the
bias-variance decomposition.
• Several ways of choosing the model complexity (i.e. k in the NN
method) will be explained later.
Seoul National University. 26
Bias-Variance tradeoff (for regression)
• Suppose Y = f(X) + ϵ with E(ϵ) = 0 and Var(ϵ) = σ2.
• For a given training sample L, the test error of f(x,L) is given by
TE = ELE(Y,X)((Y − f(X,L))2),
• which is decomposed by
TE = E(Y,X)((Y − f(X))2) + EX((f(X)− EL(f(X,L)))2)
+EX(EL(f(X,L)− EL(f(X,L))2)
= σ2 + EX(BiasL(X)2 +VarianceL(X)).
Seoul National University. 27
• In general, if the model is getting complicated, the bias decreases
and the variance increases.
• Example : k-NN method
– f(x,L) =∑k
l=1(f(x(l)) + ϵ(l))/k where the subscripts (l)
indicates the sequence of nearest neighbors to x.
– Then
BiasL(x) = f(x)− 1
k
k∑l=1
f(x(l))
and
VarianceL(x) =σ2
k.
– For k = 1, the bias is the smallest and variance is the largest
while for k = n, the bias is the largest and variance is the
smallest.
Seoul National University. 28
• Plot
Test error
Training error
Complexity
Error
High bias
Low variance
Low bias
High variance
Seoul National University. 29
6. Four situations in supervised learning
1. p is small and F is parametric.
• Standard regression and classification problems
• MLE, least square, Robust estimator etc.
2. p is large and F is parametric.
• Develop efficient methods for small and moderate samples
• Variable selection, Shrinkage, Bayesian method etc.
3. p is small and F is nonparametric.
• Nonparametric regression
• Kernel, Spline, Wavelet, Mixture model etc.
4. p is large and F is nonparametric.
• Main play ground of Data Mining
• Decision tree, Project pursuit, MARS, Neural network etc.
Seoul National University. 30