Upload
nikita-zhiltsov
View
855
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Slides of a report on Machine Learning Seminar Series'11 at Kazan (Volga Region) Federal University. See http://cll.niimm.ksu.ru/cms/main/seminars/mlseminar
Citation preview
Model Assessment and SelectionMachine Learning Seminar Series'11
Nikita Zhiltsov
Kazan (Volga Region) Federal University, Russia
18 November 2011
1 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
2 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
3 / 34
Notation
x = (x1, . . . , xD) ∈ X � a vector of inputs
t ∈ T � a target variable
y(x) � a prediction model
L(t, y(x)) � the loss function for measuring errors.Usual choices for regression:
L(t, y(x)) =
{(y(x)− t)2 squared error|y(x)− t| absolute error
... and classi�cation:
L(t, y(x)) =
{I(y(x) 6= t) 0-1 loss−2 log pt(x) log-likelihood loss
4 / 34
Notation (cont.)
err = 1N
∑Ni=1 L(ti,xi) � training error
ErrD = ED[L(t, y(x))] � test error (prediction error) for a giventraining set D
Err = E[ErrD] = E[L(t, y(x))] � expected test error
NBMost methods e�ectively estimate only Err.
5 / 34
Typical behavior of test and training errorExample
I Training error is not a good estimate of the test error
I There is some intermediate model complexity that givesminimum expected test error
6 / 34
De�ning our goals
Model SelectionEstimating the performance of di�erent models in order to choosethe best one
Model AssessmentHaving chosen a �nal model, estimating its generalization error onnew data
7 / 34
Data-rich situation
I Training set is used to learn the models
I Validation set is used to estimate prediction error for modelselection
I Test set is used for assessment of the generalization error of thechosen model
8 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
9 / 34
Bias-Variance DecompositionLet's consider expected loss E[L] for regression task:
E[L] =
∫R
∫XL(t, y(x)) p(x, t)dxdt
Under squared error loss, h(x) = E[t|x] =∫tp(t|x)dt is the optimal
prediction.Then, E[L] can be decomposed into the sum of three parts:
E[L] = bias2 + variance + noise
where
bias2 =∫
(ED[y(x;D)]− h(x))2 p(x)dx
variance =∫ED[(y(x;D)− ED[y(x;D)])2] p(x)dx
noise =∫
(h(x)− t)2p(x, t)dxdt
10 / 34
Bias-Variance DecompositionExamples
I For a linear model y(x,w) =∑p
j=1wjxj,∀wj 6= 0,the in-sample error is:
Err =1
N
N∑i=1
(y(xi)− h(xi))2 +
p
Nσ2ε + σ2
ε
I For a ridge regression model (Tikhonov regularization):
Err =1
N
N∑i=1
{(y(xi)− h(xi))2 + (y(xi)− y(xi))
2}+ V ar + σ2ε
where y(xi) � the best-�tting linear approximation to h
11 / 34
Bias-variance tradeo�Example
I Regression with squared loss
I Classi�cation with 0-1 loss
I In the 2nd case, prediction error is nolonger the sum of squared bias andvariance
⇒ The best choices of tuning parametersmay di�er substantially in the twosettings
13 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
14 / 34
Analytical methods: AIC, BIC, SRM
I They give the in-sample estimates in the general form:
Err = err + w
where w is an estimate of the average optimism
I By using w, the methods penalize too complex models
I Unlike regularization, they do not impose a speci�cregularization parameter λ
I Each criterion de�nes its notion of model complexity involved inthe penalizing term
15 / 34
Akaike Information Criterion (AIC)
I Applicable for linear models
I Either log-likelihood loss or squared error loss is used
I Given a set of models indexed by a tuning parameter α, denoteby d(α) number of parameters for each model. Then,
AIC(α) = err + 2d(α)
Nσ2ε
where σ2ε is typically estimated by the mean squared error of a
low-bias model
I Finally, we choose the model giving smallest AIC
16 / 34
Akaike Information Criterion (AIC)Example
I Phoneme recognition task (N = 1000)
I Input vector is the log-periodogram ofthe spoken vowel quantized to 256uniformly space frequencies
I Linear logistic regression is used topredict the phonem class
I Here d(α) is a number of basisfunctions
17 / 34
Bayesian Information Criterion (BIC)
I BIC, like AIC, is applicable in settings where log-likehoodmaximization is involved
BIC =N
σ2ε
(err + (logN)d
Nσ2ε )
I BIC is proportional to AIC with the factor 2 replaced by logN
I Having N > 8, BIC tends to penalize complex models moreheavily than AIC
I BIC also provides the posterior probability of each model m:
e−12BICm∑M
l=1 e− 1
2BICl
I BIC is asympotically consistent as N →∞
18 / 34
Structural Risk Minimization
I The Vapnik-Chervonenkis (VC) theory provides a generalmeasure of the model complexity and gives associated boundson the optimism
I Such a complexity measure, VC dimension, is de�ned as follows:
VC dimension of the class functions {f(x, α)} isthe largest number of points that can be shattered by
members of {f(x, α)}
I E.g. a linear indicator function in p dimensions has VCdimension p+ 1; sin(αx) has in�nite VC dimension
19 / 34
Structural Risk Minimization (cont.)I If we �t N training points using {f(x, α)} having VC dimensionh, then with probability at least 1− η the following bound holds:
Err < err +
√h
N(ln
2N
h+ 1)− ln η
N)
I SRM approach �ts a nested sequence of models of increasing VCdimensions h1 < h2 . . . and then chooses the model with thesmallest upper bound
I SVM classi�er e�ciently carries out the SRM approach
Issues
� There exists the di�culty in calculating the VC dimension of a class
of functions
� In practice, often the upper bound is very loose
20 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
21 / 34
Sample re-use: cross-validation, bootstrapping
I These methods directly (and quite accurately) estimatethe average generalization error
I The extra-sample error is evaluated rather thanin-sample one (test input vectors do not need tocoincide with training ones)
I They can be used with any loss function, and withnonlinear, adaptive and �tting techniques
I However, they may underestimate true error for such�tting methods as trees
22 / 34
Cross-validationI Probably the simplest and widely used method
I However, time-consuming methodI CV procedure looks as follows:
1 Split data into K roughly equal-sized parts
2 For k-th part we �t the model y−k(x) to other K − 1 parts
3 Then the cross-validation estimate of the prediction error is
CV =1
N
N∑i=1
L(ti, y−k(i)(xi))
I The case K = N (leave-one-out cross-validation) is roughlyunbiased, but can have high variance
23 / 34
Cross-validation (cont.)
I In practice, 5- or 10-fold cross-validation is recommended
I CV tends to overestimate the true prediction error on smalldatasets
I Often �one-standard error� rule is used with CV. See example:
I We choose the mostparsimonious modelwhose error is no morethan one standard errorabove the error of thebest model
I A model with p = 9would be chosen
24 / 34
BootstrappingI General method for assessing statistical accuracyI Given a training set, here the bootstrapping procedure steps are:
1 Randomly draw datasets of with replacement from it; each
sample is of the same size as the original one
2 This is done by B times, producing B bootstrap datasets
3 Fit the model to each of the bootstrap datasets
4 Examine the prediction error using the original training set as a
test set:
Errboot =1
N
N∑i=1
1
|C−i|∑b∈C−i
L(ti, y∗b(xi))
where C(−i) is the set of indices of the bootstrap samples that
do not contain observation i
I To alleviate the upward bias, the .632 estimator is used:
Err(.632)
= 0.368 err + 0.632 Errboot
25 / 34
Outline
1 Bias, Variance and Model Complexity
2 Nature of Prediction Error
3 Error Estimation: Analytical methodsAICBICSRM Approach
4 Error Estimation: Sample re-useCross-validationBootstrapping
5 Model Assessment in R
26 / 34
http://r-project.org
I Free software environment for statistical
computing and graphics
I R packages for machine learning and data
mining: kernlab, rpart, randomForest,
animation, gbm, tm etc.
I R packages for evaluation: bootstrap,boot
I RStudio IDE
27 / 34
Housing dataset at UCI Machine learning
repositoryhttp://archive.ics.uci.edu/ml/datasets/Housing
I Housing values in suburbs of Boston
I 506 intances, 13 attributes + 1 numeric �class� attribute(MEDV)
28 / 34
Loading data in R
> housing <- read.table("∼/projects/r/housing.data",+ header=T)
> attach(housing)
29 / 34
Cross-validation example in RHelper function
Creating a function using crossval() from bootstrap package
> eval <- function(fit,k=10){
+ require(bootstrap)
+ theta.fit <- function(x,y){lsfit(x,y)}
+ theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
+ x <- fit$model[,2:ncol(fit$model)]
+ y <- fit$model[,1]
+ results <- crossval(x,y,theta.fit,theta.predict,
+ ngroup=k)
+ squared.error=sum((y-results$cv.fit)^2)/length(y)
+ cat("Cross-validated squared error =",
+ squared.error, "\n")}
30 / 34
Cross-validation example in RModel assessment
> fit <- lm(MEDV∼.,data=housing) # A linear model that uses
all the attributes
> eval(fit)
Cross-validated squared error = 23.15827
> fit <- lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,
+ data=housing) # Less complex model
> eval(fit)
Cross-validated squared error = 23.24319
> fit <- lm(MEDV∼ RM,data=housing) # Too simple model
> eval(fit)
Cross-validated squared error = 44.38424
31 / 34
Bootstrapping example in RHelper function
Creating a function using boot() function from boot package
> sqer <- function(formula,data,indices){
+ d <- data[indices,]
+ fit <- lm(formula, data=d)
+ return (sum(fit$residuals^2)/length(fit$residuals))
+ }
32 / 34
Bootstrapping example in RModel assessment
> results <- boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼.) # 1000 bootstrapped datasets
> print(results)
Bootstrap Statistics :
original bias std. error
t1* 21.89483 -0.76001 2.296025
> results <- boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)
> print(results)
Bootstrap Statistics :
original bias std. error
t1* 22.88726 -0.5400892 2.744437
> results <- boot(data=housing,statistic=sqer,R=1000,
formula=MEDV∼ RM)
> print(results)
Bootstrap Statistics :
original bias std. error
t1* 43.60055 -0.3379168 5.40793333 / 34
Resources
I T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical
Learning, 2008
I Stanford Engineering Everywhere CS229 � Machine Learning.
Handouts 4 and 5
http://videolectures.net/stanfordcs229f07_machine_
learning/
34 / 34