7 - Model Assessment and Selection

Model Assessment and SelectionMachine Learning Seminar Series'11

Nikita Zhiltsov

Kazan (Volga Region) Federal University, Russia

18 November 2011

1 / 34

http://ksu.ru

http://cll.niimm.ksu.ru

http://cll.niimm.ksu.ru/~nzhiltsov

Outline

1 Bias, Variance and Model Complexity

2 Nature of Prediction Error

3 Error Estimation: Analytical methodsAICBICSRM Approach

4 Error Estimation: Sample re-useCross-validationBootstrapping

5 Model Assessment in R

2 / 34

http://ksu.ru


Outline






3 / 34

http://ksu.ru


Notation

x = (x1, . . . , xD) ∈ X � a vector of inputs

t ∈ T � a target variable

y(x) � a prediction model

L(t, y(x)) � the loss function for measuring errors.Usual choices for regression:

L(t, y(x)) =

{(y(x)− t)2 squared error|y(x)− t| absolute error

... and classi�cation:

L(t, y(x)) =

{I(y(x) 6= t) 0-1 loss−2 log pt(x) log-likelihood loss

4 / 34

http://ksu.ru


Notation (cont.)

err = 1N

∑Ni=1 L(ti,xi) � training error

ErrD = ED[L(t, y(x))] � test error (prediction error) for a giventraining set D

Err = E[ErrD] = E[L(t, y(x))] � expected test error

NBMost methods e�ectively estimate only Err.

5 / 34

http://ksu.ru


Typical behavior of test and training errorExample

I Training error is not a good estimate of the test error

I There is some intermediate model complexity that givesminimum expected test error

6 / 34

http://ksu.ru


De�ning our goals

Model SelectionEstimating the performance of di�erent models in order to choosethe best one

Model AssessmentHaving chosen a �nal model, estimating its generalization error onnew data

7 / 34

http://ksu.ru


Data-rich situation

I Training set is used to learn the models

I Validation set is used to estimate prediction error for modelselection

I Test set is used for assessment of the generalization error of thechosen model

8 / 34

http://ksu.ru


Outline






9 / 34

http://ksu.ru


Bias-Variance DecompositionLet's consider expected loss E[L] for regression task:

E[L] =

∫R

∫XL(t, y(x)) p(x, t)dxdt

Under squared error loss, h(x) = E[t|x] =∫tp(t|x)dt is the optimal

prediction.Then, E[L] can be decomposed into the sum of three parts:

E[L] = bias2 + variance + noise

where

bias2 =∫

(ED[y(x;D)]− h(x))2 p(x)dx

variance =∫ED[(y(x;D)− ED[y(x;D)])2] p(x)dx

noise =∫

(h(x)− t)2p(x, t)dxdt

10 / 34

http://ksu.ru


Bias-Variance DecompositionExamples

I For a linear model y(x,w) =∑p

j=1wjxj,∀wj 6= 0,the in-sample error is:

Err =1

N

N∑i=1

(y(xi)− h(xi))2 +

p

Nσ2ε + σ2

ε

I For a ridge regression model (Tikhonov regularization):

Err =1

N

N∑i=1

{(y(xi)− h(xi))2 + (y(xi)− y(xi))

2}+ V ar + σ2ε

where y(xi) � the best-�tting linear approximation to h

11 / 34

http://ksu.ru


Behavior of bias and variance

12 / 34

http://ksu.ru


Bias-variance tradeo�Example

I Regression with squared loss

I Classi�cation with 0-1 loss

I In the 2nd case, prediction error is nolonger the sum of squared bias andvariance

⇒ The best choices of tuning parametersmay di�er substantially in the twosettings

13 / 34

http://ksu.ru


Outline






14 / 34

http://ksu.ru


Analytical methods: AIC, BIC, SRM

I They give the in-sample estimates in the general form:

Err = err + w

where w is an estimate of the average optimism

I By using w, the methods penalize too complex models

I Unlike regularization, they do not impose a speci�cregularization parameter λ

I Each criterion de�nes its notion of model complexity involved inthe penalizing term

15 / 34

http://ksu.ru


Akaike Information Criterion (AIC)

I Applicable for linear models

I Either log-likelihood loss or squared error loss is used

I Given a set of models indexed by a tuning parameter α, denoteby d(α) number of parameters for each model. Then,

AIC(α) = err + 2d(α)

Nσ2ε

where σ2ε is typically estimated by the mean squared error of a

low-bias model

I Finally, we choose the model giving smallest AIC

16 / 34

http://ksu.ru


Akaike Information Criterion (AIC)Example

I Phoneme recognition task (N = 1000)

I Input vector is the log-periodogram ofthe spoken vowel quantized to 256uniformly space frequencies

I Linear logistic regression is used topredict the phonem class

I Here d(α) is a number of basisfunctions

17 / 34

http://ksu.ru


Bayesian Information Criterion (BIC)

I BIC, like AIC, is applicable in settings where log-likehoodmaximization is involved

BIC =N

σ2ε

(err + (logN)d

Nσ2ε )

I BIC is proportional to AIC with the factor 2 replaced by logN

I Having N > 8, BIC tends to penalize complex models moreheavily than AIC

I BIC also provides the posterior probability of each model m:

e−12BICm∑M

l=1 e− 1

2BICl

I BIC is asympotically consistent as N →∞

18 / 34

http://ksu.ru


Structural Risk Minimization

I The Vapnik-Chervonenkis (VC) theory provides a generalmeasure of the model complexity and gives associated boundson the optimism

I Such a complexity measure, VC dimension, is de�ned as follows:

VC dimension of the class functions {f(x, α)} isthe largest number of points that can be shattered by

members of {f(x, α)}

I E.g. a linear indicator function in p dimensions has VCdimension p+ 1; sin(αx) has in�nite VC dimension

19 / 34

http://ksu.ru


Structural Risk Minimization (cont.)I If we �t N training points using {f(x, α)} having VC dimensionh, then with probability at least 1− η the following bound holds:

Err < err +

√h

N(ln

2N

h+ 1)− ln η

N)

I SRM approach �ts a nested sequence of models of increasing VCdimensions h1 < h2 . . . and then chooses the model with thesmallest upper bound

I SVM classi�er e�ciently carries out the SRM approach

Issues

� There exists the di�culty in calculating the VC dimension of a class

of functions

� In practice, often the upper bound is very loose

20 / 34

http://ksu.ru


Outline






21 / 34

http://ksu.ru


Sample re-use: cross-validation, bootstrapping

I These methods directly (and quite accurately) estimatethe average generalization error

I The extra-sample error is evaluated rather thanin-sample one (test input vectors do not need tocoincide with training ones)

I They can be used with any loss function, and withnonlinear, adaptive and �tting techniques

I However, they may underestimate true error for such�tting methods as trees

22 / 34

http://ksu.ru


Cross-validationI Probably the simplest and widely used method

I However, time-consuming methodI CV procedure looks as follows:

1 Split data into K roughly equal-sized parts

2 For k-th part we �t the model y−k(x) to other K − 1 parts

3 Then the cross-validation estimate of the prediction error is

CV =1

N

N∑i=1

L(ti, y−k(i)(xi))

I The case K = N (leave-one-out cross-validation) is roughlyunbiased, but can have high variance

23 / 34

http://ksu.ru


Cross-validation (cont.)

I In practice, 5- or 10-fold cross-validation is recommended

I CV tends to overestimate the true prediction error on smalldatasets

I Often �one-standard error� rule is used with CV. See example:

I We choose the mostparsimonious modelwhose error is no morethan one standard errorabove the error of thebest model

I A model with p = 9would be chosen

24 / 34

http://ksu.ru


BootstrappingI General method for assessing statistical accuracyI Given a training set, here the bootstrapping procedure steps are:

1 Randomly draw datasets of with replacement from it; each

sample is of the same size as the original one

2 This is done by B times, producing B bootstrap datasets

3 Fit the model to each of the bootstrap datasets

4 Examine the prediction error using the original training set as a

test set:

Errboot =1

N

N∑i=1

1

|C−i|∑b∈C−i

L(ti, y∗b(xi))

where C(−i) is the set of indices of the bootstrap samples that

do not contain observation i

I To alleviate the upward bias, the .632 estimator is used:

Err(.632)

= 0.368 err + 0.632 Errboot

25 / 34

http://ksu.ru


Outline






26 / 34

http://ksu.ru


http://r-project.org

I Free software environment for statistical

computing and graphics

I R packages for machine learning and data

mining: kernlab, rpart, randomForest,

animation, gbm, tm etc.

I R packages for evaluation: bootstrap,boot

I RStudio IDE

27 / 34

http://ksu.ru


http://r-project.org

Housing dataset at UCI Machine learning

repositoryhttp://archive.ics.uci.edu/ml/datasets/Housing

I Housing values in suburbs of Boston

I 506 intances, 13 attributes + 1 numeric �class� attribute(MEDV)

28 / 34

http://ksu.ru


http://archive.ics.uci.edu/ml/datasets/Housing

Loading data in R

> housing <- read.table("∼/projects/r/housing.data",+ header=T)

> attach(housing)

29 / 34

http://ksu.ru


Cross-validation example in RHelper function

Creating a function using crossval() from bootstrap package

> eval <- function(fit,k=10){

+ require(bootstrap)

+ theta.fit <- function(x,y){lsfit(x,y)}

+ theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

+ x <- fit$model[,2:ncol(fit$model)]

+ y <- fit$model[,1]

+ results <- crossval(x,y,theta.fit,theta.predict,

+ ngroup=k)

+ squared.error=sum((y-results$cv.fit)^2)/length(y)

+ cat("Cross-validated squared error =",

+ squared.error, "\n")}

30 / 34

http://ksu.ru


Cross-validation example in RModel assessment

> fit <- lm(MEDV∼.,data=housing) # A linear model that uses

all the attributes

> eval(fit)

Cross-validated squared error = 23.15827

> fit <- lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,

+ data=housing) # Less complex model

> eval(fit)


> fit <- lm(MEDV∼ RM,data=housing) # Too simple model

> eval(fit)


31 / 34

http://ksu.ru


Bootstrapping example in RHelper function

Creating a function using boot() function from boot package

> sqer <- function(formula,data,indices){

+ d <- data[indices,]

+ fit <- lm(formula, data=d)

+ return (sum(fit$residuals^2)/length(fit$residuals))

+ }

32 / 34

http://ksu.ru


Bootstrapping example in RModel assessment

> results <- boot(data=housing,statistic=sqer,R=1000,

formula=MEDV∼.) # 1000 bootstrapped datasets

> print(results)

Bootstrap Statistics :

original bias std. error

t1* 21.89483 -0.76001 2.296025


formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS)

> print(results)



t1* 22.88726 -0.5400892 2.744437


formula=MEDV∼ RM)

> print(results)



t1* 43.60055 -0.3379168 5.40793333 / 34

http://ksu.ru


Resources

I T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical

Learning, 2008

I Stanford Engineering Everywhere CS229 � Machine Learning.

Handouts 4 and 5

http://videolectures.net/stanfordcs229f07_machine_

learning/

34 / 34

http://ksu.ru


http://videolectures.net/stanfordcs229f07_machine_learning/

http://videolectures.net/stanfordcs229f07_machine_learning/

Technology

7 - Model Assessment and Selection