Homework 3: Solutions - Rice Universitygallen/stat613/HW3_Solutions.pdf · Homework 3: Solutions Statistics 613 Fall 2017 Mathematical Problem: 1.Use the kernel trick to derive Kernel

Homework 3: Solutions

Statistics 613 Fall 2017

Mathematical Problem:

1. Use the kernel trick to derive Kernel Logistic Regression OR Kernel Discriminant Analysis. Show yourwork.

Kernel Logistic Regression:

Here we examine kernel logistic regression. To begin consider the typical logistic regression setup.

We consider a sample {(yi,xi)}ni=1 where yi are iid bernoulli(pi) and xi ∈ Rd. Here we wish to modelE[yi|xi] = pi = pi(xi). Note pi(xi) = P[yi = 1|xi] and 1− pi(xi) = P[yi = 0|xi].

We assume logit(pi(xi)) = f(xi), or

log(pi(xi)

1− pi(xi)) = f(xi)

Note in the typical logistic regression setting we assume f to be linear. Here we allow f to be a moregeneral function. The above delivers the following expression for pi:

pi(xi) =ef(xi)

1 + ef(xi)

The likelihood of yi’s is given by

L(f,y) =n∏

i=1

pi(xi)yi(1− pi(xi))

1−yi

so that

log(L(f,y)) =n∑

i=1

(yif(xi)− log(1 + ef(xi))

To proceed with kernel logistic regression let k: Rd × Rd → R be a symmetric, positive definite,continuous kernel function. Note there exists a corresponding Reproducing Kernel Hilbert Space(RKHS), Hk, such that k is the reproducing kernel. By Mercer’s Theorem, we may express k viaits eigen expansion:

k(x,y) =

∞∑

i=1

γiφi(x)φi(y)

where φi are eigen functions. Similarly for f ∈ Hk we may write

f(x) =

∞∑

i=1

ciφi(x) (1)

for some sequence {ci}∞i=1. Given the above, we may frame our optimization problem for kernel logisticregression in terms of minimizing the negative log likelihood (with regularization) over the RKHS.Specifically,

minf∈Hk

n∑

i=1

log(1 + ef(xi))− yif(xi) +1

2λ‖f‖2Hk

(2)

1

Given the representation of f ∈ Hk given by (1) we see that above is equivalent to minimization oversequences {ci}∞i=1. Hence the above is an infinite dimensional optimization problem.

However by the Representer Theorem we know that if f∗ solves (2) then f∗ may be expressed as

f∗(x) =n∑

j=1

αjk(x,xj)

for some α ∈ Rn. Hence we may express (2) as a finite dimensional optimization problem, namely

minα∈Rn

n∑

i=1

log(1 + ee′iKα)− yie′iKα+

1

2λα′Kα

where K denotes the n×n kernel matrix where Kij = k(xi,xj), and ei denotes the ith Euclidean vector.Note the above is the likelihood problem associated with the (regularized) kernel logistic regressionproblem.

Kernel Discriminant Analysis:

It may be useful to read the problem dealing with FDA below first to review non-kernel FDA and to seethe notation defined. Throughout this problem,

φ(·) =(φ1(·) φ2(·) . . . φn(·)

)

is the projection into the RKHS generated by the kernel function k(·, ·). Quantities with superscript Kare RKHS-projected analogues of the corresponding quantity for non-kernel FDA.

FDA turns out to be the easiest linear discriminant problem to kernelize, resulting in KernelizedFisher’s Discriminant (“KFD”).

This derivation follows the original paper [MRW+99] and is essentially the same as given on Wikipedia.A more detailed discussion with references and discussion of efficient fitting techniques may be foundin [SS01, Chapter 15].

LDA and variants satisfy the requirements of the representer theorem ([SHS01, Theorem 1]) so weknow our solution will be of the form:

f(·) ∝N∑

i=1

αik(·, xi)

for chosen kernel k(·, ·). In particular, because the discriminant function for FDA is of the form wTx∗

for new data x∗ and fixed wT , we see that w will be of the form:

wK =N∑

i=1

αiφ(xi)

To find the discriminant vectors, we apply the kernel trick to our data:

µKi =

1

ni

ni∑

j=1yj=i

φ(xj)

2

As with kernel regression, calculation of φ(xj) is impossible (or at least, rather impractical) but wecan avoid it by only looking at inner products:

(wK)TµKi =

1

ni

N∑

j=1

ni∑

k=1yk==i

αjk(xj ,xk) = αTMKi

Hence, the numerator of the FDA problem becomes:

wTΣBw = wT (µ2 − µ1)(µ2 − µ1)Tw

=⇒ (wK)TΣKBw = (wK)T (µK

1 − µK2 )(µK

1 − µK2 ))T (wK)

= αTΣKBα

Kernelizing the denominator directly is a bit tricky, but if we recall that

ΣW = ΣT −ΣB

it will suffice to kernelize wTΣTw. A bit of algebra then gives us:

wTΣWw =⇒ αΣKWα

where

ΣKW =

2∑

j=1

Kj(1− n−1j 1)KTj

where Kj is the kernelized Gram matrix for the data in class j and 1 is the matrix of all ones.

We now have a kernelized objective function:

JK(α) =αTΣK

Bα

αTΣKWα

As before, the solution to this isα = (ΣK

W )−1(MK2 −MK

1 )

If ΣKW is not invertible (which it won’t be for most kernels), we can apply Tikhonov regularization (a

ridge-style penalty) to obtain the solution

α = (ΣKW + εI)−1(MK

2 −MK1 )

for some small ε.

Hence our decision function for a new data point is of the form:

f(x∗) = wKφ(x∗)

=

N∑

i=1

αiφ(xi)Tφ(x∗)

=

n∑

i=1

αik(xi,x∗)

where

α = (ΣKW + εI)−1(MK

2 −MK1 )

=

εI +

∑

j=1,2

Kj(I − n−1j 1)KTj

−1

(MK2 −MK

1 )

3

and

MKij =

1

nj

N∑

k=1yk=j

k(xj ,xk)

References

[MRW+99] Sebastian Mika, Gunnar Ratch, Jason Weston, Bernhard Scholkopf, and Klaus-Robert Muller.Fisher discriminant analysis with kernels. In Yu-Hen Hu, Jan Larsen, Elizabeth Wilson, andScott Douglas, editors, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEEWorkshop, pages 41–48. IEEE, 1999.

[SHS01] Bernhard Scholkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem.In David Helmbold and Bob Williamson, editors, 14th Annual Conference on ComputationalLearning Theory, COLT 2001 nd 5th European Conference on Computational Learning Theory,EuroCOLT 2001, Lecture Notes in Artificial Intelligence, pages 416–426. Springer, 2001.

[SS01] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. TheMIT Press, 1st edition, 2001.

4

HW3Xuyan Lu

Data Analysis

1. Analysis pipeline

Fist randomly devide the data into two sets: randomly pick 60% of data as training set, randomly pick 20%of rest data as query set, the last 20% of the data is the test set. Then we use different classifiers to fitmodels on training set. If the model have tuning parameters, we use cross-validation to choose the besttuning parameters. After we get different models,we use query set to evaluate these models (by the predctionaccuracy on test set). So we can select the best classifier based on thier performance on the query set. Atlast, we can estimate the misclassification rate of the best classifier on the test set.## split dataindex <- sample(1:2310,1386)train <- mydata[index,]query.test <- mydata[-index,]index.2 <- sample(1:924,462)query <- query.test[index.2,]test <- query.test[-index.2,]## no. of train = 1386, no. of test =924

2. Cross-validation

For this question, I will use the following kernels: Linear Kernel, uT v, with tunning parameter cost; RadialBasis Kernel, e−γ|u−v|2 , with tunning parameters cost and γ; Polynomial Kernel, (uT v + 1)d, with tunningparameters cost and d(degree).

(a).#### k-fold cross-validationk.fold.cv <- function(data,k=10,gamma=1,cost=1,degree=3, coef0=1, kernel="radial"){

### 1st column of data is the response and set the colname of it as "Response".require("e1071")performance <- numeric(k)fold.length <- floor(length(data[,1])/k)for(i in 1:k){

if(i<k){index <- c((fold.length*(i-1)+1):(fold.length*i))train <- data[-index,]cv <- data[index,]fit <- svm(Response~., train, gamma = gamma, cost = cost, degree = degree,

kernel = kernel, coef0 = coef0)pre <- predict(fit, newdata = cv[,-1])performance[i] <- sum(pre!=cv[,1])/length(cv[,1])

}if(i==k){

index <- c((fold.length*(i-1)+1):length(data[,1]))train <- data[-index,]

1

cv <- data[index,]fit <- svm(Response~., train, gamma = gamma, cost = cost, degree = degree,

kernel = kernel, coef0 = coef0)pre <- predict(fit, newdata = cv[,-1])performance[i] <- sum(pre!=cv[,1])/length(cv[,1])

}}return(performance)

}

# mean(k.fold.cv(train,k=10,gamma=0.01831564,cost=54.59815,kernel="radial"))#### tunning parameterscv.svm <- function(data, k, gamma.seq=1,cost.seq=1,degree.seq=3,kernel="radial") {

require("e1071")if(kernel=="radial"){

performance <- matrix(0,nrow = length(gamma.seq)*length(cost.seq),ncol = 3)colnames(performance) <- c("gamma","cost", "error")n=1for(i in cost.seq){

for(j in gamma.seq){mean.performace <- mean(k.fold.cv(data,k,gamma = j,cost=i))performance[n,]=c(j,i,mean.performace)n=n+1

}}

}if(kernel=="polynomial"){

performance <- matrix(0,nrow = length(degree.seq)*length(cost.seq),ncol = 3)colnames(performance) <- c("degree","cost", "error")n=1for(i in cost.seq){

for(j in degree.seq){mean.performace <- mean(k.fold.cv(data,k,degree = j,cost=i, kernel="polynomial"))performance[n,]=c(j,i,mean.performace)n=n+1

}}

}if(kernel=="linear"){

performance <- matrix(0,nrow = length(cost.seq),ncol = 2)colnames(performance) <- c("cost", "error")n=1for(i in cost.seq){

mean.performace <- mean(k.fold.cv(data,k,cost=i,kernel="linear"))performance[n,]=c(i,mean.performace)n=n+1

}

}return(performance)

}

The results are different using min-rule and 1-SE-rule. 1-SE-rule is used to select a more stable (less complex)model. In this problem I choose to use min-rulefor the reason that: I didn’t choose much points and the

2

interval between these points are large (to saving running time), so the standard error is large, and 1-SE-ruledoesn’t perform well.

For Liner Kernel:perf.linear <- cv.svm(train,k=10, cost.seq=20:50, kernel = "linear")perf.linear[which.min(perf.linear[,2]),]

## cost error## 30.0000000 0.0468599

I used the cost sequence c(20 : 50). Using the min-rule we selected: cost = 30.

For Radial Basis Kernel:perf.radial <- cv.svm(train,k=10,gamma.seq=exp(-5:5), cost.seq=exp(-5:5), kernel = "radial")perf.radial[which.min(perf.radial[,3]),]

## gamma cost error## 0.01831564 54.59815003 0.03605072

I used the gamma sequence and cost sequence {ei|i = c(−5 : 5)}. Using the min-rule we selected: cost = 54.6,gamma = 0.018.

For Polynomial Kernel:perf.poly <- cv.svm(train,k=10, degree.seq=1:5, cost.seq=exp(-5:5), kernel = "polynomial")perf.poly[which.min(perf.poly[,3]),]

## degree cost error## 2.00000000 1.00000000 0.03318237

I used the degree sequence c(1 : 5) and cost sequence {ei|i = c(−5 : 5)}. Using the min-rule we selected:degree = 2, cost = 1.

(b).

For this part, I randomly assign 1108 of 1386 traing data to new training set, and assign the rest 278 traingdata to the validation set, and repeat k times.#### repeated cvrep.cv <- function(data,k=10,gamma=1,cost=1,degree=3, coef0=1, kernel="radial"){

### 1st column of data is the response and set the colname of it as "Response".require("e1071")performance <- numeric(k)for(i in 1:k){

index <- index <- sample(1:1386,1108)train <- data[index,]cv <- data[-index,]fit <- svm(Response~., train, gamma = gamma, cost = cost, degree = degree,

kernel = kernel,coef0 = coef0)pre <- predict(fit, newdata = cv[,-1])performance[i] <- sum(pre!=cv[,1])/length(cv[,1])


}#### tunning parametersrep.svm <- function(data, k, gamma.seq=1,cost.seq=1,degree.seq=3,kernel="radial") {

require("e1071")if(kernel=="radial"){

performance <- matrix(0,nrow = length(gamma.seq)*length(cost.seq),ncol = 3)

3

colnames(performance) <- c("gamma","cost", "error")n=1for(i in cost.seq){

for(j in gamma.seq){mean.performace <- mean(rep.cv(data,k,gamma = j,cost=i))performance[n,]=c(j,i,mean.performace)n=n+1

}}

}if(kernel=="polynomial"){

performance <- matrix(0,nrow = length(degree.seq)*length(cost.seq),ncol = 3)colnames(performance) <- c("degree","cost", "error")n=1for(i in cost.seq){

for(j in degree.seq){mean.performace <- mean(rep.cv(data,k,degree = j,cost=i,kernel="polynomial"))performance[n,]=c(j,i,mean.performace)n=n+1

}}

}if(kernel=="linear"){

performance <- matrix(0,nrow = length(cost.seq),ncol = 2)colnames(performance) <- c("cost", "error")n=1for(i in cost.seq){

mean.performace <- mean(rep.cv(data,k,cost=i,kernel="linear"))performance[n,]=c(i,mean.performace)n=n+1

}


}

For Liner Kernel:rep.linear <- rep.svm(train,k=10, cost.seq=c(20:50), kernel = "linear")rep.linear[which.min(rep.linear[,2]),]

## cost error## 35.00000000 0.04208633

I used the cost sequence c(20 : 50). Using the min-rule we selected: cost = 35.

For Radial Basis Kernel:rep.radial <- rep.svm(train,k=10,gamma.seq=exp(-5:5), cost.seq=exp(-5:5), kernel = "radial")# min-rulerep.radial[which.min(rep.radial[,3]),]

## gamma cost error## 0.04978707 20.08553692 0.03237410

I used the gamma sequence and cost sequence {ei|i = c(−5 : 5)}. Using the min-rule we selected: cost = 20,gamma = 0.05.

4

For Polynomial Kernel:rep.poly <- rep.svm(train,k=10,degree.seq=1:5, cost.seq=exp(-5:5), kernel = "polynomial")rep.poly[which.min(rep.poly[,3]),]

## degree cost error## 2.00000000 1.00000000 0.03381295

I used the degree sequence c(1 : 5) and cost sequence {ei|i = c(−5 : 5)}. Using the min-rule we selected:degree = 2, cost = 1.

(c).## kernel = linear, cost = 20:50f3 <- tune.svm(Response~.,data =train, cost = 20:50, kernel="linear")tune.linear <- f3$best.parameters## kernel = radial, gamma = exp(-5:5), cost = exp(-5:5)f1 <- tune.svm(Response~.,data =train, gamma = exp(-5:5), cost = exp(-5:5), kernel = "radial")tune.radial <- f1$best.parameters## kernel = polynomial, degree = 1:5, cost = exp(-5:5), coef0 = 1, gamma = 1f2 <- tune.svm(Response~.,data =train, degree = 1:5, cost = exp(-5:5), coef0 = 1, gamma = 1, kernel="polynomial")tune.poly <- f2$best.parameters

Linear Kernela.linear <- t(matrix(c(30,35,32),byrow = F))colnames(a.linear) <- c("10-fold CV", "Reperted CV", "Build-in Funtion")rownames(a.linear) <- "cost"knitr::kable(a.linear)

10-fold CV Reperted CV Build-in Funtioncost 30 35 32

Radial Basisa.radial <- data.frame(c(0.018,54.6), c(0.018,54.6),c(0.05,20))colnames(a.radial) <- c("10-fold CV", "Reperted CV", "Build-in Function")rownames(a.radial) <- c("gamma","cost")knitr::kable(a.radial)

10-fold CV Reperted CV Build-in Functiongamma 0.018 0.018 0.05cost 54.600 54.600 20.00

Linear Kernela.poly <- data.frame(c(2,1), c(2,1),c(2,1))colnames(a.poly) <- c("10-fold CV", "Reperted CV", "Build-in Function")rownames(a.poly) <- c("degree","cost")knitr::kable(a.poly)

10-fold CV Reperted CV Build-in Functiondegree 2 2 2cost 1 1 1

5

From the tables above we can see that the parameters selected by K-fold CV is more similar to the parametersselected by the buid-in function tune.svm.

(d). The K-fold CV is better than the repeated CV, the reason is: When we do k-fold CV every data point inthe training set has been used as validation set once; but in repeated CV we can’t guarantee every data pointhas been used as validation set. So the parameters selected by k-fold CV tend to be more unbiased than theparameters selected by repeated CV.

3. Compare and contrast result

## n.bmodel.nb <- naiveBayes(Response~.,train)pre.nb <- predict(model.nb,newdata = query[,-1])error.nb <- sum(pre.nb != query[,1])/length(query[,1])## LDAlda.train <- train[,-c(10,14,15,16)]lda.query <- query[,-c(10,14,15,16)]model.lda <- lda(Response~.,lda.train)pre.lda <- predict(model.lda, newdata = lda.query[,-1])$classerror.lda <- sum(pre.lda != query[,1])/length(query[,1])## Multinomialmodel.multi <- cv.glmnet(as.matrix(train[,-1]), train[,1], family = "multinomial")pre.multi <- predict(model.multi, newx = as.matrix(query[,-1]),

lambda="lambda.1se", type = "class")error.multi <- sum(pre.multi != query[,1])/length(query[,1])## SVM Linear Kernelmodel.svm.linear <- svm(Response~., train, cost = 32)pre.svm.linear <- predict(model.svm.linear, newdata = query[,-1])error.svm.linear <- sum(pre.svm.linear != query[,1])/length(query[,1])## SVM Radial Basis Kernelmodel.svm.radial <- svm(Response~., train, cost = 55, gamma=0.018, kernel="radial")pre.svm.radial <- predict(model.svm.radial, newdata = query[,-1])error.svm.radial <- sum(pre.svm.radial != query[,1])/length(query[,1])## SVM Polynomial Kernelmodel.svm.poly <- svm(Response~., train, degree = 3, cost = 0.05,

gamma=1, coef0 = 1, kernel="polynomial")pre.svm.poly <- predict(model.svm.poly, newdata = query[,-1])error.svm.poly <- sum(pre.svm.poly != query[,1])/length(query[,1])##error.compare <- matrix(c(error.nb,error.lda,error.multi,error.svm.linear,

error.svm.radial,error.svm.poly),ncol=1)rownames(error.compare) <- c("Naive Bayes", "LDA", "Multinomial", "SVM(Linear)",

"SVM(Radial)", "SVM(Polynomial)")colnames(error.compare) <- "Query Error Rate"knitr::kable(error.compare)

Query Error RateNaive Bayes 0.2034632LDA 0.0952381Multinomial 0.0476190SVM(Linear) 0.0411255SVM(Radial) 0.0324675

6

Query Error RateSVM(Polynomial) 0.0389610

From the above table we can see: Multinomial regression and SVMs gives better result. And anong thesegood methods, SVM with non-linear kernel performs the best.

When using LDA to fit model, I find there was a warning: “Variables are collinear”, and I find intensity-mean, exred-mean, exblue-mean and exgreen-mean can be calculated by rawred-mean, rawblue-mean andrawgreen-mean. So I deleted these variables.

When doing multinomial regression, I used L1 norm as penalty term to do regularization. I use regularizationto avoid the model overfitting the training set, and I use L1 norm as penalty term for the reason that I canget a sparse solution of coefficents (variable selection).

When I extend SVMs to multi-class data, I used one-vs-one method for the reason that one-vs-one is usuallymore accurate than one-vs-all and we only have 7 classes here, so the computational cost is affordable.

4. visualization

I use PCA to do the visualization.pca <- princomp(mydata[,-1])score <- pca$scoresggplot()+aes(x=score[,1], y=score[,2], color=mydata[,1]) + geom_point()

−150

−100

−50

0

50

100

150

−200 −100 0 100

score[, 1]

scor

e[, 2

]

mydata[, 1]

BRICKFACE

CEMENT

FOLIAGE

GRASS

PATH

SKY

WINDOW

7

I plot the 2-D plot of the 1st and 2nd principal scores, and from the plot we can see that there is a obviousboundary between SKY and other classes. This means on the training set the class SKY can be perfeclyseperate from other classes by a hyperplane.

5. Reflection

knitr::kable(error.compare)

Query Error RateNaive Bayes 0.2034632LDA 0.0952381Multinomial 0.0476190SVM(Linear) 0.0411255SVM(Radial) 0.0324675SVM(Polynomial) 0.0389610

test.svm.poly <- predict(model.svm.poly, newdata = test[,-1])test.error.svm.poly <- sum(test.svm.poly != test[,1])/length(test[,1])print(test.error.svm.poly)

## [1] 0.03030303

From the above table we can see that SVMs with non-linear kernel perform better than other methods.Different kernels don’t influence the test error much, the Radial Basis kernel performs slightly better thanthe Polynomial kernel. Kernel SVMs perform best because they have non-linear boundary, and the trueboundary of this problem may be non-linear. The estimated misclassification error get by test set usingRadial Basis kernel SVM is 0.03030303.

The confusion matirx:

Naive Bayes

## true## pred BRICKFACE CEMENT FOLIAGE GRASS PATH SKY WINDOW## BRICKFACE 67 4 0 0 0 0 14## CEMENT 1 65 0 0 2 1 8## FOLIAGE 0 3 12 0 0 0 2## GRASS 0 0 0 60 0 0 0## PATH 0 0 0 0 56 0 0## SKY 0 0 1 0 0 64 0## WINDOW 0 4 53 1 0 0 44

LDA


8

Multinomial Regression


SVM(Linear)


SVM(Radial)


SVM(Polynomial)


The most misclassified classes are Window and Foliage, the reason is that may be they looking similar on theimages (have something in common in appearance).

Conceptual Problems

1. This is not an unbiased estimate of the prediction error. The analyst uses all the data to select theoptimal parameters, so when he try to estimate the prediction error, he doesn’t use the completely newdata (i.e when tunning the parameters, the analyst used the information of his test set). My suggestionis: split the data into two sets: 70% training set and 30% test set. Use the training set to do 10-fold

9

cross validation to choose parameters,then use all the data in training set and the selected parametersto fit the model, at last use the test set to report the prediction error.

2. (a). Almost all individuals are classified as no is caused for the reason that: most of the trainingdata (92.49%) have negative (no) response, if the classifier always gives no respone (never predict yes),the training accuracy is still high: 92.49%. If the 62 yes response is equally assigned to the K folds,the CV error won’t be much higher than the training error. The situation in this question maybecaused by the un-equally assignment of the 62 yes response, for instance: the marketer performed5-fold cv, and all the 62 yes responses were assigned to one same fold, and other 4 folds only containsno response, then the cv error will be much higher than the training error. (b). I recommend he usethe precison = Number of Ture Positive

Number of Predicted Positive to evaluate his model. Or he can collect more data and useonly part of the no response to make the proportion of yes and no response much more closer (morebalanced). (c). Collect more data and make the yes and no response more balanced. After we got thebalanced data, devide the data into two sets: randomly choose 70% of the data as training set, and therest 30% as the test set. Use 10-fold cross-validation on training set to select the best regularizationparameter λ. Use all the data in the training and the selected λ to fit a regularized logistic regressionmodel and use the fitted model to get the estimated misclassification error on the test set.

10

Documents

Homework 3: Solutions - Rice Universitygallen/stat613/HW3_Solutions.pdf · Homework 3: Solutions Statistics 613 Fall 2017 Mathematical Problem: 1.Use the kernel trick to derive Kernel