E cient Transfer Learning Method for Automatic ...dyogatam/papers/yogatama+mann.aistats2014.pdf · E cient Transfer Learning Method for Automatic Hyperparameter Tuning Dani Yogatama

Efficient Transfer Learning Methodfor Automatic Hyperparameter Tuning

Dani Yogatama∗ Gideon MannCarnegie Mellon UniversityPittsburgh, PA 15213, [email protected]

Google ResearchNew York, NY 10011, [email protected]

Abstract

We propose a fast and effective algorithm forautomatic hyperparameter tuning that cangeneralize across datasets. Our method isan instance of sequential model-based opti-mization (SMBO) that transfers informationby constructing a common response surfacefor all datasets, similar to Bardenet et al.(2013). The time complexity of reconstruct-ing the response surface at every SMBO iter-ation in our method is linear in the numberof trials (significantly less than previous workwith comparable performance), allowing themethod to realistically scale to many moredatasets. Specifically, we use deviations fromthe per-dataset mean as the response values.We empirically show the superiority of ourmethod on a large number of synthetic andreal-world datasets for tuning hyperparam-eters of logistic regression and ensembles ofclassifiers.

1 Introduction

In order to get a machine learning algorithm to workwell in practice, there are often a number of hyper-parameters that need to be tuned when training themodel, and the best hyperparameter setting is oftendifferent for different datasets. Users are also facedwith non-trivial choices of which machine learning al-gorithm to use for the problem at hand. Machinelearning experts can draw on their expertise and find areasonable setting for a new dataset relatively quickly,although for a large dataset it is still imperative to get

Appearing in Proceedings of the 17th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy-right 2014 by the authors.

to the best setting with as few trials as possible. Moreimportantly, end users without much background inmachine learning often struggle to tune these hyper-parameters.

Over the past few years, the machine learning commu-nity has increased exploration into ways to simplifyhyperparameter tuning (Bergstra et al., 2011; Snoeket al., 2012; Thornton et al., 2013). Surrogate basedmodel optimization is one promising approach that hasbeen shown to work well provided that we can runenough trials for a given dataset. However, there aremany practical scenarios that require an algorithm toreturn a good hyperparameter setting with very fewtrials, often less than ten. For example, when pro-viding a machine learning service, it is unattractive interms of user wait time and provider resource time torun a large number of trials for each user.

One way to speed up the search process is by transfer-ring information from previous trials on other datasets,as has been explored in Bardenet et al. (2013). Asnoted in their work, a key challenge is that evaluationmetrics (e.g., accuracies) are not comparable acrossdatasets. For example, a simple algorithm for a highlyskewed classification with 90 percent of the instancesfrom one class can yield 90 percent accuracy, whereasthe same algorithm on a harder task with a more bal-anced class distribution might only yield 60 percentaccuracy. However, while the accuracies of the bestmodels may be different in these cases, the region ofa good hyperparameter setting may still be the same.Brochu et al. (2010) consider a similar setup of learn-ing from multiple sessions in the context of proceduralanimation design.

The method proposed by Bardenet et al. (2013) isbased on a ranking surrogate to approximate the re-sponse surface. One drawback of this approach is thatconstructing the response surface requires running aranker inside the outer loop of sequential model-based

∗Work done while at Google Research.

Efficient Transfer Learning Method for Automatic Hyperparameter Tuning

optimization. The time complexity of constructing theresponse surface is then equal to running a surrogate-based ranking algorithm (e.g., SVM Rank (Joachims,2002), GP-based ranking (Chu & Ghahramani, 2005))on at most DChoose(T, 2) points, where D is the num-ber of datasets and T is the number of trials for eachdataset. For example, the time complexity of learningSVM Rank with a linear kernel and DChoose(T, 2)points is the number of iterations times O(DT 2) withthe cutting-plane algorithm (Joachims et al., 2009).As the number of stored datasets increases, recon-structing the response surface may add significantlyto the time per trial.

In this work, we propose a novel method of transfer-ring knowledge from previous experiments using de-viations from the per-dataset mean. Our method issimple and cheap, requiring only O(T ) time to recon-struct the response surface, without needing to run asurrogate-based ranking algorithm. We demonstrateour method’s superiority in terms of accuracy com-pared to other methods that do not transfer knowl-edge from previous experiments, and in terms of actualruntime compared to the method of Bardenet et al.(2013).

2 Background

2.1 Sequential model-based optimization

Sequential model-based optimization (SMBO) is aframework for optimizing an expensive target functionby optimizing a surrogate that is cheaper to evaluate(Hutter et al., 2011). There are two components thatneed to be specified to use SMBO: the evaluation cri-terion (acquisition function) and the surrogate modelS. For the evaluation criterion of SMBO, we can use:Expected Improvement (EI) (Jones, 2001), maximumprobability of improvement (Jones, 2001), minimumconditional entropy (Villemonteix et al., 2006), GP up-per confidence bound (Srinivas et al., 2010), or a com-bination of them (Hoffman et al., 2011). In this work,we use Expected Improvement, one of the most com-monly used criteria for SMBO, which has been shownto give good empirical results for hyperparameter tun-ing. EI is defined as:

EIy∗(x,S) =∫∞−∞max(y − y∗, 0)pS(y|x)dy

where x is the evaluation point, y∗ is the optimal valuefound so far, and pS is the probability under model S.Similar to other work on automatic hyperparametertuning (e.g., Bergstra et al. (2011); Snoek et al. (2012);Bardenet et al. (2013), inter alia), we use a GaussianProcess (Rasmussen & Williams, 2006) as the surro-gate model S. Algorithm 1 outlines a typical SMBOalgorithm.

Algorithm 1 Sequential Model Based Optimization

Input: number of trials T , target function fS0 = initial surrogate modelO0 = {}for t = 1 to T do

xt = argmaxxEI(x,St−1)yt = evaluate f(xt)Ot = 〈xi, yi〉ti=1

Add yt and xt to St−1 to get Stend for

2.2 Gaussian Process

A Gaussian Process (GP) provides a convenient wayto define a distribution over functions. An importantcharacteristic that makes it attractive as a (surrogate)model for sequential-based model optimization is thatit satisfies the consistency requirement (Rasmussen &Williams, 2006). The posterior distribution of sam-ples from a GP prior is still a GP. A GP is definedby a mean function m(x) and a covariance functionk(xi,xj). If we assume that a function f(x) follows aGaussian Process:

f(x) ∼ GP (m(x), k(xi,xj)),

the mean and covariance of the posterior distributionat z (assuming the prior mean m(x) = 0, without lossof generality) after t ≥ i, j observations become:

m(z) = k(z,x1:t)>K−1f(x1:t)

k(z,xt) = k(z,xt)− k(z,x1:t)>K−1k(z,x1:t)

where K is the kernel matrix, f(x1:t) = [f(xi)]ti=1 (i.e.,

f(xi) stacked t times to form a column vector), andk(z,x1:t) = [k(z, xi)]

ti=1.

3 Transfer learning from previousexperiments

3.1 Problem Setup

Unlike Bardenet et al. (2013), we consider the stream-ing setup where we see one dataset at a time, processthe dataset, and move to another dataset. We assumethat we cannot store the datasets that we have seenin the past, except for high-level descriptive statisticsof these datasets (e.g., number of instances, numberof features, number of classes, etc.). In practice, thisturns out to be a realistic scenario since privacy andmemory concerns would prevent a system from storingall prior datasets.

Notationally, dataset index d is always given as a su-perscript, whereas trial index t is given as a subscript.We assume that we are only allowed to run T trials for

Dani Yogatama∗, Gideon Mann

each dataset. The goal is to find a good hyperparame-ter setting x for a dataset d to maximize fd(x) in as fewtrials as possible. In addition to dataset statistics, weare also allowed to store pairs of hyperparameter set-tings and function evaluations resulting from previousexperiments (prior datasets and prior trials on the cur-rent dataset) to be used by future experiments (futuredatasets). We outline the main idea of our transferlearning method in §3.2 and discuss it in more detailsubsequently.

3.2 Method

At each new dataset d, we fit a surrogate GaussianProcess to a new black-box function fd, in order tofind maxx fd(x). Instead of modeling fd(x) directly,we use the deviations from the per-dataset mean asthe common response surface values. In our setup, weare then able to directly retain the observations fromprior datasets in the surrogate function.

The response value that is used by our GP model issimply:

ydt =fd(xt)− µd

σd,

where µd and σd are the dataset d mean and standarddeviation respectively. Since we do not know µd andσd, we use

µ̂d =1

t

∑t

fd(xt)

σ̂d =

√1

t

∑t

(fd(xt)− µ̂d)2

based on dataset trial results up to t. Therefore, sim-ilar to Bardenet et al. (2013), at each iteration ourresponse surface changes.

Algorithm 2 outlines our transfer learning SMBO algo-rithm and Figure 1 illustrates the target response sur-face that we try to reconstruct using deviations fromthe per-dataset mean.

While this work was under review, a related paper waspublished by Swersky et al. (2013). In that work, theauthors proposed a multi-task Bayesian optimizationframework similar in spirit to our work. The main dif-ference is that their method estimates posterior prob-abilities of parameter estimates (the per-dataset meanand variance) using Monte Carlo methods, whereas weuse maximum likelihood estimates.

3.3 Assumptions and Justification

Besides standard assumptions for GP-based methods,we need one additional assumption to justify the above

Algorithm 2 Transfer Learning Sequential ModelBased Optimization

Input: surrogate model Sd−1, number of trials T ,dataset d, target function fd

Output: updated surrogate model SdS0 = Sd−1Od

0 = {}for t = 1 to T do

xt = argmaxx EI(x,St−1)Evaluate fd(xt)Od

t = 〈xi, fd(xi)〉ti=1

Update µ̂d and σ̂d

Reconstruct response surface: recompute yd1:t

Replace yd1:t−1 in the surrogate model St−1 with

the recomputed valuesAdd ydt and xt to St−1 to get St

end forOutput ST

method: different datasets produce similarly lookingfunctions, although the location and the scale param-eters can be different. If this assumption is satisfied,one explanation of using deviation from the datasetmean as the response surface is that we are using aGaussian Process model with different means for dif-ferent datasets. The information transfer occurs fromreusing the covariance function across datasets. Let xd

and xd+1 be trial points for datasets d and d+1. Sincethey are jointly Gaussian random vectors, we have:[

fd(xd1:T )

fd+1(xd+11:T )

]∼ N

([µd1:T

µd+11:T

],

[Kxd

1:T,xd

1:TK

xd1:T

,xd+11:T

Kxd+11:T

,xd1:T

Kxd+11:T

,xd+11:T

,

])

where µd1:T denotes µd stacked T times to form a vec-

tor and Kxd1:T ,xd

1:Tdenotes the submatrix of K for all

trials points on dataset d. The conditional distribu-tion of fd+1(xd+1

1:T ) | fd(xd1:T ) is therefore a Gaussian

distribution with mean:

µd+1 + Kxd+11:T ,xd

1:TK−1

xd1:T ,xd

1:T

(fd(xd1:T )− µd)

and covariance:

Kxd+11:T ,xd+1

1:T−Kxd+1

1:T ,xd1:T

K−1xd1:T ,xd

1:T

Kxd1:T ,xd+1

1:T.

However, even if we take into account that differentdatasets can have different means, the scales of the tar-get function values can still vary greatly. Since covari-ance is scale dependent, we need to divide fd(xd

1:T )−µd

by the dataset standard deviation σd to obtain the re-sponse value yd

1:T that is used by our GP model inorder to compare results from different datasets.

One key assumption of using GP as the surrogatemodel is that the function f : Rp → R is reasonably


x1

0

2

4

x20

24

68

10

accuracy

0.0

0.5

1.0

dataset 1

x1

0

1

2

345

x20

24

68

10

accuracy

0.0

0.5

1.0

dataset 2

x1

0

1

2

34

x20

2

46

810

response

-0.5

0.0

0.5

common response surface

Figure 1: A toy example of a common response surface reconstructed by our method from response surfaces oftwo datasets. Dataset 1 (left) and dataset 2 (center) produce similar-looking response surfaces, but they are onvery different scales. The right plot is an illustration of the common response surface constructed by our methodfrom sample points of these two datasets.

smooth in x ∈ Rp.1 In the context of hyperparameteroptimization with an independent GP model for eachdataset, this assumption translates to nearby hyper-parameter values have similar performance.

For hyperparameter optimization with transfer learn-ing across datasets, in practice, we typically augmentthe hyperparameter space with dataset features andlet x live in this extended space, i.e., x ∈ Rp+q, wherep is the number of hyperparameters and q is the num-ber of dataset features. This implies that we assumethat f(x) is also smooth in the dataset features (Bar-denet et al., 2013). One drawback of this approach isthat the performance depends greatly on the datasetfeatures. It is also unclear what dataset features touse, and further research is still needed to investigatethe best features. Furthermore, the response surfacethat is constructed by transfer learning algorithms isan approximation of the true response surface. In ourmethod, we only have access to µ̂d and σ̂d instead ofµd and σd, and this approximation might interact withdataset features. We need an approach that is morerobust to poor estimates of the response surfaces ofsome datasets.

We propose to lessen the dependencies on dataset fea-tures by using multiple kernels, which we describe be-low.

3.4 Multiple kernel framework

Having access to more information (trials from previ-ous experiments) allows us to define a better kernel

1This is mainly true if we use the Automatic Rele-vance Determination (ARD) kernel. There has been workthat has used other kernels such as the Matern 5/2 kernel(Snoek et al., 2012). Our discussion focuses on the ARDkernel since it is the most widely used kernel in hyperpa-rameter optimization.

than typical kernels used for hyperparameter tuning,and opens up the possibility to bring multiple kernellearning framework into GP-based SMBO. For exam-ple, in this work, we consider a convex combinationof two kernels, where the first is simply the squaredexponential kernel for points in the same dataset, andthe second is a nearest neighbor kernel that looks atpoints from the n nearest neighbor datasets from pre-vious experiments. In both kernels, we have x ∈ Rp

(i.e., only in the hyperparemeter space).

The first kernel (SQE) is:

k(xi,xj) = θ20 exp

(−1

2

∑p(xi,p − xj,p)2

θ2p

)

We set θ0 = 1 and optimize θp by maximizing thelikelihood of the data. As a result, our SQE kernel isan ARD kernel.

The second kernel is:

k(xi,xj) ={1− ‖xi−xj‖2

B if xj ∈ nearest neighbor datasets0 otherwise,

where ‖x‖2 ≤ B. In other words, the second kernelis one minus the normalized Euclidean distance if thepoint belong to one of the nearest neighbor datasets,and zero otherwise. We find nearest neighbors by usingEuclidean distance in the dataset feature space Rq.The goal is to combine the power of nearest neighborsearch and squared exponential kernel with automaticrelevance determination.

Notice that we only use the SQE kernel for points inthe same dataset, relying on the nearest neighbor ker-nel to capture information from other datasets. As aresult, the transfer of information is more direct. We


only use dataset features to find the nearest neigh-bors – instead of estimating exactly how similar twodatasets are, which is harder and thus more error-prone. Since each dataset in the nearest neighborscontributes equally, it can also give the algorithm moretolerance to error in approximating the response sur-face of prior datasets.

We also need not assume that f(x) is smooth in datasetfeatures, although the actual function space becomesmore complex since we have a combination of kernels.We view our work as a first step towards using multiplekernel learning framework in hyperparameter tuning.

3.5 Time complexity

We assume that evaluating f(x) is the most expensivestep in the algorithm. Our method, similar to Bar-denet et al. (2013), needs to reconstruct the responsesurface at every iteration. Therefore, we want this stepto be as cheap as possible.

Our method is very effective since at each iteration t weonly need to recompute the current dataset mean andstandard deviation. Therefore, the time complexity ofreconstructing the response surface is O(T ), where Tis the number of trials in one dataset.

On the other hand, Bardenet et al. (2013) needs torun SVM Rank on pairwise differences for per-datasettrials for all datasets. There can be DChoose(T, 2)such pairs in total, where D is the number of datasetsused to reconstruct the response surface. For ex-ample, using the cutting-plane algorithm (Joachimset al., 2009), the SVM-Rank-iteration time complexityis O(DT 2) with a linear kernel and O(D2T 4) with anon-linear kernel (e.g., RBF) without approximation.Note that the actual time complexity of reconstruct-ing the response surface is the number of SVM Rankiterations times the per-iteration complexity above.As we collect more samples from many datasets, thisstep becomes prohibitively expensive, sometimes evenmore expensive in practice than the function evalua-tion step. One possible solution is to limit the numberof datasets that contribute to the reconstruction of theresponse surface. We do not explore this option whencomparing our method to Bardenet et al. (2013).

4 Experiments

4.1 Setup

Dataset We collected 42 publicly available clas-sification datasets from http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets/ (Chang & Lin,2011). Out of time considerations, we restricted our-selves to datasets that have fewer than 40,000 train-

ing instances and 300 features. We assume that thedatasets arrive in a fixed order and are processed oneby one. Since we do not know the true ordering of thedatasets, we randomly shuffle the order of the datasetsmany times and average the results. For each dataset,we extract the (log) number of instances, the (log)number of features, and the number of classes to beused as dataset features. We split each dataset intotraining, development, and (blind) test sets. The hy-perparameters are tuned on the development set. As-sessing performance on only the development sets canbe misleading since we might overfit to the develop-ment sets. We report the results in terms of perfor-mance on both the development and test sets. Ta-ble 1 offers descriptive statistics about our collectionof datasets.

Table 1: Descriptive statistics about the datasets.The dataset features in our experiments are the num-ber of classes, the log number of features, and the lognumber of training instances.

Number of Avg Min Maxclasses 3.71 2 26

features 93.71 2 300training instances 5912.12 150 32,561

Baselines We consider three baseline methods forcomparison with our algorithm: random search(Bergstra & Bengio, 2012), an SMBO model thatonly uses information from current dataset (indepen-dent GP), and an SMBO model that transfers knowl-edge from previous experiments using surrogate-basedranking (Bardenet et al., 2013). We implement Bar-denet et al. (2013) method with SVM Rank as theranking algorithm. Due to runtime considerations, weuse a linear SVM Rank kernel instead of the isotropicsquared exponential kernel used in their paper. Onelimitation of SVM Rank is that it introduces addi-tional parameters, namely the convergence thresholdand another constant for the error penalty of the slackvariables. We set the values to 10−3 and 1 after slighttuning to get a reasonable trade-off between runtimeand performance. These parameters also require ex-tensive tuning in order to get the best performance.Our method, on the other hand, is parameter free andcan be applied without separate tuning of additionalhyperparameters.

For models that transfer knowledge from previous ex-periments, we consider two kernels: a squared expo-nential kernel with automatic relevance determination(SQE) and the multiple kernel framework that we de-scribe in §3.4 (MKL). Therefore, there are 6 modelsin total that we compare in our experiments. For themodels with multiple kernels, we arbitrarily set the

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/


kernel weights to 0.3 (SQE) and 0.7 (20 nearest neigh-bors). Future work can try to learn the optimal kernelweights on the fly. We would also like to note that weran independent GP using the nearest neighbor ker-nel on preliminary experiments and found that it gavecomparable performance to SQE, so we do not includeit here.

Evaluation metric Similar to Bardenet et al.(2013), we use average rank as the evaluation metric.Every method gets a rank for every dataset. The bestperforming method gets rank 1, the second best 2, andso on. Since there are 6 methods that we compare, forevery dataset, each method can get rank 1 to 6. Eachcompeting method is evaluated on every dataset thatis assumed to arrive in an unknown random order. Wecompute the average for every run, where a run corre-sponds to a random order of the datasets.

4.2 Synthetic datasets

In order to test our hypothesis that we can find a goodhyperparameter setting faster by using deviations fromthe per-dataset mean, we perform an experiment us-ing artificially generated datasets. For each of the42 datasets that we collected, we create 5 perturbeddatasets. A perturbed dataset is created by randomlyremoving features or instances from a source dataset.As a result, we have 210 synthetic datasets, each 5 ofthese are obviously related since they are generatedfrom one source dataset. If the methods that we useare able to transfer information from previous experi-ments, they will surely perform better than the inde-pendent GP method and random search. The relation-ships among synthetic datasets generated from differ-ent real-world datasets are less clear. However, in thisexperiment, if a method can make use of informationfrom previous datasets, we still expect a performanceimprovement over non-transfer learning methods.

We run our experiment using logistic regression as theclassifier. Logistic regression is one of the most widelyused machine learning algorithms. In our setup, thereare four hyperparameters to tune: the `1 penalty, the`2 penalty, the maximum number of iterations, andthe objective value convergence threshold. Therefore,our logistic regression is an elastic net (Zou & Hastie,2005). For each dataset, each method gets 1+5 trials.A trial corresponds to a function evaluation for a givenhyperparameter setting. We set the first trial point tobe always the same for non-transfer learning methods:random search and independent GP. Transfer learningmethods can make use of information from previousdatasets for choosing the first trial point if they haveseen more than one dataset (d > 1), so the evaluatedpoint can be different. The goal of this experiment is

Table 2: Average ranks and standard deviations on thedevelopment and test sets for logistic regression exper-iments on synthetic datasets. They are averaged over5 different random orders of the datasets. Note thatstandard errors are

√5 times smaller than standard

deviations here. The improvements for our methods(MKL and SQE) over the independent GP method forthe test set are statistically significant (p < 0.05, stu-dent’s t-test).

MethodLogistic regression

Dev TestRandom 3.98 ± 0.11 3.94 ± 0.12

Independent GP 3.55 ± 0.11 3.58 ± 0.13Bardenet et al., 2013 (SQE) 3.76 ± 0.19 3.75 ± 0.17Bardenet et al., 2013 (MKL) 3.15 ± 0.48 3.18 ± 0.43

Our method (SQE) 3.36 ± 0.14 3.34 ± 0.15Our method (MKL) 3.17 ± 0.22 3.19 ± 0.20

to simulate a machine learning service that receives nu-merous requests and needs to obtain reasonably goodperformance as fast as possible. We compute the av-erage for 5 different random orders of the datasets (5runs).

Table 2 shows the average rank of each competingmethod on these synthetic datasets. The resultsclearly indicate that deviations from the per-datasetmean is an effective method to transfer informationfrom previous experiments. As we can see, both theSQE and MKL variants of our method significantlyoutperformed independent GP. The Bardenet et al.(2013) method with MKL also performed well, but theSQE variant did not. Notice also the high standard de-viation of the MKL variant, indicating that it is lessrobust to the dataset ordering.

Figure 2 shows the average rank over time for eachof the competing methods on one run of the experi-ment. In general, we can see that after seeing about75 datasets the average ranks level off. An interestingobservation is that the average rank of our methodwith SQE kernel still decreases, whereas our methodwith MKL does not. The MKL method only looks at20 nearest neighbor datasets, so the effect of seeingmore datasets is not as significant as the SQE methodwhich looks at all datasets. One possible direction forfuture work is to investigate the effect of increasingthe size of the nearest neighbors to take advantage ofmore datasets.

4.3 Real-world datasets

We perform two experiments using 42 real-worlddatasets described in §4.1. Note that unlike in thesynthetic dataset experiment, we have no idea before-hand if any of these datasets are related and there isinformation that can be transferred across them.


synthetic dataset

number of datasets

aver

age

rank

0 50 100 150 200

2.0

2.5

3.0

3.5

4.0

RandomOur method (MKL)Bardenet et al, 2013 (MKL)

Independent GPOur method (SQE)Bardenet et al, 2013 (SQE)

Figure 2: Average ranks over time for one of the runson 210 synthetic datasets.

Logistic regression In the first experiment, we uselogistic regression as our classifier. Similar to thesynthetic dataset experiment, there are four hyper-parameters to tune: the `1 penalty, the `2 penalty,the maximum number of iterations, and the objectivevalue convergence threshold. In this experiment, eachmethod gets 1+10 trials for each dataset. The treat-ment of the first trial for each method in each datasetis similar to that in the synthetic dataset experiment.Table 3 shows the average rank on 42 datasets, aver-aged over 10 different random orders of the datasets(10 runs), on the development set and the test set.

Ensemble of classifiers In the second experiment,we consider an ensemble of seven classifiers; with 23hyperparameters to tune. The classifiers that we con-sider are: Winnow (Littlestone, 1988), perceptron, lo-gistic regression, Naive Bayes, online Support VectorMachines, an online passive-aggresive classifier (Cram-mer et al., 2006), and IKPamir.2 Each classifier hasone to six hyperparameters to tune. The classifiersare combined using weighted majority voting, so inaddition to individual algorithm hyperparameters, wealso search for classifier voting weights. Therefore, wehave 30 hyperparameters in total; see Table 4 for de-tails. Similar to the logistic regression experiments,each method gets 1+10 trials for each dataset. Wereport average rank on 42 datasets over 10 differentrandom orders of the datasets (10 runs) in Table 3.

Results and discussion The results clearly demon-strate the benefits of information from prior datasets,especially in the MKL case. In the logistic regressionexperiments, our method and Bardenet et al. (2013)with the SQE kernel failed to improve over indepen-dent GP. One possible explanation is that since thereare only 4 hyperparameters, 11 trials are sufficient.Another possible explanation is that SQE kernel re-

2IKPamir is a combination of Crammer et al. (2006) andMaji et al. (2008). (Hector Ye; personal correspondence).

Table 4: 23 hyperparameters in our ensemble of classi-fiers experiments. The minimum and maximum valueswere chosen with runtime and performance trade-off asa criterion. There are additional 7 hyperparameterscorresponding to classifier weights in the probabilitysimplex for each of these classifiers.

Classifier Hyperparameter Min MaxWinnow number of epochs 2 20Winnow conv. threshold 10−4 1Winnow bias conv. threshold 10−4 1Winnow margin 10−4 1

perceptron average 0 1perceptron number of epochs 5 15

log. reg. `1 penalty 0 10log. reg. `2 penalty 0 10log. reg. max iteration 50 500log. reg. conv. threshold 10−7 10−3

naive bayes min probability 10−4 10−2

online SVM polynomial degree 1 20online SVM polynomial bias 10−2 10online SVM lambda 0.8 1.2online SVM max iteration 100 20,000online SVM mini batch size 1 5online SVM `2 penalty 0 100pass. aggr. sensitivity 0.5 1.5pass. aggr. relaxation 0.5 1.5pass. aggr. number of epochs 5 15

IKPamir number of bins 5 100IKPamir learning rate 0.1 0.9IKPamir number of epochs 3 8

quires a crucial assumption that the function is alsosmooth in dataset features, as described in §3.3. Weonly used three dataset features: (log) number of in-stances and features, and number of classes, whichmight not satisfy this assumption in practice. Themethod based on multiple kernel learning, however,is more robust since it only uses dataset features tofind nearest neighbors and does not use the featuresdirectly in the GP kernel. We also found that forour method, randomizing some elements in x (whichis returned after maximizing EI) improved the perfor-mance. One difficulty in our method is obtaining goodestimates of the mean µ̂d and standard deviation σ̂d

while not wasting trials. Simply using unperturbedxd1:t to estimate the dataset mean and standard de-

viation introduced bias, since the x that maximizesthe EI is biased towards points that are predicted tohave good performance (accuracy). The rationale forrandomization is to reduce the bias. For all our exper-iments, we use uniform randomization with replace-ment probability 0.25 (i.e., an element of x is replaced,with probability 0.25, by a new value uniformly sam-pled from the corresponding hyperparameter space).

Figure 3 shows runtime comparisons in CPU secondsfor one of the runs of logistic regression and ensemble


Table 3: Average ranks and standard deviations on the development and test sets for logistic regression experi-ments (left) and ensemble of classifiers experiments (right). They are averaged over 10 different random orders ofthe datasets. Note that standard errors are

√10 times smaller than standard deviations here. The improvement

for our method (MKL) over the independent GP method for the test set is statistically significant (p < 0.05,student’s t-test).

MethodLogistic regression Ensemble of classifiersDev Test Dev Test

Random 3.93 ± 0.19 3.89 ± 0.16 3.85 ± 0.13 3.77 ± 0.17Independent GP 3.42 ± 0.09 3.43 ± 0.11 3.62 ± 0.19 3.55 ± 0.29

Bardenet et al., 2013 (SQE) 3.51 ± 0.18 3.50 ± 0.18 3.75 ± 0.26 3.57 ± 0.20Bardenet et al., 2013 (MKL) 3.20 ± 0.46 3.18 ± 0.40 3.11 ± 0.24 3.31 ± 0.24

Our method (SQE) 3.85 ± 0.31 3.78 ± 0.20 3.58 ± 0.13 3.50 ± 0.24Our method (MKL) 3.08 ± 0.17 3.23 ± 0.21 3.10 ± 0.31 3.29 ± 0.26

logistic regression

number of datasets

CPU

seco

nds

0 10 20 30 40

02000

4000

6000

RandomIndependent GPOur method (MKL)Our method (SQE)Bardenet et al, 2013 (MKL)Bardenet et al, 2013 (SQE)

ensemble of classifiers

number of datasets

CPU

seco

nds

0 10 20 30 40

05000

10000

15000

RandomIndependent GPOur method (MKL)Our method (SQE)Bardenet et al, 2013 (MKL)Bardenet et al, 2013 (SQE)

Figure 3: Runtime comparisons (CPU seconds) for one of the runs in logistic regression experiments (left) andensemble of classifiers experiments (right).

of classifiers. We can see that the runtime of Bardenetet al. (2013) grows significantly faster with the numberof datasets, whereas the runtime of our method growsalmost linearly. For example, in the logistic regressionexperiment, increasing the number of seen datasetsfrom 20 to 40 leads to about five times increase intime (≈1,000 CPU seconds to ≈5,000 CPU seconds)for Bardenet et al. (2013), whereas our method (MKL)goes from ≈400 CPU seconds to ≈1,000 CPU seconds.

In practice, running a surrogate ranking algorithm toreconstruct the common response surface can be veryexpensive with a large number of datasets. The goalof SMBO is to optimize an expensive target functionby optimizing its surrogate that is cheaper to evaluate.When constructing the surrogate becomes too expen-sive, it nullifies the benefits of SMBO.

An interesting direction is to compare our method andBardenet et al. (2013) using a fixed computation re-source budget (e.g., time, memory). Since our methodis more effective, it can make use of more datasets topotentially provide better predictions. We leave thisfor future work.

5 Conclusion

We proposed an algorithm for automatic hyperparam-eter tuning that can generalize across datasets. Ourmethod is an instance of sequential model-based op-timization that uses deviations from the per-datasetmean as the response values. We theoretically andempirically showed that the time complexity of recon-structing the response surface at every iteration in ourmethod is significantly less than previous work, withcomparable performance. We demonstrated the supe-riority of our method compared to non-transfer learn-ing methods on a large number of synthetic and real-world datasets for tuning hyperparameters of logisticregression and ensembles of classifiers.

Acknowledgements

The authors thank anonymous reviewers, ThomasFu, abe research group, Nathan Schneider, BrendanO’Connor, and Noah A. Smith for helpful commentsand discussions.


References

Bardenet, Remi, Brendel, Matyas, Kegl, Balazs, andSebag, Michele. Collaborative hyperparameter tun-ing. In Proc. of ICML, 2013.

Bergstra, James and Bengio, Yoshua. Random searchfor hyper-parameter optimization. Journal of Ma-chine Learning Research, 13:281–305, 2012.

Bergstra, James, Bardenet, Remi, Bengio, Yoshua,and Kegl, Balazs. Algorithms for hyper-parameteroptimization. In Proc. of NIPS, 2011.

Brochu, Eric, Brochu, Tyson, and de Freitas, Nando.A Bayesian interactive optimization approach toprocedural animation design. In Proc. of ACM SIG-GRAPH/Eurographics Symposium on ComputerAnimation, 2010.

Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: alibrary for support vector machines. ACM Transac-tions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011.

Chu, Wei and Ghahramani, Zoubin. Preference learn-ing with gaussian processes. In Proc. of ICML, 2005.

Crammer, Koby, Dekel, Ofer, Keset, Joseph, Shalev-Shwarts, Shai, and Singer, Yoram. Onlive passive-aggresive algorithms. Journal of Machine LearningResearch, 7:551–585, 2006.

Hoffman, Matthew, Brochu, Eric, and de Freitas,Nando. Portfolio allocation for bayesian optimiza-tion. In Proc. of UAI, 2011.

Hutter, Frank, Hoos, Holger H., and Leyton-Brown,Kevin. Sequential model-based optimization for gen-eral algorithm configuration. In Proc. of LION-5,2011.

Joachims, Thorsten. Optimizing search engines usingclickthrough data. In Proc. of KDD, 2002.

Joachims, Thorsten, Finley, Thomas, and Yu, Chun-Nam John. Cutting-plane training of structuralSVMs. Machine Learning Journal, 77(1):27–59,2009.

Jones, Donald R. A taxonomy of global optimiza-tion methods based on response surfaces. Journalof Global Optimization, 21:345–385, 2001.

Littlestone, Nick. Learning quickly when irrelevantattributes are abound: A new linear threshold algo-rithm. Machine Learning, 2:285–318, 1988.

Maji, Subhransu, Berg, Alexander C., and Malik, Ji-tendra. Classification using intersection kernel sup-port vector machines is efficient. In Proc. of CVPR,2008.

Rasmussen, Carl Edward and Williams, ChristopherK. I. Gaussian Processes for Machine Learning. TheMIT Press, 2006.

Snoek, Jasper, Larrochelle, Hugo, and Adams, Ryan P.Practical bayesian optimization of machine learningalgorithms. In Proc. of NIPS, 2012.

Srinivas, Niranjan, Krause, Andreas, Kakade, Sham,and Seeger, Matthias. Gaussian process optimiza-tion in the bandit setting: No regret and experi-mental design. In Proc. of ICML, 2010.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P.Multi-task bayesian optimization. In Proc. of NIPS,2013.

Thornton, Chris, Hutter, Frank, Hoos, Holger H., andLeyton-Brown, Kevin. Auto-WEKA: Combined se-lection and hyperparameter optimization of classifi-cation algorithms. In Proc. of KDD, 2013.

Villemonteix, Julien, Vazquez, Emmanuel, and Wal-ter, Eric. An informational approach to theglobal optimization of expensive-to-evaluate func-tions. Journal of Global Optimization, 2006.

Zou, Hui and Hastie, Trevor. Regularization and vari-able selection via the elastic net. Journal of theRoyal Statistical Society, Series B, 67(2):301–320,2005.

Documents

E cient Transfer Learning Method for Automatic ...dyogatam/papers/yogatama+mann.aistats2014.pdf · E cient Transfer Learning Method for Automatic Hyperparameter Tuning Dani Yogatama