Scalable Bayesian Optimization Using Deep Neural Networks · Scalable Bayesian Optimization Using Deep Neural Networks anced during the search. Perhaps the most commonly used model

Scalable Bayesian Optimization Using Deep Neural Networks

Jasper Snoek∗ [email protected] Rippel†∗ [email protected] Swersky§ [email protected] Kiros§ [email protected] Satish‡ [email protected] Sundaram‡ [email protected]. Mostofa Ali Patwary‡ [email protected]? [email protected] P. Adams∗ [email protected]∗Harvard University, School of Engineering and Applied Sciences†Massachusetts Institute of Technology, Department of Mathematics§University of Toronto, Department of Computer Science‡Intel Labs, Parallel Computing Lab?NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA, USA 94720

Abstract

Bayesian optimization has been demon-strated as an effective methodology for theglobal optimization of functions with expen-sive evaluations. Its strategy relies on query-ing a distribution over functions defined by arelatively cheap surrogate model. The abil-ity to accurately model this distribution overfunctions is critical to the effectiveness ofBayesian optimization, and is typically fitusing Gaussian processes (GPs). However,since GPs scale cubically with the number ofobservations, it has been challenging to han-dle objectives whose optimization requires alarge number of evaluations, and as such,massively parallelizing the optimization.

In this work, we explore the use of neu-ral networks as an alternative to Gaussianprocesses to model distributions over func-tions. We show that performing adaptivebasis function regression with a neural net-work as the parametric form performs com-petitively with state-of-the-art GP-based ap-proaches, but scales linearly with the numberof data rather than cubically. This allows usto achieve a previously intractable degree ofparallelism, which we use to rapidly searchover large spaces of models. We achieve

Copyright 2015 by the author(s).

state-of-the-art results on benchmark objectrecognition tasks using convolutional neuralnetworks, and image caption generation us-ing multimodal neural language models.

1. Introduction

Recently, the field of machine learning has seen un-precedented growth due to a new wealth of data, in-creases in computational power, new algorithms, anda plethora of exciting new applications. As researcherstackle more ambitious problems, the models they useare also becoming more sophisticated. However, thegrowing complexity of machine learning models in-evitably comes with the introduction of additional hy-perparameters. These range from design decisions,such as the shape of a neural network architecture,to optimization parameters, such as learning rates, toregularization hyperparameters such as weight decay.Proper setting of these hyperparameters is critical forperformance on difficult problems.

There are many methods for optimizing over hyperpa-rameter settings, ranging from simplistic procedureslike grid or random search (Bergstra & Bengio, 2012),to more sophisticated model-based approaches usingrandom forests (Hutter et al., 2011) or Gaussian pro-cesses (Snoek et al., 2012). Bayesian optimization isa natural framework for model-based global optimiza-tion of noisy, expensive black-box functions. It offersa principled approach to modeling uncertainty, whichallows exploration and exploitation to be naturally bal-

arX

iv:1

502.

0570

0v1

[st

at.M

L]

19

Feb

2015


anced during the search. Perhaps the most commonlyused model for Bayesian optimization is the Gaussianprocess (GP) due to its simplicity and flexibility interms of conditioning and inference.

However, a major drawback of GP-based Bayesian op-timization is that inference time grows cubically inthe number of observations, due to the need to solvea linear system with a dense covariance matrix. Forproblems with a very small number of hyperparame-ters, this has not been an issue, as the minimum isoften discovered before the cubic scaling renders fur-ther evaluations prohibitive. As the complexity of ma-chine learning models grows, however, the size of thesearch space grows as well, along with the number ofhyperparameter configurations that need to be eval-uated before a solution of sufficient quality is found.Fortunately, as models have grown in complexity, com-putation has become significantly more accessible andit is now possible to train many models in parallel us-ing relatively inexpensive compute clusters. A naturalsolution to the hyperparameter search problem is totherefore combine large-scale parallelism with a scal-able Bayesian optimization method. The cubic scalingof the GP, however, has made it infeasible to pursuethis approach.

The goal of this work is to develop a method for scalingBayesian optimization, while still maintaining its de-sirable flexibility and characterization of uncertainty.To that end, we propose the use of neural networks tolearn an adaptive set of basis functions for Bayesianlinear regression. We refer to this approach as DeepNetworks for Global Optimization (DNGO). Unlikea standard Gaussian process, DNGO scales linearlywith the number of function evaluations—which, inthe case of hyperparameter optimization, correspondsto the number of models trained—and is amenable tostochastic gradient training. Although it may seemthat we are merely moving the problem of setting thehyperparameters of the model being tuned to settingthem for the tuner itself, we show that for a suitable setof design choices it is possible to create a robust, scal-able, and effective Bayesian optimization system thatgeneralizes across many global optimization problems.

We demonstrate the effectiveness of DNGO on a num-ber of difficult problems, including benchmark prob-lems for Bayesian optimization, convolutional neuralnetworks for object recognition, and multi-modal neu-ral language models models for image caption gener-ation. We find hyperparameter configurations thatachieve state-of-the-art results of 6.37% and 27.4%on CIFAR-10 and CIFAR-100 respectively, and BLEUscores of 25.1 and 26.7 on the Microsoft COCO 2014

dataset using a single model and a 2-model ensemblerespectively.

2. Background and Related Work

2.1. Bayesian Optimization

Bayesian optimization is a well-established strategy forthe global optimization of noisy, expensive black-boxfunctions (Mockus et al., 1978). For an in-depth re-view, see Lizotte (2008), Brochu et al. (2010) and Os-borne et al. (2009). Bayesian optimization relies onthe construction of a probabilistic model that definesa distribution over objective functions from the inputspace to the objective of interest. Conditioned on aprior over the functional form and a set of N obser-vations of input-target pairs {(xn, yn)}Nn=1, the rela-tively cheap posterior over functions is then queriedto reason about where to seek the optimum of the ex-pensive function of interest. The promise of a newexperiment is quantified using an acquisition function,which, applied to the posterior mean and variance, ex-presses a trade-off between exploration and exploita-tion. Bayesian optimization proceeds by performinga proxy optimization over this acquisition function inorder to determine the next input to evaluate.

Recent innovation has resulted in significant progressin Bayesian optimization, including elegant theoreti-cal results (Srinivas et al., 2010; Bull, 2011; de Fre-itas et al., 2012), multitask and transfer optimiza-tion (Krause & Ong, 2011; Swersky et al., 2013; Bar-denet et al., 2013) and the application to diverse taskssuch as sensor set selection (Garnett et al., 2010), thetuning of adaptive Monte Carlo (Mahendran et al.,2012) and robotic gait control (Calandra et al., 2014b).

Typically, Gaussian processes have been used to con-struct the distribution over functions used in Bayesianoptimization, due to their flexibility, well-calibrateduncertainty, and analytic properties (Jones, 2001; Os-borne et al., 2009). Recent work has sought to im-prove the performance of the GP approach throughaccommodating higher dimensional problems (Wanget al., 2013; Djolonga et al., 2013), input non-stationarities (Snoek et al., 2014) and initializationthrough meta-learning (Feurer et al., 2015). Randomforests, which scale linearly with the data, have alsobeen used successfully for algorithm configuration byHutter et al. (2011) with empirical estimates of modeluncertainty.

More specifically, Bayesian optimization seeks to solvethe minimization problem

x? = argminx∈X

f(x) , (1)


where we take X to be a compact subset of RK . In ourwork, we build upon the standard GP-based approachof Jones (2001) which uses a GP surrogate and theexpected improvement acquisition function (Mockuset al., 1978). For model hyperparameters Θ,let σ2(x; Θ) = Σ(x,x; Θ) be the marginal predictivevariance of the probabilistic model, µ(x; {xn, yn} ,Θ)be the predictive mean, and define

γ(x) =f(xbest)− µ(x; {xn, yn} ,Θ)

σ(x; {xn, yn} ,Θ), (2)

where f(xbest) is the lowest observed value. The ex-pected improvement criterion is defined as

aEI(x; {xn, yn} ,Θ) = (3)

σ(x; {xn, yn} ,Θ) [γ(x)Φ(γ(x)) +N (γ(x); 0, 1)] .

Here Φ(·) is the cumulative distribution function ofa standard normal, and N (·; 0, 1) is the density of astandard normal. Note that numerous alternate ac-quisition functions and combinations thereof have beenproposed (Kushner, 1964; Srinivas et al., 2010; Hoff-man et al., 2011), which could be used without affect-ing the analytic properties of our approach.

2.2. Bayesian Neural Networks

The idea of applying Bayesian methods to neural net-works has a rich history in machine learning (MacKay,1992; Hinton & van Camp, 1993; Buntine & Weigend,1991; Neal, 1995; De Freitas, 2003). The goal ofBayesian neural networks is to uncover the full pos-terior distribution over the network weights in orderto capture uncertainty, to act as a regularizer, andto provide a framework for model comparison. Thefull posterior is, however, intractable for most formsof neural networks, necessitating expensive approxi-mate inference or Markov chain Monte Carlo simu-lation. More recently, full or approximate Bayesianinference has been considered for small pieces of theoverall architecture. For example, in similar spirit tothis work, Lazaro-Gredilla & Figueiras-Vidal (2010);Hinton & Salakhutdinov (2008) and Calandra et al.(2014a) considered inference over just the last layerof a neural network. Other examples can be foundin Kingma & Welling (2014); Rezende et al. (2014)and Mnih & Gregor (2014), where a neural networkis used in a variational approximation to the poste-rior distribution over the latent variables of a directedgenerative neural network.

3. Adaptive Basis Regression withDeep Neural Networks

A key limitation of GP-based Bayesian optimizationis that the computational cost of the technique scalescubically in the number of observations, limiting theapplicability of the approach to objectives that requirea relatively small number of observations to optimize.In this work, we aim to replace the GP traditionallyused in Bayesian optimization with a model that scalesin a less dramatic fashion, but retains most of theGP’s desirable properties such as flexibility and well-calibrated uncertainty. Bayesian neural networks are anatural consideration, not least because of the theoret-ical relationship between Gaussian processes and infi-nite Bayesian neural networks (Neal, 1995; Williams,1996). However, deploying these at a large scale is verycomputationally expensive.

As such, we take a pragmatic approach and add aBayesian linear regressor to the last hidden layer ofa deep neural network, marginalizing only the out-put weights of the net while using a point estimatefor the remaining parameters. This results in adap-tive basis regression, a well-established statistical tech-nique which scales linearly in the number of observa-tions, and cubically in the basis function dimension-ality. This allows us to explicitly trade off evaluationtime and model capacity. As such, we form the basisusing the very flexible and powerful non-linear func-tions defined by the neural network.

First of all, without loss of generality and assumingcompact support for each input dimension, we scalethe input space to the unit hypercube. We denoteby [φ1(·), . . . , φD(·)]T the vector of outputs from thelast hidden layer of the network, trained on inputs andtargets D := {(xn, yn)}Nn=1 ⊂ RK × R. We take theseto be our set of basis functions. In addition, define Φto be the design matrix arising from the data and thisbasis, where Φnd = φd(xn) is the output design matrix,and y the stacked target vector.

These basis functions are parameterized via theweights and biases of the deep neural network, andthese parameters are trained via backpropagation andstochastic gradient descent with momentum. In thistraining phase, a linear output layer is also fit. Thisprocedure can be viewed as a maximum a posteri-ori (MAP) estimate of all parameters in the network.Once this “basis function neural network” has beentrained, we replace the MAP-parameterized outputlayer with a Bayesian linear regressor that capturesuncertainty in the weights. See Section 3.1.2 for amore elaborate explanation of this choice.


0 500 1000 1500 2000Number of Observations

0

1000

2000

3000

4000

5000

6000

Seco

nds

per

Itera

tion

DNGO (Neural Network)

Spearmint (GP)

Figure 1. A comparison of the time per suggested experi-ment for our method compared to the state-of-the-art GPbased approach (Snoek et al., 2014) on the six dimensionalHartmann function. We ran each algorithm on the same32 core system with 80GB of RAM five times and plot themean and standard deviation.

The predictive mean µ(x; Θ) and variance σ2(x; Θ) ofthe model are then given by

µ(x; {xn, yn} ,Θ) = mTφ(x) + η(x) , (4)

σ2(x; {xn, yn} ,Θ) = φ(x)TK−1φ(x) + α2 (5)

where

m = βK−1ΦTy ∈ RD (6)

K = βΦTΦ + Iα2 ∈ RD×D. (7)

Here, η(x) is a prior mean function which is describedin Section 3.1.3. In addition, α, β ∈ Θ are regressionmodel hyperparameters. We integrate out α and βusing slice sampling (Neal, 2000) according to themethodology of Snoek et al. (2012) over the marginallikelihood, which is given by

log p(y |X, α, β) =− 1

2log[(2πα)NDβ|K|

]− 1

2α(y − Φm)2 − 1

2βmTm . (8)

It is clear that the computational bottleneck of thisprocedure is the inversion of K. However, note thatthe size of this matrix grows with the output dimen-sionality D, rather than the number of observations Nas in the GP case. This allows us to scale to signifi-cantly more observations than with the GP as demon-strated in Figure 1.

3.1. Model details

Here we elaborate on various choices made in the de-sign of the model.

3.1.1. Network architecture

A natural concern with the use of deep networks isthat they often require significant effort to tune andtailor to specific problems. One can consider adjust-ing the architecture and tuning the hyperparametersof the neural network as itself a difficult hyperparam-eter optimization problem. An additional challengeis that we aim to create an approach that generalizesacross optimization problems. We found that designdecisions such as the type of activation function usedsignificantly altered the performance of the Bayesianoptimization routine. For example, in Figure 2 we seethat the commonly used rectified linear (ReLU) func-tion leads to very poor estimates of uncertainty, whichcauses the Bayesian optimization routine to exploreexcessively. Since the bounded tanh function resultsin smooth functions with realistic variance, we use thisnonlinearity in this work; however, if the smoothnessassumption needs to be relaxed, a combination of rec-tified linear functions with a tanh function only on thelast layer can also be used in order to bound the basis.

In order to tune any remaining hyperparameters, suchas the width of the hidden layers and the amount ofregularization, we used GP-based Bayesian optimiza-tion. For each of one to four layers we ran Bayesianoptimization using the Spearmint (Snoek et al., 2014)package to minimize the average relative loss on a se-ries of benchmark global optimization problems. Wetuned a global learning rate, momentum, layer sizes,L2 normalization penalties for each set of weights anddropout rates (Hinton et al., 2012) for each layer.Interestingly, the optimal configuration featured nodropout and very modest L2 normalization. We sus-pect that dropout, despite having an approximate cor-rection term, causes noise in the predicted mean re-sulting in a loss of precision. The optimizer insteadpreferred to restrict capacity via narrow layer width.Namely, the optimal architecture is a deep and nar-row network with 3 hidden layers and approximately50 hidden units per layer. We use the same architec-ture throughout all of our empirical evaluation, andthis architecture is illustrated in Figure 2(d).

3.1.2. Marginal likelihood versus MAPestimate

The standard empirical Bayesian approach to adap-tive basis regression is to maximize the marginal like-lihood with respect to the parameters of the basis (see


0.0 0.2 0.4 0.6 0.8 1.02.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

(a) TanH Units

0.0 0.2 0.4 0.6 0.8 1.02.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

(b) ReLU + TanH Units

0.0 0.2 0.4 0.6 0.8

3

2

1

0

1

2

3

(c) ReLU Units

TanH

TanH

TanH

x

y

(d)

Figure 2. A comparison of the predictive mean and uncertainty learned by the model when using 2(a) only tanh, 2(c) onlyrectified linear (ReLU) activation functions or 2(b) ReLU’s but a tanh on the last hidden layer. The shaded regionscorrespond to standard deviation envelopes around the mean. The choice of activation function significantly modifiesthe basis functions learned by the model. Although the ReLU, which is the standard for deep neural networks, is highlyflexible, its unbounded activation leads to extremely large uncertainty estimates. Subfigure 2(d) illustrates the overallarchitecture of the DNGO model. Dashed lines correspond to weights that are marginalized.

Equation 8), thus taking the model uncertainty intoaccount. However, in the context of our method, thisrequires evaluating the gradient of the marginal likeli-hood, which requires inverting a D ×D matrix on eachupdate of stochastic gradient descent. As this makesthe optimization of the net significantly slower, we takea pragmatic approach and optimize the basis using apoint estimate and apply the Bayesian linear regres-sion layer post-hoc. We found that both approachesgave qualitatively and empirically similar results, andas such we in practice employ the more efficient one.

3.1.3. Quadratic Prior

One of the advantages of Bayesian optimization isthat it provides natural avenues for incorporating priorinformation about the objective function and searchspace. For example, when choosing the boundaries ofthe search space, a typical assumption has been thatthe optimal solution lies somewhere in the interior ofthe input space. However, by the curse of dimension-ality, most of the volume of the space lies very close toits boundaries.

As such, we select a mean function η(x) (see Equa-tion 4) to reflect our subjective prior beliefs thatthe function is coarsely approximated by a convexquadratic function centered in the bounded search re-gion, i.e.,

η(x) = λ+ (x− c)TΛ(x− c) (9)

where c is the center of the quadratic, λ is an offsetand Λ a diagonal scaling matrix. We place a Gaussianprior with mean 0.5 (the center of the unit hypercube)on c, horseshoe (Carvalho et al., 2009) priors on the

diagonal elements Λkk ∀k ∈ {1, . . . ,K} and integrateout b, λ and c using slice sampling over the marginallikelihood.

The horseshoe is a so-called one-group prior for in-ducing sparsity and is a somewhat unusual choice forthe weights of a regression model. Here we choose itbecause it 1) has support only on the positive reals,leading to convex functions, and 2) it has a large spikeat zero with a heavy tail, resulting in strong shrink-age for small values while preserving large ones. Thislast effect is important for handling model misspeci-fication as it allows the quadratic effect to disappearand become a simple offset if necessary.

3.2. Incorporating input space constraints

Many problems of interest have complex, possibly un-known bounds, or exhibit undefined behavior in someregions of the input space. These regions can be char-acterized as constraints on the search space. Recentwork (Gelbart et al., 2014; Snoek, 2013; Gramacy &Lee, 2010) has developed approaches for modeling un-known constraints in GP-based Bayesian optimizationby learning a constraint classifier and then discountingexpected improvement by the probability of constraintviolation.

More specifically, define cn ∈ {0, 1} to be a binary indi-cator of the validity of input xn. Also, denote the setsof valid and invalid inputs as V = {(xn, yn) | cn = 1}and I = {(xn, yn) | cn = 0}, respectively. Note thatD = V ∪ I. Lastly, let Ψ be the collection of con-straint hyperparameters. The modified expected im-


Experiment # Evals SMAC TPE Spearmint DNGO

Branin (0.398) 200 0.655± 0.27 0.526± 0.13 0.398± 0.00 0.399± 0.002Hartmann6 (-3.322) 200 −2.977± 0.11 −2.823± 0.18 −3.3166± 0.02 −3.289± 0.002Logistic Regression 100 8.6± 0.9 8.2± 0.6 6.88± 0.0 6.89± 0.04LDA (On grid) 50 1269.6± 2.9 1271.5± 3.5 1266.2± 0.1 1266.2± 0.0SVM (On grid) 100 24.1± 0.1 24.2± 0.0 24.1± 0.1 24.1± 0.1

Table 1. Evaluation of DNGO on global optimization benchmark problems versus scalable (TPE, SMAC) and non-scalable(Spearmint) Bayesian optimization methods. All problems are minimization problems. For each problem, each methodwas run 10 times to produce error bars.

provement function can be written as

aCEI(x;D,Θ,Ψ) = aEI(x;V,Θ)P [c = 1 |x, I,Ψ] .

In this work, to model the constraint surface, we sim-ilarly replace the Gaussian process with the adaptivebasis model, integrating out the output layer weights:

P[c = 1 |x,D,Ψ] =∫w

P [c = 1 |x,D,w,Ψ] P(w; Ψ)dw .(10)

In this case, we use a Laplace approximation to theposterior. For noisy constraints we perform Bayesianlogistic regression, using a logistic likelihood functionfor P [c = 1 |x,V ∪ I,w,Ψ]. For noiseless constraints,we replace the logistic function with a step function.

3.3. Parallel Bayesian Optimization

Obtaining a closed form expression for the joint acqui-sition function across multiple inputs is intractable ingeneral (Ginsbourger & Riche, 2010). However, a suc-cessful Monte Carlo strategy for parallelizing Bayesianoptimization was developed in Snoek et al. (2012). Theidea is to marginalize over the possible outcomes ofcurrently running experiments when making a deci-sion about a new experiment to run. Following thisstrategy, we use the posterior predictive distributiongiven by Equations 4 and 5 to generate a set of fan-tasy outcomes for each running experiment which wethen use to augment the existing dataset. By aver-aging over sets of fantasies, we can perform approxi-mate marginalization when computing EI for a candi-date point. We note that this same idea works withthe constraint network, where instead of computingmarginalized EI, we would compute the marginalizedprobability of violating a constraint.

To that end, given currently running jobs with in-puts {xj}Jj=1, the marginalized acquisition function

aMCEI(·;D,Θ,Ψ) is given by

aMCEI(x;D, {xj}Jj=1,Θ,Ψ) =∫aCEI(x;D ∪ {(xj , yj)}Jj=1,Θ,Ψ)

× P[{cj , yj}Jj=1 | D, {x}Jj=1

]dy1...dyndc1...dcn .

(11)

When this strategy is applied to a GP, the cost ofcomputing EI for a candidate point becomes cubic inthe size of the augmented dataset. This restricts boththe number of running experiments that can be tol-erated, as well as the number of fantasy sets used formarginalization. With DNGO it is possible to scaleboth of these up to accommodate a much higher de-gree of parallelism.

Finally, following the approach of Snoek (2013) we in-tegrate out the hyperparameters of the model to ob-tain our final integrated acquisition function. For eachiteration of the optimization routine we pick the nextinput, x∗, to evaluate according to

x∗ = argmaxx

aMCEI(x;D, {xj}Jj=1) , (12)

where

aMCEI(x;D, {xj}Jj=1) =∫aMCEI(x;D, {xj}Jj=1,Θ,Ψ) dΘdΨ.

(13)

4. Experiments

4.1. HPOLib Benchmarks

In the literature, there exist several other methodsfor model-based optimization. Among these, the mostpopular variants in machine learning are the randomforest-based SMAC procedure (Hutter et al., 2011)and the tree Parzen estimator (TPE) (Bergstra et al.,2011). These are faster to fit than a Gaussian pro-cess and scale more gracefully with large datasets, but


(a) “A person riding a wave in the ocean.” (b) “A bird sitting on top of a field.” (c) “A horse is rid-ing a horse.”

Figure 3. Sample test images and generated captions from the best LBL model on the COCO 2014 dataset. The first twocaptions sensibly describe the contents of their respective images, while the third is offensively inaccurate.

this comes at the cost of a more heuristic treatment ofuncertainty. By contrast, DNGO provides a balancebetween scalability and the Bayesian marginalizationof model parameters and hyperparameters.

To demonstrate the effectiveness of our approach,we compare DNGO to these scalable model-basedoptimization variants, as well as the input-warpedGaussian process method of Snoek et al. (2014) onthe benchmark set of continuous problems from theHPOLib package (Eggensperger et al., 2013). As Table1 shows, DNGO significantly outperforms SMAC andTPE, and is competitive with the Gaussian process ap-proach. This shows that, despite vast improvements inscalability, DNGO retains the statistical efficiency ofthe Gaussian process method in terms of the numberof evaluations required to find the minimum.

4.2. Image Caption Generation

In this experiment, we explore the effectiveness ofDNGO on a practical and expensive problem wherehighly parallel evaluation is necessary to make progressin a reasonable amount of time. We consider the taskof image caption generation using multi-modal neurallanguage models. Specifically, we optimize the hyper-parameters of the log-bilinear model (LBL) from Kiroset al. (2014) to optimize the BLEU score of a valida-tion set from the recently released COCO dataset (Linet al., 2014). From our experiments, each evaluationof this model took an average of 26.6 hours.

We optimize learning parameters such as learning rate,momentum and batch size; regularization parameterslike dropout and weight decay for word and image rep-resentations; and architectural parameters such as thecontext size, whether to use the additive or multiplica-tive version, the size of the word embeddings and the

Method Test BLEU

Human Expert LBL 24.3Regularized LSTM 24.3Soft-Attention LSTM 24.310 LSTM ensemble 24.4Hard-Attention LSTM 25.0

Single LBL 25.12 LBL ensemble 25.93 LBL ensemble 26.7

Table 2. Image caption generation results using BLEU-4on the Microsoft COCO 2014 test set. Regularized andensembled LSTM results are reported in Zaremba et al.(2015). The baseline LBL tuned by a human expertand the Soft and Hard Attention models are reportedin (Xu et al., 2015). Using a DNGO optimized LBL modelachieves the state-of-the-art results on this task. We seethat ensembling our top models resulting from the opti-mization further improves results significantly. We noticedthat there were distinct multiple local optima in the hy-perparameter space, which may explain the dramatic im-provement from ensembling a small number of models.

multi-modal representation size 1. The final param-eter is the number of factors, which is only relevantfor the multiplicative model. This adds an interestingchallenge, since it is only relevant for half of the hy-perparameter space. This gives a total of 11 hyperpa-rameters. Even though this number seems small, thisproblem offers a number of challenges which renderits optimization quite difficult. For example, in orderto not lose any generality, we choose broad box con-straints for the hyperparameters; this, however, ren-ders most of the volume of the model space infeasible.

1Details are provided in the supplementary material.


In addition, quite a few of the hyperparameters arecategorical, which introduces severe non-stationaritiesin the objective surface.

Nevertheless, one of the advantages of a scalablemethod is the ability to highly parallelize hyperparam-eter optimization. In this way, high quality settingscan be found after only a few sequential steps. To testDNGO in this scenario, we optimize the log-bilinearmodel with up to 800 parallel evaluations.

Running between 300 and 800 experiments in paral-lel (determined by cluster availability), we proposedand evaluated approximately 1800 experiments—theequivalent of almost 2000 CPU days—in less than oneweek. Using the BLEU-4 metric, we optimized thevalidation set performance and the best LBL modelfound by DNGO outperforms recently proposed mod-els using LSTM recurrent neural networks (Hochre-iter & Schmidhuber, 1997; Zaremba et al., 2015; Xuet al., 2015) on the test set. This is remarkable, as theLBL is a relatively simple approach. Ensembling thistop model with the second and third best (under thevalidation metric) LBL models resulted in a test-setBLEU score 2 of 26.7, significantly outperforming theLSTM-based approaches. We noticed that there weredistinct multiple local optima in the hyperparameterspace, which may explain the dramatic improvementfrom ensembling a small number of models. We showsome qualitative examples of generated captions ontest images in Figure 3.

4.3. Deep Convolutional Neural Networks

Finally, we use DNGO on a pair of highly compet-itive deep learning visual object recognition bench-mark problems. We tune the hyperparameters of adeep convolutional neural network on the CIFAR-10and CIFAR-100 datasets. Our approach is to estab-lish a single, generic architecture, and specialize it tovarious tasks via individualized hyperparameter tun-ing. As such, for both datasets, we employed the samearchitecture, which can be found in Table 3.

For this architecture, we tuned the momentum, learn-ing rate, L2 weight decay coefficients, dropout rates,standard deviations of the random i.i.d. Gaussianweight initializations, and corruption bounds for vari-ous data augmentations: global perturbations of hue,saturation and value, random scalings, input pixeldropout and random horizontal reflections. We op-

2We have verified that our BLEU score evaluation isconsistent across reported results. We will incorporateother baseline results when the COCO evaluation serveris running. We used a beam search decoding for our testpredictions with the LBL model.

Layer type # Filters Window Stride

Convolution 96 3× 3


Max pooling 3× 3 2




Max pooling 3× 3 2



Convolution 10/100 1× 1

Global averaging 6× 6

Softmax

Table 3. Our convolutional neural network architecture.This choice was chosen to be maximally generic. Eachconvolution layer is followed by a ReLU nonlinearity.

timized these over a validation set of 10,000 examplesdrawn from the training set, running each network for200 epochs. See Figure 4 for a visualization of thehyperparameter tuning procedure.

We performed the optimization on a cluster of Intel R©

Xeon PhiTM

coprocessors, with 40 jobs running in par-allel. We ran our experiments using a kernel librarythat has been highly optimized for efficient computa-tion on the Intel R© Xeon Phi

TM

coprocessor. We aregoing to release this library for public use.

For the optimal hyperparameter configuration found,we then ran a final experiment for 350 epochs on theentire training set, and report its result.

Method CIFAR-10 CIFAR-100

Maxout 9.38% 38.57%DropConnect 9.32% N/ANetwork in network 8.81% 35.68%Deeply supervised 7.97% 34.57%

Tuned CNN 6.37% 27.4%Human error 6.0% N/A

Table 4. We use our algorithm to optimize validation seterror as a function of various hyperparameters of a convo-lutional neural network. We report the test errors of themodels with the optimal hyperparameter configurations, ascompared to current state-of-the-art results.


0 50 100 150 200

Iteration #

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Valid

ation error

Current best

Figure 4. Validation errors on CIFAR-100 corresponding todifferent hyperparameter configurations as evaluated overtime. These are represented as a planar histogram, wherethe shade of each bin indicates the total count within it.The current best validation error discovered is traced inblack. This projection touches on the exploration-versus-exploitation paradigm of Bayesian Optimization, in whichthe algorithm trades off between treading unexplored partsof the space, and zooming in on parts which have alreadyshown promise.

Following this, our optimal models for CIFAR-10 andCIFAR-100 achieved test errors of 6.37% and 27.4%respectively. These, to our knowledge, represent thestate-of-the-art for these datasets. For perspective,the human manual classification error on CIFAR-10is 6.0%. A comparison to current state-of-the-art re-sults (Goodfellow et al., 2013; Wan et al., 2013; Linet al., 2013; Lee et al., 2014) can be found in Table 4.

A comprehensive overview of the setup, the tuning andthe optimum configuration can be found in the supple-mentary material.

5. Conclusion

In this paper, we introduced deep networks for globaloptimization, or DNGO, which enables efficient op-timization of noisy, expensive black-box functions.While this model maintains desirable properties of theGP such as tractability and principled managementof uncertainty, it greatly improves its scalability fromcubic to linear as a function of the number of obser-vations. We demonstrate that while this model allowsefficient computation, its performance is neverthelesscompetitive with existing approaches.

While adaptive basis regression with neural networksprovides one approach to the enhancement of scala-bility, other models may also present promise. One

interesting line of work, for example, would be to in-troduce a similar methodology by instead employingthe sparse Gaussian process as the underlying proba-bilistic model (Snelson & Ghahramani, 2005; Titsias,2009; Hensman et al., 2013).

6. Acknowledgements

This work was supported by the Director, Office ofScience, Office of Advanced Scientific Computing Re-search, Applied Mathematics program of the U.S. De-partment of Energy under Contract No. DE-AC02-05CH11231. This work used resources of the Na-tional Energy Research Scientific Computing Center,which is supported by the Office of Science of theU.S. Department of Energy under Contract No. DE-AC02-05CH11231. We would like to acknowledge theNERSC systems staff, in particular Helen He and Har-vey Wasserman, for providing us with generous accessto the Babbage Xeon Phi testbed.

The image caption generation computations in this pa-per were run on the Odyssey cluster supported by theFAS Division of Science, Research Computing Groupat Harvard University. We would like to acknowledgethe FASRC staff and in particular James Cuff for pro-viding generous access to Odyssey.

Jasper Snoek is a fellow in the Harvard Center forResearch on Computation and Society. Kevin Swer-sky is the recipient of an Ontario Graduate Scholar-ship (OGS). This work was partially funded by NSFIIS-1421780, the Natural Sciences and Engineering Re-search Council of Canada (NSERC) and the CanadianInstitute for Advanced Research (CIFAR).

References

Bardenet, Remi, Brendel, Matyas, Kegl, Balazs, and Se-bag, Michele. Collaborative hyperparameter tuning. InInternational Conference on Machine Learning, 2013.

Bergstra, James and Bengio, Yoshua. Random searchfor hyper-parameter optimization. Journal of MachineLearning Research, 13:281–305, 2012.

Bergstra, James S., Bardenet, Remi, Bengio, Yoshua, andKegl, Balazs. Algorithms for hyper-parameter optimiza-tion. In Advances in Neural Information Processing Sys-tems. 2011.

Brochu, Eric, Brochu, Tyson, and de Freitas, Nando.A Bayesian interactive optimization approachto procedural animation design. In ACM SIG-GRAPH/Eurographics Symposium on ComputerAnimation, 2010.

Bull, Adam D. Convergence rates of efficient global opti-mization algorithms. Journal of Machine Learning Re-search, (3-4):2879–2904, 2011.


Buntine, Wray L. and Weigend, Andreas S. Bayesian back-propagation. Complex systems, 5(6):603–643, 1991.

Calandra, Roberto, Peters, Jan, Rasmussen, Carl Edward,and Deisenroth, Marc Peter. Manifold Gaussian pro-cesses for regression. arXiv preprint arXiv:1402.5876,2014a.

Calandra, Roberto, Peters, Jan, Seyfarth, Andre, andDeisenroth, Marc P. An experimental evaluation ofBayesian optimization on bipedal locomotion. In In-ternational Conference on Robotics and Automation,2014b.

Carvalho, Carlos M., Polson, Nicholas G., and Scott,James G. Handling sparsity via the horseshoe. In AIS-TATS, 2009.

De Freitas, Joao FG. Bayesian methods for neural net-works. PhD thesis, Citeseer, 2003.

de Freitas, Nando, Smola, Alex J., and Zoghi, Masrour.Exponential regret bounds for Gaussian process banditswith deterministic observations. In International Con-ference on Machine Learning, 2012.

Djolonga, Josip, Krause, Andreas, and Cevher, Volkan.High dimensional Gaussian process bandits. In Advancesin Neural Information Processing Systems, 2013.

Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J.,Snoek, J., Hoos, H., and Leyton-Brown, K. Towards anempirical foundation for assessing Bayesian optimizationof hyperparameters. In Advances in Neural InformationProcessing Systems Workshop on Bayesian Optimizationin Theory and Practice, 2013.

Feurer, M., Springenberg, T., and Hutter, F. Initial-izing Bayesian hyperparameter optimization via meta-learning. In AAAI Conference on Artificial Intelligence,2015.

Garnett, Roman, Osborne, Micheal A., and Roberts,Stephen J. Bayesian optimization for sensor set selec-tion. In International Conference on Information Pro-cessing in Sensor Networks, 2010.

Gelbart, Michael A., Snoek, Jasper, and Adams, Ryan P.Bayesian optimization with unknown constraints. In Un-certainty in Artificial Intelligence, 2014.

Ginsbourger, David and Riche, Rodolphe Le. Dealingwith asynchronicity in parallel Gaussian process basedglobal optimization. http://hal.archives-ouvertes.fr/hal-00507632, 2010.

Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi,Courville, Aaron C., and Bengio, Yoshua. Maxout net-works. In International Conference on Machine Learn-ing, 2013.

Gramacy, Robert B. and Lee, Herbert K. H. Optimiza-tion under unknown constraints, 2010. arXiv:1004.4027[stat.ME].

Hensman, J, Fusi, N, and Lawrence, N.D. Gaussian pro-cesses for big data. In Uncertainty in Artificial Intelli-gence, 2013.

Hinton, G. E. and van Camp, D. Keeping neural net-works simple by minimizing the description length of theweights. In ACM Conference on Computational Learn-ing Theory, 1993.

Hinton, Geoffrey E. and Salakhutdinov, Ruslan. Usingdeep belief nets to learn covariance kernels for Gaussianprocesses. In Advances in neural information processingsystems, pp. 1249–1256, 2008.

Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Improvingneural networks by preventing co-adaptation of featuredetectors. arXiv preprint arXiv:1207.0580, 2012.

Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. Neural computation, 9(8):1735–1780,1997.

Hoffman, Matthew, Brochu, Eric, and de Freitas, Nando.Portfolio allocation for Bayesian optimization. In Un-certainty in Artificial Intelligence, 2011.

Hutter, Frank, Hoos, Holger H., and Leyton-Brown, Kevin.Sequential model-based optimization for general algo-rithm configuration. In Learning and Intelligent Opti-mization 5, 2011.

Jones, Donald R. A taxonomy of global optimization meth-ods based on response surfaces. Journal of Global Opti-mization, 21, 2001.

Kingma, Diederik P. and Welling, Max. Auto-encodingvariational Bayes. In International Conference on Learn-ing Representations, 2014.

Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel,Richard S. Multimodal neural language models. In In-ternational Conference on Machine Learning, 2014.

Krause, Andreas and Ong, Cheng Soon. Contextual Gaus-sian process bandit optimization. In Advances in NeuralInformation Processing Systems, 2011.

Kushner, H. J. A new method for locating the maximumpoint of an arbitrary multipeak curve in the presence ofnoise. Journal of Basic Engineering, 86, 1964.

Lazaro-Gredilla, Miguel and Figueiras-Vidal, Anıbal R.Marginalized neural network mixtures for large-scale re-gression. Neural Networks, IEEE Transactions on, 21(8):1345–1351, 2010.

Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang,Zhengyou, and Tu, Zhuowen. Deeply supervised nets. InDeep Learning and Representation Learning Workshop,NIPS, 2014.

Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network innetwork. CoRR, abs/1312.4400, 2013. URL http://arxiv.org/abs/1312.4400.

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays,James, Perona, Pietro, Ramanan, Deva, Dollar, Pi-otr, and Zitnick, C Lawrence. Microsoft COCO: Com-mon objects in context. In ECCV 2014, pp. 740–755.Springer, 2014.

http://hal.archives-ouvertes.fr/hal-00507632

http://hal.archives-ouvertes.fr/hal-00507632

http://arxiv.org/abs/1312.4400

http://arxiv.org/abs/1312.4400


Lizotte, Dan. Practical Bayesian Optimization. PhD the-sis, University of Alberta, Edmonton, Alberta, 2008.

MacKay, David J.C. A practical Bayesian framework forbackpropagation networks. Neural computation, 4(3):448–472, 1992.

Mahendran, Nimalan, Wang, Ziyu, Hamze, Firas, andde Freitas, Nando. Adaptive MCMC with Bayesian op-timization. In AISTATS, 2012.

Mnih, Andriy and Gregor, Karol. Neural variational in-ference and learning in belief networks. In InternationalConference on Machine Learning, 2014.

Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas.The application of Bayesian methods for seeking the ex-tremum. Towards Global Optimization, 2, 1978.

Neal, Radford. Slice sampling. Annals of Statistics, 31:705–767, 2000.

Neal, Radford M. Bayesian learning for neural networks.PhD thesis, University of Toronto, 1995.

Osborne, Michael A., Garnett, Roman, and Roberts,Stephen J. Gaussian processes for global optimization.In Learning and Intelligent Optimization, 2009.

Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,Daan. Stochastic back-propagation and variational in-ference in deep latent Gaussian models. In InternationalConference on Machine Learning, 2014.

Snelson, Edward and Ghahramani, Zoubin. Sparse Gaus-sian processes using pseudo-inputs. In Advances inNeural Information Processing Systems, pp. 1257–1264,2005.

Snoek, Jasper. Bayesian Optimization and SemiparametricModels with Applications to Assistive Technology. PhDthesis, University of Toronto, Toronto, Canada, 2013.

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.Practical Bayesian optimization of machine learning al-gorithms. In Advances in Neural Information ProcessingSystems, 2012.

Snoek, Jasper, Swersky, Kevin, Zemel, Richard S., andAdams, Ryan P. Input warping for Bayesian optimiza-tion of non-stationary functions. In International Con-ference on Machine Learning, 2014.

Srinivas, Niranjan, Krause, Andreas, Kakade, Sham, andSeeger, Matthias. Gaussian process optimization in thebandit setting: no regret and experimental design. InInternational Conference on Machine Learning, 2010.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan Prescott.Multi-task Bayesian optimization. In Advances in NeuralInformation Processing Systems, 2013.

Titsias, Michalis K. Variational learning of inducing vari-ables in sparse Gaussian processes. In InternationalConference on Artificial Intelligence and Statistics, pp.567–574, 2009.

Wan, Li, Zeiler, Matthew D., Zhang, Sixin, LeCun, Yann,and Fergus, Rob. Regularization of neural networks us-ing dropconnect. In International Conference on Ma-chine Learning, 2013.

Wang, Ziyu, Zoghi, Masrour, Hutter, Frank, Matheson,David, and de Freitas, Nando. Bayesian optimizationin high dimensions via random embeddings. In IJCAI,2013.

Williams, Christoper K. I. Computing with infinite net-works. In Advances in Neural Information ProcessingSystems, 1996.

Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,Courville, Aaron, Salakhutdinov, Ruslan, Zemel,Richard, and Bengio, Yoshua. Show, attend and tell:Neural image caption generation with visual attention.arXiv preprint arXiv:1207.0580, 2015.

Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol.Recurrent neural network regularization. arXiv preprintarXiv:1207.0580, 2015.


A. Convolutional neural networkexperiment specifications

In this section we elaborate on the details of the networktraining and the meta-optimization. In the following sub-sections we elaborate on the hyperparametrization scheme.The priors on the hyperparameters as well as their optimalconfigurations for the two datasets can be found in Table 5.

A.1. Data augmentation

We corrupt each input in a number of ways. Below wedescribe our parametrization of these corruptions.

HSV We shift the hue, saturation and value fieldsof each input by global constants bH ∼ U(−BH , BH),bS ∼ U(−BS , BS), bV ∼ U(−BV , BV ). Similarly, we glob-ally stretch the saturation and value fields by global con-stants aS ∼ U( 1

1+AS, 1 +AS), aV ∼ U( 1

1+AV, 1 +AV ).

Scalings Each input is scaled by some factors ∼ U( 1

1+S, 1 + S).

Translations We crop each input to size 27×27, wherethe window is chosen randomly and uniformly.

Horizontal reflections Each input is reflected hori-zontally with a probability of 0.5.

Pixel dropout Each input element is dropped inde-pendently and identically with some random probabilityD0.

A.2. Initialization and training procedure

We initialize the weights of each convolution layer m withi.i.d zero-mean Gaussians with standard deviation σ√

Fm

where Fm is the number of parameters per filter for thatlayer. We chose this parametrization to produce activa-tions whose variances are invariant to filter dimensionality.We use the same standard deviation for all layers but theinput, for which we dedicate its own hyperparameter σIas it oftentimes varies in scale from deeper layers in thenetwork.

We train the model using the standard stochastic gradientdescent and momentum optimizer. We use minibatch sizeof 128, and tune the momentum and learning rate, whichwe parametrize as 1 − 0.1M and 0.1L respectively. Weanneal the learning rate by a factor of 0.1 at epochs 130and 190. We terminate the training after 200 epochs.

We regularize the weights of all layers with weight decaycoefficient W . We apply dropout on the outputs of the maxpooling layers, and tune these rates D1, D2 separately.

A.3. Testing procedure

We evaluate the performance of the learned model by aver-aging its log-probability predictions on 100 samples drawnfrom the input corruption distribution, with masks drawnfrom the unit dropout distribution.

B. Multimodal neural language modelhyperparameters

B.1. Description of the hyperparameters

We optimize a total of 11 hyperparameters of the log-bilinear model (LBL). Below we explain what these hy-perparameters refer to.

Model The LBL model has two variants, an additivemodel where the image features are incorporated via anadditive bias term, and a multiplicative that uses a factoredweight tensor to control the interaction between modalities.

Context size The goal of the LBL is to predict the nextword given a sequence of words. The context size dictatesthe number of words in this sequence.

Learning rate, momentum, batch size These areoptimization parameters used during stochastic gradientlearning of the LBL model parameters. The optimizationover learning rate is carried out in log-space, but the pro-posed learning rate is exponentiated before being passedto the training procedure.

Hidden layer size Controls the size of the joint hiddenrepresentation for words and images.

Embedding size Words are represented by feature em-beddings rather than one-hot vectors. This is the dimen-sionality of the embedding.

Dropout A regularization parameter that determinesthe amount of dropout to be added to the hidden layer.

Context decay, Word decay L2 regularization onthe input and output weights respectively. Like the learn-ing rate, these are optimized in log-space as they vary overseveral orders of magnitude.

Factors The rank of the weight tensor. Only relevantfor the multiplicative model.


Hyperparameter Notation Support of prior CIFAR-10 Optimum CIFAR-100 Optimum

Momentum M [0.5, 2] 1.6242 1.3339Learning rate L [1, 4] 2.7773 2.1205Initialization deviation σI [0.5, 1.5] 0.83359 1.5570Input initialization deviation σ [0.01, 1] 0.025370 0.13556Hue shift BH [0, 45] 31.992 19.282Saturation scale AS [0, 0.5] 0.31640 0.30780Saturation shift BS [0, 0.5] 0.10546 0.14695Value scale AS [0, 0.5] 0.13671 0.13668Value shift BS [0, 0.5] 0.24140 0.010960Pixel dropout D0 [0, 0.3] 0.19921 0.00056598Scaling S [0, 0.3] 0.24140 0.12463L2 weight decay W [2, 5] 4.2734 3.1133Dropout 1 D1 [0, 0.7] 0.082031 0.081494Dropout 2 D2 [0, 0.7] 0.67265 0.38364

Table 5. Specification of the hyperparametrization scheme, and optimal hyperparameter configurations found.

Hyperparameter Support of prior Notes COCO Optimum

Model {additive,multiplicative} additiveContext size [3, 25] 5Learning rate [0.001, 10] Log-space 0.43193Momentum [0, 0.9] 0.23269Batch size [20, 200] 40Hidden layer size [100, 2000] 441Embedding size {50, 100, 200} 100Dropout [0, 0.7] 0.14847Word decay [10−9, 10−3] Log-space 2.98456−7

Context decay [10−9, 10−3] Log-space 1.09181−8

Factors [50, 200] Multiplicative model only -

Table 6. Specification of the hyperparametrization scheme, and optimal hyperparameter configurations found for themultimodal neural language model. For parameters marked log-space, the log is given to the Bayesian optimizationroutine and the result is exponentiated before being passed into the multimodal neural language model for training.Square brackets denote a range of parameters, while curly braces denote a set of options.

Documents

Scalable Bayesian Optimization Using Deep Neural Networks · Scalable Bayesian Optimization Using Deep Neural Networks anced during the search. Perhaps the most commonly used model