Download pdf - Batch Bayesian Optimization Nathan Hunt

Batch Bayesian Optimization

by

Nathan Hunt

Submitted to the Department of Electrical Engineering and ComputerScience

in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February 2020

© Massachusetts Institute of Technology 2020. All rights reserved.

Signature redactedAuthor. .............................

Department of Electrical Engineering and Computer Science

Ja ry 10, 2020

Signature redactedCertified by................................

David K. GiffordProfessor of Electrical Engineering and Computer Science

Thesis Supervisor

A ccepted by ........................MASSACHUSETS INSTITUTE

OFTECHNOLOGY

MAR 13 2020

LIBRARIES

Professor of Electrical

Signature redacted

Llslie4 (*olodziejskiEngineering and Computer Science

Chair, Department Committee on Graduate Students

MITLibraries77 Massachusetts AvenueCambridge, MA 02139http://Iibraries.mit.edu/ask

DISCLAIMER NOTICE

Due to the condition of the original material, there are unavoidableflaws in this reproduction. We have made every effort possible toprovide you with the best copy available.

Thank you.

Pages contain copy print steak marks.

I

2

Batch Bayesian Optimization

by

Nathan Hunt

Submitted to the Department of Electrical Engineering and Computer Scienceon January 10, 2020, in partial fulfillment of the

requirements for the degree ofMaster of Science

Abstract

Bayesian optimization is a useful technique for maximizing expensive, unknown func-tions that employs an acquisition function to determine what unseen input point toquery next. In many real-world applications, batches of input points can be queriedsimultaneously for only a small marginal cost compared to querying a single point.Most classical acquisition functions cannot be used for batch acquisition, and thusbatch acquisition strategies are required. Several such strategies have been developedin the past decade. We review and compare batch acquisition strategies in a varietyof settings to assist practitioners in selecting appropriate batch acquisition functionsand facilitate further research in this area.

Thesis Supervisor: David K. GiffordTitle: Professor of Electrical Engineering and Computer Science

3

4

Acknowledgments

As I began graduate school, I expected it to be easier in some ways than my un-

dergraduate education. However, I soon found being a graduate student to be much

more taxing as I immediately dealt with the need to find funding to even be able to

continue. After this was resolved, I struggled to find good research questions and at

times felt demoralized when even approaches I had had hope for failed to yield useful

results. I'm thus all the more grateful for the support that I've received during these

years.

I am greatly indebted to my advisor, Dave Gifford, for the financial, academic, and

moral support that he's given me throughout this work. Though the most stressful

time of my week was often when I was preparing to meet with him, and worried about

the slow or nonexistent progress I had made since the last week, Dave always showed

great patience, never demeaned my work, and provided useful discussions on what

to do next. I'm very grateful that he made finishing this thesis possible. I'm also

grateful for the other members of the Gifford lab who I've been able to work with,

especially Sid Jain for his help developing new Bayesian optimization methods.

I'm also thankful to Marzyeh Ghassemi and Pete Szolovits for their mentoring

in my research during my last year of undergrad. In Pete's group, I felt like a real

researcher, not just an undergrad. Marzyeh's persistent optimism raised my spirits

through many setbacks, and she convinced me that I could do real research worth

sharing. I doubt I would have made it into grad school without their support. I greatly

appreciate the interactions I was able to have with the other members of Pete's group

as well, especially Harini Suresh, Tristan Naumann, and Matthew McDermott.

I'm grateful for my siblings, mother, and other family members for their support

during these years and for sharing their lives with me. I'm also grateful for my friends,

especially for reminding me how much there is to life outside of earning degrees.

I am especially grateful for my wife, Rachel Mok, and our children Sammy and

Grace. I'm thankful I can come home to hugs, singing, reading stories, building,

and being the voice of every stuffed animal, for how they bring me back to the

5

greater purpose of my life. I appreciate that Rachel listens, at least most of the time,

when I want to talk about papers, algorithms, code, and what shapes have infinite

symmetries. I'm grateful for her patience with the imperfect partner that I am and

the journey that we undertake together. I'm thankful that she believes in my dreams,

sometimes more than I do, and sacrifices so we might achieve them. The love, time,

and effort that she devotes to our family is one of my greatest blessings.

Finally, I owe a debt of gratitude for the divine assistance that has carried me

through these years.

6

Contents

1 Introduction

1.1 R elated W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Background

2.1 Notation ... ..... ... ... .

2.2 M odels . . . . . . . . . . . . . . . .

2.2.1 Gaussian Processes .

2.2.2 Other Models . . . . . . . .

2.3 Sequential Acquisition Functions.

2.3.1 Probability of Improvement

2.3.2 Expected Improvement . .

2.3.3 Upper Confidence Bound.

2.3.4 Thompson Sampling .

2.3.5 Other . . . . . . . . . . . .

3 Batch Acquisition Functions

3.1 q-points Expected Improvement (qEI) . . . . . .

3.1.1 Kriging Believer (KB) . . . . . . . . . .

3.1.2 Con tant Liar (CL) . . . . . . . . . . . .

3.2 Local Penalization (LP) . . . . . . . . . . . . .

3.3 Thompson Sampling . . . . . . . . . . . . . . .

3.4 Batch Upper Confidence Bound (BUCB) . . . .

3.5 Upper Confidence Bound with Pure Exploration

7

15

16

19

20

21

22

24

26

26

27

27

28

28

31

32

33

33

34

36

37

38(UCB-PE)

3.6 Distance Exploration (DE) . . . . . . . . . . . . .

3.7 Budgeted Batch Bayesian Optimization (B30) . .

3.8 k-means Batch Bayesian Optimization (KMBBO)

4 Experiments

4.1 Objective Functions . . . . . . . . . . . . . . . . .

4.1.1 Test Functions . . . . . . . . . . . . . . .

4.1.2 Dataset Maximization Tasks . . . . . . . .

4.1.3 Simulations . . . . . . . . . . . . . . . . .

4.2 B aselines . . . . . . . . . . . . . . . . . . . . . . .

4.3 Implementation Details . . . . . . . . . . . . . . .

4.4 M etrics . . . . . . . . . . . . . . . . . . . . . . . .

5 Results

5.1 Low-dimensional, Continuous Objectives . . . . . . . . . . . . . . . .

5.2 Dataset Maximization Objectives . . . . . . . . . . . . . . . . . . . .

5.3 High-dimensional, Continuous Objectives . . . . . . . . . . . . . . . .

6 Conclusion

A Test Functions

B Extended Results

8

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

38

40

41

43

43

43

43

45

45

46

47

49

51

51

53

55

57

63

List of Figures

5-1 Median average gap metric for synthetic objective functions....

5-2 Median average gap metric for discrete objective functions. . . . . .

5-3 Median average gap metric for high-dimensional objective functions.

A-I The Gramacy function. . . . . . . . . . . . . . . . . . . . . . . . . .

A-2 The Branin function. . . . . . . . . . . . . . . . . . . . . . . . . . .

A-3 The Alpine2 function. . . . . . . . . . . . . . . . . . . . . . . . . .

A-4 The Bohachevsky function. . . . . . . . . . . . . . . . . . . . . . . .

A-5 The Goldstein function. . . . . . . . . . . . . . . . . . . . . . . . .

B-i Abalone ...

B-2 Alpine2 ...

B-3 Bohachevsky .

B-4 Branin . . . .

B-5 DNA Binding:

B-6 DNA Binding:

B-7 DNA Binding:

B-8 DNA Binding:

B-9 DNA Binding:

B-10 Goldstein . .

B-11 Gramacy ...

B-12 Hartmann3

B-13 Hartmann6

B-14 Robot Push

52

52

53

59

59

60

60

61

64

64

65

65

66

66

67

67

68

68

69

69

70

70

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

ARX ......

BCL6 ......

CRX . . . . . .

EGR2 ......

ESXi ......

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

9

B-15 Rover Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

10

List of Tables

4.1 Overview of the objective functions. xi denotes the i-th dimension

of the input point x. Some of these functions have been negated so

that they are maximization problems when traditionally they would

be minimizations. Alpine2 is defined for any number of dimensions,

but we use it in 2 dimensions. Formulas that are excessively long are

deferred to the appendix. . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Gap metric performance of random baselines averaged over all objec-

tive functions of each class. AF = acquisition function. . . . . . . . . 49

5.2 Gap metric performance of sequential acquisition functions averaged

over all objective functions of each class. AF = acquisition function,

NN = neural network ensemble, NNM = neural network ensemble with

maximizing overall diversity method. . . . . . . . . . . . . . . . . . . 50

5.3 Gap metric performance of batch acquisition functions averaged over

all objective functions of each class. AF = acquisition function, NN

= neural network ensemble, NNM = neural network ensemble with

maximizing overall diversity method. . . . . . . . . . . . . . . . . . . 50

11

12

List of Algorithms

1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Kriging Believer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Constant Liar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Local Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Upper Confidence Bound with Pure Exploration . . . . . . . . . . . . 39

7 Distance Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Budgeted Batch Bayesian Optimization . . . . . . . . . . . . . . . . . 41

9 k-means Batch Bayesian Optimization . . . . . . . . . . . . . . . . . . 42

13

14

Chapter 1

Introduction

Most research in Bayesian optimization has been in the sequential setting where

the objective function is queried at a single point at a time. While this work has

been useful for many applications, there are also many domains where functions of

interest cannot reasonably be optimized sequentially. Instead, batches of points need

to be queried at once. Most sequential Bayesian optimization techniques are not

directly applicable in the batch setting. However, research interest in batch Bayesian

optimization has grown recently, and many batch acquisition functions having been

proposed in the past decade. However, not all of these have been compared with

each other. Here we introduce and compare all major batch Bayesian optimization

acquisition functions. In Bayesian optimization in particular, because the functions to

be optimized are expensive to evaluate, a practitioner is likely unable to try multiple

acquisition functions to determine which is best for their task. A review is thus useful

both to those hoping to apply batch Bayesian optimization in their work as well as

to facilitate further research.

Our contributions are as follows:

1. We describe many of the existing acquisition functions for batch Bayesian opti-

mization. Having this information in one source is especially useful for compar-

ing the different strategies used for batch Bayesian optimization and considering

which would apply best in a setting or what new approaches could be attempted

15

next.

2. We test several batch acquisition functions on a variety of tasks and discuss

their relative performance to help understand when certain functions will be

most useful.

3. We provide a modular, extensible software package which implements the ac-

quisition functions used in our experiments.

In chapter 2 we discuss existing summaries of Bayesian optimization and highlight

the contributions of this review. Chapter 3 provides an introduction to Bayesian op-

timization including defining notation, describing the Gaussian process and other

models, and presents multiple sequential acquisition functions. Chapter 4 reviews

the batch acquisition functions studied in this work. Chapter 5 describes the ex-

periments conducted including the various objective functions and implementation

details. Chapter 6 showcases and discusses the results of these experiments. Chap-

ter 7 concludes, summarizing the findings of this work and suggests areas for future

research in batch acquisition functions.

1.1 Related Work

Three well-known reviews of Bayesian optimization optimization have already been

written in the past couple of decades. [291 examined existing Bayesian optimization

approaches including analysis of their strengths and weaknesses by applying them

to one- and two-dimensional test functions. [4] reviewed a few sequential acquisition

functions, showcased a couple of applications of Bayesian optimization, and provided

general advice for practitioners. They mention briefly that work ir batch Bayesian

optimization has begun. [46] review several sequential acquisition functions, including

more recent ones such as entropy search[], GP-hedge[], and entropy search portfolio[].

They also discuss practical considerations in Bayesian optimization such as approxi-

mation methods for Gaussian processes and methods of optimizing utility functions.

Batch Bayesian optimization receives a subsection in this paper where the constant

16

liar and Kriging believer strategies, which will be explained later, are described and a

few other approaches are mentioned by reference. Any of these papers, and especially

the latest one, can be consulted for a more in-depth review of sequential Bayesian

optimization. This review sets itself apart by focusing specifically on batch Bayesian

optimization, especially the acquisition functions necessary for this.

[44] provide a review of probabilistic models, most involving neural networks, for

contextual bandit problems. They exclusively use the Thompson sampling acquisition

function, which can be used in a sequential or batch setting. These methods could be

applied to Bayesian optimization tasks as well. In order to look more fully at different

acquisition functions, we restrict ourselves to just three models: exact Gaussian pro-

cesses, one approximation method for Gaussian processes, and one model involving

neural networks.

17

18

Chapter 2

Background

The field of Bayesian optimization is concerned with finding an optimum (which

we will always call a maximum; any minimization problem can easily be converted

to a maximization problem through negation) of a function f which has no known

analytic form or gradients and is costly to evaluate. For example, f may represent

a physical or biological system where no model exists. As such, the learning of a

model for f should be optimized in as few evaluations as possible. Additionally,

observations of f may be noisy. These constraints preclude traditional optimization

techniques. Bayesian optimization approaches this task by learning a distribution

over functions based on previously observed data and using this to determine which

points to query f at next. This decision must balance exploration (querying points

with higher uncertainty) and exploitation (querying points that are expected to have

high values). There are two main co nponents to a Bayesian optimization method: a

model, which determines the family f distributions over functions, and an acquisition

function, which determines how new query points are selected. A general algorithm

for Bayesian optimization is given in Algorithm 1. After describing notation, we

will discuss specific models, including what creating and updating them entails, and

sequential acquisition functions.

19

Algorithm 1: Bayesian OptimizationInput: f: function to be optimized

T: number of iterationsD: initial dataset {(Xi, yi),. . ., (X, Yn)}

model = createmodel(D)for t in 1, ... , T :

x = acquisitionfunction(model, D)y f(x)model.update(x, y)D = D U {(x, y)}

return argmax EX model.expectedvalue(x)

2.1 Notation

In order to discuss Bayesian optimization in greater detail, we establish the following

notation.

" f is the objective function. Noise may be added to observations of it (see below).

" X: the input space of f. The output space is R, or some subset thereof, so

that f : X -+ R. X is usually a box-constrained subset of Rd where d is the

dimensionality of X.

" y - f(x) + E: observations of f at x take this form, where c is noise that,

generally, is assumed to be zero-mean and independent of x. We make these

assumptions as well, so E [f(x) + E] = f(x). The most common case in the

literature is that E is normally distributed.

" x* = argmaxXGX f(x) is the optimal point.

• y* = E [f(x*) + E] = f(x*) is the value of the optimal point.

• D = {(xi, y, .... , (x, y,)} is the previously observed dataset. T is is popu-

lated with new data points as they are acquired. We will also sometimes use

X ={xi,..., x} and Y = {yi,..., yn}.

• f = p(f I D) is the current distribution over functions. Note that this involves

conditioning on all known data. f(x) is the distribution over function values at

20

the particular input location x. Bayesian optimization seeks to improve f as an

approximation of f by choosing the best input points at which to observe f.

L p(x), U(x) are the mean and variance, respectively, of f(x).

• x* = argmaxxcx E [f(x)] is the expected location of x* under f.

Sy* = E [f(x*)] is the expected value of x.

Sa(f , X) is an acquisition function which selects a point x c X to query f at

next. Acquisition functions are often written as az(x) in the literature and the

acquired point is then argmaxXEX a(x). Not all acquisition functions can easily

be written this way, though, so we use a more general notation. We think of

functions which score inputs as utility functions instead of acquisition functions;

the acquisition function is then a wrapper around the utility function which

includes the maximization: a(f, X) = argmaxXEX u(f, x) where, for example,

we could have u(j, x) = E [f(x)] . Some acquisition functions may also make

use of D, but this is less common, so we omit it from the notation for brevity

unless it is required.

• T is the number of iterations of Bayesian optimization to run and t is the index

for the current iteration

" b is the batch size when we are acquiring multiple points at once.

• x(b) is the set of points in the current batch. We will use this for batch acquisition

functions which generate a batch point by point.

2.2 Models

The model in Bayesian optimization determines, given the current dataset, what

distribution we have over possible functions that could have generated that data. We

want to have a flexible model so that it can fit many possible objective functions.

21

It should also provide reasonable uncertainty estimates so that we can acquire new

points intelligently.

2.2.1 Gaussian Processes

The most commonly used and studied model for Bayesian optimization is the Gaus-

sian process (GP). A GP is an infinite-dimensional generalization of a Gaussian dis-

tribution with the property that any finite subset of points in its input space is jointly

Gaussian. GPs are defined by their mean and covariance/kernel functions, pt(x) and

k(x, x'), respectively.

Given observations X, Y, the posterior distribution of a Gaussian process at a

point x is

f(x) A N(m(x), a2(x)) (2.1)

m(x) = [(x) + k(X, x)T [K + oI] 1 (Y - p(x)) (2.2)

(X) = k(x, x) - k(X,x )T [K + oc I] k(X, x) (2.3)

where Kij = k(xi, x), k(X, x) = (k(x, xi),... ,k(x,Xz)), and or is the variance

of the noise (assumed additive Gaussian). Intuitively, the predicted mean value for a

new point is our prior belief about the mean at that location (p(x)) plus a posterior

correction term that accounts for the correlation between the new point and all points

with known values and the difference between the expected and actual values at those

points. The case for the variance is similar. Note that the variance can only decrease

from the prior value as more data is observed and that the variance does not depend

on Y or p. In addition to marginal posterior distributions at single points, a GP can

give us the joint posterior distribution at any nmber of points. See [43] for a full

treatment of GPs from which this material is drawn.

Mean Functions The mean function is commonly p(x) = c for some constant c

which is often 0. Away from the observed data, the mean value predicted by the GP

22

will tend to p(x), so this may be set to the safest "default" value.

Kernel Functions Perhaps the most well-known kernel function is the radial basis

function (RBF), also known as the squared exponential (SE) kernel. This takes the

form

kSE (x, x) =exp (Ix- X'112 )

where f is a lengthscale hyperparameter. The kernel function determines impor-

tant properties of the possible functions. For example, using an RBF kernel leads to

a distribution over infinitely-differentiable functions. This may be an unrealistic as-

sumption in practice. Another popular family of kernels, the Mat6rn kernels, address

this issue. Their general form is

kMatern (X, X') - 21- (%2vr)7-K ( -,2 r)

where r = ||x - x'11 2 , F is the gamma function, and K, is a modified Bessel

function. The Mat6rn kernel is [vj-times differentiable. Common values for y are

5/2 and 3/2, for which the kernel is much simpler. For v 5/2, it is

kMatern52 (X, X ) 1+ V5r + 5r2 exp (v r)

Generally v is chosen ahead of time and not modified based on the data, though

£ may be. There may also be an output scale parameter a-, multiplied at the start

either of these kernel functions.

Training A Gaussian process is a non-parametric model; to add in newly observed

datapoints, one only needs to update K with a new row and column and use the

updated X, Y when computing posterior distributions at new points. In practice, it

is often useful to estimate hyperparameter values (such as lengthscales) based on the

data. A couple of approaches for this are 1) learn parameters by gradient descent

with the loss function being the marginal log-likelihood of the observed data or 2)

23

set a prior for the hyperparameters and sample from the posterior via MCMC. 1) is

more commonly used because 2) requires the extra computation of averaging across

the predictions of multiple GPs (with distinct hyperparameters), though [471 found 2)

yielded better results. The hyperparameters may be re-estimated or resampled every

acquisition, only once a certain number of points have been acquired, or whenever an

observation is added which was unlikely based on the current model.

2.2.2 Other Models

Though Gaussian processes are the most commonly used models for Bayesian op-

timization, some other models have been used as well with different strengths and

weaknesses. Because of the matrix inversions in Equations 2.2 and 2.3, exact infer-

ence in Gaussian processes takes 0(n') time when n datapoints have been observed.

This quickly becomes intractable as the dataset grows larger. Additionally, Gaussian

processes may be less preferable when the input data points have features which, indi-

vidually, are not very informative. For example, if the inputs are images, it is difficult

to express their similarity (and thus covariance) in terms of elementwise differences

between their pixels; images can have the essentially same content but different pixel

values in corresponding locations (e.g. due to translation). Other models may thus

be more scalable or better suited to particular types of data. However, much research

has also gone into approximations to make GPs scale better [56, 42, 15] or adaptations

to increase their ability to extract useful features from raw data [10, 57].

Bayesian Neural Networks In a Bayesian neural network (BNN), a prior is placed

on the network parameters (weights) and then (approximate) posterior inference is

done given a training dataset. It is common to use independent normal distrib tions

for each parameter. This leads to only twice as many parameters as a standar neu-

ral network, whereas the more expressive alternative of a full multivariate Gaussian

distribution would square the number of parameters. Given the prior weight distribu-

tions p(w) and a training dataset D, we would ideally like to compute the posteriors

as p(wID) = p(Dw)p(w) This is intractable in practice, so approximate inferencef p(D~jw)p(w)dw *

24

techniques are used instead. One common approach is stochastic variational inference

where the parameters of a set of distributions are optimized through gradient descent

to minimize their divergence to the true posterior distributions [19, 3]. Other ap-

proaches to training Bayesian neural networks include probabilistic backpropagation

[21] and approximations using dropout [14].

Unlike with GPs, it is not clear exactly how one should "add" a point to a BNN.

This requires training on the new point, likely in combination with previous points

to avoid degradation at other areas of the input space. However, there isn't a clear

way to determine how much additional training ought to be done. This training

may take place after every acquisition or every few acquisitions. For a BNN, the

hyperparameters of the model are likely kept fixed throughout Bayesian optimization

because the parameters themselves already need continual updating. One weakness

of a BNN compared with a GP is the larger number of such hyperparameters (batch

size, learning rate, number and size of layers, types of activation functions, etc.) and

the larger number of parameters requiring training. Because of the generally limited

amount of training data, simple architectures with only one or two hidden layers tend

to be used. Previous work in Bayesian optimization using BNNs includes [48], [50],

and [23].

Ensembles Ensembles of models may also be used to get distribution over the func-

tion values at each point. This can be done by parametrizing a normal distribution

for each point with the empirical mean and variance of the predictions from each

ensemble member. Random forests (ensembles of regression trees) have been used for

Bayesian optimization and are especially useful for their ability to deal with condi-

tional input spac s where the value in one input dimension affects whether the values

in certain other c imensions matter at all [25].

Recently, ensembles of neural networks have also been used as a probabilistic

model [35]. To the best of our knowledge, these have not previously been explored

for Bayesian optimization. We consider this as a simple alternative model for cases

where exact GPs are intractable. Such ensembles face the same difficulties as BNNs

25

do, though, in determining how much training is required to "add" points to the model

and many hyperparameter choices.

2.3 Sequential Acquisition Functions

A variety of different sequential acquisition functions have been developed for Bayesian

optimization which differ in how they balance exploration vs exploitation as well as

in other properties such as computational complexity. Because existing review papers

already discuss these, we only review those sequential acquisition functions that will

be used in the batch acquisition functions discussed later. Some others are mentioned

briefly by reference, though, as a starting point for the interested reader.

2.3.1 Probability of Improvement

The maximum probability of improvement (PI) utility function, suggested by [34], is

upI (X) = Pz'iX>

As the name suggests, maximizing upI selects the point which has the maximum

probability of improving on the current best value. If the model is a GP, this has the

closed form

UPI-GP()Y

where <D is the standard normal cumulative distribution function. As PI is fairly

exploitative, it is often implemented in practice as

for some constant c. [34] recommended using a predefined schedule for E which starts

higher to favor exploration and then decreases to favor exploitation, though [36] found

that such a schedule yielded no improvement over a constant value for a suite of test

functions.

26

2.3.2 Expected Improvement

One issue with PI is that it doesn't take into account the magnitude of the improve-

ment, only the probability that any improvement is made. Thus using PI might

acquire points that are very likely to have a very slightly larger value. We can define

the improvement of a point as I(x) = max(f(x) - y*, 0). The expected improvement

(EI) utility function, proposed by [37], thus uses both the probability and magnitude

of improvement:

UEI(X) =Ef [I(x)] =x,?(X) [max(x' - f, 0)]

As with PI, this can be generalized to

UEI'(X) - max(x' - - c, 0)]

where E, as in P1, lets one adjust the exploration/exploitation tradeoff. For a GP,

this has the closed form

UWx = px) - Y* - e)<(D ( rx )+ UO-( U

where # is the standard normal probability density function.

2.3.3 Upper Confidence Bound

[9] proposed the sequential design for optimization (SDO) utility function

USDO(X) = P(x) bu(x)

for some constant b (they used b = 2 or b = 2.5). [51] later determined a prin-

cipled schedule for the constant which allowed them to prove convergence of their

utility function, which they called the Gaussian process upper confidence bound, for

27

commonly-used kernels. The utility function is

UUCB(X) - P(X) + 3U(X).

See [51] for details on setting #t and the associated regret bounds. In practice, one

may also use a fixed value for the constant though the theoretical guarantees no longer

hold.

2.3.4 Thompson Sampling

[52] introduced Thompson sampling (TS) which can be used as a sequential or batch

acquisition function. TS is a stochastic acquisition function where the probability

of acquiring a point is the likelihood of that point being the maximum. This can

be done by sampling a function from the posterior distribution of the model and

then acquiring the point which maximizes that sampled function. For a BNN, this

would involve sampling a set of weights and then finding an input with maximum

value as given by the (non-Bayesian) neural network parametrized by those weights.

For a GP, a sampled function would be infinite-dimensional, so approximations are

required to use TS. Using Bochner's theorem about the duality between a kernel and

its Fourier transform, we can sample from the spectral density of a kernel to get

a finite-dimensional approximation of a sampled function and find the input which

maximizes this. Another approach is to sample from the posterior distribution of the

Gaussian process at some finite set of points and return that point which has the

maximum sampled value.

2.3.5 Other

The entropy search family o acquisition functions - including predictive entropy

search [22], output-space entropy search [24], and max-value entropy search [54] -

use information theoretic approaches to non-myopically acquire points. Approxima-

tions are generally required to implement these in practice. GP-hedge is a weighted

combination of acquisition functions where the weights depending on how well each

28

acquisition function does [5].

29

30

Chapter 3

Batch Acquisition Functions

Many batch acquisition functions can be grouped into a few categories based on their

overall strategy. We define six categories:

1. value-estimators: these define some function g which is used to estimate function

values for points in the batch. A batch is acquired by repeatedly maximizing

any sequential utility function u to get x = argmaxxcx u(x) and then updating

the model with {x, g(x)}. Once the true objective function values for the batch

are acquired, the estimates g(x) are replaced by the observed f(x).

2. explorers: these acquire an initial point using a sequential acquisition function

and then fill the rest of the batch with points that are selected purely to explore

the input space.

3. stochastic: a stochastic acquisition function can yield multiple, distinct points,

so it may be used in the batch setting without modification. An example of this

is multiple applications of Thompson Sampling.

4. penalizers: these apply a penalty directly to the utility function (instead of

updating the model with an estimated value) that discourages acquisition of

points that are too similar.

5. mode-finders: these acquire points at modes of a utility function

6. others: acquisition functions that use another strategy.

31

3.1 q-points Expected Improvement (qEI)

The qEI acquisition function aims to acquire a batch of q points which jointly max-

imize the El utility. Many papers have proposed different methods of computing or

approximating qEI. The idea of qEI was first proposed in [451 as q-step El but was

not used or developed any further. [16, 17] further developed qEI in two overlap-

ping papers. In particular, they developed an analytic form for 2-EI; a Monte Carlo

(MC) method for estimating qEI in the general case; and two classes of heuristics,

Kriging believer (KB) and constant liar (CL), which can be used to more efficiently

approximate qEI. KB and CL are discussed in Sections 3.1.1 and 3.1.2. Later, other

papers suggested different strategies of maximizing qEI using adaptive MC approaches

[26, 27]. [13] determined an unbiased estimator of the gradient of qEI and used this

to do stochastic gradient ascent to find a batch of points that maximizes qEI locally.

They used multiple random starts to attempt to find a global maximizer. Finally, [71

developed a method for maximizing qEI analytically that works well for small values

of q (e.g. q < 10). They also proposed a new variant of CL.

We will focus on the initial work of [16, 171. As in other places, we will use b for

the batch size except in the established name qEI. The desired batch is for qEI is

X(b)* argmax EI(x(b)) = E'?().X"?(Xb) [max (max(xz - y, 0),..., max(x' - 0))x(b) EjXb

= XJ ?(Xb) [max (max(xa, ... , x') - y, 0)] .(3.1)

Intuitively, qEI is high for a batch of points that have high 1-El and are not

too close to each other; this could correspond to selecting multiple local maxima of

1-El or selecting multiple points located around some maximum [17]. Selecting a

local maximum and points that are too near it will not give a high qEI because the

nearby points will have a lower El and be highly correlated with the local maximum,

giving a smaller probability that they increase the qEI than, e.g., points with the

same expected value but further away (and thus less correlated). [17] provide an

32

analytical form for 2-EI (see their Equation 18), but they note that the complexity

of qEI, including evaluating q-dimensional Gaussian CDFs, seems to necessitate ap-

proximation by numerical integration anyway, so Monte Carlo sampling approaches

make sense. This can be done by sampling from the posterior distribution at x(b) and

then evaluating the average improvement over the samples.

As both the batch size and the dimensionality of the input space grow, Monte

Carlo approaches can become intractable. Thus heuristic approaches have also been

developed which aim to approximate qEI in a more computationally tractable fash-

ion by doing sequential, rather than joint, optimization to select the batch. These

heuristics take advantage of the fact that, as you select points for the batch sequen-

tially, you always know both the location of the previous points and the distribution

of function values at those points. A first approach might be, when selecting x 2 , to

integrate over all possible values of f(xi), weighted by their probabilities. However,

this quickly leads to the same issues of integrating high-dimensional Gaussians that

computing qEI exactly has. Thus [16, 17] propose the Kriging believer and constant

liar heuristic acquisition functions.

3.1.1 Kriging Believer (KB)

Rather than integrating over all possible values of the points selected so far for the

batch, the KB approach acts as if the expected value of each previous point was

the true value (thus believing what the Kriging, or GP regression, model expected).

The model is then updated with these imagined observations (without re-estimating

the hyperparameters). Once the batch is acquired, the imagined observations are

replaced in the dataset with the true observations. This acquisition function is shown

in Algorithm 2.

3.1.2 Constant Liar (CL)

The CL heuristic selects some value L based on the observations seen so far (Y) and

uses this value as the imagined observation for each point while building a batch.

33

Algorithm 2: Kriging BelieverInput: f: function to be optimized

T: number of iterationsD: initial dataset {(xi, yi),. .. , (x., y.)}b: batch size

model = createmodel(D)for t in 1, ... , T :

fori in 1,...,b:Xi = argmaxxEX uEI(x)model.update(xi, p(x))

Yi:b = f(X1:b)D = D U {(x1, y1),..., (b, yb)}model.replace_data(D) // replace the imagined labels

return argmaxeCg model.expected value(x)

[17] considered three settings for L: max(Y), mean(Y), and min(Y). The terms CL-

max and CL-min can lead to confusion because their interpretation differs depending

on whether one is trying to maximize or minimize the objective function (again, we

always maximize. However some Bayesian optimization works minimize). Thus we

will refer to these as CL-optimistic (CLO) and CL-pessimistic (CLP) where being

optimistic means selecting L = max(Y) if you are trying to maximize the objective

function or L = min(Y) if the objective is to be minimized and similarly for CLP. CL-

mean will be referred to as CLA (CL-average) to avoid confusion with the acronym

CLM for CL-mix from [7]. CLP will promote more exploration because the points

which appeared good have been assigned a very poor label, reducing the El in that

region of the input space. The more optimistic the value of L, the less explorative

the batch will be. This acquisition function is shown in Algorithm 3; it differs from

Algorithm 2 only in computing and using the lie value instead of p(x).

3.2 Local Penalization (LP)

The acquisition function proposed by [18] takes a utility function, such as UCB or

El, and acquires multiple points from it by reducing the utility (penalizing) around

points that have already been added to the batch. The modified utility thus takes

34

Algorithm 3: Constant LiarInput: f: function to be optimized

T: number of iterationsD: initial dataset {(i, yi),.. . , (X., Y)}b: batch sizeL9: lie-generating function (e.g. max)

model = create_ model(D)

for t in 1,. .. ,TL = Lg(Y)for i in 1, ... ,b:

xi = argmaxxs tUEI(X)

model.update(z, L)

Y1:b = f(X1:b)

D = D U {(i, y1),.., (b,yYmodel.replace_data(D) // replace the imagined labels

return argmaxXEX model.expected-value(x)

the formk

u'(x) = u(x) fp(x, z)

j=1

where k points are in x(b) and cp(x, xj) are penalization functions centered at x and

such that 0 < p(x, xz) < 1 and which are non-decreasing in |1xj - xf|. In order

to determine a reasonable amount of penalization, it is assumed that f is Lipschitz

continuous with Lipschitz constant L. We then consider the ball centered at a point

x with radius rj, denoted by B,,(xj) = {x c X : |x|j - xH| < rj}. If we choose

= Y*~f) then x* Br' or else L would not be a Lipschitz constant for f. This3j L

gives rise to the penalty function o(x, Xj) = 1 - P (x c Bri (Xz)) which penalizes

points that are too close to x, to be (likely to be) the optimal point. For a Gaussian

process model, this has the closed form p(x, x) = lerfc(-z) where erfc is the

plementary error function and z = jxj x y*+(x) y* and L are generally not

knwn in practice (though y* might sometimes be knownleven though x* is not), so

L is estimated as L = maxxEx yv(x)j where hyv(x) = k(X, x) [K + o2I] Y is the

expected gradient of the GP at x and (OK(X, ) =k(x,xi) * may be estimated

as y*. This acquisition function is shown in Algorithm 4. To prevent vanishing or

exploding gradients when multiplying the utility by many penalty functions, a log

35

transform is used. This also requires the utility function to be non-negative; if this is

not the case, a softplus function is applied.

Algorithm 4: Local PenalizationInput: f: function to be optimized

T: number of iterationsD: initial dataset {(Xi, yi),..., (X, y.)}b: batch sizeu: base utility function (assumed positive)

model = create_ model(D)fort in 1,...,T :

for i in 1, .. ., b:xi = argmaxXX u(x)u(.) = u(.) + log(<p(., x)) // Add a penalizer centered at x

Y1:b = f (X1:b)D = D U {(Xi, yi),... (b, yb)}

return argmax XE model.expected value(x)

3.3 Thompson Sampling

123, 31] both propose to use Thompson sampling as a batch acquisition function. As

mentioned earlier, Thompson sampling applies trivially in the batch setting because

it is a stochastic function, so it can provide multiple points to evaluate. In the

discrete case, it's possible that the same point could be sampled multiple times. If

the objective function has much noise, it could be worth querying points multiple

times. Otherwise, one could generate points using Thompson sampling until a batch

of unique points of the desired size is reached. For the continuous case, duplicate

points are not an issue, though an exte:1sion of TS could require that the points in

the batch are not too close to each other.

[31] provide regret bounds for sequential, synchronous batch, and asynchronous

batch TS and compare these three applications. For simplicity, we focus on the

sequential setting in Algorithm 5.

36

Algorithm 5: Thompson SamplingInput: f: function to be optimized

T: number of iterationsD: initial dataset {(xi,yi), .. , (Xn, Yn)}

b: batch sizemodel = create__model(D)for t in 1, ... ,T :

for i in 1,... , b:f = model.sample function()x = argmaxx f(x)

Yi:b = f (Xi:b)D = D) U {(x1i, . , (Xb, YOy

return argmaxxcx model.expected-value(x)

3.4 Batch Upper Confidence Bound (BUCB)

Recall that the formula for calculating the variance of a Gaussian process at a given

point is

on(x) = k(x, x) - k(X, x)T [K + oI] k(X, x).

We note again that the variance does not depend on the observed values y; only

the locations, x, of the observations matter. This is the insight behind the BUCB

acquisition function of [11]. When acquiring with BUCB, the first point in a batch

is acquired with sequential UCB. For all subsequent points, the variance of the GP

is updated based on the location of the previous point added to the batch; the mean

function is kept constant. The decreased variance around points already in the batch

means that subsequently added points won't be too close to previously added ones.

[111 prove that BUCB also converges when a particular schedule is used for #t and

when an initial set of points is sampled appropriately.

In practice, especially when using pre-existing software libraries, it may be simpler

to add a new point to the GP than to update just the variance. Adding a new point

whose observed value is the same as the expected value for that point causes no

change in the predictive mean of the GP, so this technique can be used to add new

points that only update the variance. Thus BUCB is just a Kriging believer strategy

37

applied with UCB as the base utility function and with a particular initial sampling

technique and schedule for #t to ensure convergence in the batch setting. We do not

provide an additional algorithm for BUCB because the KB algorithm may be used.

3.5 Upper Confidence Bound with Pure Exploration

(UCB-PE)

[8] developed the Gaussian process upper confidence bound with pure exploration

(UCB-PE) acquisition function. This is a batch variant of UCB which, as the name

implies, focuses purely exploration to fill the batch. The first point is still acquired

by maximizing the UCB utility function, but all subsequent points in the batch are

acquired by sequentially choosing points with maximal variance. After each point is

added to the batch, the variance of the GP is updated just as in BUCB. This is a

greedy approximation to choosing the set of points which give maximal information

gain. There are two more nuanced points to how UCB-PE works. First, to prevent

the acquisition function from exploring in regions that are very unlikely to yield high-

value points, the variance-maximizing points are only selected from a constrained

region. The constraint is that the UCB at the point must be greater than the highest

lower confidence bound (LCB) of any point. Second, in order to bound the regret of

the optimization and prove convergence, the #t values are increased relative to the

schedule for UCB; see [8] for details. Algorithm 6 shows this method.

3.6 Distance Exploration (DE)

In the general setting for Bayesian optimization, the objective function f is expensive

to evaluate. Because it's expected that most of the time will be spelt querying f,

there is not a great deal of emphasis placed on making the acquisition function effi-

cient, though it must be reasonably tractable. [40] break from that trend with this

acquisition function and instead focus on acquiring batches quickly enough for use

on less costly objective functions. This fills a middle ground between the expensive

38

Algorithm 6: Upper Confidence Bound with Pure ExplorationInput: f: function to be optimized

T: number of iterationsD: initial dataset {(x 1 , y1),. , (xn, yn)}b: batch size

model = create_ model(D)for t in 1, ... T :

x1 ={argmaxxEXuUCB(X)}A argmax.,E p(x) - vO-(X) // maximum lower confidence bound

91 = {x E X | p(x) + 2 /#t+1j-(x) > A} // relevant regionfort i n2,...,b :

xi = argmaxe o-(x)// we only need to update the variance, but, as mentioned

for BUCB, this update does that and may be easier inpractice

model.add_point(x, I())Y1:b = f (X1:b)D = D U { (X1, y1), ... ,(Xb, yb)}model.replace_data(D) // replace the imagined labels

return argmax.sc model.expected value(x)

functions typically optimized by Bayesian optimization and the cheap functions op-

timized using other global optimization techniques like DIRECT [30] or evolutionary

algorithms.

DE is similar to UCB-PE but selects its exploratory points in a more efficient,

if not as principled, way. The first point is still selected using the UCB acquisition

function. All subsequent points in the batch are chosen to be as far as possible

from the closest point in the dataset or the current batch. The first point is thus

selected to be optimistically the best point while the others use a space-filling design

to gain 1:nowledge that will improve the model in the subsequent rounds. To make

this app oach even faster, a discretization scheme is used. At he start of Bayesian

optimization, a space-filling set of possible points are chosen (e.g. a Sobol sequence

[49]). The exploratory points are only chosen from this set removing the need for

optimization (e.g. by gradient ascent) in a continuous input space. Algorithm 7

shows this acquisition function.

39

Algorithm 7: Distance ExplorationInput: f: function to be optimized

T: number of iterationsD: initial dataset {(Xi, yi),. , (Xn, Y)}b: batch size

model = createmodel(D)// Sample enough points for all iterations using some

space-filling designS = sample_ sobol(bT)for t in 1 .... , T :

x1= {argmaxXEX UUCB(XI

for i in 2, ... ,b:xi = argmaxxas (min'EXUX(b) kv - X'1)

Y1:b = f (X1:b)D = D U {(xi, y 1 ),. . ., by

return argmaxXEX model.expected value(x)

3.7 Budgeted Batch Bayesian Optimization (B30)

While most batch acquisition functions use a predefined batch size, B30, developed

by [39], acquires batches of variable size. The intuition is that using a fixed batch

size may force the acquisition function to add points that are not expected to be

very useful merely for the sake of filling the batch. When the cost of querying the

objective function at a full batch of points is only slightly larger than the cost for a

partial batch, a fixed size makes sense. If the marginal cost of each point in the batch

is meaningful, though, then a variable batch size can conserve resources while still

providing a speedup over a fully-sequential acquisition function.

The goal with B30 is to acquire one point from each mode of some utility func-

tion. There are two steps used to find these modes: 1) draw samples from the utility

function with sampling probability proportional to utility and 2) fit an infinite Gaus-

sian mixture model (IGMM) to these samples. The samples are drawn using slice

sampling. To make this more efficient in practice, batches of samples are drawn at

once. Slice sampling requires that the score for each point be non-negative (an un-

normalized probability distribution). Because this is not necessarily the case for a

utility function (e.g. UCB), a first optimization is done to find the smallest utility

40

value. This can be subtracted from all utility scores to make them non-negative.

After collecting a batch of samples from the utility function, an IGMM is fit to these

samples. A Dirichlet process prior is placed on the number of components in the

mixture and variational inference is used to fit the mixture to the samples. Finally,

the acquired points for the batch are the means of the Gaussian distributions in the

mixture model. This approach is summarized in Algorithm 8.

Algorithm 8: Budgeted Batch Bayesian OptimizationInput: f: function to be optimized

T: number of iterationsD: initial dat aset {(x 1 ,y), . . . , (Xn, yn)}u: sequential utility function // the authors f ound UCB worked

bestn.: number of samples to draw

model = create_ model(D)for t in 1,...,T :

S = slicesample(u, n.)

pGM )GM = fit _igmm(S)X1:k={pGM GM} // IGMM's means; k is variable

Yi:k = f(X1:k)D = D U {(x1 , yi), . . . , (xk,yk)}

return argmax_-, model.expected__value(x)

3.8 k-means Batch Bayesian Optimization (KMBBO)

[20] proposed KMBBO which is a fixed batch size variant of B30. KMBBO also

uses slice sampling to generates samples from X proportional to their utility, but

instead of fitting a Gaussian mixture with a variable number of means, they ensure

a constant batch size by clustering the samples using the k-means algorithm. The k

cluster centers are the points to be accuired. KMBBO would be preferred over B30

in settings where smaller batches would lead to unused resources. For example, if the

cost of electricity is negligible and no other computations need to be done, a computer

with b cores, each capable of running an experiment, could gather more information

by always running b experiments. Other settings, such as biological experiments, may

41

also have negligible costs for querying more points up until b points are being queried.

The KMBBO acquisition function is shown in Algorithm 9.

Algorithm 9: k-means Batch Bayesian Optimization

Input: f: function to be optimized

T: number of iterationsD: initial dataset {(X1 , y1),..., (Xz, y)}b: batch size

u: sequential utility function // the authors f ound UCB workedbest

n.: number of samples to drawmodel = createmodel(D)fort in 1,...,T :

S = slicesample(u, n,)pKM = fit k means(b, S)

X1:b = {1 M K.. M} // cluster centers

Y1:b = f(iX:b)D = D U {(xi, y 1 ),. . . , (Xb, yb)}

return argmaxxEX model.expected value(x)

42

Chapter 4

Experiments

4.1 Objective Functions

We test on a variety of objective functions from the Bayesian optimization literature

in order to better understand which types of objective functions are amenable to

optimization by different acquisition functions.

The objective functions are listed in Table 4.1 and described in the following

subsections.

4.1.1 Test Functions

We use a handful of common "test" functions which are low-dimensional and where we

actually know the analytic form. This makes it easier, especially for the functions that

we can visualize directly, to understand what each acquisition function is doing and see

where they fail. These are the Alpine2, Bohachevsky, Branin, Gramacy, Goldstein,

Hartmann3, and Hartmann6 functions. Those which are 1- or 2-dimensional are

plotted in Appendix A.

4.1.2 Dataset Maximization Tasks

Because of the expensive nature of applications where Bayesian optimization is used,

it is not always possible to evaluate an objective function multiple times to compare

43

Name Description Dimension

Gramacy f(x) = - " +07 + (x - 1)4 1

Alpine2 f (x) = v/ »-isin(xi) 2Bohachevsky f(x) = - (0.7 + X2 + 2x2 - 0.3 cos(37rxo) - 0.4 cos(47rx1)) 2Branin f(x) = - ((x 1 - X2 + $xo - 6)2 + 10(1 - ) cos(Xo) + 10) 2Goldstein Appendix A 2Hartmann3 Appendix A 3Hartmann6 Appendix A 6Abalone Predict age of abalone from features. 8DNA Binding Identify strongly-binding DNA sequences. 32Robot Push Identify parameters for 2 robot hands pushing 2 objects. 14Trajectory Identify trajectory points for minimal-cost path. 60

Table 4.1: Overview of the objective functions. xi denotes the i-th dimension ofthe input point x. Some of these functions have been negated so that they aremaximization problems when traditionally they would be minimizations. Alpine2 isdefined for any number of dimensions, but we use it in 2 dimensions. Formulas thatare excessively long are deferred to the appendix.

different methods. Given an existing set of inputs and outputs from a function, we

can simulate optimizing the function by attempting to find the input among this set

with the largest corresponding output. As usual, a random subset of these points are

sampled to begin Bayesian optimization. At each subsequent step, only the points in

the dataset are considered for acquisitions. This lets us evaluate many methods on

the same task without the expense of running many physical experiments.

We consider a couple of different dataset maximization tasks in this work.

Abalone The Abalone dataset consists of 8 features (such as diameter, height, and

weight) of a few thousand abalone (sea snail) organisms as well as the age of each [381.

This dataset has been used previously in Bayesian ptimization either as a prediction

task [28], where the hyperparameters of a model t~lat predicts age from the features

are tuned using Bayesian optimization, or directly as a dataset maximization task [8].

We adopt the latter setting. Maximizing the function consists of identifying those

features of an abalone which lead (in the given dataset) to an organism with the

greatest age.

44

DNA-binding Proteins We use the DNA-binding experiments from [1] as a set of

objective functions. These experiments measure how strongly different proteins (tran-

scription factors) bind to given 8-basepair DNA segments. There are 38 transcription

factors in this dataset; we sample five of them to use here (each can be considered

a separate objective function). These experiments have evaluated all approximately

32,000 possible inputs, so our methods are able to acquire any possible point in this

space. We represent each basepair as a vector of length four with has zeroes at all

positions except a one at the index corresponding to the basepair. DNA sequences

are represented as a 32-dimensional vector which is the concatenation of the 8 one-hot

vectors for each basepair.

4.1.3 Simulations

Robot Pushing This is a task defined on a simulated robot introduced by [55]. The

robot is composed of two hands and the goal is to push two objects into a specific

location. The input parameters specify things such as the location, rotation, and

velocity of the robot hands.

Rover Trajectory This task was introduced in [53] and emulates a 2D rover navi-

gation task. The goal is to define a trajectory from a start location to a target location

by giving the coordinates of 60 points along that trajectory (to which a BSpline is fit

to determine the intermediate locations). A predetermined cost function is integrated

across the trajectory to determine the reward.

4.2 Baselines

For comparison, we use two random acquisition functions as baselines. The first

selects points uniformly at random from the input space (in the case of a finite input

space, this is uniform among the points not yet acquired). The second selects all points

at once using Latin hypercube sampling with multi-dimensional uniformity (LHS-

MDU) and then returns the next b points each time a batch is to be acquired (selecting

45

b points using LHS-MDU separately for each iteration would yield fundamentally

different results) [12].

4.3 Implementation Details

Here we list several practical details for the experiments that we run.

Utility Maximization We use the limited-memory Broyden-Fletcher-Goldfarb-

Shanno algorithm with bounds (L-BFGS-B; [61), as implemented in scikit-learn

[41], to optimize the utility functions as well as for optimizations like finding the

maximum LCB for UCB-PE. We start by selecting 1,000 points at uniformly at

random from the input space. Whichever point has the highest function value is

used as the starting point of the optimization. We allow at most 1,000 iterations

for the local optimization to converge. We run local ascent from 10 starting points

(all but the first sampled uniformly at random) and return the point with the best

score. If constraints, apart from box constraints, are given for the optimization, we

use sequential least squares programming (SQSLP; [33]) instead but keep all other

parameters the same.

Initialization We sample the initial points either uniformly at random except with-

out replacement, if the input space is finite, or using LHS-MDU if the input space is

continuous and bounded. We set the number of random points to be five times the

dimension of the input space (e.g. 10 points for a 2D function).

Normalization We scale each dimension of the input space to be in [-1, 1]. For

the outputs, we normalize them to be y' = `11 where py and o are the mean andstandard deviation of the observed outputs. py and o- are updated at the star of

each new batch.

Model Settings For the exact GP, we use a constant mean function where the

constant is learned from an initial value of 0. The covariance kernel is a Matern

46

kernel with v = 2.5 multiplied by a learned output scale o, initialized to 1. We use

separate lengthscale parameters for each dimension of the input space, each initialized

to 0.3. We set the noise of the likelihood to be 0.0001 and don't train this value (all

of the objectives we use are noiseless, but a small amount of noise in the likelihood

stabilizes the training). In total, there are d + 2 parameters to learn. We train

the parameters by maximizing the marginal log-likelihood of the data using gradient

descent with the Adam optimizer with an initial learning rate of 0.01 [32].

4.4 Metrics

For Bayesian optimization, it primarily matters how good the best point is at the

end of the optimization. One common metric is thus the simple regret r,(t) = y* -

maxXext f(x) where Xt are the points acquired up to iteration t; this measures the

goodness of the best acquired point. Another common metric is inference regret

ri(t) = y* - f(argmaxxcx pt(x)) where pt is the expected value of f(x) at iteration t.

The maximization required for r, is easy, so this metric can be computed freely; the

optimization needed to compute ri is more challenging, especially in high dimensions

(this corresponds to finding the point that maximizes the expected reward).

In this work, we use another metric that, like rs, only requires a maximization

over acquired points so that it is easy to compute the metric at all timesteps. This

is the "gap" metric r9(t) maxx x f(x)-yo where yo is the value of the best point

from the initial randomly acquired points [2]. rg has the nice property that it is

bounded between 0 (no improvement yet over the random points) and 1 (optimal

value achieved) making it easier to compare methods across objective functions and

without explicitly identifying the optimal value.

Though we do cre most about the best point at the end of the optimization, this

makes our evaluation dependent on the specific number of acquisition steps we've

chosen to use. If we had used half as many steps instead, different methods may have

performed better. Because of this, we also look at the average gap rang = # rg(t)

to see how a method would have performed, on average, for any smaller number of

47

acquisition steps. Note that rgavg c [0, 1] still, so we can still easily compare across

objective functions.

48

Chapter 5

Results

We discuss the results for the objective functions and highlight which acquisition

functions perform well for certain models and types of objective functions. Plots

showing the results at each acquisition are included in Appendix B. We also show

the performance of each model, acquisition function pair averaged over each class

of objective functions in the tables below. Table 5.1 shows the performance of the

random baselines. Note that the LHS and Sobol methods only work in continuous

spaces and so can't be evaluated on the dataset maximization tasks. The LHS method

was unable to scale to the simulation tasks. If a method has no performance on a task,

we fill that entry with a zero in these tables. Table 5.2 shows the performance of the

sequential acquisition functions as another baseline. These aren't evaluated on the

simulation tasks due to the much larger number of total acquisitions there. Finally,

Table 5.3 shows the performance of the batch acquisition functions. The results in

the tables will be discussed more in the remainder of this section.

AF Dataset Maximization Siraulations Test Functions

Uniform 0.23 0.356 0.647Sobol 0.00 0.390 0.696LHS-MDU 0.00 0.000 0.666

Table 5.1: Gap metric performance of random baselines averaged over all objectivefunctions of each class. AF = acquisition function.

49

Dataset Maximization Test FunctionsAF GP NN NNM GP NN NNM

EI 0.870 0.840 0.815 0.937 0.715 0.764ExpectedReward 0.789 0.897 0.890 0.861 0.665 0.657PI 0.745 0.553 0.540 0.845 0.800 0.694UCB 0.843 0.874 0.882 0.903 0.681 0.666

Table 5.2: Gap metric performance of sequential acquisition functions averaged over

all objective functions of each class. AF = acquisition function, NN = neural network

ensemble, NNM = neural network ensemble with maximizing overall diversity method.

Dataset Maximization Simulations Test FunctionsAF GP NN NNM GP NN NNM GP NN NNM

B30 0.368 0.491 0.451 0.0 0.021 0.013 0.578 0.507 0.468BUCB 0.787 0.889 0.891 0.0 0.000 0.000 0.879 0.383 0.402ConstantLiar-O 0.766 0.849 0.828 0.0 0.506 0.540 0.948 0.594 0.550ConstantLiar-P 0.854 0.830 0.834 0.0 0.477 0.471 0.932 0.652 0.539DistanceExploration 0.639 0.736 0.754 0.0 0.372 0.425 0.885 0.765 0.771KMBBO 0.350 0.678 0.711 0.0 0.599 0.559 0.861 0.837 0.804KrigingBeliever 0.848 0.850 0.855 0.0 0.478 0.525 0.893 0.605 0.548LocalPenalization 0.482 0.838 0.846 0.0 0.499 0.441 0.661 0.610 0.606UCBPE 0.744 0.860 0.873 0.0 0.000 0.000 0.854 0.485 0.494

Table 5.3: Gap metric performance of batch acquisition functions averaged over allobjective functions of each class. AF = acquisition function, NN = neural networkensemble, NNM = neural network ensemble with maximizing overall diversity method.

50

5.1 Low-dimensional, Continuous Objectives

These are the synthetic objective functions (Alpine2, Bohachevsky, Branin, Goldstein,

Gramacy, Hartmann3, and Hartmann6). Though low-dimensional, some of these

objective functions are still difficult to optimize. For example, the Gramacy function

has varying lengthscales across its input domain, so an acquisition function which

does not explore sufficiently will have difficulty reaching the optimal value. For other

functions, such as Branin, some of the points that are randomly acquired at the start

of Bayesian optimization will have values close to the optimal value. Because of how

the gap metric works, with the best initial value becoming the baseline, getting a

high gap score is still non-trivial. The overall results for each objective function are

shown in Figure 5-1 and the "Test Functions" columns of the preceding tables.

We see that the GP models are much stronger on the test functions; they outper-

form the neural network models for every acquisition function. This suggests either

that a GP is a better model for low-dimensional, continuous functions or that these

particular functions, often used as tests in Bayesian optimization, are more suited

to the assumptions of GPs. With multiple acquisition functions, the neural network

ensembles actually perform worse than the random acquisition baselines.

Among the acquisition functions, the two variants of Constant Liar both perform

very well with GP models, with the optimistic version even outperforming the best

sequential acquisition function. The KMBBO method works best with the NN models

and achieves performance that is not far behind most of the acquisition functions with

a GP model.

5.2 Dataset Maximization O jectives

These are the DNA-binding objectives and the Abalone function. The cardinality of

the spaces is small enough that it is possible to maximize utility functions through

exhaustive evaluation on the entire set of inputs. This is important because it means

that acquisition functions are not penalized for being more difficult to maximize.

51

DNA nding2 _ 08

DN ndn-CX.E0_ 1ac 0.6

0.6Branin.0

Abaone 0

o del-Aq. Func-Batch Size

Figure 5-: Median average gap metric for dsnetic objective functions.

preceB~dingtables.Y

Thbet reut h 06 p Of qu

tions the GP moe is clos behnd One posibiit is tha th acqitofucin

NORj W 4P /IV4/ $ra/

Model-Aq Foun ach Stz e

Figure 5-2: Median average gap metric for scretl. objective functions.

precendingR tables.I

Th13egR bEtA Reut her ar0ihN.oel,tog6o acul faqusto uctiloinig1CI tHe76 P moe0scoebhn.Oepsiiit4sta h custo ucin

sequenti~~Abloqiiionucinsaeel

52

90 M 0.6

RoerTrajedory 05

0.46,

6 0.3

"4 ? 02

Tsa tRv Trjcor-n RbPe B e o te ig

de l of t f a e lge mer o n t ea e 4t t h G a pc me p

A 4 6' " ~ ~ A q

sevel but: noMl,aqiiinfncin iihd hedian average gap valcfrhghdmninl betv uios.fo

all comletdhepeimensiesonl inire s 5-Oad jhe"imltionsclmso

Thes predn tablRoe.rjcoyadRbo uhojcie. easftehg

Because the onyG oeconierdheeweeexc'Gs4te wrualet

scaleninlt toteefncin n the large number of sapepeurdoyteeojcis Wea noted gai tha

thered re manymapproximte arantsia ofrthes Pel whicrcanscae oorger daasts bute

iTrasecnoty tratae o conGesidernmoredl winh the presettdy gven rRt nuber

ofl dfet acquisition adociefunctions alihe.Frhea preet.okesmloes

Thvere ist no clea et,oears acquisition functions oiihd h ei n thesa e objecties,

but thmee estpermaes aceed byw thFiue NN3 enemble wSithutios Mo.mn KBof

yieldse be resltsPmdlcniee here, thuhwtwa emogte worst acqstion feuntion fo

the datsert aximitionn objectives Thenti isti vaeay riaent.o osat iras

performs well across both models.

53

54

Chapter 6

Conclusion

We have reviewed much of the work in batch Bayesian optimization, especially the

design of acquisition functions, which has taken place in the past decade. There are a

few different batch acquisition strategies that become clear in this comparison such as

a focus on exploration, acquiring modes of a utility function, or penalizing the utility

near points already in the batch. Some of these functions build a batch sequentially

and are thus easily adapted to the asynchronous setting while others acquire a batch

jointly. We have highlighted the combinations of models and acquisition functions

that perform well for various objective functions. Overall, GPs seem the strongest

model for lower-dimensional functions, at least in continuous spaces. The NN ensem-

bles performed well at dataset maximization, however, suggesting that they may be

better in discrete spaces (as is the case for the DNA binding datasets) or when only

a finite number of points are under consideration so that utility functions are easily

optimized. The best acquisition function varies depending on the objective function

and the model, though some do stand out, such as the optimistic variant of Constant

Liar which, when combined with a good model for the set of objective functions, was

usually among the best acquisition functions.

55

56

Appendix A

Test Functions

For those test functions whose equations are too long for inclusion in the main text,

they are presented below. All 1- or 2-dimensional test functions are also plotted.

Goldstein

f(x) =- (1+ (Xo0 ±7 + 1)2 * (19- 14 * zo + 3 * z 14 * xi + 6 * xo * x1 + 3 * X2))

*(30 + (2 * xo - 3 * xi)2 * (18 - 32* xo + 12 *x + 48 * x1 - 36 * zo * x1+27 * xi)

Hartmann3

57

4

a=[1.0, 1.

3.0

0.1A=

3.0

0.1

p= 10-4

3

ai exp - Aij (j - Pi)2) ,wherej=1

2, 3.0, 3 .2]T

10 30

10 35

10 30

10 35

3689

4699

1091

1170

4387

8732

2673

7470

5547 IHartmann6

4 / 6

(x) = - E a exp -Ai - i , wherei=a1 j=1

a = [ 1. 0, 1. 2, 3.0, 3.2]T

10

0.5

3

17

P = 10- 4

3

10

3.5

8

1312

2329

2348

4047

17

17

1.7

0.05

3.5

0.1

10

10

1696

4135

1451

8828

1.7

8

17

0.1

5569

8307

3522

8732

8

14

8

14

124

3736

2883

5743

8283

1004

3047

1091

5886

9991

6650

38

58

f(x)

a

0

-1

-2

-3

-4-

-505 LO L5 0 .5

x

Figure A-1: The Gramacy function.

14

-s512

-501-0

-100

-150' -150

-200 6-200

-2504

F r A-3 T-250

14 2-300

10" -s 5 s 0-2 6 8 X

-0 2 4 246

68 02

Figure A-2: The Branin function.

59

10

86

46

2 y

0

-2 4-4

-6 2

10

24 4% 8 00

Figure A-3: The Alpine2 function.

4 6 8 10x2

iii

*1-000

-0000

1500J

20000

25000

0

50

10075 0-25 -250 25-50

250 7S -757100 -00

100

75

50

25 r

0

-25

-50

-75

-100-100 -50 0 so 100

XI

Figure A-4: The Bohachevsky function.

60

6

4

2

0

-2

-5000

-10000

-15000

-20000

-- 25000

z

20

1.5

00000 10

0000 .

-0

0.00-

00000

1 000000-000000 -1.5-

00.5

-2.0-1.5 0.0-1'.05 -0.5

- . -1.00.5L -1.5

L5 2.0 -2.0

-2 -2 xxi

Figure A-5: The Goldstein function.

61

-- 200000

- -400000

- -600000

--- 800000

1 2I-AE" --- UIIJR

62

Appendix B

Extended Results

Here we show the values of the gap metric at each acquisition for all combinations

of objective function, model, acquisition function, and batch size tested. In the main

text we primarily compared the acquisition functions on their results after a certain

number of points had been acquired (assuming constant batch sizes). These plots

enable comparisons based on different possible ending points (e.g. which pair of

acquisition function and model would have performed best on Gramacy if we had

only acquired 20 points?).

Note that the x-axes in these plots are the acquisition number, not necessarily the

total number of acquired points so far. To compare a method with a batch size of 1

to one with a batch size of 10, you would compare step t of the first method with step

t/10 of the second. The sets of plots are ordered alphabetically by objective function.

For clarity, and given the large number of lines in many of the plots, no error bars or

regions are shown. Each line is the median across all repeats.

63

Batch Size = 1

I:I IL--' 1E~

r

ft

ft

Ift

Batch Size = 10

-- IEI:

Fiwii R-1L Ahalone

Batch Size = 1

Batch Size = 10

Figure B-2: Alpine2

64

9

-f

s

S

Batch Size = I

. ...

Batch Size = 10

Isi

- - -

Batch Size = 1

-" - - - -

Batch Size = 10

1 S

Figure B-4: Branin

65

1 1 17

Batch Size = 1

r -

Batch Size = 10

.. .

I Ie

Figt *LRX

Batch Size = 1

a7

Batch Size = 10

Figure B-6: DNA Binding: BCL6

66

a

-1

i---: It

Batch Size = 1

W- IZ

Batch Size = 10

a 5 J

Fig - - - - -RX

Batch Size = 1

*

S

r

Batch Size = 10

~1-KFigure B-8: DNA Binding: EGR2

67

Batch Size = 1

Batch Size 10

S £

Figu 7SX1

Batch Size = 1

S &

/

/

Batch Size = 10

Figure B-10: Goldstein

68

I I............ - -

r[F-

Fr---w --------- --- -------

- L-

Batch Size = 1

S I

Batch Size = 10

1 5 a ,'

Figure B-12: Hartmann3

69

Batch Size = 1

Batch Size = 10

i I

9

Batch Size = 1

I a

Batch Size = 10

*

r

Figure B-13: Hartmann6

Batch Size = 1000

Figure B-14: Robot Push

70

I I

MrE 1

Batch Size = 1000

I~Ii* - I

Figure B-15: Rover Trajectory

71

I

72

Bibliography

[1] Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen SGisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock,Sachi Inukai, et al. Survey of variation in human transcription factors revealsprevalent dna binding changes. Science, 351(6280):1450-1454, 2016.

[2] Russell R Barton. Minimization algorithms for functions with random noise.American Journal of Mathematical and Management Sciences, 4(1-2):109-138,1984.

[3] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference:A review for statisticians. Journal of the American Statistical Association,112(518):859-877, 2017.

[4] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian op-timization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.

[5] Eric Brochu, Matthew W Hoffman, and Nando de Freitas. Portfolio allocationfor bayesian optimization. arXiv preprint arXiv:1009.5419, 2010.

[6] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited mem-ory algorithm for bound constrained optimization. SIAM Journal on ScientificComputing, 16(5):1190-1208, 1995.

[7] C16ment Chevalier and David Ginsbourger. Fast computation of the multi-pointsexpected improvement with applications in batch selection. In International Con-ference on Learning and Intelligent Optimization, pages 59-69. Springer, 2013.

[8] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallelgaussian process optimization with upper confidence bound and pure exploration.In Joint European Conference on Machine Learning and Knowledge Discoveryin Databases, pages 225-240. Springer, 2013.

[9] Dennis D Cox and Susan John. A statistical method for global optimization.In [Proceedings] 1992 IEEE International Conference on Systems, Man, andCybernetics, pages 1241-1246. IEEE, 1992.

[10] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In ArtificialIntelligence and Statistics, pages 207-215, 2013.

73

[11] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizingexploration-exploitation tradeoffs in gaussian process bandit optimization. TheJournal of Machine Learning Research, 15(1):3873-3923, 2014.

[12] Jared L Deutsch and Clayton V Deutsch. Latin hypercube sampling with multidi-mensional uniformity. Journal of Statistical Planning and Inference, 142(3):763-772, 2012.

[13] Peter I Frazier and Scott C Clark. Parallel global optimization using an improvedmulti-points expected improvement criterion. 2012.

[14] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Rep-resenting model uncertainty in deep learning. In international conference onmachine learning, pages 1050-1059, 2016.

[15] Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and An-drew Gordon Wilson. Product kernel interpolation for scalable gaussian pro-cesses. arXiv preprint arXiv:1802.08903, 2018.

[16] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. A multi-points cri-terion for deterministic parallel global optimization based on gaussian processes.<hal-00260579>, 2008.

[17] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to parallelize optimization. In Computational intelligence in expensiveoptimization problems, pages 131-162. Springer, 2010.

[18] Javier Gonzilez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batchbayesian optimization via local penalization. In Artificial Intelligence and Statis-tics, pages 648-657, 2016.

[19] Alex Graves. Practical variational inference for neural networks. In Advances inneural information processing systems, pages 2348-2356, 2011.

[20] Matthew Groves and Edward 0 Pyzer-Knapp. Efficient and scalable batchbayesian optimization using k-means. arXiv preprint arXiv:1806.01159, 2018.

[21] Jos6 Miguel Hernindez-Lobato and Ryan Adams. Probabilistic backpropagationfor scalable learning of bayesian neural networks. In International Conferenceon Machine Learning, pages 1861-1869, 2015.

[22] Jos6 Miguel Hernindez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani.Predictive entropy search for efficient global optimization of black-box functions.In Advances in neural information processing systems, pages 918-926, 2014.

[23] Jos6 Miguel Hernindez-Lobato, James Requeima, Edward 0 Pyzer-Knapp, andAlan Aspuru-Guzik. Parallel and distributed thompson sampling for large-scaleaccelerated exploration of chemical space. In Proceedings of the 34th Interna-tional Conference on Machine Learning- Volume 70, pages 1470-1479. JMLR.org, 2017.

74

[24] Matthew W Hoffman and Zoubin Ghahramani. Output-space predictive entropysearch for flexible global optimization.

[25] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-basedoptimization for general algorithm configuration. In International Conference on

Learning and Intelligent Optimization, pages 507-523. Springer, 2011.

[26] Janis Janusevskis, Rodolphe Le Riche, and David Ginsbourger. Parallel ex-

pected improvements for global optimization: summary, bounds and speed-up.Technical report, August 2011.

[27] Janis Janusevskis, Rodolphe Le Riche, David Ginsbourger, and Ramunas Girdz-

iusas. Expected improvements for the asynchronous parallel global optimization

of expensive functions: Potentials and challenges. In Learning and Intelligent

Optimization, pages 413-418. Springer, 2012.

[28] Dipti Jasrasaria and Edward 0 Pyzer-Knapp. Dynamic control of explore/exploit

trade-off in bayesian optimization. In Science and Information Conference, pages

1-15. Springer, 2018.

[29] Donald R Jones. A taxonomy of global optimization methods based on response

surfaces. Journal of global optimization, 21(4):345-383, 2001.

[30] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian op-

timization without the lipschitz constant. Journal of optimization Theory and

Applications, 79(1):157-181, 1993.

[31] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and BarnabisP6czos. Parallelised bayesian optimisation via thompson sampling. In Interna-

tional Conference on Artificial Intelligence and Statistics, pages 133-142, 2018.

[32] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

arXiv preprint arXiv:1412.6980, 2014.

[33] Donald H. Kraft. A software package for sequential quadratic programming.1988.

[34] Harold J Kushner. A new method of locating the maximum point of an arbitrarymultipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97-106, 1964.

[35] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple andscalable predictive uncertainty estimation using deep ensembles. In Advances in

Neural Information Processing Systems, pages 6402-6413, 2017.

[361 Daniel James Lizotte. Practical bayesian optimization. University of Alberta,2008.

75

[37] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application ofbayesian methods for seeking the extremum. Towards global optimization, 2(117-129):2, 1978.

[38] Warwick J Nash and Tasmania. Marine Research Laboratories. The populationbiology of abalone (haliotis species) in tasmania. 1, blacklip abalone (h. rubra)from the north coast and the islands of bass strait. 1994. CIP confirmed.

[39] V. Nguyen, S. Rana, S. K. Gupta, C. Li, and S. Venkatesh. Budgeted batchbayesian optimization. In 2016 IEEE 16th International Conference on DataMining (ICDM), pages 1107-1112, Dec 2016.

[40] Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Prac-tical batch bayesian optimization for less expensive functions. arXiv preprintarXiv:1811.01466, 2018.

[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, 0. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.

[42] Geoff Pleiss, Jacob R Gardner, Kilian Q Weinberger, and Andrew Gordon Wil-son. Constant-time predictive distributions for gaussian processes. arXiv preprintarXiv:1803.06058, 2018.

[431 Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes forMachine Learning. The MIT Press, 2006.

[44] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian banditsshowdown: An empirical comparison of bayesian deep networks for thompsonsampling. arXiv preprint arXiv:1802.09127, 2018.

[45] Matthias Schonlau. Computer Experiments and Global Optimization. PhD thesis,Waterloo, Ont., Canada, Canada, 1997. AAINQ22234.

[46] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Fre-itas. Taking the human out of the loop: A review of bayesian optimization.Proceedings of the IEEE, 104(1):148-175, 2016.

[47] Jas er Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian opti-mi ation of machine learning algorithms. In Advances n neural informationprocessing systems, pages 2951-2959, 2012.

[48] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalablebayesian optimization using deep neural networks. In International conferenceon machine learning, pages 2171-2180, 2015.

76

[491 IM SOBOL. The distribution of points in a cube and the approximate evaluationof integrals. USSR Computational Mathematics and Mathematical Physics, 7:86-112, 1967.

[50] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter.Bayesian optimization with robust bayesian neural networks. In Advances in

Neural Information Processing Systems, pages 4134-4142, 2016.

[51] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaus-sian process optimization in the bandit setting: No regret and experimental de-

sign. arXiv preprint arXiv:0912.3995, 2009.

[52] William R Thompson. On the likelihood that one unknown probability exceeds

another in view of the evidence of two samples. Biometrika, 25(3/4):285-294,1933.

[53] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batchedlarge-scale bayesian optimization in high-dimensional spaces. arXiv preprint

arXiv:1706.01445, 2018.

[54] Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesianoptimization. arXiv preprint arXiv:1703.01968, 2017.

[55] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional bayesian optimization via structural kernel learning. arXiv preprint

arXiv:1703.01973, 2017.

[56] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured

gaussian processes (kiss-gp). In International Conference on Machine Learning,pages 1775-1784, 2015.

[57] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing.Deep kernel learning. In Artificial Intelligence and Statistics, pages 370-378,2016.

77