16
Journal of Computational Physics 425 (2021) 109901 Contents lists available at ScienceDirect Journal of Computational Physics www.elsevier.com/locate/jcp Bayesian optimization with output-weighted optimal sampling Antoine Blanchard, Themistoklis Sapsis Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States of America a r t i c l e i n f o a b s t r a c t Article history: Received 6 June 2020 Received in revised form 2 October 2020 Accepted 3 October 2020 Available online 8 October 2020 Keywords: Bayesian optimization Optimal sampling Extreme events In Bayesian optimization, accounting for the importance of the output relative to the input is a crucial yet challenging exercise, as it can considerably improve the final result but often involves inaccurate and cumbersome entropy estimations. We approach the problem from the perspective of importance-sampling theory, and advocate the use of the likelihood ratio to guide the search algorithm towards regions of the input space where the objective function to minimize assumes abnormally small values. The likelihood ratio acts as a sampling weight and can be computed at each iteration without severely deteriorating the overall efficiency of the algorithm. In particular, it can be approximated in a way that makes the approach tractable in high dimensions. The “likelihood-weighted” acquisition functions introduced in this work are found to outperform their unweighted counterparts in a number of applications. © 2020 Elsevier Inc. All rights reserved. 1. Introduction Most optimization problems encountered in practical applications involve objective functions whose level of complexity prohibits use of classical optimization algorithms such as grid search, random search, and gradient-based methods (including Newton methods and conjugate gradient methods) [1,2]. In fact, so little is known about the internal workings of those systems that practitioners have no choice but to treat them as “black boxes”, for which a high evaluation cost adds to the issue of opacity, making optimization a daunting endeavor. To solve such optimization problems while keeping the number of black-box evaluations at a minimum, one possibility is to use an iterative approach. In this area, Bayesian optimization has received a great deal of attention because of its ability to a) incorporate prior belief one may have about the black-box objective function, and b) explore the input space carefully, compromising between exploration and exploitation before each function evaluation [3,4]. A key component in Bayesian optimization lies in the choice of acquisition function, which is the ultimate decider as it commands where to next query the objective function. Acquisition functions come in many shapes and forms, ranging from traditional improvement-based [5,6,1] and optimistic policies [7,8] to more recent information-based strategies [9,2,10]. The latter differ from the former in that they account for the importance of the output relative to the input (usually through the entropy of the posterior distribution over the unknown minimizer), leading to significant gains when the objective function is noisy and multi-modal. Despite superior empirical performance, information-based acquisition functions suffer from several shortcomings, in- cluding slow evaluation caused by heavy sampling requirements, laborious implementation, intractability in high di- * Corresponding author. E-mail addresses: [email protected] (A. Blanchard), [email protected] (T. Sapsis). https://doi.org/10.1016/j.jcp.2020.109901 0021-9991/© 2020 Elsevier Inc. All rights reserved.

Bayesian optimization with output-weighted optimal sampling

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian optimization with output-weighted optimal sampling

Journal of Computational Physics 425 (2021) 109901

Contents lists available at ScienceDirect

Journal of Computational Physics

www.elsevier.com/locate/jcp

Bayesian optimization with output-weighted optimal sampling

Antoine Blanchard, Themistoklis Sapsis ∗

Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States of America

a r t i c l e i n f o a b s t r a c t

Article history:Received 6 June 2020Received in revised form 2 October 2020Accepted 3 October 2020Available online 8 October 2020

Keywords:Bayesian optimizationOptimal samplingExtreme events

In Bayesian optimization, accounting for the importance of the output relative to the input is a crucial yet challenging exercise, as it can considerably improve the final result but often involves inaccurate and cumbersome entropy estimations. We approach the problem from the perspective of importance-sampling theory, and advocate the use of the likelihood ratio to guide the search algorithm towards regions of the input space where the objective function to minimize assumes abnormally small values. The likelihood ratio acts as a sampling weight and can be computed at each iteration without severely deteriorating the overall efficiency of the algorithm. In particular, it can be approximated in a way that makes the approach tractable in high dimensions. The “likelihood-weighted” acquisition functions introduced in this work are found to outperform their unweighted counterparts in a number of applications.

© 2020 Elsevier Inc. All rights reserved.

1. Introduction

Most optimization problems encountered in practical applications involve objective functions whose level of complexity prohibits use of classical optimization algorithms such as grid search, random search, and gradient-based methods (including Newton methods and conjugate gradient methods) [1,2]. In fact, so little is known about the internal workings of those systems that practitioners have no choice but to treat them as “black boxes”, for which a high evaluation cost adds to the issue of opacity, making optimization a daunting endeavor.

To solve such optimization problems while keeping the number of black-box evaluations at a minimum, one possibility is to use an iterative approach. In this area, Bayesian optimization has received a great deal of attention because of its ability to a) incorporate prior belief one may have about the black-box objective function, and b) explore the input space carefully, compromising between exploration and exploitation before each function evaluation [3,4].

A key component in Bayesian optimization lies in the choice of acquisition function, which is the ultimate decider as it commands where to next query the objective function. Acquisition functions come in many shapes and forms, ranging from traditional improvement-based [5,6,1] and optimistic policies [7,8] to more recent information-based strategies [9,2,10]. The latter differ from the former in that they account for the importance of the output relative to the input (usually through the entropy of the posterior distribution over the unknown minimizer), leading to significant gains when the objective function is noisy and multi-modal.

Despite superior empirical performance, information-based acquisition functions suffer from several shortcomings, in-cluding slow evaluation caused by heavy sampling requirements, laborious implementation, intractability in high di-

* Corresponding author.E-mail addresses: [email protected] (A. Blanchard), [email protected] (T. Sapsis).

https://doi.org/10.1016/j.jcp.2020.1099010021-9991/© 2020 Elsevier Inc. All rights reserved.

Page 2: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

mensions, and limited choice of Gaussian process kernels [10–12]. The only exception of which we are aware is the fast information-theoretic Bayesian optimization (FITBO) algorithm [13], in which efficiency and flexibility come at the cost of a decline in performance, the latter being just “on par” with traditional improvement-based and optimisticpolicies.

While it is clear that incorporating information about the output space in an acquisition function has significant merit, doing so while eliminating the above limitations calls for a different approach. Inspired by the theory of importance sam-pling [14], we propose to equip acquisition functions with the likelihood ratio, a quantity that accounts for the importance of the output relative to the input and which appears frequently in uncertainty quantification and experimental design problems [15,16]. The significance of the proposed approach is manifold:

1. The likelihood ratio acts as a probabilistic sampling weight and guides the algorithm towards regions of the input space where the objective function assumes abnormal values, which is beneficial in cases where the minima of the objective function correspond to rare and extreme output values.

2. In addition to output information, the likelihood ratio encapsulates any prior knowledge one may have about the distribution of the input, which has significant implications in problems related to uncertainty quantification and ex-perimental design where the input space often comes equipped with a non-uniform probability distribution.

3. The likelihood ratio can be satisfactorily approximated with a Gaussian mixture model, making it possible to compute any “likelihood-weighted” acquisition function (and its gradients) analytically for a range of Gaussian process kernels and therefore allowing for the possibility of the input space being high-dimensional.

Before going further, a word of caution is in order regarding the alleged advantages of the proposed approach. Anyone engaged in solving black-box optimization problems must be mindful of the no-free-lunch theorem which states that “if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems” [17]. Our approach is no exception to the rule. It is expected to provide an advantage in situations where a) Bayesian optimization is a suitable candidate, and b) the global minimum of the objective function is “extreme”, that is, it is separated from the remainder of the optimization landscape by several standard deviations. In all other situations, the proposed approach may provide no advantage at all, and even perform worse than other algorithms such as genetic programming or random-restart hill climbing.

2. Formulation of the problem

2.1. A brief review of Bayesian optimization

We consider the problem of finding a global minimizer of a function f : Rd −→ R over a compact set X ⊂ Rd . This is usually written as

minx∈X f (x). (1)

We assume that the objective function f is unknown and therefore treat it as a black box. In words, this means that f has no simple closed form, and neither do its gradients. This allows for the possibility of f being nonlinear, non-convex and multi-peak, although we do require that f be Lipschitz continuous to avoid pathological cases [3]. The objective function can, however, be evaluated at any arbitrary query point x ∈ X , each evaluation producing a potentially noise-corrupted output y ∈R. In this work, we model uncertainty in observations with additive Gaussian noise:

y = f (x) + ε, ε ∼ N (0,σ 2ε ). (2)

In practice, the function f can be a machine-learning algorithm (with x the hyper-parameters), a large-scale computer sim-ulation of a physical system (with x the physical parameters), or a field experiment (with x the experimental parameters). Evaluating f can thus be very costly, so to solve the minimization problem (1) each query point must be selected very meticulously.

Bayesian optimization is a sequential approach which, starting from an initial dataset of input–output pairs, iteratively probes the input space and, with each point visited, attempts to construct a surrogate model for the objective function. At each iteration, the “best next point” to visit is determined by minimizing an acquisition function a : Rd −→ R which serves as a guide for the algorithm as it explores the input space. After a specified number of iterations, the algorithm uses the surrogate model it has constructed to make a final recommendation for what it believes is the true minimizer of the objective function (Algorithm 1).

The two key issues in Bayesian optimization are the choice of surrogate model f and the choice of acquisition func-tion a. The former is important because it represents our belief about what the objective function looks like given the data collected by the algorithm; the latter is important because it guides the algorithm in its exploration of the input space. For Bayesian optimization to provide any sort of advantage over a brute-force approach, the costs of constructing

2

Page 3: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Algorithm 1 Bayesian optimization.Input: Number of initial points ninit , number of iterations niter

Initialize: Surrogate model f on initial dataset D0 = {xi , yi}niniti=1

for n = 0 to niter doSelect best next point xn+1 by minimizing acquisition function:

xn+1 = arg minx∈X

a(x; f ,Dn)

Evaluate objective function f at xn+1 and record yn+1

Augment dataset: Dn+1 = Dn ∪ {xn+1, yn+1}Update surrogate model

end forReturn: Final recommendation from surrogate model

x∗ = arg minx∈X

f (x)

the surrogate model and optimizing the acquisition function must be small compared to that of evaluating the black-boxfunction f .

2.2. Model selection

Many different surrogate models have been used in Bayesian optimization, with various levels of success [4]. In this work, we use a non-parametric Bayesian approach based on Gaussian process (GP) regression [18]. This choice is appropriate because Gaussian processes a) are agnostic to the details of the black box, and b) provide a way to quantify uncertainty associated with noisy observations [19–21].

A Gaussian process f (x) is completely specified by its mean function m(x) and covariance function k(x, x′). For a dataset D of input–output pairs (written in matrix form as {X, y}) and a Gaussian process with constant mean m0, the random process f (x) conditioned on D follows a normal distribution with posterior mean and variance

μ(x) = m0 + k(x,X)K−1(y − m0), (3a)

σ 2(x) = k(x,x) − k(x,X)K−1k(X,x), (3b)

respectively, where K = k(X, X) + σ 2ε I. Equation (3a) can be used to predict the value of the surrogate model at any point x,

and (3b) to quantify uncertainty in prediction at that point [18].In GP regression, the choice of covariance function is crucial, and in what follows we use the radial-basis-function (RBF)

kernel with automatic relevance determination (ARD):

k(x,x′) = σ 2f exp

[−(x − x′)T�−1(x − x′)/2], (4)

where � is a diagonal matrix containing the lengthscales for each dimension. Other choices of kernels are possible (e.g., linear kernel or Matérn kernel), but we will see that the RBF function has several desirable properties that will come in handy when designing our algorithm.

Exact inference in Gaussian process regression requires inverting the matrix K, typically at each iteration. This is usually done by Cholesky decomposition whose cost scales like O (n3), with n the number of observations [18]. (A cost of O (n2)

can be achieved if the parameters of the covariance function are fixed.) Although an O (n3) scaling may seem daunting, it is important to note that in Bayesian optimization the number of observations (i.e., function evaluations) rarely exceeds a few dozens (or perhaps a few hundreds), as an unreasonably large number of observations would defeat the whole purpose of the algorithm.

2.3. Acquisition functions for Bayesian optimization

The acquisition function is at the core of the Bayesian optimization algorithm, as it solely determines the points at which to query the objective function. The role of the acquisition function is to find a compromise between exploration (i.e., visiting regions where uncertainty is high) and exploitation (i.e., visiting regions where the surrogate model predicts small values). In this work, we consider the following three classical acquisition functions.

Probability of Improvement (PI). Given the current best observation y∗ , PI attempts to maximize

aPI(x) = �(λ(x)), (5)

where λ(x) = [y∗ − μ(x) − ξ ]/σ (x), � is the cumulative density function of the standard normal distribution, and ξ ≥ 0 is a user-defined parameter that controls the trade-off between exploration and exploitation [1].

3

Page 4: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Expected Improvement (EI). Arguably the most popular acquisition function for Bayesian optimization, EI improves on PI in that it also accounts for how much improvement a point can potentially yield [1]:

aEI(x) = σ(x) [λ(x)�(λ(x)) + φ(λ(x))] , (6)

where φ is the probability density function (pdf) of the standard normal distribution.Lower Confidence Bound (LCB). In LCB, the best next point is selected based on the bandit strategy of Srinivas et al. [7]:

aLCB(x) = μ(x) − κσ (x), (7)

where κ ≥ 0 is a user-specified parameter that balances exploration (large κ ) and exploitation (small κ ).The popularity of PI, EI and LCB can be largely explained by the facts that a) implementation is straightforward, b) evalu-

ation is inexpensive, and c) gradients are readily accessible, opening the door to gradient-based optimizers. The combination of these features makes the search for the best next point considerably more efficient than otherwise.

It is important to note that the rationale behind LCB is to “repurpose” a purely explorative acquisition function which reduces uncertainty globally, σ(x), into one suitable for Bayesian optimization which is more aggressive towards minima. This is done by appending μ(x) and introducing the trade-off parameter κ , as done in (7). Following the same logic, we repurpose the Integrated Variance Reduction (IVR) of Sacks et al. [22],

aIVR(x) = 1

σ 2(x)

∫cov2(x,x′)dx′, (8)

into

aIVR-BO(x) = μ(x) − κ aIVR(x), (9)

where the suffix “BO” in “IVR-BO” stands for “Bayesian optimization”. In the above, cov(x, x′) denotes the posterior covari-ance between x and x′ . The formula for IVR involves an integral over the input space, but for the RBF kernel this integral can be computed analytically, and the same is true of its gradients (Appendix A). Therefore, IVR-BO retains the three important features discussed earlier for PI, EI and LCB.

3. Bayesian optimization with output-weighted optimal sampling

In this section, we introduce a number of novel acquisition functions for Bayesian optimization, each having the following features: a) they leverage previously collected information by assigning more weight to regions of the search space where the objective function assumes abnormally small values; b) they incorporate a prior px over the search space, allowing the algorithm to focus on regions of potential relevance; and c) their computational complexity does not negatively affect the overall efficiency of the sequential algorithm.

3.1. Likelihood-weighted acquisition functions

To construct these acquisition functions, we proceed in two steps. First, we introduce a purely explorative acquisition function that focuses on abnormal output values without any distinction being made between maxima and minima. Then, we repurpose this acquisition function as was done for LCB and IVR-BO. The repurposed criterion is suitable for Bayesian optimization as it specifically targets abnormally small output values (i.e., extreme minima).

When the objective function assumes rare and extreme (i.e., abnormally large or small) output values, then the con-ditional pdf of the output p f |x is heavy-tailed. (A pdf is heavy-tailed when at least one of its tails is not exponentially bounded.) Heavy-tailed distributions are the manifestation of high-impact events occurring with low probability, and as a result they commonly arise in the study of risk [23] and extreme events [24].

One possible strategy is for the sequential algorithm to use the pdf of the GP mean pμ as a proxy for p f and select the best next point so that uncertainty in pμ is most reduced. The latter can be quantified by

aL(x) =∫ ∣∣log pμ+(y) − log pμ−(y)

∣∣dy, (10)

where μ±(x′; x) denotes the upper and lower confidence bounds at x′ had the data point {x, μ(x)} been collected, that is, μ±(x′; x) = μ(x′) ± σ 2(x′; x). The use of logarithms in (10) places extra emphasis on the pdf tails where extreme minima and maxima “live” [25].

The above metric enjoys attractive convergence properties [25] but is cumbersome to compute (not to mention optimize) and intractable in high dimensions. Therefore, we show that aL(x) is bounded above (up to a multiplicative constant) by

aB(x) =∫

σ 2(x′;x)px(x′)

p (μ(x′))dx′. (11)

μ

4

Page 5: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Equation (11) is a significant improvement over (10) from the standpoint of reducing complexity. More importantly, it reveals an unexpected connection between the metric aL (whose primary focus is the reduction of uncertainty in pdf tails) and the IVR acquisition function in (8). Indeed, it only takes a few lines to show that aB (x) is strictly equivalent to

aIVR-LW(x) = 1

σ 2(x)

∫cov2(x,x′) px(x′)

pμ(μ(x′))dx′, (12)

which is clearly a cousin of (8), with the ratio px(x)/pμ(μ(x)) playing the role of a sampling weight (Appendix B). The suffix “LW” in “IVR-LW” stands for “likelihood-weighted”.

With (12) in hand, two remarks are in order. First, we can establish the convergence of (12) by recognizing that there exists a constant M > 0 such that for all x ∈Rd ,

0 ≤ px(x)

pμ(μ(x))≤ M. (13)

Therefore, we have that 0 ≤ aIVR-LW(x) ≤ MaIVR(x); and since aIVR goes to zero in the limit of infinitely many observations, so does aIVR-LW . Second, it is important to note that (12) is a purely explorative acquisition function which does not discriminate between extreme minima (i.e., heavy left tail) and extreme maxima (i.e., heavy right tail).

To make IVR-LW suitable for Bayesian optimization, we repurpose (12) as

aIVR-LWBO(x) = μ(x) − κ aIVR-LW(x), (14)

which specifically targets extreme minima, consistent with (1). By the same logic, we also introduce

aLCB-LW(x) = μ(x) − κσ (x)px(x)

pμ(μ(x))(15)

as the likelihood-weighted counterpart to LCB. Here, it is natural to incorporate the ratio px(x)/pμ(μ(x)) in a product with the explorative term σ(x). This is in the same spirit as (11) where that ratio naturally appears as a sampling weight for the posterior variance.

3.2. The role of the likelihood ratio

In the importance-sampling literature, the ratio

w(x) = px(x)

pμ(μ(x))(16)

is referred to as the likelihood ratio [14]. (In that context, the distributions px and pμ are referred to as the “nominal distribution” and “importance distribution”, respectively.) The likelihood ratio is important in cases where some points are more important than others in determining the value of the output. For points with similar probability of being observed “in the wild” (i.e., same px), the likelihood ratio assigns more weight to those that have a large impact on the magnitude of output (i.e., small pμ). For points with similar impact on the output (i.e., same pμ), it promotes those with higher probability of occurrence (i.e., large px). In other words, the likelihood ratio favors points for which the magnitude of the output is unusually large over points associated with frequent, average output values.

In Bayesian optimization, the likelihood ratio can be beneficial in at least two ways. First, it should improve performance in situations in which the global minimum of the objective function “lives” in the (heavy) left tail of the output pdf p f . Second, the likelihood ratio makes it possible to distill any prior knowledge one may have about the distribution of the input. For “vanilla” Bayesian optimization, it is natural to use a uniform prior for px because in general any point x is as good as any other. But if the optimization problem arises in the context of uncertainty quantification of a physical system (where generally the input space comes equipped with a non-uniform probability distribution), or if one has prior beliefs about where the global minimizer may be located, then use of a non-trivial px has the potential to considerably improve performance of the search algorithm.

As far as we know, use of a prior on the input space is virtually unheard of in Bayesian optimization, the only exceptions being the works of Bergstra et al. [26] and Oliveira et al. [27]. There, a prior is placed on the input space in order to account for localization noise, i.e., the error in estimating the query x location with good accuracy. This is quite different from our approach in which the query locations are assumed to be known with exactitude and the input prior is used to highlight certain regions of the search space before the search is initiated. We also note that in GP regression, prior beliefs about the objective function are generally encoded in the covariance function, and its selection is a delicate matter even for the experienced practitioner. Using a prior on the input space may be viewed as a simple way of encoding structure without having to resort to convoluted GP kernels.

5

Page 6: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. 1. Contour plots of the objective function (left panel), the likelihood ratio with uniform px (center panel), and the GMM approximation (right panel). We used nGMM = 2 for the 2-D Ackley function and nGMM = 4 for the 2-D Michalewicz function.

3.3. Approximation of the likelihood ratio

We must ensure that introduction of the likelihood ratio does not compromise our ability to compute the acquisition functions and their gradients efficiently. We first note that to evaluate the likelihood ratio, we must estimate the conditional pdf of the posterior mean pμ , typically at each iteration. This is done by computing μ(x) for a large number of input points and applying kernel density estimation (KDE) to the resulting samples. Fortunately, KDE is to be performed in the one-dimensional output space, allowing use of fast FFT-based algorithms which scale linearly with the number of samples [28].

Yet, the issue remains that in IVR-LW(BO), the likelihood ratio appears in an integral over the input space. To avoid resorting to Monte Carlo integration, we approximate w(x) with a Gaussian mixture model (GMM):

w(x) ≈nGMM∑i=1

αi N (x;ωi,�i). (17)

The GMM approximation has two advantages. First, when combined with the RBF kernel, the problematic integrals and their gradients become analytic (Appendix C). This is important because it makes the approach tractable in high dimensions, a situation with which sampling-based acquisition functions are known to struggle. Second, the GMM approximation imposes no restriction on the nature of px and pμ . This is an improvement over the approach of Sapsis [29] whose approximation of 1/pμ by a second-order polynomial led to the requirement that px be Gaussian or uniform.

The number of Gaussian mixtures to be used in (17) is at the discretion of the user. In this work, nGMM is kept constant throughout the search, although the option exists to modify it “on the fly”, either according to a predefined schedule or by selecting the value of nGMM that minimizes the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) at each iteration [30]. We note that the optimal value of nGMM might be unreasonably large for very high-dimensional systems. Should that happen, we recommend prescribing a threshold value for nGMM not to exceed. The resulting mixture model will be suboptimal, but this is a small price to pay to alleviate the curse of dimensionality given that the extreme regions need not be localized with pinpoint accuracy. (A coarse representation of these regions suffices.)

We illustrate the benefits of using the likelihood ratio in the 2-D Ackley and Michalewicz functions, two test functions notoriously challenging for optimization algorithms. (Analytical expressions are given in Appendix D.) For these functions, Fig. 1 makes it visually clear that the likelihood ratio gives more emphasis to the region where the global minimum is located (i.e., the center region for the Ackley function, and the slightly off-center region for the Michalewicz function). Fig-ure 1 also shows that w(x) can be approximated satisfactorily with a small number of Gaussian mixtures, a key prerequisite for preserving algorithm efficiency.

We summarize our algorithm for computation of likelihood-weighted acquisition functions in Algorithm 2. Up to now we have adhered to conventional notation at the expense of making it clear whether the acquisition functions should be minimized or maximized. The summary presented in Table 1 should dissipate any ambiguity.

Algorithm 2 Likelihood-weighted acquisition function.Input: Posterior mean μ(x), input pdf px , functional form for LW acquisition function a(x; w(x)), number of Gaussian mixtures nGMM

doSample posterior mean μ(x)

Estimate pμ by KDECompute w(x) = px(x)/pμ(μ(x))

wGMM(x) ← Fit GMM to w(x)

end doReturn: a(x; wGMM(x)) and gradients in analytic form

6

Page 7: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Table 1Summary of the acquisition functions considered in this work.

Acquisition function Equation Rule

PI (5) maximizeEI (6) maximizeLCB(-LW) (7), (15) minimizeIVR(-LW) (8), (12) maximizeIVR-BO(LW) (9), (14) minimize

4. Results

4.1. Experimental protocol

To demonstrate the benefits of the likelihood ratio, we conduct a series of numerical experiments with the acquisition functions introduced in Section 3. Specifically, we compare EI, PI, LCB(-LW), IVR(-BO), and IVR-LW(BO). We include IVR and IVR-LW in this list because they provide insight about the behavior of IVR-BO and IVR-LWBO, respectively, in the limit of large κ .

For each example considered, we indicate the number of Gaussian mixtures used in the GMM approximation. We use ξ = 0.01 for EI and PI, and κ = 1 for LCB, IVR and their respective variants. (Although not considered in this work, use of a schedule for κ has the potential to favorably affect the algorithm [7].) We do not set the noise variance σ 2

ε , but rather let the algorithm learn it from data.

When the location of the global minimum xtrue and corresponding function value ytrue are known (as in Section 4.2), we report the simple regret

r(n) = mink∈[0,n]

f (x∗k ) − ytrue, (18)

where x∗k denotes the optimizer recommendation at iteration k, and n is the index of the current iteration [12]. Because

several of the functions considered thereafter are highly multimodal and oscillatory, we also report the distance

(n) = mink∈[0,n]

‖xtrue − x∗k‖2, (19)

as in Ru et al. [13]. (When the function has multiple global minimizers, we compute for each minimizer and report the smallest value.) When the global minimizers of the objective function are not known a priori (as in Section 4.3), we use (18) with ytrue = 0, which we complement with the observation regret

ro(n) = minyi∈Dn

yi . (20)

For each example considered, we run 100 Bayesian experiments, each differing in the choice of initial points, and report the median for the metrics introduced above. (The shaded error bands indicate a quarter of the median absolute deviation.) Experiments were performed on a computer cluster equipped with 40 Intel Xeon E5-2630v4 processors clocking at 2.2 GHz. Our code is available on GitHub.1

4.2. Benchmark of test functions

We begin with the benchmark of six test functions commonly used to evaluate performance of optimization algorithms (Appendix D). The Branin and 6-D Hartmann functions are moderately difficult, while the other four are particularly chal-lenging: the Ackley function features numerous local minima found in a nearly flat region surrounding the global minimum; the Bukin function is non-differentiable with a steep ridge filled with local minima; and the Michalewicz functions have steep valleys and ridges with a global minimum accounting for a tiny fraction of the search space.

For each test function, the input space is rescaled to the unit hypercube so as to facilitate training of the GP hyper-parameters. The noise variance is specified as σ 2

ε = 10−3 and appropriately rescaled to account for the variance of the objective function. We use ninit = 3 for the 2-D functions and ninit = 10 otherwise. In each case the initial points are selected by Latin hypercube sampling (LHS).

For uniform px and nGMM = 2, Fig. 2 shows that the LW acquisition functions systematically and substantially outperform their unweighted counterparts in identifying the location and objective value of the global minimum. The only exception is with the Branin function, for which the likelihood ratio appears to have no positive effect. For a possible explanation, consider that the output pdf of the Branin function has a very light left tail; by contrast, the other objective functions

1 https://github .com /ablancha /gpsearch.

7

Page 8: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. 2. Performance of EI, PI, IVR(-BO), IVR-LW(BO) and LCB(-LW) for six benchmark test functions.

have heavy left tails (Fig. D.5). This observation is consistent with the derivation in Section 3.1 in which it was shown that the LW acquisition functions primarily target abnormally small output values, strongly suggesting that a heavy left tail is a prerequisite for LW acquisition functions to be competitive.

Figure 2 confirms that the likelihood ratio allows the algorithm to explore the search space more efficiently by focusing on regions where the magnitude of the objective function is thought to be unusually large. For a clear manifestation of this, consider the left panel in Fig. 2(c), where the three LW acquisition functions dramatically outclass the competition, a consequence of the fact that the likelihood ratio helps the algorithm target the ridge of the Bukin function much more rapidly and thoroughly than otherwise.

Figures 2(e) and 2(f) demonstrate superiority of our approach in high dimensions. Overall, Fig. 2 suggests that the best-performing acquisition functions are LCB-LW and IVR-LWBO, solidifying the utility of the likelihood ratio in Bayesian optimization. Figure 2 also makes clear that there is no “warm-up” period for the LW acquisition functions. That the latter “zero in” much faster than the competition is important since the power of Bayesian optimization lies in keeping the number of black-box evaluations at a minimum.

8

Page 9: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. 3. Dynamical system with extreme events.

We have investigated how computation of the likelihood ratio affects algorithm efficiency (Appendix E). We have found that the benign increase in runtime associated with Algorithm 2 a) can be largely mitigated if sampling of the posterior mean μ(x) is done frugally, and b) is inconsequential when each black-box query takes hours or days, as in many practical applications.

4.3. Computation of precursors for extreme events in dynamical systems

For a real-world application, we consider the problem of predicting the occurrence of extreme events in dynamical systems. This is a topic worthy of investigation because extreme events (e.g., earthquakes, road accidents, wildfires) have the potential to cause significant damage to people, infrastructure, and nature [24,31,32]. From a prediction standpoint, the central issue is to identify precursors, i.e., those states of the system which are most likely to lead to an extreme event in the near future. But searching for precursors is no easy task because extreme events often arise in highly complex dynamical systems, which adds to the issue of low frequency of occurrence. To combat this, one can use Bayesian optimization to parsimoniously probe the state space of the system and thus identify “dangerous” regions with as little data as possible.

Formally, the dynamical system is treated as a black box, which assigns to an initial condition x0 a measure of danger-ousness, e.g.,

F : Rd −→ R

x0 �−→ maxt∈[0,τ ] G(St(x0)). (21)

Here, t denotes the time variable, St the flow map of the system (i.e., the dynamics of the black box), G : Rd −→ Rthe observable of interest, and τ the time horizon over which prediction is to be performed. In words, F records the maximum value attained by the observable G during the time interval [0, τ ] given initial condition x0. The role of Bayesian optimization is to search for those x0 that give rise to large values of F indicative of an extreme event occurring within the next τ time units.

In practice, not the whole space of initial conditions is explored by the algorithm, as this would allow sampling of unrealistic initial conditions. Instead, one may view extreme events as excursions from a “background” attractor [33], for which a low-dimensional representation can be constructed by principal component analysis (PCA). Searching for precursors in the PCA subspace is recommended when the dimension d is unfathomably large (e.g., when the state vector arises from discretizing a partial differential equation). Another advantage is that the PCA subspace comes equipped with the Gaussian prior px(x) =N (x; 0, �), where the diagonal matrix � contains the PCA eigenvalues.

We consider the dynamical system introduced by Farazmand and Sapsis [34] in the context of extreme-event prediction in turbulent flow. The governing equations are given by

x = αx + ωy + αx2 + 2ωxy + z2, (22a)

y = −ωx + αy − ωx2 + 2αxy, (22b)

z = −λz − (λ + β)xz, (22c)

with parameters α = 0.01, ω = 2π , and λ = β = 0.1. (This is an example of a Shilnikov system operated backward in time [35].) Figure 3(a) shows that the system features successive “cycles” during which a trajectory initiated close to the origin spirals away towards the point (−1, 0, 0), only to find itself swiftly repelled from the z = 0 plane. After some hovering about, the trajectory ultimately heads back to the origin, and the cycle repeats itself. Here, extreme events correspond to bursts in the z coordinate, as shown in Fig. 3(b).

To identify precursors for these bursts, we apply Bayesian optimization to the function −F with observable G = eTz . The

background attractor is approximated with the leading two principal components, which roughly span the (x, y) plane. We

9

Page 10: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. 4. Performance of EI, PI, IVR(-BO), IVR-LW(BO) and LCB(-LW) for computation of extreme-event precursor.

use τ = 50 for the prediction horizon, σ 2ε = 10−3 for the noise variance, and two Gaussian mixtures for the GMM fit. The

search algorithm is initialized with three points sampled from an LHS design. To avoid the sampling of unrealistic initial conditions, we require that x0 lie no further than four PCA standard deviations in any direction.

Figure 4 shows that here again the LW acquisition functions trumps their unweighted cousins by a significant margin. The two best-performing acquisition functions are again LCB-LW and IVR-LWBO, with the former predicting an objective value (about 0.98 in Fig. 4) within a few percent of the observation (about 0.95 in Fig. 3(b)). We note that use of the Gaussian prior for px may lead to even greater gains when the dynamical system features multiple dangerous regions, with some more “exotic” than others. We also note that uncertainty in observations (encapsulated in σ 2

ε ) may be interpreted as feedback from unresolved scales in a turbulent flow, or imperfections in the experimental apparatus. In an experiment, the question of localization noise (i.e., the error in estimating or realizing a particular state x0) is also highly relevant.

5. Conclusions

We have investigated the effect of the likelihood ratio, a quantity that accounts for the importance of the output of an objective function relative to its input, on the performance of Bayesian optimization algorithms. We have shown that use of the likelihood ratio in an acquisition function can dramatically improve algorithm efficiency, with faster convergence seen in a number of synthetic test functions. We have proposed an approximation of the likelihood ratio as a superposition of Gaussian mixtures to make the approach tractable in high dimensions. We have successfully applied the proposed method to the problem of extreme-event prediction in complex dynamical systems.

While in principle the proposed approach can be applied to any optimization problem, it is expected to provide the greatest gains in situations where Bayesian optimization is a reasonable candidate and the global minimum of the objective function is several standard deviations away from the remainder of the optimization landscape. Applications of potential interest to the practitioner include prediction of extreme events and identification of associated precursors in turbulent flow [36], active control of a turbulent jet for enhanced mixing [37], optimal path planning of autonomous vehicles for anomaly detection in environment exploration [38], and hyper-parameter training of deep-learning algorithms [39].

CRediT authorship contribution statement

Antoine Blanchard: Conceptualization, Data curation, Methodology, Software, Writing - original draft. Themistoklis Sap-sis: Conceptualization, Funding acquisition, Methodology, Writing - review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors acknowledge support from the Army Research Office (Grant No. W911NF-17-1-0306) and the 2020 Math-Works Faculty Research Innovation Fellowship.

10

Page 11: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Appendix A. Analytical expressions for the IVR acquisition function with RBF kernel

We first rewrite the formula for IVR using the GP expression for the posterior covariance:

σ 2(x)aIVR(x) =∫ [

k(x,x′) − k(x,X)K−1k(X,x′)]2

dx′ (A.1a)

=∫

k(x,x′)k(x′,x)dx′

+ k(x,X)K−1[∫

k(X,x′)k(x′,X)dx′]

K−1k(X,x)

− 2k(x,X)K−1∫

k(X,x′)k(x′,x)dx′. (A.1b)

If we introduce

k(x1,x2) =∫

k(x1,x′)k(x′,x2)dx′, (A.2)

then (A.1b) can be rewritten as

σ 2(x)aIVR(x) = k(x,x) + k(x,X)K−1[k(X,X)K−1k(X,x) − 2k(X,x)

]. (A.3)

This shows that to compute IVR and its gradients, we only need a mechanism to compute (A.2) and its gradients, regardless of the choice of GP kernel.

For the RBF kernel

k(x,x′;�) = σ 2f exp

[−(x − x′)T�−1(x − x′)/2], (A.4)

we have

k(x1,x2) = σ 2f π

d/2|�|1/2k(x1,x2;2�) (A.5a)

and

d

dx1k(x1,x2) = −k(x1,x2)(x1 − x2)

T(2�)−1. (A.5b)

For further details, we refer the reader to McHutchon [40].

Appendix B. Mathematical derivation of IVR-LW

We begin with Theorem 1 of Mohamad and Sapsis [25] which states that for small enough σ(x),

aL(x) ≈∫Y

1

pμ(y)

∣∣∣∣ d

dyE[σ 2(x′;x) · 1μ(x′)=y]

∣∣∣∣ dy, (B.1)

where E is the expectation with respect to px , and Y is the domain over which the pdf pμ is defined. Standard inequalities [41] allow us to bound (B.1) as follows:∫

Y

∣∣∣∣ 1

pμ(y)

d

dyE[σ 2(x′;x) · 1μ(x′)=y]

∣∣∣∣ dy ≤ K

∫Y

1

pμ(y)E[σ 2(x′;x) · 1μ(x′)=y]dy, (B.2)

where K is a positive constant. But we note that

E[σ 2(x′;x) · 1μ(x′)=y] =∫

σ 2(x′;x) · 1μ(x′)=y px(x′)dx′ (B.3a)

=∫

μ(x′)=y

σ 2(x′;x)px(x′)dx′. (B.3b)

Therefore,∫Y

1

pμ(y)E[σ 2(x′;x) · 1μ(x′)=y]dy =

∫′

σ 2(x′;x)px(x′)pμ(μ(x′))

dx′. (B.4)

μ(x )∈Y

11

Page 12: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

In practice, the domain of integration in (B.4) is replaced with the support of the input pdf px:

aB(x) =∫

σ 2(x′;x)px(x′)

pμ(μ(x′))dx′. (B.5)

Finally, it should be clear that the optimization problem

minx∈X

∫σ 2(x′;x)

px(x′)pμ(μ(x′))

dx′ (B.6)

is strictly equivalent to

maxx∈X

∫[σ 2(x′) − σ 2(x′;x)] px(x′)

pμ(μ(x′))dx′, (B.7)

since the term involving σ 2(x′) in (B.7) does not depend on the optimization variable x. We can then rewrite the difference of variances using the trick of Gramacy and Lee [42]. We thus obtain

maxx∈X

1

σ 2(x)

∫cov2(x,x′) px(x′)

pμ(μ(x′))dx′, (B.8)

which concludes the derivation of IVR-LW.

Appendix C. Analytical expressions for the IVR-LW acquisition function with RBF kernel

With the likelihood ratio being approximated with a GMM, the IVR-LW acquisition function becomes

aIVR-LW(x) ≈ 1

σ 2(x)

nGMM∑i=1

βi ai(x), (C.1)

where each ai is given by

ai(x) =∫

cov2(x,x′)N (x′;ωi,�i)dx′. (C.2)

Using the formula for the posterior covariance, we get

ai(x) =∫ [

k(x,x′) − k(x,X)K−1k(X,x′)]2 N (x′;ωi,�i)dx′ (C.3a)

=∫

k(x,x′)k(x′,x)N (x′;ωi,�i)dx′

+ k(x,X)K−1[∫

k(X,x′)k(x′,X)N (x′;ωi,�i)dx′]

K−1k(X,x)

− 2k(x,X)K−1∫

k(X,x′)k(x′,x)N (x′;ωi,�i)dx′, (C.3b)

= ki(x,x) + k(x,X)K−1[ki(X,X)K−1k(X,x) − 2ki(X,x)

], (C.3c)

where we have defined

ki(x1,x2) =∫

k(x1,x′)k(x′,x2)N (x′;ωi,�i)dx′. (C.4)

Therefore, to evaluate ai and its gradients, we only need a mechanism to compute ki and its gradients. For the RBF kernel (A.4), it is straightforward to show that

ki(x1,x2) = |2�i�−1 + I|−1/2k(x1,x2;2�)k(x1 + x2,ωi;� + 2�i) (C.5a)

and

dki

dx1= ki(x1,x2)

{−xT

1�−1 + 1

2

[ωT

i + (x1 + x2)T�−1�i

](�i + �/2)−1

}. (C.5b)

For further details, we refer the reader to McHutchon [40].

12

Page 13: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. D.5. Conditional pdf with uniform input for the six benchmark test functions considered in this work.

Appendix D. Analytical expressions for benchmark test functions

The analytical expressions for the synthetic test functions considered in Section 4.2 are given below. Further details may be found in Surjanovic and Bingham [43] and Jamil and Yang [44]. For these test functions, Figure D.5 shows the conditional pdf of the output for uniformly distributed input. We used 105 samples for the 2-D functions, and 106 samples for the 6-D Hartmann and 10-D Michalewicz functions.

Ackley function:

f (x) = −a exp

⎡⎣−b

√√√√1

d

d∑i=1

x2i

⎤⎦ − exp

[1

d

d∑i=1

cos(cxi)

]+ a + exp(1), (D.1)

where d denotes the dimensionality of the function, and a = 20, b = 0.2, and c = 2π .Branin function:

f (x) = a(x2 − bx21 + cx1 − r)2 + s(1 − t) cos(x1) + s, (D.2)

where a = 1, b = 5.1/(4π2), c = 5, r = 6, s = 10, and t = 1/(8π).

13

Page 14: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Bukin function:

f (x) = 100√

|x2 − 0.01x21| + 0.01|x1 + 10| (D.3)

Michalewicz:

f (x) = −d∑

i=1

sin(xi) sin2m(ix2i /π), (D.4)

where d denotes the dimensionality of the function, and m controls the steepness of the valleys and ridges. In this work we use m = 10, making optimization quite challenging.

6-D Hartmann function:

f (x) = −4∑

i=1

ai exp

⎡⎣−

6∑j=1

Aij(x j − Pij)2

⎤⎦, (D.5)

where

a = [1 1.2 3 3.2

]T, (D.6a)

A =

⎡⎢⎢⎣

10 3 17 3.5 1.7 80.05 10 17 0.1 8 14

3 3.5 1.7 10 17 817 8 0.05 10 0.1 14

⎤⎥⎥⎦ , (D.6b)

P =

⎡⎢⎢⎣

0.1312 0.1696 0.5569 0.0124 0.8283 0.58860.2329 0.4135 0.8307 0.3736 0.1004 0.99910.2348 0.1451 0.3522 0.2883 0.3047 0.66500.4047 0.8828 0.8732 0.5743 0.1091 0.0381

⎤⎥⎥⎦ . (D.6c)

Appendix E. Comparison of runtime for likelihood-weighted and unweighted acquisition functions

To investigate how computation of the likelihood ratio affects the overall efficiency of the algorithm, we proceed as follows. For the Ackley function, we consider an initial dataset composed of ten LHS input–output pairs and, for the eight acquisition functions considered in Section 4.2, record the time required to perform a single iteration of the Bayesian al-gorithm. During that time interval, the following operations are performed: computation of the likelihood ratio (for LW acquisition functions only), minimization of the acquisition function over the input space, query of the objective function at the best next point, and update of the surrogate model.

Since the focus is on the likelihood ratio, we investigate the effect of the following parameters on runtime: number of samples nsamples drawn from the posterior mean used in KDE, number of Gaussian mixtures nGMM used in the GMM fit, and dimensionality of the objective function d. The baseline case uses nsamples = 106, nGMM = 2, and d = 2. (These are the parameters used to generate the results in Fig. 2a.) For each parameter, we perform 100 experiments (each with a different LHS initialization, followed by a single Bayesian iteration) and report the median runtime (in seconds) for EI, PI, IVR(-BO), IVR-LW(BO), and LCB(-LW).

Figure E.6 shows that computation of the likelihood ratio has a mildly adverse effect on algorithm efficiency, with the main culprit being the sampling of the posterior mean. The left panel in Fig. E.6 shows a relatively strong dependence of the runtime on nsamples . We have verified that for the LW acquisition functions, very little time is spent in the KDE part of the algorithm—as discussed in Section 3.3, one-dimensional FFT-based KDE scales linearly with the number of samples. Instead, most of the iteration time is spent generating the samples of the posterior mean μ(x). Little can be done to avoid this issue, except for using a reasonably small number of samples. For nsamples ≤ 105, the left panel in Fig. E.6 shows that runtime for LW acquisition functions is not dramatically larger than that for unweighted acquisition functions. We note that for relatively low dimensions, reducing the number of samples has virtually no effect on the accuracy of the KDE. For nsamples = 106, the center and right panels in Fig. E.6 suggest that runtime scales linearly with the number of Gaussian mixtures and dimensionality, which is unproblematic from the standpoint of efficiency.

As a final word, we note that the runtimes for LW acquisition functions remain on the same order of magnitude as that for unweighted acquisition functions. In practical applications, evaluation of the black-box objective function may take days or even weeks, so whether computation and optimization of the acquisition function takes one or ten seconds is inconsequential.

14

Page 15: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

Fig. E.6. Effect of nsamples , nGMM , and d, on single-iteration runtime for the Ackley function.

References

[1] D.R. Jones, M. Schonlau, W.J. Welch, Efficient global optimization of expensive black-box functions, J. Glob. Optim. 13 (1998) 455–492.[2] P. Hennig, C.J. Schuler, Entropy search for information-efficient global optimization, J. Mach. Learn. Res. 13 (2012) 1809–1837.[3] E. Brochu, V.M. Cora, N. De Freitas, A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and

hierarchical reinforcement learning, Technical Report TR-2009-023, Department of Computer Science, University of British Columbia, 2009.[4] B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, N. De Freitas, Taking the human out of the loop: a review of Bayesian optimization, Proc. IEEE 104 (2015)

148–175.[5] H.J. Kushner, A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise, J. Fluids Eng. 86 (1964) 97–106.[6] J. Mockus, V. Tiesis, A. Zilinskas, The application of Bayesian methods for seeking the extremum, Towards Glob. Optim. 2 (1978) 117–129.[7] N. Srinivas, A. Krause, S. Kakade, M. Seeger, Gaussian process optimization in the bandit setting: no regret and experimental design, in: Proceedings of

the 27th International Conference on Machine Learning, 2010, pp. 1015–1022.[8] E. Kaufmann, O. Cappé, A. Garivier, On Bayesian upper confidence bounds for bandit problems, in: Artificial Intelligence and Statistics, 2012,

pp. 592–600.[9] J. Villemonteix, E. Vazquez, E. Walter, An informational approach to the global optimization of expensive-to-evaluate functions, J. Glob. Optim. 44

(2009) 509–534.[10] J.M. Hernández-Lobato, M.W. Hoffman, Z. Ghahramani, Predictive entropy search for efficient global optimization of black-box functions, in: Advances

in Neural Information Processing Systems, 2014, pp. 918–926.[11] M.W. Hoffman, Z. Ghahramani, Output-space predictive entropy search for flexible global optimization, in: NIPS Workshop on Bayesian Optimization,

2015, pp. 1–5.[12] Z. Wang, S. Jegelka, Max-value entropy search for efficient Bayesian optimization, in: Proceedings of the 34th International Conference on Machine

Learning, 2017, pp. 3627–3635.[13] B. Ru, M.A. Osborne, M. Mcleod, D. Granziol, Fast information-theoretic Bayesian optimisation, in: Proceedings of Machine Learning Research, vol. 80,

2018, pp. 4384–4392.[14] A.B. Owen, Monte Carlo theory, methods and examples, published online at https://statweb .stanford .edu /~owen /mc, 2013.[15] K. Chaloner, I. Verdinelli, Bayesian experimental design: a review, Stat. Sci. 10 (1995) 273–304.[16] X. Huan, Y.M. Marzouk, Simulation-based optimal Bayesian experimental design for nonlinear systems, J. Comput. Phys. 232 (2013) 288–317.[17] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Trans. Evol. Comput. 1 (1997) 67–82.[18] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, 2006.[19] M. Raissi, P. Perdikaris, G.E. Karniadakis, Inferring solutions of differential equations using noisy multi-fidelity data, J. Comput. Phys. 335 (2017)

736–746.[20] M. Raissi, P. Perdikaris, G.E. Karniadakis, Machine learning of linear differential equations using Gaussian processes, J. Comput. Phys. 348 (2017)

683–693.[21] G. Pang, P. Perdikaris, W. Cai, G.E. Karniadakis, Discovering variable fractional orders of advection–dispersion equations from field data using multi-

fidelity Bayesian optimization, J. Comput. Phys. 348 (2017) 694–714.[22] J. Sacks, W.J. Welch, T.J. Mitchell, H.P. Wynn, Design and analysis of computer experiments, Stat. Sci. 4 (1989) 409–423.[23] P. Embrechts, C. Klüppelberg, T. Mikosch, Modeling Extremal Events for Insurance and Finance, Springer Verlag, New York, 2013.[24] S. Albeverio, V. Jentsch, H. Kantz, Extreme Events in Nature and Society, Springer Verlag, New York, 2006.[25] M.A. Mohamad, T.P. Sapsis, Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems, Proc. Natl. Acad. Sci. 115 (2018)

11138–11143.[26] J.S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization, in: Advances in Neural Information Processing Systems, 2011,

pp. 2546–2554.[27] R. Oliveira, L. Ott, F. Ramos, Bayesian optimisation under uncertain inputs, in: Proceedings of Machine Learning Research, vol. 89, 2019, pp. 1177–1184.[28] J. Fan, J.S. Marron, Fast implementations of nonparametric curve estimators, J. Comput. Graph. Stat. 3 (1994) 35–56.[29] T.P. Sapsis, Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples, Proc. R. Soc. A 476 (2020) 20190834.[30] J. VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data, O’Reilly Media, 2016.[31] J. Li, J. Li, D. Xiu, An efficient surrogate-based method for computing rare failure probability, J. Comput. Phys. 230 (2011) 8683–8697.[32] M. Farazmand, T.P. Sapsis, Reduced-order prediction of rogue waves in two-dimensional deep-water waves, J. Comput. Phys. 340 (2017) 418–434.[33] M. Farazmand, T.P. Sapsis, A variational approach to probing extreme events in turbulent dynamical systems, Sci. Adv. 3 (2017) e1701533.[34] M. Farazmand, T.P. Sapsis, Dynamical indicators for the prediction of bursting phenomena in high-dimensional systems, Phys. Rev. E 94 (2016) 032212.[35] S. Wiggins, Global Bifurcations and Chaos: Analytical Methods, Springer Verlag, New York, 1988.[36] T. Lestang, F. Bouchet, E. Lévêque, Numerical study of extreme mechanical force exerted by a turbulent flow on a bluff body by direct and rare-event

sampling techniques, J. Fluid Mech. 895 (2020) A19.

15

Page 16: Bayesian optimization with output-weighted optimal sampling

A. Blanchard and T. Sapsis Journal of Computational Physics 425 (2021) 109901

[37] Y. Zhou, D. Fan, B. Zhang, R. Li, B.R. Noack, Artificial intelligence control of a turbulent jet, J. Fluid Mech. 897 (2020) A27.[38] A. Blanchard, T. Sapsis, Informative path planning for anomaly detection in environment exploration and monitoring, arXiv:2005 .10040, 2020.[39] J. Snoek, H. Larochelle, R.P. Adams, Practical Bayesian optimization of machine learning algorithms, in: Advances in Neural Information Processing

Systems, 2012, pp. 2951–2959.[40] A. McHutchon, Differentiating Gaussian processes, http://mlg .eng .cam .ac .uk /mchutchon /DifferentiatingGPs .pdf, 2013.[41] M.K. Kwong, A. Zettl, Norm Inequalities for Derivatives and Differences, Springer Verlag, New York, 2006.[42] R.B. Gramacy, H.K. Lee, Adaptive design and analysis of supercomputer experiments, Technometrics 51 (2009) 130–145.[43] S. Surjanovic, D. Bingham, Virtual library of simulation experiments: test functions and datasets, http://www.sfu .ca /~ssurjano, 2013.[44] M. Jamil, X.-S. Yang, A literature survey of benchmark functions for global optimisation problems, Int. J. Math. Model. Numer. Optim. 4 (2013) 150–194.

16