Multivariate Gaussian Process Regression for ... - IITmath.iit.edu/~mdixon7/multi-GP-DC.pdf · Illinois Institute of Technology and St ephane Cr epeyy LaMME, Univ Evry, CNRS, Universit

Multivariate Gaussian Process Regression for

Derivative Portfolio Modeling: Application to

CVA

Matthew F. Dixon∗

Department of Applied MathematicsIllinois Institute of Technology

and

Stephane Crepey†

LaMME, Univ Evry, CNRS, Universite Paris-Saclay, 91037, Evry, France

December 4, 2018

Abstract

Modeling counterparty risk is computationally challenging because itrequires the simultaneous evaluation of all the trades with each counter-party under both market and credit risk. We present a multi-Gaussianprocess regression for estimating portfolio risk, which is well suited forOTC derivative portfolios, in particular CVA computation. Our spatio-temporal modeling approach avoids nested MC simulation or simulationand regression of cash flows by learning a ’kernel pricing layer’. Thepricing layer is flexible - we model the joint posterior of the derivativesas a Gaussian over function space, with the spatial covariance structureimposed only on the risk factors. Monte-Carlo (MC) simulation is thenused to simulate the dynamics of the risk factors. Our approach quanti-fies uncertainty in portfolio valuation and CVA arising from the Gaussianprocess approximation. Numerical experiments demonstrate the accuracyand convergence properties of our approach for CVA computations.

∗Matthew Dixon is an Assistant Professor in the Department of Applied Mathematics,Illinois Institute of Technology, Chicago. E-mail: [email protected].†Stephane Crepey is a Professor in the Department of Mathematics, University of Evry,

Paris. E-mail: [email protected]: The authors are grateful to Marc Chataigner and Areski Cousin for insightsand feedback.

1

1 Overview

Post the global financial crisis of 2007-2008, banks have been subject to muchstricter regulation and conservative capital and liquidity requirements. Pric-ing, valuing and managing over-the-counter (OTC) derivatives has been sub-stantially revised to more robustly capture counter-party credit risk. Pricingand accounting now includes valuation adjustments collectively known as xVAs(Abbas-Turki et al., 2018; Kenyon and Green, 2014; Crepey et al., 2014). Sincethe xVA market risk must be hedged, xVA sensitivities, such as delta and vega,are also computed.

The BCBS pointed out that 2/3 of total credit losses during the 2007-2009crisis were CVA losses, but this risk was not capitalized under Basel II. A CVAregulatory framework is present since the initial phase of the Basel III frameworkin December 2010.

Modeling counterparty risk is computationally challenging because it re-quires the evaluation of all the trades with each counterparty under market andcredit simulation. In practice, CVA computation requires pricing an option foreach counterparty portfolio under simulated market moves, with counterpartydefault modeled separately. The sensitivities of the CVA, with respect to thehundreds or even thousands of market risk buckets, are required for hedging.

The main source of computational complexity in CVA computation arisesfrom full model reevaluation of portfolio holdings (including path dependentor early exercise options) in numerous future dynamic scenarios. Cash flowsimulation and regression schemes a la (Longstaff and Schwartz, 2001), knownas Least Squares Monte Carlo (LSMC) methods, are commonly used to esti-mate the portfolio price and Greeks over all paths, but this comes with limitederror control. The computational complexity is exacerbated for computationof CVA Expected Shortfall and VaR, since a nested Monte-Carlo is typicallyneeded (see (Barrera et al., 2017) for a comparison between nested Monte-Carloand cash flow simulation and regression schemes). There has also been muchprogress towards real-time CVA sensitivity estimation using adjoint algorithmicdifferentiation to reduce the computational work (Giles and Glasserman, 2005;Capriotti et al., 2011; Capriotti, 2011; Antonov et al., 2018). However, overall,algorithmic differentiation is still very challenging to implement at the level ofa banking derivative portfolio.

In this paper, we depart from the LSMC approach by turning to the useof Gaussian processes (GP) interpolation on the mark-to-market (MtM) cube,i.e. prices of spanning instruments in future time points and scenarios. Most‘vanilla’ xVAs can be deduced from the MtM cube by integration of the re-sulting expected exposure profiles against suitable credit and funding curves.Our approach consists in simulating the market risk factors forward in time andthen interpolate the option surface at each time step from a set of, option modelgenerated, reference derivative prices. Such an approach is predicated on thenotion that a GP model can reliably interpolate quickly, accurately and providefast prices and the associated (differentiated) Greeks. By comparison, standardinterpolation techniques such as cubic spline basis functions perform poorly if

2

the knot points are not uniform, and require many knot points to resolve a non-smooth function. Another classical interpolation scheme is Chebyshev nodesand weights, recently made popular in computational finance by Gaß et al.(2017), but the node location is deterministic and fixed, which entails stringentcurse of dimensionality issues.

An alternative approach, which we also partially consider, is GP estimationof the MtM cube from historical data, without the use of gridded model referenceprices. Such a ‘model free’ approach then relies on GPs for both interpolationand extrapolation - the latter is much more challenging for GPs. Thus, as acounterpoint to the benchmark nested MC approach, there are three differentregression (that includes interpolation) approaches for CVA simulation: LSMC,regression to model prices, regression to market prices, the latter two of whichwe consider by using GPs.

Gaussian process regression, or simply Gaussian Processes (GPs), is a Bayesiankernel learning method which has demonstrated much success in spatio-temporalapplications outside of finance. Their adoption in financial modeling is more re-cent and typically under the name of ’kriging’ (see e.g. (Liu and Staum, 2010),(Ludkovski, 2018), or (Cousin et al., 2016)). We refer to the reader to (Ras-mussen and Williams, 2005) for an excellent general introduction to GPs. Inadditional to a number of favorable statistical and mathematical properties,such as universality (Micchelli et al., 2006), the implementation support infras-tructure is mature - provided by GpyTorch, scikit-learn, Edward, STAN andother open source machine learning packages. Spiegeleer et al. (2018) alreadynoted, in the general context of derivative pricing, that many of the calculationsfor pricing a wide array of complex instruments, are often similar. Furthermore,the market conditions affecting OTC derivatives may often only slightly varybetween observations by a few variables, such as interest rates. For fast deriva-tive pricing and Greeking, Spiegeleer et al. (2018) propose offline learning thepricing function, through Gaussian Process regression. Specifically, the authorsconfigure the training set over a grid and then use the GP to interpolate at thetest points. The advantage of this approach, compared to training on historicaloption prices, is the ability to estimate options prices over a wider domain thantypically observed from historical prices. Such GP estimates depend on optionpricing models, rather than just market data - somewhat counter the motiva-tion for adopting machine learning, but not a singular point in computationalfinance applications (see e.g. Hernandez (2017), Spiegeleer et al. (2018), or E.et al. (2017)).

Spiegeleer et al. (2018) demonstrate the speed up of GPs relative to Monte-Carlo methods and tolerable accuracy loss applied to pricing and Greek estima-tion with a Heston model, in addition to approximating the implied volatilitysurface. The increased expressibility of GPs compared to cubic spline inter-polation, a popular numerical approximation techniques useful for fast pointestimation, is also demonstrated.

However, the applications shown in (Spiegeleer et al., 2018) are limited tosingle instrument pricing and do not consider risk modeling aspects. In par-ticular, their study is limited to univariate GPs (i.e. with a single response),

3

without consideration of multivariate GPs (a.k.a. multi-GPs).This paper presents a multivariate generalization of GPs for learning the

posterior distribution of a portfolio value prediction1. Multi-GPs learn the jointposterior distribution of each derivative price in the portfolio, given a training setof risk factors and derivative prices (for each given time to maturity of interest).In a single-response GP setting, individual GPs are used to model the posteriorof each predicted derivative price under the assumption that the derivative pricesare independent, conditional on the training data and test input. Given thateither the derivatives may share common underlyings, or the underlyings aredifferent but correlated, this assumption is clearly violated in practice. Bycontrast, multi-GPs directly model the uncertainty in the prediction of a vectorof derivative prices (responses) with spatial covariance matrices specified bykernel functions. Thus the amount of error in a portfolio value prediction,at any point in space and time, can be better modeled using multi-GPs thansingle-GPs. Multi-GPs do not, however, provide any methodology improvementin estimation of the mean.

The need for ’uncertainty’ in the prediction is the primary practical mo-tivation for using GPs, as opposed to frequentist machine learning techniquessuch as support vector machines or neural networks etc, which provide pointestimates. In practice, a high uncertainty in a prediction might result in a GPmodel estimate being rejected in favor of either retraining the model or evenusing full derivative model repricing.

Our usage of ’uncertainty’ in this paper primarily refers to the interpolationerror estimate. However, GPs can also be used for uncertainty quantification inthe sense of quantifying model risk, which we also demonstrate in this paper.Model risk is, in particular, an important and widely open xVA issue.

Outline Our goal is to develop a methodology and provide numerical evidencein favor of using GPs to estimate the CVA of a derivative portfolio. The useof multi-GPs compared to single-GPs provides a consistent approach to aggre-gating uncertainty in point estimates over a portfolio, accounting for the jointposterior over the options in the portfolio.

Our approach is based on training to model rather than training to data dueto limitations of OTC derivative historical data. However, if sufficient historicaldata is available, we emphasize that the methodology presented here can justas easily train to data, as demonstrated in Section 6.5.

This paper begins by reviewing GPs in the simpler setting of a single re-sponse, providing the minimal necessary terminology for the remainder of thepaper. Section 3 introduces a multi-response generalization of GPs and demon-strates the application to prediction of a call and put portfolio. Section 4 de-velops the approach for portfolio risk modeling, introducing a transition densityfunction with the view towards Monte-Carlo simulation of the risk factors. Sec-tion 5 reviews the formulation of a CVA model which uses our Monte Carlo

1Through out this paper, we will refer to ’prediction’ as out-of-sample point estimation.For avoidance of doubt, the test point need not be in the future as the terminology suggests.

4

multi-GPs approach. Numerical experiments demonstrating the accuracy andconvergence properties of the approach are presented in Section 6. Section 7concludes. Section A provides Python code excerpts illustrating the key featuresof our MC-GP approach. These example together with additional examples areprovided in the github repository: https://github.com/mfrdixon/GP-CVA.

2 Gaussian Processes

Statistical inference involves learning a function Y = f(X) of the data, (X,Y ) :=(xi,yi) | i = 1, . . . , n. The idea of Gaussian processes (GPs) is to, withoutparameterizing2 f(X), place a prior directly on the space of functions (MacKay,1997). The GP is hence a Bayesian nonparametric model that generalizes theGaussian distributions from finite dimensional vector spaces to infinite dimen-sional function spaces.

Before describing GPs in more detail, it is instructive to contrast GPs withclassical financial modeling. In a Black-Scholes framework, noise is modeled asa Gaussian distribution in a vector space and linear diffusion of asset prices ismodeled with multi-variate Geometric Brownian motion (GBM). Under the riskneutral measure, the implied drift and covariance of the GBM can be calibratedto observed pairs of asset and option prices. It is well known that since derivativeprices are not generated by the Black Scholes model, the calibrated parametersviolate the assumption of spatial-temporal homogeneity.

GPs do not assume a data generation process and learn a parameterizedcovariance function of the input through maximum likelihood estimation overall input and output pairs. GPs learn the priors over the output space withoutnecessarily knowing the functional form of the map between input and output.So, for example, if the data is observed pairs of asset and option prices, thenthe GP learns the functional relationship between them. If the option prices aregenerated by an option pricing model, then the GP will learn the relationshipbetween the input variables and the model option prices, without knowledge ofthe model.

GPs are an example of a more general class of supervised machine learningtechniques referred to as ’kernel learning’, which model the covariance matrixfrom a set of parametrized kernels over the input. The approach can conse-quently be referred to as ’model-free’ if the data is learned without relying onan option pricing and asset dynamics model. However, in this paper, we willmainly train our GPs on model simulated data.

The basic theory of prediction with Gaussian processes dates back to at leastas far as the time series work of Wiener [1949] and Kolmogorov [1941] in the1940s (Whittle and Sargent, 1983). Examples of applying GPs to financial timeseries prediction are presented in (Roberts et al., 2013). The same authors help-fully note that AR(p) processes are discrete time equivalents of GP models witha certain class of covariance functions, known as Matern covariance functions.

2This is in contrast to nonlinear regressions commonly used in finance, which attempt toparameterize a non-linear function with a set of weights.

5

Hence, GPs can be viewed as a Bayesian non-parametric generalization of wellknown econometrics techniques.

GPs are not new in portfolio risk modeling; da Barrosa et al. (2016) presenta GP method for optimizing financial asset portfolios which allows for approxi-mating the risk surface. Other examples of GPs include meta-modeling for ex-pected shortfall through nested simulation (Liu and Staum, 2010), where GPsare used to infer portfolio values in a scenario based on inner-level simulation ofnearby scenarios. This significantly reduces the required computational effortby avoiding inner-level simulation in every scenario and naturally takes accountof the variance that arises from inner-level simulation.

Spiegeleer et al. (2018) demonstrate how GPs can be applied to many classi-cal problems in derivate pricing, with speed-ups of several orders of magnitudethrough pricing function estimation. GPs are found to be much more accuratethan spline fitting techniques commonly used in derivative modeling. An at-tractive advantage of GPs over spline fitting is that the reference points neednot be uniformly distributed. Examples demonstrate the pricing of Americanoptions and the pricing of exotic options under models beyond the Black-Scholessetting.

2.1 Preliminaries

More formally, we say that a random function f is drawn from a GP with amean function µ and a covariance kernel k, f ∼ GP(µ, k), if for any vector ofinputs, [x1,x2, . . . ,xn], the corresponding vector of function values is Gaussian:

[f(x1), f(x2), . . . , f(xn)] ∼ N (µ,KX,X),

with mean µ, such that µi = µ(xi), and covariance matrix KX,X that satisfies(KX,X)ij = k(xi,xj). With loss of generality, we follow the convention in theliterature of assuming µ = 0.

GPs can be seen as distributions over the reproducing kernel Hilbert space(RKHS) of functions which is uniquely defined by the kernel function, k (Scholkopfand Smola, 2001). GPs with RBF kernels are known to be universal approx-imators with prior support to within an arbitrarily small epsilon band of anycontinuous function (Micchelli et al., 2006).

Assuming additive Gaussian noise, y | x ∼ N (f(x), σ2), and a GP prior onf(x), given training inputs x ∈ X and training targets y ∈ Y , the predictivedistribution of the GP evaluated at an arbitrary test point x∗ ∈ X∗ is:

f∗ | X,Y,x∗ ∼ N (E[f∗|X,Y,x∗],V[f∗|X,Y,x∗]), (1)

where the moments of the posterior over X∗ are

E[f∗|X,Y,X∗] = µX∗ +KX∗,X [KX,X + σ2I]−1Y,

V[f∗|X,Y,X∗] = KX∗,X∗ −KX∗,X [KX,X + σ2I]−1KX,X∗ .(2)

Here, KX∗,X , KX,X∗ , KX,X , and KX∗,X∗ are matrices that consist of the kernel,k : Rp ×Rp 7→ R, evaluated at the corresponding points, X and X∗, and µX∗is the mean function evaluated on the test inputs X∗.

6

2.2 Hyper-parameter tuning

GPs are fit to the data by optimizing the evidence-the marginal probability ofthe data given the model with respect to the learned kernel hyperparameters.

The evidence has the form (see e.g. (Murphy, 2012, Section 15.2.4, p. 523)):

log p(y | x, λ) = −[y>(K + σ2I)−1y + log det(K + σ2I)

]− n

2log 2π, (3)

where we use a shorthand K for KX,X , and K implicitly depends on the kernelhyperparameters λ = [`, σ] and ` is the length-scale of the Radial Basis Function(RBF) kernel:

cov(f(x), f(x′)) = k(x,x′) = exp− 1

2`2||x− x′||2. (4)

This objective function consists of a model fit and a complexity penalty termthat results in an automatic Occam’s razor for realizable functions (Rasmussenand Ghahramani, 2001). By optimizing the evidence with respect to the kernelhyperparameters, we effectively learn the structure of the space of functionalrelationships between the inputs and the targets:

λ∗ = arg maxλ

log p(y | x, λ) = −[y>(K + σ2I)−1y + log det(K + σ2I)

]−n

2log 2π.

(5)The gradient of the log likelihood is given analytically:

∂λ log p(y | x, λ) = tr(ααT − (K + σ2I)−1)∂λ(K + σ2I)−1

)(6)

where α := (K + σ2I)−1y and

∂`(K + σ2I)−1 = −(K + σ2I)−2∂`K, (7)

∂σ(K + σ2I)−1 = −2σ(K + σ2I)−2. (8)

and∂`k(x,x′) = `−3||x− x′||2k(x,x′). (9)

2.3 Computational properties

Training time, required for maximizing (5) numerically, scales poorly with thenumber of observations n. This complexity stems from the need to solve linearsystems and compute log determinants involving an n × n symmetric positivedefinite covariance matrix K. This task is commonly performed by comput-ing the Cholesky decomposition of K incurring O(n3) complexity. Prediction,however, is faster and can be performed in O(n2) with a matrix-vector multi-plication for each test point, and hence the primary motivation for using GPsis real-time risk estimation performance.

7

Massively Scalable Gaussian Processes Note that fast massively scalableGaussian processes (MSGP) (Gardner et al., 2018) are a significant extensionof the basic kernel interpolation framework described above. The core ideaof the framework is to improve scalability by combining GPs with ‘inducingpoint methods’. Using structured kernel interpolation (SKI), a small set of minducing points are carefully selected from the original training points. Thecovariance matrix has Kronecker and Toeplitz structure which is exploited byFFT. Finally, output over the original input points is interpolated from theoutput at the inducing points. The interpolation complexity scales linearly withdimensionality p of the input data by expressing the kernel interpolation as aproduct of 1D kernels. Overall, SKI gives O(pn+ pmlogm) training complexityand O(1) prediction time per test point. See Section B for further details.In this paper, we primarily use the basic interpolation approach for accuracyevaluation. However for completeness, Section 6.3 shows the scaling of the timetaken to train and predict with MSGPs.

3 Multi-response Gaussian Processes

A multivariate Gaussian process is a collection of random vector-valued vari-ables, any finite number of which have matrix-variate Gaussian distribution.We define a multivariate Gaussian process as follows.

Definition 3.0.1 (MGP). f is a multivariate Gaussian process on Rp withvector-valued mean function µ : Rp 7→ Rd, kernel k : Rp×Rp 7→ R and positivesemi-definite parameter covariance matrix Ω ∈ Rd×d if the vectorization of anyfinite collection of vector-valued variables have a joint multi-variate Gaussiandistribution,

vec([f(x1), . . . , f(xn)]) ∼ N (vec(M),Σ⊗ Ω),

where f ,µ ∈ Rd are column vectors whose components are the functions fidi=1

and µidi=1 respectively, and ⊗ is the Kronecker product. Furthermore, M ∈Rd×n with Mij = µj(xi), and Σ ∈ Rn×n with Σij = k(xi,xj). Sometimes Σ iscalled the column covariance matrix while Ω is the row covariance matrix. Wedenote f ∼MGP(µ, k,Ω).

3.1 Multivariate Gaussian process regression

Following Chen et al. (2017), we are given n pairs of observations (xi,yi)ni=1,xi ∈Rp,yi ∈ Rd, we assume the following model

f ∼ MGP(µ, k′,Ω),

yi = f(xi), i ∈ 1, . . . , n

where k′ = k(xi,xj) + δijσ2n, and σ2

n is the variance of the additive Gaussiannoise.

8

By the definition of multivariate Gaussian process, it yields that the vec-torization of the collection of functions [f(x1), . . . , f(xn)] follow a multivariateGaussian distribution

vec([f(x1), . . . , f(xn)]) ∼ N (0,K ′ ⊗ Ω),

where K ′ is the n× n covariance matrix of which the (i, j)-th element [K ′]ij =k′(xi,xj). See Appendix 3.2 for further details of prediction with the multi-GPmodel.

In the next section, we shall consider the general application of GPs to port-folio value estimation and market risk modeling. The scope of the methodologyis therefore more general than CVA modeling.

3.2 Prediction with Multi-GPs

To predict a new variable f∗ = [f∗1, . . . , f∗m] at the test locationsX∗ = [xn+1, . . . ,xn+m],the joint distribution of the training observations Y = [y1, . . . ,yn] and the pre-dictive targets f∗ are given by[

Yf∗

]∼MN

(0,

[K ′(X,X) K ′(X∗, X)T

K ′(X∗, X) K ′(X∗, X∗)

],Ω

), (10)

where K ′(X,X) is an n×n matrix of which the (i, j)-th element [K ′(X,X)]ij =k′(xi, xj), K

′(X∗, X) is anm×nmatrix of which the (i, j)-th element [K ′(X∗, X)]ij =k′(xn+i, xj), and K ′(X∗, X∗) is an m × m matrix with the (i, j)-th element[K ′(X∗, X∗)]ij = k′(xn+i, xn+j). Thus, taking advantage of conditional distri-bution of multivariate Gaussian process, the predictive distribution is:

p(vec(f∗)|X,Y,X∗) = N (vec(M), Σ⊗ Ω), (11)

where

M = K ′(X∗, X)TK ′(X,X)−1Y, (12)

Σ = K ′(X∗, X∗)−K ′(X∗, X)TK ′(X,X)−1K ′(X∗, X), (13)

Ω = Ω. (14)

Additionally, the expectation and the covariance are obtained,

E[f∗|X,Y,X∗] = M, (15)

cov(vec(f∗)|X,Y,X∗) = Σ⊗ Ω. (16)

The hyperparameters and elements of the covariance matrix Ω are found byminimizing the negative log marginal likelihood of observations:

L(Y |X,λ,Ω) =nd

2ln(2π) +

d

2ln |K ′|+ n

2ln |Ω|+ 1

2tr((K ′)−1Y Ω−1Y T ). (17)

Further details of the multi-GP are given in (Chen et al., 2017). The com-putational remarks made in Section 2.3 also apply here, with the additionalcomment that the training and prediction time also scale linearly with the num-ber of dimensions d.

9

4 Portfolio Value and Market Risk Estimation

The value of a portfolio of financial derivative contracts can be expressed as alinear combination of the components of f , ’kernel pricing’ functions, on a setof underlying risk factors x

MtM(x) =

N∑i=1

wifi(x) (18)

We estimate the moments of the predictive distribution, p(MtM∗|X,Y,X∗),where MtM∗ := MtM(X∗):

E[MtM∗|X,Y,X∗] = wT M, (19)

cov(MtM∗|X,Y,X∗) = wT Σ⊗ Ωw −wT M ⊗ Mw. (20)

where

M = K ′(X∗, X)TK ′(X,X)−1Y, (21)

Σ = K ′(X∗, X∗)−K ′(X∗, X)TK ′(X,X)−1K ′(X∗, X). (22)

We therefore have an expression for estimating the value of a portfolio, given theunderlying risk factors, which accounts for the dependence between the financialderivative contracts. In general financial derivative contracts share common riskfactors in the portfolio and the risk factors are correlated.

The integral of the marginal distribution of MtM over x∗ ∈ X∗ gives

p(MtM|X,Y ) =

∫p(MtM|X,Y,x∗)p(x∗)dx∗ (23)

where p(x∗) is the prior over x∗ and MtM is now a scalar value, depending onthe training set, and not a function of x∗.

Example The above concepts are illustrated in Figure 1 for a portfolio holdingtwo long positions in a call option struck at 130 (left) and a short position ina put option struck at 70 (center), where S0 = 100. For ease of exposition, thetime to maturity of each option is the same and assumed fixed here (2 years).In this example, there is one risk factor which is common to both options - theunderlying instrument S. To illustrate the uncertainty band under interpolation,each GP (with a RBF kernel) has been trained to (Black-Scholes) model as afunction of S on just five training points, without additive noise. Typically, onewould use hundreds of training points.

The multi-GP subsequently estimates the price of the options at a numberof test points. Some of these test points have been chosen to coincide with thetraining set and others are not in the set. The test points which are also in thetraining set are observed to exhibit a zero width 95% confidence band, whereastest points far from observed points exhibit a wide band. The value of theportfolio at the training and test points is shown in the right hand graph. Note

10

that the uncertainty in the point estimates is an aggregate of the uncertaintyin the point estimate of each option price and the cross-terms in the covariancematrix in Equations 19 and 20. If, instead, single GPs were used separatelyfor the put and the call price, then the uncertainty in the point estimate wouldneglect the cross-terms in the covariance matrix.

(a) call price (b) put price (c) portfolio price

Figure 1: Using a set of just five training points, the predicted mean (red line)and variance of the posterior are estimated from Equations 19 and 20 over allS∗ for the (left) call option (center) put option and portfolio (right) for the casewhen no additive noise is used for training. The gray shaded envelope representsthe 95% confidence interval about the mean of the posterior. The exact result,using the Black-Scholes pricing formula is given by the black line. Note that thetime to maturity of the options are fixed to two years.

4.1 Discussion

Our approach learns a kernel representation of the joint posterior distributionover the estimated derivative prices in a portfolio. This posterior is used inconjunction with a parameterized covariance function over the input space. Weemphasize that our approach does not fit a (parameterized) covariance func-tion over the derivative prices, only to the risk factors. Imposing covariancestructure on the derivative prices would raise the difficult question of how toattain consistency between the derivative covariance structure and underlyingcovariance structure.

The GP model, as illustrated here, is purely spatial (i.e. there is no temporalcomponent) and it is entirely financial model based (i.e. trained from simulateddata). Specifically, the training set of the GP is a grid of risk factors andcorresponding model option prices. We then estimate the option price at a testpoint, not necessary in the training set, and evaluate the means and covarianceof the posterior. Kernel learning is sufficiently flexible to allow for the functionto be non-smooth, as observed at, say, the maturity of the option.

The above example has no time dependency - we only learned a snapshotin time of an option surface, with a time to maturity of one year. In Section 6,we consider learning option prices as a function of underlying prices, volatilityand time to maturity - fixing time to maturity for each GP and then steppingbackward in time to give a sequence of GPs.

We also clarify that the underlying risk factor dynamics are kept separate

11

from the GP model. This has the practical advantage of being able to use thesame approach and software to learn any risk factor dynamics since GPs aregeneral and, in our setup, pricing model agnostic. In using an option pricingmodel to train the GP, we tacitly assume a data generation process for the riskfactor dynamics. For example, in Section 6, we shall train a GP from a Hestonmodel and evaluate the CVA by simulating the price and volatility under Hestondynamics using Monte-Carlo.

Once the ’pricing kernel layer’ - consisting of a kernel representation of theprices of all options in the portfolio- has been learned, there is no need to evalu-ate derivate prices with a numerical pricing formula. Hence the practical utilityof our multi-GP approach is the ability to quickly predict new option pricesand, hence, portfolio values together with an error estimate which accounts forcovariance of the derivative prices over the test points.

Moreover, the weights of the portfolio can change as the pricing kernel layerallows for dynamic weights. Thus the predictive distribution of the portfolioremains valid even when the portfolio composition changes (or in the contextof trade incremental xVA computations, see (Albanese et al., 2018, Section 5)).Note that, if a new derivative is added to the portfolio, we need not necessarilyretrain all the GPs - the mean posterior estimate of the portfolio value remainsvalid. However, the kernels must be relearned to update the covariance estimate.By construction, a derivative can be subtracted from the portfolio by simplysetting the weight to zero - no retraining is required.

In principle, a financial ’model-free’ alternative approach could be formu-lated by only using past observed risk factors and option prices at differentmaturities. In estimating the surface of the option price, it is difficult for theGP to decouple the effects of each observed variable on the observed optionprices, e.g. fixing price, implied volatility, and varying only time to maturity.Moreover, many OTC derivatives do not have comparable exchange traded in-struments and can be illiquid. Hence we have chosen to pursue an option modelbased approach. However, we evaluate the potential for a model free approachin Section 6.5.

4.2 Portfolio Risk

In this section we combine our spatial kernel option pricing layer with a temporalmodel for the risk factors. We hence arrive at a spatio-temporal model forportfolio risk which accounts for the joint uncertainty in point estimation of thefinancial derivative contracts.

Under a Markovian stochastic process (Xt)t≥0, the marginal distribution ofthe portfolio value MtMt+h at time t+ h, given Xt = x, satisfies

p(MtMt+h|X,Y,Xt = x) =

∫p(MtM(x∗)|X,Y,x∗)p(Xt+h = x∗|Xt = x)dx∗

(24)where the multi-variate transition density function p(Xt+h|Xt) for (Xt)t≥0 isdetermined by a diffusion model or estimated from historical data. Typically

12

p(Xt+h = x∗|Xt = x) is not known in closed form and must be estimated withMonte-Carlo simulation. Hence, our approach combines Monte-Carlo simulationwith GP pricing to estimate the portfolio risk. We refer to this approach as MC-GP.

The distribution of the future portfolio value depends on the uncertaintyfrom the point distribution p(MtM∗|X,Y,x∗). Note that if Xt+h = x∗ ∈ X,then the uncertainty in the estimate MtM∗ is zero.

Note if the risk factor is not an observable, or the risk manager simply seeksto express uncertainty p(Xt) in the current risk factor value (where p(Xt) refersto a prior distribution on Xt), then the more general form of Equation 24 canbe used

p(MtMt+h|X,Y ) =

∫ ∫p(MtM(x∗)|X,Y,x∗)p(Xt+h = x∗|Xt = x)p(Xt = x)dx∗dx

(25)In Section 6, we use Equation 24 to estimate the expected future exposure of aportfolio and associated kernel approximation error for CVA estimation. How-ever, our kernel approach described above is general and valid for any portfoliorisk measure such as VaR, Expected Shortfall and techniques such as stresstesting.

4.3 Computational aspects

We reiterate that the benefit of using GPs is primarily for fast real-time compu-tation. If the option depends on several variables, then n =

∏i ni, where ni are

the number of grid points per variable. Single GPs involved in the MtM cubecomputation are independent and can thus be trained in parallel over a grid ofcompute nodes such as a GPU or CPU. Also, although there are many mod-els, typically the number of input variables per model is small, and hence thetraining set consists of relatively few observations. The training of multi-GPsis more challenging since it involves, in principal, fitting to all instruments inthe portfolio. In practice, we identify the subset of derivatives sharing commonrisk factors and fit a multi-GP to each subset. The computational overhead isjustified by more accurate uncertainty estimates.

Note that although each kernel matrix KX,X is n × n, we only store then-vector α in (6) for each option, which brings reduced memory requirements.

Online learning If the option pricing model is recalibrated intra-day, thenthe corresponding GP model should be retrained - online learning techniquespermit performing this incrementally (Pillonetto et al., 2010). To enable onlinelearning, the training data should be augmented with the constant model pa-rameters. Each time the parameters are updated, a new observation (x′,y′) isgenerated from the option model prices under the new parameterization. Theposterior at test point x∗ is then updated with the new training point by ’onlinelearning’:

13

p(f∗|X,Y,x′,y′,x∗) =p(x′,y′|f∗)p(f∗|X,Y,x∗)∫

f∗p(x′,y′|f∗)p(f∗|X,Y,x∗)df∗

. (26)

where the previous posterior p(f∗|X,Y,x∗) becomes the prior in the update.The key idea is that the GP ’shadows’ the pricing model as it is recalibrated.Specifically, the GP learns over time as model parameters (which are an inputto the GP) are updated through pricing model recalibration.

Many small versus one holistic GPs We suggest, in passing, an alternativeconfiguration of GPs for risk. Instead of fitting a GP to each derivative in theportfolio, as suggested above, a single GP could be fit to the portfolio value.The perceived benefit of this approach is the simplicity - there would only be oneGP per portfolio. Note however, that such a holistic, portfolio-wide GP couldnot be used for Greek estimation unless the portfolio surface is learned over ahigh dimensional input space of all risk factors. The training time would thenbe too prohibitive for larger portfolios. Moreover, if the weights of the portfolioare changed, then the GP must be re-trained. We mention these pros and consso that the most suitable approach can be assessed for each risk application.

5 CVA

As an example of a portfolio risk application, we consider the estimation ofcounter-party credit risk on a client portfolio. The expected loss to the bankassociated with the counterparty defaulting is given by the (unilateral) CVA.Taking expectations with respect to the pricing measure associated with a nu-meraire Nt at time t, the expected loss from client default is given by

CVA0 = (1−R)E∫ T

0

MtM+t N−1t δτ (dt), (27)

where δτ is a Dirac measure at the client default time τ and R is the clientrecovery rate.

Assuming τ endowed with a stochastic intensity process γ and a basic immer-sion setup between a reference filtration and the filtration progressively enlargedwith τ , we have

CVA0 = (1−R)E∫ T

0

MtM+t N−1t e−

∫ t0γsdsγtdt. (28)

Moreover, under Markovian specifications, MtMt is a deterministic functionof time and suitable risk factors Xt, i.e. MtMt = MtM(t,Xt); likewise, in thecase of intensity models, γt = γ(t,Xt). Factors common to MtM and γ allowsmodeling wrong way risk3, i.e. the risk of adverse dependence between the riskof default and the market exposure.

3Or, at least, soft wrong way risk, by contrast with hard wrong way risk that may arise inmodels with common jump specifications (see (Crepey and Song, 2016)).

14

In the special case where the default is independent of the portfolio valueexpressed in numeraire units, then the expression (27) simplifies to

CVA0 = (1−R)

∫ T

0

E[MtM+t N−1t ]p(t)dt, (29)

where p(t) is the probability density function of τ . The probabilities ∆pi =P (ti ≤ τ < ti+1) can be bootstrapped from the CDS curve of the client (orsome proxy if such curve is not directly available).

To compute the CVA numerically based on (29) in the independent case,a set of n dates t1, . . . , T = tn is chosen over which to evaluate the expectedpositive exposure E[MtM+

t N−1t ]. In stochastic default intensity models, one can

evaluate likewise the E[MtM+t N−1t e−

∫ t0γsdsγt] and ompute the CVA based on

(28), or simulate τ and compute the CVA based on (27).In any case, the portfolio ingredients (contingent claims such as options) are

priced with respect to the same numeraire and pricing measure than the onesthat appear in the CVA formulas. Note that the portfolio weights wi in (18) arethen all 0 or 1, in line with the trade incremental feature of xVA computations(cf. (Albanese et al., 2018, Section 5)).

5.1 GP Regression estimation of CVA

First we consider the independent case (29), which entails the following Monte-Carlo estimate of the CVA over M paths, along which the market risk factorsare sampled:

CVAM =(1−R)

M

M∑j=1

n∑i=1

MtM(ti,X(j)ti )+(N

(j)ti )−1∆pi, (30)

where the exact portfolio value MtM(ti,X(j)ti )+ is evaluated for simulated risk

factor X(j)ti in path j at time ti.

Then we replace the exact derivative prices with the mean of the posteriorfunction conditioned on the simulated market risk factors Xti :

CVAM =(1−R)

M

M∑j=1

n∑i=1

|E[MtM∗|X,Y,x∗ = X(j)ti ]|+(N

(j)ti )−1∆pi, (31)

and the following GP error estimate, based on the covariance of the posteriorof MtM∗, evaluated over each sample path:

εM =(1−R)

M

M∑j=1

n∑i=1

1(E[MtM∗|··· ]>0)cov(MtM∗|X,Y,x∗ = X(j)ti )(N

(j)ti )−1∆pi.

(32)The above approximation uses Gaussian Process regression to estimate the

potential future exposure of the portfolio. We note that the pricing models

15

are still fitted to model generated data, assuming a data generation processfor the risk factors. However, we have used machine learning to learn deriva-tive exposure as a function of the underlying and other parameters, including(by slicing in time) time to maturity. In this way, we avoid nested Monte-Carlosimulations, which are computationally demanding, and payoff regression LSMCschemes, which come without error control (cf. (Barrera et al., 2017)). More-over, the multi-GP regression provides an estimation of the amount of error inthe point estimation of the portfolio value.

6 Numerical Experiments

In the following example, we use MC-GP simulation to estimate the CVA fromEquations 31 and 32. All performances are based on a 2.2 GHz Intel Core i7laptop. The portfolio holds a long position in both a European call and a putoption struck on the same underlying, with K = 100. We assume that theunderlying follows Heston dynamics:

dStSt

= µdt+√VtdW

1t , (33)

dVt = κ(θ − Vt)dt+ σ√VtdW

2t , (34)

d〈W 1,W 2〉t = ρdt. (35)

where the notation and fixed parameter values used for experiments are givenin Table 1 under µ = r0. We use a Fourier Cosine method (Fang and Oosterlee,2008) to generate the European Heston option price training and testing datafor the GP. We also use this method to compare the GP Greeks, obtained bydifferentiating the kernel function.

Parameter description Symbol ValueMean reversion rate κ 0.1Mean reversion level θ 0.15Vol. of Vol. σ 0.1Risk free rate r0 0.002Strike K 100Maturity T 2.0Correlation ρ −0.9

Table 1: This table shows the values of the parameters for the Heston dynamicsand terms of the European Call and Put option contracts.

For the corresponding intervals used for the CVA estimate, we simultane-ously fit a multi-GP to both gridded call and put prices over price and volatility,keeping time to maturity fixed. Figure 2 shows the gridded call (top) and put(bottom) price surfaces at various time to maturities, together with the GP

16

estimate. Within each column in the figure, the same GP model has been si-multaneously fitted to both the call and put price surfaces over a 30 × 30 gridΩh ⊂ Ω := [0, 1] × [0, 1] of prices and volatilities4, fixing the time to maturity.The scaling to the unit domain is not essential. However, we observed superiornumerical stability when scaling.

Across each column, corresponding to different time to maturities, a differentGP model has been fitted. The GP is then evaluated out-of-sample over a 40×40grid Ωh′ ⊂ Ω, so that many of the test samples are new to the model. This isrepeated over various time to maturities corresponding to the MtM exposuresimulation times ti. The option model versus GP model are observed to producevery similar results.

Table 1 lists the values of the parameters for the Heston dynamics and termsof the European Call and Put option contract used in our numerical experiments.Tables 2 and 3 show the values for the Euler time stepper used for simulatingHeston dynamics and the credit risk model.

(a) Call: T − t = 1.0 (b) Call: T − t = 0.5 (c) Call: T − t = 0.1

(a) Call: T − t = 1.0 (b) Call: T − t = 0.5 (c) Call: T − t = 0.1

Figure 2: This figure compares the gridded Heston model call (top) and put(bottom) price surfaces at various time to maturities, with the GP estimate.The GP estimate is observed to be practically identical (slightly below in thefirst five panels and slightly above in the last one). Within each column in thefigure, the same GP model has been simultaneously fitted to both the Hestonmodel call and put price surfaces over a 30 × 30 grid of prices and volatilities,fixing the time to maturity. Across each column, corresponding to different timeto maturities, a different GP model has been fitted. The GP is then evaluatedout-of-sample over a 40× 40 grid, so that many of the test samples are new tothe model. This is repeated over various time to maturities.

Figure 3 compares the (left) full-MC and MC-GP estimate of the expectedpositive exposure of the portfolio over time. The error in the MC-GP estimate

4Note that the plot uses the original coordinates and not the re-scaled co-ordinates.

17

Parameter description Symbol ValueNumber of simulation M 1000Number of time steps ns 100Initial stock price S0 100Initial variance V0 0.1

Table 2: This table shows the values for the Euler time stepper used for marketrisk factor simulation.

Constant hazard rate λ 0.1Number of time steps ti n 10Recover rate R 0.4

Table 3: This table shows the parameters of the reduced form credit risk modelused for estimating the CVA in our numerical experiments.

and 95% uncertainty band, exclusive of the MC sampling error, is also shownagainst time (right).

Figure 4 shows how the error in the MC-GP CVA estimate versus MC withfull portfolio evaluation decays against the number of training samples used foreach GP model. The 95% confidence band of the MC-GP prediction, exclusiveof the MC sampling error, is also shown. Note that while the training samplesare varied, the 40× 40 testing set remains fixed during the experiment.

Figure 3: (Left) Full-MC and MC-GP estimate of the expected positive exposureof the portfolio over time. The two graphs are practically indistinguishable,with one graph superimposed over the other. (Right) The error in the MC-GPestimate and 95% uncertainty band (exclusive of the MC sampling error) is alsoshown against time.

6.1 CVA VaR

In this section, we demonstrate the application of GPs to the estimation of theValue-at-risk (VaR) of a one year incremental CVA. The purpose of the calcula-tion is to estimate, at a given confidence level, the extent which to the CVA will

18

Figure 4: This figure compares the GP component of the error between MC-GPCVA estimate versus MC (with full portfolio), against the number of trainingsamples used for each GP model. The 95% confidence band in the GP predictionis also shown centered about MC-GP CVA estimation error.

increase over the next year. For this purpose, we estimate the distribution ofthe incremental CVA over one year, i.e. of the random variable (CVA1−CVA0).

In order to illustrate CVA VaR estimation using both credit and marketsimulation, we introduce a dynamic default intensity model (Bielecki et al.,2011):

λ(S) = λ0(S0

S)λ1 (36)

where λ0 = 0.02 and λ1 either equals 1.2 (strong interaction between equityand credit), or 0 (no interaction between equity and credit). Additionally, weimpose the constraint

Ee−∫ T0λ(St)dt = P(τ > T ) (37)

so that the model parameters (λ0, λ1) match a given target value, extractedfrom the client CDS curve. See Appendix C for further details.

For simplicity we assume a GBM asset price process (St)t≥0, and model the(pre-default) CVA process such that

1t<τCV A(t, St) = 1t<τE[1τ<TMtM(τ, Sτ ) | St, t < τ ], (38)

where N = 1, i.e. we use zero interest rates and the related risk-neutral pricingmeasure with expectation E, and MtM is the value of a portfolio long two BScalls and short a BS put with strikes 130 and 70 respectively. Each option hasa maturity of two years.

19

Figure 5 (left) shows the one year incremental CVA distribution, as estimatedwith a full MC (i.e. using a Black-Scholes formula) and a MC-GP method witherror bounds. In order to isolate the effect of the GP approximation, we useidentical random numbers for each method.

The MC-GP and MC Analytic graphs are practically indistinguishable, withone graph superimposed over the other. Note that the reason for the sharpapproximation is three-fold: (i) the dimension (in the sense of the number ofrisk factors) is only 1, (ii) the statistical experiment has been configured as aninterpolation problem, with many of the gridded training points close to thegridded test points; and (iii) the training sample size of 200 is relatively largeto approximate smooth surfaces (with no outliers). The center plot shows themean error and GP uncertainty in CVA estimation against the training size.The right hand plot shows the distribution of λ(St) at various times over thesimulation horizon for a fixed λ1 = 1.2.

Figure 5: (left) The one year incremental CVA distribution, as estimated by aMC with full repricing using Black-Scholes formulas versus a MC-GP methodwith error bounds. In order to isolate the effect of the GP approximation, we useidentical random numbers for each method. (Center) The mean error and GPuncertainty in CVA estimation against the training size. (Right) The distribu-tion of λ(St) at various times over the simulation horizon for a fixed λ1 = 1.2.

Figure 6 shows (left) the distribution of the 99% VaR of (CV A1 − CV A0)under a chi-squared prior on the parameter λ1, in Equation 36. The correspond-ing prior on λ0 which satisfies the constraint 37 with P(τ > 2) = 0.05 is alsoshown (center). The MC-GP and MC with analytic option prices (right) areobserved to be practically identical under the same random numbers.

6.2 Derivatives

The GP provides analytic derivatives with respect to the input variables

∂X∗E[f∗|X,Y,X∗] = ∂X∗µX∗ + ∂X∗KX∗,Xα (39)

where ∂X∗KX∗,X = 1`2 (X−X∗)KX∗,X and we recall that α = [KX,X +σ2I]−1y

(and in the numerical experiments we set µ = 0). Second order sensitivities areobtained by differentiating once more with respect to X∗.

Note that α is already calculated at training time (for pricing) by Choleskymatrix factorization of [KX,X + σ2I] with O(n3) complexity, so there is no

20

Figure 6: This figures shows the distribution of the 99% VaR of (CV A1 −CV A0) under a chi-squared prior (left) on λ1 in Equation 36. (Center) Thecorresponding prior on λ0 which satisfies the constraint 37 with P(τ > 2) = 0.05.(Right) The MC-GP and MC with analytic option prices are observed to bepractically identical under the same random numbers.

significant computational overhead from Greeking. Once the GP has learnedthe derivative prices, Equation 39 is used to evaluate the first order Greeks withrespect to the input variables over the test set. Example source code illustratingthe implementation of this calculation is given in Section A.2.

Figure 7 shows (left) the GP estimate of the call option’s delta ∆ := ∂C∂S and

(right) vega ν := ∂C∂σ , having trained on the underlying prices, implied volatility

and BS option model prices. For avoidance of doubt, the model is not trainedon the BS Greeks. For comparison in the figure, the BS delta and vega are alsoshown. In each case, the two graphs are practically indistinguishable, with onegraph superimposed over the other.

Figure 7: This figure shows (left) the GP estimate of the call option’s delta ∆ :=∂C∂S and (right) vega ν := ∂C

∂σ , having trained on the underlying prices, impliedvolatility and BS option model prices. For avoidance of doubt, the model is nottrained on the BS Greeks. For comparison in the figure, the BS delta and vegaare also shown. In each case, the two graphs are practically indistinguishable,with one graph superimposed over the other.

21

6.3 Scalability with MSGPs

Figure 8 shows the increase of MSGP training time and prediction time againstthe number of training points. Fixing the number of inducing points to 30,we increase the number of training points over a 1D training set. Note thattraining and testing times can be improved with CUDA on a GPU, but are notevaluated here.

Setting the number of SGD iterations to 1000, we observe an approximate1.4 increase in training time for a 10x increase in the training sample. Weobserve an approximate 2x increase in prediction time for a 10x increase in thetraining sample. This arises from an increase in the cache computation timesince we are calling the prediction each time after training.

Figure 8: (Left) The elapsed wall-clock time is shown for training against thenumber of training points. (Right) The elapsed wall-clock time for prediction ofa single point is shown against the number of testing points (cache computationtime).

6.4 Mesh-Free GPs

The above numerical examples have trained and tested GPs on uniform grids.This approach suffers from the curse of dimensionality, as the number of trainingpoints grows exponentially with the dimensionality of the data. This is why, inorder to estimate the MtM cube, we advocate divide-and-conquer, i.e. the useof numerous low-dimensional GPs run in parallel on specific asset classes.

However, use of a fixed grids is by no means necessary. We show here howGPs can show favorable approximation properties with a relatively few numberof simulated reference points (cf. also (Gramacy and Apley, 2015)).

Figure 9 shows the predicted Heston call prices using (left) 50 and (right)100 simulated training points, indicated by ’+’s, drawn from a uniform randomdistribution. The Heston call option is struck at K = 100 with a maturity ofT = 2 years.

Figure 10 (left) shows the convergence of the (GP) MSE of the prediction,based on the number of simulated 1D training points. Fixing the number ofsimulated points to 100, but increasing the dimensionality of each observation

22

Figure 9: This figure shows the predicted Heston Call prices using (left) 50 and(right) 100 simulated training points, indicated by ’+’s, drawn from a uniformrandom distribution.

point, Figure 10 (right) shows the wall-clock time for training a GP with SKI.Note that the number of SGD iterations has been fixed to 1000.

Figure 10: (Left) The convergence of the MSE of the prediction is shown basedon the number of simulated 1D training points. (Right) Fixing the number ofsimulated points to 100, but increasing the dimensionality d of each observationpoint, the figure shows the wall-clock time for training a GP with SKI.

6.5 Model free price estimation

In this section, we estimate equity option prices from historical observationsof underlying price, time-to-maturity, strike, implied volatility, option type andoption prices. In our dataset5 each option chain6 is observed over four snapshotsin time (observation days). For each chain, we separate calls and puts andconstruct a training set from the moneyness S/K, implied volatility, time-to-maturity and option price using three of the snapshots (approximately 1300observations). The most recent snapshot is reserved for testing.

One of the challenges of training GPs to data arises when test points mustbe extrapolated. (Wilson and Adams, 2013) propose the use of spectral mixing

5The dataset has been downloaded from https://mamamomama.org on September 20th,2018.

6Set of European calls and puts with different moneynesses and time-to-maturities.

23

methods to improve the capacity of GPs to extrapolate. The results shown hereuse a spectral mixing kernel provided in GPyTorch.

Figure 11 compares the (left) GP estimate of the call prices (blue), havingtrained from the joint observations of the moneyness, maturity and volatility,with the observed out-of-sample call prices (red). The training data is shownwith gray points. Note that the volatility is not shown in the figure. (Right)The error in the GP estimate, with and without volatility as an input variable,is compared with the observed call prices in the test set against moneyness fora fixed maturity (2 years). We note that the figure shows the importance ofincluding volatility as an input variable. In particular, the uncertainty in theGP estimate is observed to be large if the volatility is excluded.

The above exercise also shows the difference between uncertainty in pointestimation and model risk: one should not overstate the message of the redcurve and of the related GP error in Figure 11 (right) by concluding that the“true” price would lie within the gray band at a 95% confidence level. This isonly true if all the explanatory factors have been included in the first place.

Figure 11: This figure compares the (left) GP estimate of the call prices (blue),having trained from the joint observations of the moneyness, maturity andvolatility, with the observed out-of-sample call prices (red). The training datais shown with gray points. Note that the volatility is not shown in the figure.(Right) The error in the GP estimate, with and without volatility as an inputvariable, is compared with the observed call prices in the test set against mon-eyness for a fixed maturity (2 years).

24

7 Conclusion

This paper introduces a Gaussian process interpolation and Monte Carlo simu-lation (MC-GP) approach for fast evaluation of derivative portfolios and theirrisk. The approach is demonstrated by estimating the CVA on a simple port-folio with numerical studies of accuracy and convergence of MC-GP estimates.The primary advantage of kernel learning over full repricing is computational -there is no need to use expensive derivative pricing or Greeking functions for riskpoint estimates once the kernels have been learned. The kernels permit a closedform approximation for the sensitivity of the portfolio risk to the risk factorsand the approach preserves the flexibility to rebalance the portfolio. However,the advantage is more than just computational. The risk estimation approachis Bayesian - the uncertainty in a point estimate which the model hasn’t seen inthe training data is quantified and can be factored into the risk estimate. Ad-ditionally, derivatives of the pricing kernel layer are given analytical and henceavoid the use of numerical differentiation.

A Code Excerpts

All code excerpts are extracted from the Github repository https://github.com/mfrdixon/GP-CVA.

A.1 Multi-GP Prediction with GPyTorch

This Python 3.0 code excerpt, using GPyTorch, illustrates how to train a MC-GP for predicting the value of a toy portfolio containing a call and a put option(priced under Black-Scholes). We used the Adam update rule for SGD. See Sec-tion 3 for details of the MultitaskGPModel model and Example-1-MGP-BS-Pricing.ipynb

to run the implementation.

import torchimport gpytorchimport matplotlib.pyplot as plt

from numpy import *from BlackScholes import *from torch import optimfrom gpytorch.kernels import RBFKernel, MultitaskKernelfrom gpytorch.means import ConstantMean, MultitaskMeanfrom gpytorch.likelihoods import MultitaskGaussianLikelihoodfrom gpytorch.random_variables import MultitaskGaussianRandomVariable

# define multi-GP classclass MultitaskGPModel(gpytorch.models.ExactGP):

def __init__(self, train_x, train_y, likelihood):super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)self.mean_module = MultitaskMean(ConstantMean(), n_tasks=2)self.data_covar_module = RBFKernel()self.covar_module = MultitaskKernel(self.data_covar_module, n_tasks=2, rank=1)

def forward(self, x):mean_x = self.mean_module(x)covar_x = self.covar_module(x)return MultitaskGaussianRandomVariable(mean_x, covar_x)

25

# set BS model parametersr = 0.0002 # risk-free rateS= 100 # Underlying spotKC = 130 # Call strikeKP = 70 # Put strikesigma = 0.4 # implied volatilityT = 2.0 # Time to maturitylb = 0.001 # lower bound on domainub = 300 # upper bound on domain

# define the call and put prices using the BS modelcall = lambda x: bsformula(1, lb+(ub-lb)*x, KC, r, T, sigma, 0)[0]put = lambda x: bsformula(-1, lb+(ub-lb)*x, KP, r, T, sigma, 0)[0]

training_number = 30 # Number of training samplestesting_number = 100 # Number of testing samples

train_x = torch.linspace(0, 1.0, training_number)train_y1 = torch.FloatTensor(call(np.array(train_x)))train_y2 = torch.FloatTensor(put(np.array(train_x)))

#Create a train_y which interleaves the twotrain_y = torch.stack([train_y1, train_y2], -1)

# test datatest_x = torch.linspace(0, 1.0, testing_number)test_y1 = torch.FloatTensor(call(np.array(test_x)))test_y2 = torch.FloatTensor(put(np.array(test_x)))test_y = torch.stack([test_y1, test_y2], -1)

# Gaussian likelihood is used for regression to give predictive mean+variance# and learn noise (Equation 17)likelihood = MultitaskGaussianLikelihood(n_tasks=2)gp = MultitaskGPModel(train_x, train_y, likelihood)gp.train()likelihood.train()

# Use the adam optimizeroptimizer = torch.optim.Adam([

’params’: gp.parameters(), # Includes GaussianLikelihood parameters], lr=0.1)

# "Loss" for GPs - the marginal log likelihoodmll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, gp)

n_iter = 300for i in range(n_iter):

optimizer.zero_grad()output = gp(train_x)loss = -mll(output, train_y)loss.backward()if (i % 10==0):

print(’Iter %d/%d - Loss: %.3f’ % (i + 1, n_iter, loss.item()))optimizer.step()

# Make predictionsgp.eval()likelihood.eval()with torch.no_grad():

y_hat= likelihood(gp(test_x)) #Equation 15

lower, upper = y_hat.confidence_region() #Equation 16

26

A.2 GP Greeks

This Python 3.0 code excerpt, using scikit-learn, illustrates how to calcu-late the derivative of the option by differentiating the GP price model. x aregridded underlying prices, so that f prime is the estimate of the delta. If xwere gridded volatilities, then f prime would be the estimate of the vega. SeeExample-2-GP-BS-Derivatives.ipynb to run the implementation.

import scipy as spimport matplotlib.pyplot as pltimport numpy as np

from BlackScholes import *from sklearn import gaussian_processfrom sklearn.gaussian_process.kernels import ConstantKernel, RBF

# set BS model parametersr = 0.0002 # risk-free rateS= 100 # Underlying spotKC = 130 # Call strikeKP = 70 # Put strikesigma = 0.4 # implied volatilityT = 2.0 # Time to maturitylb = 0.001 # lower bound on domainub = 300 # upper bound on domain

call = lambda x,y: bsformula(1, lb+(ub-lb)*x, KC, r, T, y, 0)[0]put = lambda x,y: bsformula(-1, lb+(ub-lb)*x, KP, r, T, y, 0)[0]

training_number = 30testing_number = 100

x_train = np.array(np.linspace(0.01,1.0, training_number), dtype=’float32’).reshape(training_number, 1)x_test = np.array(np.linspace(0.01,1.0, testing_number), dtype=’float32’).reshape(testing_number, 1)

y_train = []

for idx in range(len(x_train)):y_train.append(call(x_train[idx], sigma))

y_train = np.array(y_train)

sk_kernel = RBF(length_scale=1.0, length_scale_bounds=(0.01, 10000.0))gp = gaussian_process.GaussianProcessRegressor(kernel=sk_kernel, n_restarts_optimizer=20)gp.fit(x_train,y_train)y_pred, sigma_hat = gp.predict(x_test, return_std=True)

l = gp.kernel_.length_scalerbf= gaussian_process.kernels.RBF(length_scale=l)

Kernel= rbf(x_train, x_train)K_y = Kernel + np.eye(training_number) * 1e-8L = sp.linalg.cho_factor(K_y)alpha_p = sp.linalg.cho_solve(np.transpose(L), y_train)

k_s = rbf(x_test, x_train)

k_s_prime = np.zeros([len(x_test), len(x_train)])for i in range(len(x_test)):

for j in range(len(x_train)):k_s_prime[i,j]=(1.0/l**2)*(x_train[j]-x_test[i])*k_s[i,j]

# Calculate the gradient of the mean using Equation 39.f_prime = np.dot(k_s_prime, alpha_p)/(ub-lb)

# show error between BS delta and GP deltadelta = lambda x,y: bsformula(1, lb+(ub-lb)*x, KC, r, T, y, 0)[1]

27

delta(x_test, sigma) - f_prime

A.3 MC-GP CVA

This Python 3.0 code excerpt, using scikit-learn, illustrates how to simu-late the CVA of a portfolio using MC-GP. The implementation assumes BSpricing with a dynamic default intensity model given by Equation 36. SeeExample-3-MC-GPA-BS-CVA.ipynb for further details of the implementation.

import numpy as npfrom BlackScholes import *from random import Randomfrom scipy import *

from sklearn import gaussian_processfrom sklearn.gaussian_process.kernels import WhiteKernel, ConstantKernel, RBF

#BS model parametersS0= 100KC = 130KP = 70r = 0.0002sigma = 0.4T = 2.0nt = 11lb = 0.01ub = 400

timegrid = np.array(np.linspace(0.0,T,nt), dtype=’float32’).reshape(nt, 1)

# specify the portfolioportfolio = portfolio[’call’]=portfolio[’put’]=

portfolio[’call’][’price’]= lambda x,y: bsformula(1, lb+(ub-lb)*x, KC, r, y, sigma, 0)[0]portfolio[’put’][’price’]= lambda x,y: bsformula(-1, lb+(ub-lb)*x, KP, r, y, sigma, 0)[0]portfolio[’call’][’weight’]=2.0portfolio[’put’][’weight’]=-1.0

# prepare training and test datatraining_number= 100testing_number = 50x_train = np.array(np.linspace(0.0,1.0, training_number), dtype=’float32’).reshape(training_number, 1)x_test = np.array(np.linspace(0.0,1.0, testing_number), dtype=’float32’).reshape(testing_number, 1)

# train and predict over portfoliofor key in portfolio.keys():

portfolio[key][’GPs’] = trainGPs(x_train, portfolio[key][’price’], timegrid)portfolio[key][’y_tests’], portfolio[key][’preds’], portfolio[key][’sigmas’] = predictGPs(x_test, portfolio[key][’price’], portfolio[key][’GPs’], timegrid)

# MC-CVA

n_sim_dt = 100 # number of Euler time steps

# simulate underlying GBM dynamicsstride = n_sim_dt/(nt-1)idx = np.arange(0,n_sim_dt+1,stride, dtype=int)

M = 1000 # number of sample paths

28

seed = 777 # random seednp.random.seed(seed)

pi_tilde = np.array([0.0]*(nt-1)*M, dtype=’float32’).reshape((nt-1), M) # GP portfolio valuepi = np.array([0.0]*(nt-1)*M, dtype=’float32’).reshape((nt-1), M) # BS portfolio valuepi_tilde_var = np.array([0.0]*(nt-1)*M, dtype=’float32’).reshape((nt-1), M) # GP portfolio variancelmbdas = np.array([0.0]*(nt-1)*M, dtype=’float32’).reshape((nt-1), M) # hazard rates

gamma_0 = 0.02 # parameters to the default intensity modelgamma_1 = 1.2

for m in range(M):

t, S = gbm(S0, r, sigma, T, n_sim_dt)i = 1

for time in timegrid[1:]:S_= S[idx[i]] # simulated S# avoid simulated S breaching boundaries of domainif (S_<lb):

S_=lb

if (S_>ub):S_=ub

# default intensity modellmbdas[i-1,m] = gamma_0*(S0/S_)**gamma_1

pred_= 0v_ = 0var_ =0

for key in portfolio.keys():pred, std = portfolio[key][’GPs’][i].predict(np.array([(S_-lb)/(ub-lb)]).reshape(1,-1),return_std=True)pred_ += portfolio[key][’weight’]*np.maximum(pred,0)var_ += (portfolio[key][’weight’]*std)**2

if key==’call’:v_ += portfolio[key][’weight’]*np.maximum(bsformula(1, S_, KC, r, time, sigma, 0)[0],0)

else:v_ += portfolio[key][’weight’]*np.maximum(bsformula(-1, S_, KP, r, time, sigma, 0)[0],0)

pi_tilde[i-1,m] = np.maximum(pred_,0)pi[i-1,m] = np.maximum(v_,0)pi_tilde_var[i-1,m] =var_i += 1

# compute default probabilitieshazard_rate = np.mean(lmbdas)recovery = 0.4dPD = np.exp(-hazard_rate*(timegrid[:-1])) - np.exp(-hazard_rate*(timegrid[1:]))

# one year CVA distributioni = 5EPE_tilde = 0EPE = 0t = 1.0for time in timegrid[6:]:

EPE_tilde += dPD[i,:]*pi_tilde[i,:]*np.exp(r*(time-t))EPE += dPD[i,:]*pi[i,:]*np.exp(r*(time-t))i+=1

CVA_1_yr_tilde = (1-recovery) * EPE_tilde * np.exp(-r*t)CVA_1_yr = (1-recovery) * EPE * np.exp(-r*t)

# CVA_0i = 0EPE_tilde = 0

29

EPE = 0EPE_tilde_up = 0EPE_tilde_down = 0

for time in timegrid[1:]:EPE_tilde += np.mean(dPD[i,:]*pi_tilde[i,:])*np.exp(r*(time-t))EPE_tilde_up += np.mean(dPD[i,:]*(pi_tilde[i,:])+ 2*np.sqrt(np.mean(pi_tilde_var[i,:])))*np.exp(r*(time-t))EPE_tilde_down += np.mean(dPD[i,:]*(pi_tilde[i,:])- 2*np.sqrt(np.mean(pi_tilde_var[i,:])))*np.exp(r*(time-t))EPE += np.mean(dPD[i,:]*pi[i,:])*np.exp(r*(time-t))i+=1

CVA_tilde = (1-recovery) * EPE_tildeCVA_tilde_up = (1-recovery) * EPE_tilde_upCVA_tilde_down = (1-recovery) * EPE_tilde_downCVA = (1-recovery) * EPE

A.4 MC-GP CVA

This Python 3.0 code excerpt, using scikit-learn, illustrates how to simulatethe CVA-VaR of a portfolio using MC-GP. The implementation assumes BSpricing with a dynamic default intensity model given by Equation 36. Note,for conciseness, that this excerpt should only be run after running the previousexcerpt. See Example-4-MC-GPA-BS-CVA-VaR.ipynb for further details.

from scipy import stats

np.random.seed(seed)J = 100 #number of outer simulation paths for prior sampling

pi_tilde = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)pi = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)pi_tilde_var = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)S = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)lmbdas = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)gamma_0= np.array([0.0]*J, dtype=’float32’)gamma_1= np.array([0.0]*J, dtype=’float32’)

# Sample from prior distribution using non-centered chi-squared random variatesgamma_1 = (1.2 + 1.0*np.random.randn(J))**2

for j in range(J):eps = np.random.randn(M)i=1print(j) # print loop counter for progress trackingfor time in timegrid[1:]:

S_= S0*np.exp((r-0.5*sigma**2)*time + sigma*np.sqrt(time)*eps) # direct sampling (for speed)# avoid simulated S breaching boundaries of domainif np.any(S_<lb):

mins=S_S_[S_<lb] = lb

if np.any(S_>ub):maxs=S_S_[S_>ub] = ub

# assume unit portfolio weightspred_= 0v_ = 0var_ =0for key in portfolio.keys():

pred, std = portfolio[key][’GPs’][i].predict(np.array([(S_-lb)/(ub-lb)]).reshape(-1,1),return_std=True)pred_ += portfolio[key][’weight’]*np.maximum(pred,0)var_ += (portfolio[key][’weight’]*std)**2

if key==’call’:

30

v_ += portfolio[key][’weight’]*np.maximum(bsformula(1, S_, KC, r, time, sigma, 0)[0],0)else:

v_ += portfolio[key][’weight’]*np.maximum(bsformula(-1, S_, KP, r, time, sigma, 0)[0],0)

pi_tilde[i-1,:,j] = np.maximum(pred_,0).flatten()pi[i-1,:,j] = np.maximum(v_,0).flatten()pi_tilde_var[i-1,:,j] = var_S[i-1,:,j] = S_

i += 1# solve for gamma_0 given gamma_1x = np.exp(S0/S[:,:,j])**gamma_1[j]dt = timegrid[1]-timegrid[0]# default probability (assumed to be estimated from credit spread)p = 0.05f = lambda y: np.abs(np.mean(np.prod(x**(-y*dt), axis=0)) - p)res=sp.optimize.basinhopping(f, 0.1, niter=10)ii = 1while (abs(res.fun) >1e-3):

res = sp.optimize.basinhopping(f, 0.1, niter=100*ii)ii *= 2

gamma_0[j]= res.xprint(gamma_0[j], gamma_1[j], f(gamma_0[j]), res.fun)for i in range(1,len(timegrid)):

lmbdas[i-1,:,j] = gamma_0[j]*(S0/S[i-1,:,j])**gamma_1[j]

# estimate default probabilitieshazard_rate = np.mean(lmbdas)dPD = np.array([0.0]*(nt-1)*M*J, dtype=’float32’).reshape((nt-1), M, J)recovery = 0.4for j in range(J):dPD[:,:,j] = np.exp(-lmbdas[:,:,j]*(timegrid[:-1])) - np.exp(-lmbdas[:,:,j]*(timegrid[1:]))

# one year CVA distributionEPE_tilde = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)EPE_tilde_up = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)EPE_tilde_down = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)EPE = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)CVA_1_yr_tilde = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)CVA_1_yr_tilde_up = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)CVA_1_yr_tilde_down = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)CVA_1_yr = np.array([0.0]*J*M, dtype=’float32’).reshape(M, J)t = 1.0for j in range(J):

i = 5for time in timegrid[6:]:

EPE_tilde[:,j] += dPD[i,:,j]*pi_tilde[i,:,j]*np.exp(r*(time-t))EPE_tilde_up[:,j] += dPD[i,:,j]*(pi_tilde[i,:,j]+ 2*np.sqrt(pi_tilde_var[i,:,j]))*np.exp(r*(time-t))EPE_tilde_down[:,j] += dPD[i,:,j]*(pi_tilde[i,:,j]- 2*np.sqrt(pi_tilde_var[i,:,j]))*np.exp(r*(time-t))EPE[:,j] += dPD[i,:,j]*pi[i,:,j]*np.exp(r*(time-t))i+=1

CVA_1_yr_tilde[:,j] = (1-recovery) * EPE_tilde[:,j] * np.exp(-r*t)CVA_1_yr_tilde_up[:,j] = (1-recovery) * EPE_tilde_up[:,j] * np.exp(-r*t)CVA_1_yr_tilde_down[:,j] = (1-recovery) * EPE_tilde_down[:,j] * np.exp(-r*t)CVA_1_yr[:,j] = (1-recovery) * EPE[:,j] * np.exp(-r*t)

# CVA_0EPE_tilde = np.array([0.0]*J)EPE = np.array([0.0]*J)EPE_tilde_up = np.array([0.0]*J)EPE_tilde_down = np.array([0.0]*J)

31

for j in range(J):i = 0for time in timegrid[1:]:

EPE_tilde[j] += np.mean(dPD[i,:,j]*pi_tilde[i,:,j])*np.exp(r*(time-t))EPE_tilde_up[j] += np.mean(dPD[i,:,j]*(pi_tilde[i,:,j])+ 2*np.sqrt(np.mean(pi_tilde_var[i,:,j])))*np.exp(r*(time-t))EPE_tilde_down[j] += np.mean(dPD[i,:,j]*(pi_tilde[i,:,j])- 2*np.sqrt(np.mean(pi_tilde_var[i,:,j])))*np.exp(r*(time-t))EPE[j] += np.mean(dPD[i,:,j]*pi[i,:,j])*np.exp(r*(time-t))i+=1

CVA_tilde = (1-recovery) * EPE_tildeCVA_tilde_up = (1-recovery) * EPE_tilde_upCVA_tilde_down = (1-recovery) * EPE_tilde_downCVA = (1-recovery) * EPE

B Structured kernel interpolation (SKI)

Given a set of m inducing points, the n×m cross-covariance matrix, KX,U , be-tween the training inputs, X, and the inducing points, U, can be approximatedas KX,U = WXKU,U using a (potentially sparse) n×m matrix of interpolationweights, WX . This allows to approximate KX,Z for an arbitrary set of inputs

Z as KX,Z ≈ KX,UW>Z . For any given kernel function, K, and a set of induc-

ing points, U, structured kernel interpolation (SKI) procedure (Gardner et al.,2018) gives rise to the following approximate kernel:

KSKI(x, z) = WXKU,UW>z , (40)

which allows to approximate KX,X ≈WXKU,UW>X . Gardner et al. (2018) note

that standard inducing point approaches, such as subset of regression (SoR)or fully independent training conditional (FITC), can be reinterpreted fromthe SKI perspective. Importantly, the efficiency of SKI-based MSGP methodscomes from, first, a clever choice of a set of inducing points that allows toexploit algebraic structure of KU,U , and second, from using very sparse localinterpolation matrices. In practice, local cubic interpolation is used.

B.1 Kernel approximations

If inducing points, U , form a regularly spaced P -dimensional grid, and we usea stationary product kernel (e.g., the RBF kernel), then KU,U decomposes as aKronecker product of Toeplitz matrices:

KU,U = T1 ⊗T2 ⊗ · · · ⊗TP . (41)

The Kronecker structure allows to compute the eigendecomposition of KU,U

by separately decomposing T1, . . . ,TP , each of which is much smaller thanKU,U . Further, to efficiently eigendecompose a Toeplitz matrix, it can be ap-proximated by a circulant matrix7 which eigendecomposes by simply applying

7Gardner et al. (2018) explored 5 different approximation methods known in the numericalanalysis literature.

32

discrete Fourier transform (DFT) to its first column. Therefore, an approxi-mate eigendecomposition of each T1, . . . ,TP is computed via the fast Fouriertransform (FFT) and requires only O(m logm) time.

B.2 Structure exploiting inference

To perform inference, we need to solve (KSKI+σ2I)−1y; kernel learning requiresevaluating log det(KSKI + σ2I). The first task can be accomplished by usingan iterative scheme—linear conjugate gradients—which depends only on matrixvector multiplications with (KSKI +σ2I). The second is performed by exploitingthe Kronecker and Toeplitz structure of KU,U for computing an approximateeigendecomposition, as described above.

C Dynamic Default Intensity Model Constraint

Recall that the parameters, (λ0, λ1), of the dynamic intensity model (36) mustsatisfy the constraint:

Ee−∫ T0λ(St)dt = P(τ > T ) (42)

where the right-hand side is a given target value extracted from the client CDScurve.

Under Monte-Carlo simulation with time stepping, we impose an asymptot-ically equivalent constraint

1

M

M∑j=1

e−∑n

i=1 λ(S(j)ti

)∆t = P(τ > T ). (43)

For any non-negative choice of λ1 there is a corresponding λ0 which satisfies the

constraint in Equation 43. Setting lnxij :=

(S0

S(j)ti

)λ1

gives

1

M

M∑j=1

n∏i=1

xλ0∆tij = P(τ > T ) (44)

and so we solve the non-linear expression in λ0:

f(λ0;λ1) :=1

M

M∑j=1

n∏i=1

x−λ0∆tij − P(τ > T ) = 0. (45)

Note that if λ1 is given by a prior distribution, then for the random samples

λ(i)1 Ni=1 drawn from the prior, there exists a corresponding λ(i)

0 Ni=1 given by

f(λ(i)0 ;λ

(i)1 ) = 0, ∀i.

33

References

Abbas-Turki, L. A., S. Crepey, and B. Diallo (2018). XVA Principles,Nested Monte Carlo Strategies, and GPU Optimizations. International Jour-nal of Theoretical and Applied Finance 21, 1850030. Forthcoming (DOI:10.1142/S0219024918500309).

Albanese, C., M. Chataigner, and S. Crepey (2018). Wealth transfers, in-difference pricing, and XVA compression schemes. In Y. Jiao (Ed.), FromProbability to Finance—Lecture note of BICMR summer school on financialmathematics, Mathematical Lectures from Peking University Series. Springer.Forthcoming.

Antonov, A., S. Issakov, A. McClelland, and S. Mechkov (2018). Pathwise XVAGreeks for early-exercise products. Risk Magazine (January).

Barrera, D., S. Crepey, B. Diallo, G. Fort, E. Gobet, and U. Stazhynski (2017).Stochastic approximation schemes for economic capital and risk margin com-putations. hal-01710394.

Bielecki, T. R., S. Crepey, M. Jeanblanc, and M. Rutkowski (2011). Convertiblebonds in a defaultable diffusion model. In A. Kohatsu-Higa, N. Privault, andS.-J. Sheu (Eds.), Stochastic Analysis with Financial Applications, Basel, pp.255–298. Springer Basel.

Capriotti, L. (2011). Fast greeks by algorithmic differentiation. Journal ofComputational Finance 14 (3), 3–35.

Capriotti, L., J. Lee, and M. Peacock (2011). Real-time counterparty credit riskmanagement in monte carlo. Risk 24 (6).

Chen, Z., B. Wang, and A. N. Gorban (2017, March). Multivariate Gaussian andStudent−t Process Regression for Multi-output Prediction. ArXiv e-prints.

Cousin, A., H. Maatouk, and D. Rulliere (2016). Kriging of financial termstructures. European Journal of Operational Research 255, 631–648.

Crepey, S. and S. Song (2016). Counterparty risk and funding: Immersion andbeyond. Finance and Stochastics 20 (4), 901–930.

Crepey, S., T. Bielecki, and D. Brigo (2014). Counterparty Risk and Funding.New York: Chapman and Hall/CRC.

da Barrosa, M. R., A. V. Salles, and C. de Oliveira Ribeiro (2016). Portfoliooptimization through kriging methods. Applied Economics 48 (50), 4894–4905.

E., W., J. Han, and A. Jentzen (2017). Deep learning-based numerical methodsfor high-dimensional parabolic partial differential equations and backwardstochastic differential equations. arXiv:1706.04702.

34

Fang, F. and C. W. Oosterlee (2008). A novel pricing method for europeanoptions based on fourier-cosine series expansions. SIAM J. SCI. COMPUT .

Gardner, J., G. Pleiss, R. Wu, K. Weinberger, and A. Wilson (2018). Productkernel interpolation for scalable gaussian processes. In International Confer-ence on Artificial Intelligence and Statistics, pp. 1407–1416.

Gaß, M., K. Glau, and M. Mair (2017). Magic points in finance: Empiricalintegration for parametric option pricing. SIAM Journal on Financial Math-ematics 8, 766–803.

Giles, M. and P. Glasserman (2005). Smoking adjoints: fast evaluation of greeksin monte carlo calculations. Technical report.

Gramacy, R. and D. Apley (2015). Local gaussian process approximation forlarge computer experiments. Journal of Computational and Graphical Statis-tics 24 (2), 561–578.

Hernandez, A. (2017). Model calibration with neural networks. Risk Maga-zine (June 1-5). Preprint version available at SSRN.2812140, code availableat https://github.com/Andres-Hernandez/CalibrationNN.

Kenyon, C. and A. Green (2014). Efficient XVA management: Pricing, hedging,and attribution using trade-level regression and global conditioning. Papers,arXiv.org.

Liu, M. and J. Staum (2010). Stochastic kriging for efficient nested simulationof expected shortfall. Journal of Risk 12 (3), 3–27.

Longstaff, F. A. and E. S. Schwartz (2001). Valuing american options by simula-tion: A simple least-squares approach. The Review of Financial Studies 14 (1),113–147.

Ludkovski, M. (2018). Kriging metamodels and experimental design for Bermu-dan option pricing. Journal of Computational Finance 22 (1), 37–77.

MacKay, D. J. (1997). Gaussian processes - a replacement for supervised neuralnetworks?

Micchelli, C. A., Y. Xu, and H. Zhang (2006, December). Universal kernels. J.Mach. Learn. Res. 7, 2651–2667.

Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. The MITPress.

Pillonetto, G., F. Dinuzzo, and G. D. Nicolao (2010, Feb). Bayesian online mul-titask learning of gaussian processes. IEEE Transactions on Pattern Analysisand Machine Intelligence 32 (2), 193–205.

Rasmussen, C. E. and Z. Ghahramani (2001). Occam’s razor. In In Advancesin Neural Information Processing Systems 13, pp. 294–300. MIT Press.

35

Rasmussen, C. E. and C. K. I. Williams (2005). Gaussian Processes for MachineLearning (Adaptive Computation and Machine Learning). The MIT Press.

Roberts, S., M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain (2013).Gaussian processes for time-series modelling. Philosophical Transactions ofthe Royal Society of London A: Mathematical, Physical and Engineering Sci-ences 371 (1984).

Scholkopf, B. and A. J. Smola (2001). Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. Cambridge, MA, USA:MIT Press.

Spiegeleer, J. D., D. B. Madan, S. Reyners, and W. Schoutens (2018). Machinelearning for quantitative finance: fast derivative pricing, hedging and fitting.Quantitative Finance 0 (0), 1–9.

Whittle, P. and T. J. Sargent (1983). Prediction and Regulation by Linear Least-Square Methods (NED - New edition ed.). University of Minnesota Press.

Wilson, A. G. and R. P. Adams (2013). Gaussian process covariance kernels forpattern discovery and extrapolation. CoRR abs/1302.4245.

36

Documents

Multivariate Gaussian Process Regression for ... - IITmath.iit.edu/~mdixon7/multi-GP-DC.pdf · Illinois Institute of Technology and St ephane Cr epeyy LaMME, Univ Evry, CNRS, Universit