Conditional Copula Inference and Efficient Approximate MCMC...his insightful comments about MCMC theory. I wish to thank Nancy Reid, David Brenner, Radford Neal, Jerry Brunner, Fang

Conditional Copula Inference and Efficient ApproximateMCMC

by

Evgeny Levi

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Statistical SciencesUniversity of Toronto

c© Copyright 2019 by Evgeny Levi

Abstract

Conditional Copula Inference and Efficient Approximate MCMC

Evgeny Levi

Doctor of Philosophy

Graduate Department of Statistical Sciences

University of Toronto

2019

This thesis consists of two main parts. The first part focuses on parametric conditional cop-

ula models that allow the copula parameters to vary with a set of covariates according to an

unknown calibration function. Flexible Bayesian inference for the calibration function of a

bivariate conditional copula is introduced. The prior distribution over the set of smooth cal-

ibration functions is built using a sparse Gaussian Process prior for the Single Index Model.

The estimation of parameters from the marginal distributions and the calibration function

is done jointly via Markov Chain Monte Carlo sampling from the full posterior distribution.

A new Conditional Cross Validated Pseudo-Marginal criterion is used to perform copula se-

lection and is modified using a permutation-based procedure to assess data support for the

simplifying assumption.

The first part concludes with methods for establishing data support for the simplifying as-

sumption in a bivariate conditional copula model. After splitting the observed data into

training and test sets, the method proposed will use a flexible Bayesian model fit to the

training data to define tests based on randomization and standard asymptotic theory. I

discuss theoretical justification for the method and implementations in alternative models

of interest: Gaussian, Logistic and Quantile regressions. The performance is studied via

simulated data.

The second part of the thesis focuses on approximate Bayesian methods. Approximate

Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL) are popular sim-

ii

ulation based methods for sampling from the posterior distribution when the likelihood is

not tractable but simulations for each parameter are easily available. However these meth-

ods can be computationally inefficient since a large number of pseudo-data simulations is

required. I propose to use perturbed MCMC versions of ABC and BSL algorithms and at-

tempt to significantly accelerate these samplers. The main idea of the proposed strategy is

to utilize past samples with k-Nearest-Neighbor approach for likelihood approximation. This

general method works for ABC and BSL and greatly reduces computational cost and num-

ber of required simulations for these samplers. Performance and computational advantage

are examined via series of simulation examples. The second part concludes with theoretical

justifications and convergence properties of the proposed strategies.

iii

Acknowledgements

First I would like to express my sincere gratitude to my supervisor Radu Craiu for his

guidance, patience and generous financial support. I am sure this work would not be accom-

plished without his help, I am very lucky to have him as my supervisor.

I would like to thank my committee members, Stanislav Volgushev and Daniel Roy for com-

ments and advices that greatly improved this thesis. Special gratitude to Stanislav Volgushev

for great discussions about time series topics and simplifying assumption problem that now

constitutes a quite large portion of my thesis. Special appreciation to Jeffrey Rosenthal for

his insightful comments about MCMC theory.

I wish to thank Nancy Reid, David Brenner, Radford Neal, Jerry Brunner, Fang Yao and

Keith Knight for courses that I have taken and greatly increased my interest and compre-

hension of the statistical science.

I would also like to truly thank the staff members in the Statistical Sciences department,

Andrea Carter, Christine Bulguryemez, Annette Courtemanche, Angela Fleury and Dermot

Whelan who were always available to help at every step of my graduate studies.

iv

Contents

I Conditional Copula and Simplifying Assumption Testing 1

1 Introduction 2

1.1 Conditional Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Brief review of Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . 6

1.2.1 Metropolis Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Bayesian Inference and Gaussian Processes . . . . . . . . . . . . . . . . . . . 9

1.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Simplifying Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Bayesian Conditional Copula using Gaussian Processes 16

2.1 GP-SIM for Conditional copula . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Computational Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Performance of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.2 Simulation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.3 Proof of concept based on one Replicate . . . . . . . . . . . . . . . . 24

2.3.4 Multiple Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Model Selection 32

3.1 Conditional CVML criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Simulation results with CVML, CCVML and WAIC criteria . . . . . . . . . 33

3.2.1 One replicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.2 Multiple replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Additional Simulation Results Based On Multiple Replicates . . . . . . . . . 35

v

4 Simplifying Assumption 38

4.1 Interesting Connection between Model Misspecification and the Simplifying

Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 A Permutation-based CVML to Detect Data Support for the Simplified As-

sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Two other methods for Detecting Data Support for SA . . . . . . . . . . . . 43

4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Extensions to other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6.3 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Data Analysis 65

5.1 Red Wine Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

II Approximated Bayesian Methods 69

6 Introduction 70

6.1 The Need of Simulation Based Methods . . . . . . . . . . . . . . . . . . . . . 70

6.2 Approximate Bayesian Computation (ABC) . . . . . . . . . . . . . . . . . . 71

6.3 Bayesian Synthetic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.4 Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Approximated ABC (AABC) 79

7.1 Computational Inefficiency of ABC . . . . . . . . . . . . . . . . . . . . . . . 79

7.2 Approximated ABC-MCMC (AABC-MCMC) . . . . . . . . . . . . . . . . . 79

8 Approximated BSL (ABSL) 83

8.1 Computational Inefficiency of BSL . . . . . . . . . . . . . . . . . . . . . . . 83

8.2 Approximated Bayesian Synthetic Likelihood (ABSL) . . . . . . . . . . . . . 83

9 Simulations 86

9.1 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

vi

9.2 Measures for Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.3 Moving Average Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.4 Ricker’s Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.5 Stochastic Volatility with Gaussian emissions . . . . . . . . . . . . . . . . . . 97

9.6 Stochastic Volatility with α-Stable errors . . . . . . . . . . . . . . . . . . . . 101

10 Data Analysis 106

10.1 Dow-Jones log-returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

11 Theoretical Justifications 109

11.1 Preliminary Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

11.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

11.3 Proofs of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

III Final Remarks 120

12 Conclusions and Future Work 121

vii

List of Tables

2.1 Copula density functions for each copula family. . . . . . . . . . . . . . . . 21

2.2 Parameter’s range, Inverse-link functions and the functional relationship be-

tween Kendall’s τ and the copula parameter. . . . . . . . . . . . . . . . . . 22

2.3 Estimated√

Bias2,√

IVar and√

IMSE of Kendall’s τ for each Scenario and

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Estimated√

Bias2,√

IVar and√

IMSE of E(Y1|y2, x) for each Scenario and

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Estimated√

Bias2,√

IVar and√

IMSE of Kendall’s τ and E(Y1|y2, x) for GP-

SIM and Additive models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 CVML, CCVML and WAIC values for each Scenario and Model. . . . . . . 34

3.2 The percentage of correct decisions for each selection criterion when comparing

the correct Clayton model with a non-constant calibration with all the other

models: Frank model with non-constant calibration, Gaussian model with

non-constant calibration, Clayton model with constant calibration. . . . . . 34


the correct Clayton model with a constant calibration with three models:

Clayton, Frank and Gaussian, all of them assuming a GP-SIM calibration. . 35


the correct additive model with GP-SIM with non-constant calibration. . . 35

3.5 Estimated√

Bias2,√

IVar and√

IMSE of Kendall’s τ for each Scenario and

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Estimated√

Bias2,√

IVar and√

IMSE of E(Y1|y2, x) for each Scenario and

Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

viii


the correct Clayton model with a non-constant calibration with all the other

models: Frank model with non-constant calibration, Gaussian model with

non-constant calibration, Clayton model with constant calibration. . . . . . 37

4.1 Missed covariate: CVML, CCVML and WAIC criteria values for model with

conditional copula depends on one covariate and when it is constant. . . . . 39

4.2 The percentage of correct decisions for each selection criterion and scenarios.

GP-SIM and SA were fitted with Clayton copula, sample size is 1500. . . . 43

4.3 The percentage of correct decisions for each selection criterion and scenario.

Predicted CVML and CCVML values based on n1 = 1000 training and n2 =

500 test data, respectively. The calculation of EV is based on a random sample

of 500 permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Method 1: A permutation-based procedure for assessing data support in favour

of SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Method 2: A Chi-square test for assessing data support in favour of SA . . . 45

4.6 Simulation Results: Generic, proportion of rejection of SA for each scenario,

sample size and generic criteria. . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 Simulation Results: Proposed method, proportion of rejection of SA for each

scenario, sample size, number of bins (K) and method. . . . . . . . . . . . . 48

4.8 Simulation Results for Regression: Generic, proportion of rejections of SA for

each scenario, sample size and generic criteria. . . . . . . . . . . . . . . . . . 54

4.9 Simulation Results for Regression: Proposed methods, proportion of rejections

of SA for each scenario, sample size and number of bins. . . . . . . . . . . . 55

4.10 Simulation Results for Logistic Regression: Generic, proportion of rejections

of SA for each scenario, sample size and generic criteria. . . . . . . . . . . . 59

4.11 Simulation Results for Logistic Regression: Proposed methods, proportion of

rejections of SA for each scenario, sample size and number of bins. . . . . . . 59

4.12 Simulation Results for Quantile Regression: Generic, proportion of rejections

of SA for each scenario, sample size, τ and generic criteria. . . . . . . . . . . 63

4.13 Simulation Results for Quantile Regression: Proposed methods, proportion of

rejections of SA for each scenario, sample size, τ and number of bins. . . . . 63

5.1 Red Wine data: CVML, CCVML and WAIC criteria values different models. 66

5.2 Wine data: Posterior means and quantiles of β. . . . . . . . . . . . . . . . . 67

ix

5.3 Wine data: CVML, CCVML and WAIC criteria values for variable selection

in conditional copula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

9.1 Simulation Results (MA model): Average Difference in mean, Difference in

covariance, Total variation, square roots of Bias, Variance and MSE, Effec-

tive sample size and Effective sample size per CPU time, for every sampling

algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9.2 Simulation Results (Ricker’s model): Average Difference in mean, Difference

in covariance, Total variation, square roots of Bias, variance and MSE, Effec-


algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.3 Simulation Results (SV model): Average Difference in mean, Difference in

covariance, Total variation, square roots of Bias, variance and MSE, Effec-


algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.4 Simulation Results (SV α-Stable model): Average Difference in mean, Differ-

ence in covariance, Total variation, square roots of Bias, variance and MSE,

Effective sample size and Effective sample size per CPU time, for every sam-

pling algorithm. In DIM, DIC and TV, samplers are compared to SMC. . . 105

10.1 Dow Jones log return stochastic volatility: 95% credible intervals and posterior

averages for 4 parameters for two proposed samplers (AABC-U and ABSL-U). 107

x

List of Figures

2.1 Sc1: Clayton copula, Gelman-Rubin MCMC diagnostic for beta and two

variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Sc1: Trace-plots, ACFs and histograms of parameters based on MCMC sam-

ples generated under the true Clayton family. . . . . . . . . . . . . . . . . . 25

2.3 Sc1: Estimation of marginal means. The leftmost 2 columns show the accu-

racy for predicting E(Y1) and the rightmost 2 columns show the results for

predicting E(Y2). The black and green lines represent the true and estimated

relationships, respectively. The red lines are the limits of the pointwise 95%

credible intervals obtained under the true Clayton family. . . . . . . . . . . 26

2.4 Sc1: Estimation of Kendall’s τ dependence surface. The true surface (left

panel) is very similar to the estimated one (right panel). . . . . . . . . . . . 26

2.5 Sc1: Estimation of Kendall’s τ one-dimensional projections when x1 = 0.2 or 0.8

(top panels) and when x2 = 0.2 or 0.8 (bottom panels). The black and green

lines represent the true and estimated relationships, respectively. The red

lines are the limits of the pointwise 95% credible intervals obtained under the

true Clayton family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 Sc1: Histogram of predicted Kendall’s τ values obtained under the true Clay-

ton copula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Sc1: Histogram of predicted τs with Gaussian copula model. . . . . . . . . 28

2.8 Sc3: Estimation of Kendall’s τ one-dimensional projections for each coor-

dinate fixing all other coordinates at 0.5 levels. The black and green lines

represent the true and estimated relationships, respectively. The red lines are

the limits of the pointwise 95% credible in intervals obtained under the true

Clayton family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xi

4.1 Estimation of Kendall’s τ as a function of x1 when only first covariate is

used in estimation. The dotted black and solid green lines represent the true

and estimated relationships, respectively. The red lines are the limits of the

pointwise 95% credible in intervals obtained under the true Clayton family. 39

5.1 Wine Data: Pairwise scatterplots of all the variables in the analyzed data. . 66

5.2 Wine Data: Slices of predicted Kendall’s τ as function of covariates. Red

curves represent 95% credible intervals. . . . . . . . . . . . . . . . . . . . . 68

5.3 Wine Data: Plots of ‘fixed acidity’(blue) and ‘density’(red) (linearly trans-

formed to fit on one plot) against covariates. . . . . . . . . . . . . . . . . . 68

9.1 MA2 model: AABC-U Sampler. Each row corresponds to parameters θ1 (top

row) and θ2 (bottom row) and shows in order from left to right: Trace-plot,

Histogram and Auto-correlation function. Red lines represent true parameter

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.2 MA2 model: ABSL-U Sampler. Each row corresponds to parameters θ1 (top



values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.3 MA2 model: ABC-RW Sampler. Each row corresponds to parameters θ1 (top



values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9.4 MA model: Estimated densities for each component. First row compares

Exact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,

BSL-IS and ABSL-U. Columns correspond to parameter’s components, from

left to right: θ1 and θ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9.5 Ricker’s model: AABC-U Sampler. Each row corresponds to parameters θ1

(top row), θ2 (middle row) and θ3 (bottom row) and shows in order from

left to right: Trace-plot, Histogram and Auto-correlation function. Red lines

represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 94

9.6 Ricker’s model: ABSL-U Sampler. Each row corresponds to parameters θ1




xii

9.7 Ricker’s model: ABC-RW Sampler. Each row corresponds to parameters θ1



represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.8 Ricker’s model: Estimated densities for each component. First row compares

Exact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,

BSL-RW and ABSL-U. Columns correspond to parameter’s components, from

left to right: θ1, θ2 and θ3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.9 SV model: AABC-U Sampler. Each row corresponds to parameters θ1 (top

row), θ2 (middle row) and θ3 (bottom row) and shows in order from left

to right: Trace-plot, Histogram and Auto-correlation function. Red lines


9.10 SV model: ABSL-U Sampler. Each row corresponds to parameters θ1 (top




9.11 SV model: ABC-RW Sampler. Each row corresponds to parameters θ1 (top




9.12 SV model: Estimated densities for each component. First row compares Ex-

act, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,


left to right: θ1, θ2 and θ3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.13 SV α-Stable model: AABC-U Sampler. Each row corresponds to parameters

θ1 (top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and

shows in order from left to right: Trace-plot, Histogram and Auto-correlation

function. Red lines represent true parameter values. . . . . . . . . . . . . . 103

9.14 SV α-Stable model: ABSL-U Sampler. Each row corresponds to parameters θ1

(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and


function. Red lines represent true parameter values. . . . . . . . . . . . . . . 104

xiii

9.15 SV α-Stable model: ABC-RW Sampler.Each row corresponds to parameters θ1

(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and


function. Red lines represent true parameter values. . . . . . . . . . . . . . . 104

9.16 SV α-Stable model: Estimated densities for each component. First row com-

pares SMC, ABC-RW and AABC-U samplers. Second row compares SMC,


left to right: θ1, θ2, θ3 and θ4. . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.1 Dow Jones daily transformed log return for a period of Jan 2010 - Dec 2018. 107

10.2 Dow Jones log returns: AABC-U Sampler. Every column corresponds to a

particular parameter component from left to right: θ1, θ2, θ3, θ4 and shows

trace-plot on top and histogram on bottom. . . . . . . . . . . . . . . . . . . 108

10.3 Dow Jones log returns: ABSL-U Sampler. Every column corresponds to a

particular parameter component from left to right: θ1, θ2, θ3, θ4 and shows

trace-plot on top and histogram on bottom. . . . . . . . . . . . . . . . . . . 108

xiv

Part I

Conditional Copula and Simplifying

Assumption Testing

1

Chapter 1

Introduction

1.1 Conditional Copulas

A copula is a mathematical concept often used to model the joint distribution of several

random variables. Copulas are useful in modelling the dependent structure in the data

when there is interest in separating it from the marginal models or when none of the existent

multivariate distributions are suitable. The applications of copula models permeate a number

of fields where of interest is the simultaneous study of dependent variables, e.g. [43], [70], [36]

and [52]. For continuous multivariate distributions, the elegant result of [86] guarantees the

existence and uniqueness of the copula C : [0, 1]p → [0, 1] that links the marginal cumulative

distribution functions (cdf) and the joint cdf. Specifically,

H(Y1, . . . , Yp) = C(F1(Y1), . . . , Fp(Yp)), (1.1)

where H is the joint cdf, and Fi is the marginal cdf for variable Yi, for 1 ≤ i ≤ p, respec-

tively. For statistical modelling it is also useful to note that, a p-dimensional copula C and

marginal continuous cdfs Fi(yi), i = 1, . . . , p are building blocks for a valid p-dimensional

cdf, C(F1(y1), . . . , Fp(yp)) with ith marginal cdf equal to Fi(yi), thus providing much-needed

flexibility in modelling multivariate distributions. For example with this construction we can

create a random vector with ”Gaussian” dependence structure (i.e. copula) but any marginal

distributions. To find Gaussian copula we extract it from the joint Gaussian distribution

since from (1.1) we immediately get:

C(U1, . . . , Up) = H(F−11 (U1), . . . , F−1

p (Up)), (1.2)

2

this gives us a way to get a copula function from any joint distribution.

The focus of this part of the thesis is on copula models used in a regression setting in which

covariate values are expected to influence the responses Y1, . . . , Yp through the marginal mod-

els and the interdependence between them through the copula. The extension to conditional

distributions via the conditional copula was used by [53] and subsequently formalized by [70]

so that

H(Y1, . . . , Yp|X) = CX(F1|X(Y1|X), . . . , Fp|X(Yp|X)), (1.3)

where X ∈ Rq is a vector of conditioning variables, CX is the conditional copula that

may change with X and Fi|X is the conditional cdf of Yi given X for 1 ≤ i ≤ p. A

parametric model for the conditional copula assumes CX = Cθ(X) belongs to a parametric

family of copulas and only the parameter θ ∈ Θ varies as a function of X. Throughout

this chapter uppercase letters identify random variables, while their realizations are denoted

using lowercase. We also assume that there exists a known one-to-one function g : Θ → R

such that θ(X) = g−1(η(X)) with the calibration function η : R→ R in the inferential focus.

Moreover it is sometimes convenient to parametrize a copula family in terms of Kendall’s tau

correlation τ(X) which takes values in [−1, 1] and has one-to-one correspondence with θ(X)

for one-parameter copula families. Thus there is a known one-to-one function g′(·) such that

η(X) = g′(τ(X)).

There are a number of reasons one is interested in estimating the conditional copula. First,

in regression models with multivariate responses, which is the main focus of this chapter,

one may want to determine how the dependence structure among the components of the

response varies with the covariates. Second, the copula model will ultimately impact the

performance of model-based prediction. For instance, for a bivariate response, (Y1, Y2), in

which one component is predicted given the other, the conditional density of Y1, given X = x

and Y2 = y2, takes the form

h(y1|y2, x) = f(y1|x)cθ(x)(F1|x(y1|x), F2|x(y2|x)), (1.4)

where cθ(x) is the density of the conditional copula Cθ(x) and f(y1|x) is the marginal condi-

tional density of y1 given X = x. Hence, in addition to the information contained in the

marginal model, in equation (1.4) we use for prediction also the information in the other

responses.

Third, when specifying a general multivariate distribution, the conditional copula is an

essential ingredient. For instance, if U1, U2, U3 are three Uniform(0, 1) variables, when ap-

3

plying a vine decomposition using bivariate copulas [e.g., 21] their joint density is

c(u1, u2, u3) = c12(u1, u2)c23(u2, u3)cθ(u2) (P (U1 ≤ u1|u2), P (U3 ≤ u3|u2)|u2) ,

where cij is the density of the copula between variables Ui and Uj and cθ(u2) is the density of

the conditional copula of U1, U3|U2 = u2. Finally, a conditional copula with predictor values

X ∈ Rq in which η(X) is constant, may exhibit non-constant patterns when some of the

components of X are not included in the model. This point will be revisited in section 4.1.

When estimation for the conditional copula model is contemplated, one must consider

that there are multiple sources of error and each will have an impact on the model. Even in

the simple case in which the estimation of the marginals and copula suffer from errors that

depend only on X = x one obtains via Taylor expansion:

cθ(x)+δ3(x)(F1|x(y1|x) + δ1(x), F2|x(y2|x) + δ2(x)) = cθ(x)(F1|x(y1|x), F2|x(y2|x)) (1.5)

+ c(1,0,0)θ(x) (F1|x(y1|x), F2|x(y2|x))δ1(x) (1.6)

+ c(0,1,0)θ(x) (F1|x(y1|x), F2|x(y2|x))δ2(x) (1.7)

+ c(0,0,1)θ(x) (F1|x(y1|x), F2|x(y2|x))δ3(x) +O(‖δ(x)‖2), (1.8)

where c(1,0,0), c(0,1,0) and c(0,0,1) are the partial derivatives of cz(x, y) w.r.t. x, y and z,

respectively, and δi(x), 1 ≤ i ≤ 3, denote various estimation error terms due to model

misspecification, e.g. δ3(x) is the error in estimation of the copula parameter at a given

covariate value x. The right hand term in equation (1.5) marks the correct joint likelihood

while (1.6)-(1.8) show the biases incurred due to errors in estimating the first and second

marginal conditional cdf’s and the copula calibration function, respectively. It becomes

apparent that in order to keep the estimation error low, one must consider flexible models

for the marginals and the copula.

In practice we observe data that consist of n independent triplets D = {(xi, y1i, y2i), i =

1, . . . , n} where yji ∈ R, j = 1, 2, and xi ∈ Rq. Denote y1 = (y11, . . . , y1n), y2 =

(y21, . . . , y2n) and x ∈ Rn×q the matrix with ith row equal to xTi . Then using (1.3) den-

sity for Y1 and Y2 given x is given by

p(y1,y2|x, ω) =n∏i=1

f1(y1i|ω, xi)f2(y2i|ω, xi)cθ(xi) (F1(y1i|ω, xi), F2(y2i|ω, xi)) , (1.9)

where fj, Fj are the density and, respectively, the cdf for Yj, and ω denotes all the parameters

4

and latent variables in the joint and marginals models. The copula density function is

denoted by c and it depends on X through unknown function θ(X) = g−1(η(X)).

Depending on the strength of assumptions we are willing to make about η(X), a number of

possible approaches are available. The most direct is to assume a known parametric form for

the calibration function, e.g. constant or linear, and estimate the corresponding parameters

by maximum likelihood estimation [37]. This approach relies on knowledge about the shape

of the calibration function which, in practice, can be unrealistic. A more flexible approach

uses non-parametric methods [3, 90] and estimate the calibration function using smoothing

methods. Recently, we have seen a number of developments using nonparametric Bayesian

techniques for estimating a multivariate copula using an infinite mixture of Gaussian copulas

[95], or via flexible Dirichlet process priors [96, 67]. The infinite mixture approach in [95]

was extended to estimate any conditional copula with a univariate covariate by [22], while

an alternative Bayesian approach based on a flexible cubic spline model for the calibration

functions was built by [20]. For multivariate covariates, [80], [15] and [49] avoid the curse

of dimensionality that appears even for moderate values of q, say q ≥ 5, by specifying

an additive model structure for the calibration function. Few alternatives to the additive

structure exist. One exception is [42] who used a sparse Gaussian Process (GP) prior for

estimating the calibration function and subsequently used the same construction for vine

copulas estimation in [58]. However, when the dimension of the predictor space is even

moderately large the curse of dimensionality prevails and it is expected that the q-dimensional

GP used for calibration estimation will not capture important patterns for sample sizes that

are not very large. Moreover, the full efficiency of the method proposed in [42] is difficult

to assess since their model is build with uniform marginals, which in a general setup is

equivalent to assuming exact knowledge about the marginal distributions. In fact, when

the marginal distributions are estimated it is of paramount importance to account for the

resulting variance inflation due to error propagation in the copula estimation as reflected

by equations (1.5)-(1.8). The Bayesian model in which joint and marginal components are

simultaneously considered will appropriately handle error propagation as long as it is possible

to study the full posterior distribution of all the parameters in the model, be they involved

in the marginals or copula specification.

5

1.2 Brief review of Markov Chain Monte Carlo (MCMC)

Since we implement MCMC algorithms for both parts of this thesis, a brief introduction is

given here.

We start by assuming that we need to sample from some distribution with density π(x)

where x ∈ X ∈ Rq. Moreover we do not have access to the closed form of this density but

only π(x) can be evaluated where:

π(x) =π(x)

C,

and C is some unknown normalization constant. This situation is typical in Bayesian statis-

tics where posterior distribution of parameters θ given observed data y is:

π(θ|y) =p(θ)f(y|θ)∫p(θ)f(y|θ)dθ

,

here p(θ) is the prior and f(y|θ) is the model density. In this setting π(θ) = p(θ)f(y|θ)can be easily computed while the normalization constant is generally not known. In these

problems where posterior cannot be found in closed form, the objective is to draw from the

posterior θ(t), t = 1, . . . ,M and then use these to approximate quantities of interest. By the

Strong Law of Large Numbers:

1

M

M∑t=1

g(θ(t))a.s.→ E[g(θ)], (1.10)

for any measurable function g(·). When dimension q is moderate or large it becomes in-

creasing difficult to draw independent samples from π therefore MCMC algorithms aim to

simulate dependent samples by constructing a Markov chain with the stationary distribution

π , see [19, 14].

First we define a Markov Chain:

Definition 1.2.1. Given a pair of state space and Borel σ-field (X ,F). A stochastic process

X(0), X(1), . . . , X(M), . . ., is a Markov chain if:

P (X(t) ∈ A|X(0), . . . , X(t−1)) = P (X(t) ∈ A|X(t−1)) for all A ∈ F .

We can think about this process as ordered in time and future values depend only on

the present and not on the past. In MCMC theory we usually assume that these conditional

probabilities are homogeneous:

Definition 1.2.2. A Markov chain is homogeneous if P (X(t) ∈ A|X(t−1)) is the same for all

6

t = 1, 2, . . ..

So that conditional probability does not change with time t. Therefore the joint distri-

bution of a random vector (X(0), . . . , X(M)) is

P (x(0), . . . , x(M)) = P (x(0))× P (x(1)|x(0))× · · · × P (x(M)|x(M−1)),

so it is fully specified by the initial distribution P (x(0)) and the transition probability (or

kernel) which we denote by P (x, dy) so that P (X(1) ∈ A|X(0) = x) =∫A P (x, dy)dy. It

is also convenient to define PM(x, ·) = P (X(M) ∈ ·|X(0) = x) which is the conditional

probability of a Markov chain in M steps.

A very important concept in MCMC is stationary (or invariant) distribution of a Markov

chain which is defined as:

Definition 1.2.3. π(x) defined on X is called invariant for a Markov chain on (X ,F) if∫π(dx)P (x,A) = π(A) for all A ∈ F .

It just means that if for example X(0) has distribution π(x) then X(1) also has the same

distribution and similarly all other random variables. Note that if we set X(0) ∼ π(x) (where

π(x) is invariant) then homogeneous Markov Chain becomes strongly stationary.

In MCMC we actually have a reverse problem; here we are given π(x) that we want to

sample from and the first step is to construct a Markov chain (with appropriate transition

kernel) for which π(x) is invariant. Once we have it, if we start a chain by sampling from π(x)

then by stationarity all X(1), X(2), · · · , will follow the same target distribution and the goal

is achieved. However if we can simulate X(0) from the target distribution, we do not need the

Markov chain at all, so the main question is if we sample from a different distribution at step

0 will the distribution of X(M) converge to π(x) as M increases. Under several assumptions

this convergence result is actually true. First we define a Total Variation Distance between

two measures ν1 and ν2 as

‖ν1(·)− ν2(·)‖TV = supA|ν1(A)− ν2(A)|.

The main convergence result is:

Theorem 1.2.1. If a Markov chain defined on (X ,F) with transition kernel P (x, ·) and in-

variant measure π(·) is φ-irreducible and aperiodic, then for π almost every x ∈ X we have:

limM→∞

‖PM(x, ·)− π(·)‖TV = 0.

7

The importance of this theorem is that we can start a Markov chain by setting x(0) at

almost any value then run the chain (for which π(x) is stationary) and after large number

of iterations we expect samples from the target distribution. The result depends on two

assumptions that are usually satisfied in practice, see [64] for details:

Definition 1.2.4. A Markov chain is φ-irreducible if there exist a non-zero σ-finite measure φ

in X such that for all A ∈ X with φ(A) > 0 and for all x ∈ X , there exist positive integer

M such that PM(x,A) > 0.

Definition 1.2.5. A Markov chain with invariant distribution π(x) is aperiodic if there do

not exist disjoint subsets A1, · · · ,Ad ∈ F , d ≥ 2 with P (x,Ai+1) = 1 for all x ∈ Ai(1 ≤ i ≤ d− 1) and P (x,A1) = 1 for all x ∈ Ad.

1.2.1 Metropolis Hastings algorithm

Given unnormalized density π(x) of the target distribution π(x), to apply MCMC theory we

need to find such conditional distribution P (x, ·) so that the target is invariant. Metropolis-

Hastings [63, 41] is probably the most frequently used algorithm to construct such transitional

distributions. The main idea, at iteration t is to sample a proposal x∗ from some distribution

q(·|X(t−1) = x) which can depend on the previous state, calculate appropriate probability of

acceptance α(x, x∗) and then accept x∗ with this probability, see Algorithm 1. Notice that

Algorithm 1 Metropolis-Hastings

1: Given initial x(0) and required number of samples M .2: for t = 1, · · · ,M do3: Set x = x(t−1).4: Simulate x∗ ∼ q(·|x) where q(·|x) is some density.

5: Calculate α(x, x∗) = min(

1, π(x∗)q(x|x∗)π(x)q(x∗|x)

)= min

(1, π(x∗)q(x|x∗)

π(x)q(x∗|x)

).

6: Simulate u ∼ U(0, 1).7: if If u ≤ α(x, x∗) then8: Accept, X(t) = x∗.9: else

10: Reject, X(t) = x.11: end if12: end for

α(x, x∗) depends on the ratio between π(x∗) and π(x) and therefore this algorithm can be

implemented when the normalization constant C is unknown.

8

It is easy to see that the transition kernel for this algorithm is:

P (x, dx∗) = α(x, x∗)q(x∗|x)dx∗ + r(x)δx(dx∗),

where r(x) = 1 −∫α(x, x∗)q(x∗|x)dx∗ and δx(·) is point mass at point x. It can be shown

that this transition probability preserves the target distribution π(x).

1.3 Bayesian Inference and Gaussian Processes

Assume we observe n independent realizations, y1, . . . , yn, of a random variable Y ∈ R

and that each observation yi corresponds to a covariate measurement xi ∈ Rq. Henceforth,

we assume that x1, . . . , xn are fixed by design. The distribution of Yi has a known form

and depends on xi through some unknown function f and parameter σ so that the joint

distribution of the data is

P (y|x1, . . . , xn, σ) = P (y|f(x1), . . . , f(xn), σ) =n∏i=1

P (yi|f(xi), σ). (1.11)

Usually, the main inferential goal is to estimate the unknown smooth function f : Rq → R,

while σ is a nuisance parameter. If we let x = (x1, . . . , xn)T denote the n covariate values,

then a Gaussian Process (GP) prior on the function f implies

f = (f(x1), f(x2), . . . , f(xn))T ∼ N (0, K(x,x; w)), (1.12)

where N (µ,Σ) denotes a normal distribution with mean µ and variance matrix Σ and K is a

variance matrix which depends on x and additional parameters w. Here we use the squared

exponential kernel to model the matrix K(x,x; w), i.e. its (i, j) element is

k(xi, xj; w) = ew0 exp

[−

q∑s=1

(xis − xjs)2

ews

], (1.13)

where xis is the sth coordinate value for ith covariate measurement xi. The unknown param-

eters w = (w0, . . . , wq) that determine the strength of dependence in (1.13) are inferred from

the data. Of interest is predicting the values of the nonlinear predictor at new observations

x∗ = (x∗1, . . . , x∗m)T , which we denote f∗ = (f(x∗1), . . . , f(x∗m))T . In the case in which the

covariate dimension, q, is moderately large, an accurate estimation of f∗ will require a large

9

sample size, n. Unfortunately, this desideratum is hindered by the computational cost of

fitting a GP model when n is large. For example, if Yi∼N (f(xi), σ2) then equations (1.12)

and (1.11) yield a joint Gaussian distribution of Y = (Y1, . . . , Yn) and f∗. If y = (y1, . . . , yn)

denotes the observed response, then the conditional distribution of f∗|Y = y is N(µ∗,Σ∗)

where

µ∗ = K(x∗,x; w)[K(x∗,x; w) + σ2In]−1y, (1.14)

Σ∗ = K(x∗,x∗; w)−K(x∗,x; w)[K(x,x; w) + σ2In]−1K(x,x∗; w), (1.15)

and K(x∗,x∗; w), K(x∗,x; w) and K(x,x∗; w) have their elements defined using (1.13).

With the Gaussian sampling model it is clear from (1.14) and (1.15) that the MCMC sam-

pling of the posterior requires at each iteration the calculation and inversion of the matrix

K(x,x; w) + σ2In ∈ Rn×n which becomes prohibitive when n is large. To make GP models

applicable for larger data we refer to the literature on sparse GP [74, 87, 66] in which it is

assumed that learning about f can be achieved using a smaller sample of m latent variables,

called inducing variables, which may be a subsample of the original data or can be built

using other considerations as further discussed. The intuitive idea is to use the inducing

variables to channel the information contained in the covariate values x = {x1, . . . , xn}. We

denote the inducing values as x = (x1, . . . , xm)T ∈ Rm×q and K(x, x; w) ∈ Rn×m the matrix

K(x, x; w) =

k(x1, x1; w) · · · k(x1, xm; w)

.... . .

...

k(xn, x1; w) · · · k(xn, xm; w)

, (1.16)

where k(xi, xj; w) is defined as in (1.13). The ratio m/n influences the trade-off between

computational efficiency and statistical efficiency, as a smaller m will favour the former

and a larger m will ensure no significant loss of the latter. If the function values for the

inducing points are defined as f = (f(x1), . . . , f(xm))T then the joint density of the response

vector Y, the latent variable f and the parameter w can be expressed only in terms of the

m-dimensional vector f since

P (y, f ,w|x, x) = P (y|A(x, x; w)f)N (f ; 0, K(x, x; w))p(w), (1.17)

10

where N (x;µ,Σ) is the normal density with mean µ and covariance Σ, p(w) is the prior

probability for the parameters w and

A(x, x; w) = K(x, x; w)K(x, x; w)−1. (1.18)

The form of P (y|A(x, x; w)f) is derived under the assumption that f = A(x, x; w)f and de-

pends on form of the sampling model P (y|f , σ), e.g., when the latter is N (f , σIn) we obtain

P (y|A(x, x; w)f) = N (A(x, x; w)f , σIn).

The posterior distribution π(f ,w|y,x) is not tractable, but sampling from it will be much

less expensive since K(x, x; w) ∈ Rn×m and K(x, x; w) ∈ Rm×m. While the inducing inputs

x can be selected from the samples collected, we will use an alternative approach where

we group the covariate values observed, x, into m clusters, and choose the cluster-specific

covariate averages as x1, . . . , xm. For instance, given a specific value k, one can use a simple

k-means algorithm [12] to classify x into k clusters and estimate clusters’ means using an

iterative method. Intuitively, it makes sense to have more inducing points in regions that

exhibit more variation in covariate values.

Given a new test point x∗ we are interested in the corresponding posterior predictive distri-

bution of f ∗ = f(x∗):

P (f ∗|x∗,x,y) =

∫P (f ∗|f ,w, x∗, x)P (f ,w|x,y)dfdw. (1.19)

In general, the integral involved in (1.19) cannot be calculated in closed form but we can use

posterior draws (f ,w)(t), t = 1, . . . ,M given x,y to approximate distribution of f ∗|x∗,x,yby samples

(f ∗)(t) = A(x∗, x; w(t))f (t), t = 1, · · · ,M.

Statistical inference can be build on these samples.

Finally, in order to reduce the dimensionality of the parameter space, we assume that

f(xi) = f(xTi β), (1.20)

and we set f = (f(z1), . . . , f(zm))T , where (z1, . . . , zm) are inducing variables in R, f : R→R is an unknown function that is of interest and β ∈ Rq is normalized, i.e. ‖β‖ = 1.

Note that without normalization the parameter β is not identifiable. Here {z1, . . . , zm} play

the same role as {x1, . . . , xm} for general sparse GP. They help sample the posterior latent

11

variables much faster and should be spread in the range of {xT1 β, . . . , xTnβ}. In the next

chapter we show how to choose the positions of these inducing inputs. The single index

model (SIM) defined by (1.20) coupled with the sparse GP approach (henceforth denoted

GP-SIM) has the advantage that it casts the original problem of estimating a general function

f in q dimensions based on n observations into the estimation of q-dimensional parameter

vector β and of the one-dimensional map f based on m << n inducing points. The GP-SIM

approach was successfully implemented for mean regression problems [17, 39] and quantile

regression [44]. It can be used for large covariate dimension and is much more flexible than

simple linear model.

1.4 Model Selection

The conditional copula model involves two types of selection. First one needs to choose the

copula family from a set of possible candidates. Second, it is often of interest to determine

whether a parametric simple form for the calibration is supported by the data. For instance,

a constant calibration function indicates that the dependence structure does not vary with

the covariates, a conclusion that may be of scientific interest in some applications. Let ω(t)

denote the vector of parameters and latent variables drawn at step t from the posterior

corresponding to model M. We consider two measures of fit that can be estimated from

the MCMC samples ω(t), t = 1, . . . ,M . As was mentioned before the observed data set is

denoted by D = {y1i, y2i, xi}ni=1.

Cross-Validated Pseudo Marginal Likelihood

The cross-validated pseudo marginal likelihood (CVML) [33, 40] calculates the average (over

parameter values) prediction power for model M via

CVML(M) =n∑i=1

log (P (y1i, y2i|D−i,M)) , (1.21)

where D−i is the data set from which the ith observation has been removed. An estimate of

(1.21) can be obtained using posterior draws for all the parameters and latent variables in

the model. Specifically, if the latter are denoted by ω, then

E[P (y1i, y2i|ω,M)−1

]= P (y1i, y2i|D−i,M)−1, (1.22)

12

where the expectation is with respect to conditional distribution of ω given full data D and

the model M. Based on the posterior samples we can estimate the CVML as

CVMLest(M) = −n∑i=1

log

(1

M

M∑t=1

P (y1i, y2i|ω(t),M)−1

). (1.23)

The model with the largest CVML is selected.

Watanabe-Akaike Information Criterion

The Watanabe-Akaike Information Criterion [WAIC, 91] is an information-based criterion

that is closely related to the CVML, as discussed in [92], [35] and [89].

The WAIC is defined as

WAIC(M) = −2fit(M) + 2p(M), (1.24)

where the model fitness is

fit(M) =n∑i=1

logE [P (y1i, y2i|ω,M)] (1.25)

and the penalty

p(M) =n∑i=1

V ar[logP (y1i, y2i|ω,M)]. (1.26)

The expectation in (1.25) and the variance in (1.26) are with respect to the conditional

distribution of ω given the data and can be computed using Monte Carlo samples from π.

For instance, the Monte Carlo estimate of the fit is

fit(M) =n∑i=1

log

(∑Mt=1 P (y1i, y2i|ω(t),M)

M

), (1.27)

and p(M) can be estimated similarly using the posterior samples. The model with the

smallest WAIC is preferred.

13

1.5 Simplifying Assumption

Great dimension reduction of the parameter space is achieved under the so-called simplifying

assumption (SA) that assumes Cθ(X) = C (as in (1.3)), i.e. the conditional copula is constant

[38, 21]. The SA condition can significantly simplify the vine copula estimation [for example,

see 1], but it is known to lead to bias when it is wrongly assumed [2]. Therefore, for

conditional copula models it is of practical interest to assess whether the data supports or

not SA. A first step towards a formal test for SA can be found in [4]. The reader is referred

to [23] for an excellent review of work on SA, and ideas for future developments.

If the calibration function and marginal distributions are modeled parametrically, e.g.

η(X) = α0 +K∑j=1

αjΨj(X),

where Ψ(X) are some basis functions and unknown parameters are estimated using maximum

likelihood estimation (MLE) [37] then we can utilize standard asymptotic theory to test

α1 = α2 = . . . = αK = 0 via a canonical likelihood ratio test. However this approach relies

on knowledge of the shape of calibration function, which is unrealistic in practice. A number

of research contributions address this issue for frequentist analyses, e.g. [4], [38], [23], [48].

Moreover even if the calibration form is guessed correctly while marginals are misspecified it

can lead to wrong conclusions about the calibration behavior as noted in chapter 4 and [57].

Our contribution belongs within the Bayesian paradigm, following the general philosophy

expounded also in [49]. In this setting, it was observed in [20] that generic model selection

criteria tend to choose a more complex model even when SA holds.

1.6 Plan

In Chapter 2 we consider Bayesian joint analysis of the marginal and copula models using

flexible GP models. Our emphasis is placed on the estimation of the calibration function

η(X) which is assumed to have a GP prior that is evaluated at βTX for some normalized

β, thus coupling the GP-prior construct with the single index model (SIM) of [17] and

[39]. The GP-SIM is more flexible than a canonical linear model and computationally more

manageable than a full GP with q variables. The proposed model can be used for large

covariate dimension q and for large samples. Both marginal means will be fitted using

sparse GP approaches so that large data sets can be computationally manageable. The

14

dimension reduction of the SIM approach has been noted also by [30] who used two-stage

semiparametric methods to estimate the calibration function. In contrast to [30], we use a

Bayesian approach and estimate marginals and copula parameters jointly. So far, GP-SIM’s

have been used mostly in regression settings where the algorithm of [39] can be used to

efficiently sample the posterior distribution. However, the GP-SIM model for conditional

copulas involves a non-Gaussian likelihood which requires important modifications of their

algorithm.

A second contribution of this work (Chapters 3 and 4) deals with model selection issues that

are particularly relevant for the conditional copula construction. We consider of importance

the choice of copula family and identifying whether the simplifying assumption (SA) is

supported by the data. For the former task we develop a conditional cross-validated marginal

likelihood (CCVML) criterion and also examine its relation with the Watanabe Information

Criterion [91], while for determining the data support for SA we construct a permutation-

based variant of the CVML that shows good performance in our numerical experiments.

We then identify an important link between SA and missing covariates in the conditional

copula model. To our knowledge, this connection has not been reported elsewhere. Finally

we propose two other testing SA procedures that utilizes the idea of splitting the data set to

two sets, fitting a flexible model on the first and based on predictions made by this model on

the second data set. We then divide the data in the second (test) set to ”bins” by the order of

predicted values. To check if the distribution in each bin is the same we utilize permutation

or Chi-square test. We show with theoretical arguments that this procedure produces the

required Probability of Type I error and we support them with simulation results. We then

extend these ideas to other models, and show that generic tests may not be reliable when

the complexity of the model is data-driven. A merit of the proposed methods is that it is

quite general in its applicability, but this comes, unsurprisingly, at the expense of power.

In order to investigate whether the trade-off is reasonable we design a simulation study and

present its conclusions.

We close this part by applying proposed methods to a real world problem and analyze the

Wine Data set in Chapter 5.

15

Chapter 2

Bayesian Conditional Copula using

Gaussian Processes

2.1 GP-SIM for Conditional copula

We consider a bivariate response variable (Y1, Y2) ∈ R2 together with covariate measurement

X ∈ Rq. Hence, the data D = {(y1i, y2i, xi), i = 1 . . . n} consist of triplets (y1i, y2i, xi)

where y1i, y2i ∈ R and xi ∈ Rq. For notational convenience, let y1 = (y11, . . . , y1n)T ,

y2 = (y21, . . . , y2n)T and x = (x1, . . . , xn)T . We assume that the marginal distribution

of Yj (j = 1, 2) is Gaussian with mean fj(X) and constant variance σ2j . If we let Yj =

(Yj1, . . . , Yjn)T , j = 1, 2, and fj = (fj(x1), . . . , fj(xn))T we can compactly write:

Yj ∼ N (fj, σ2j In) j = 1, 2. (2.1)

Generally, it is difficult to discern whether the copula structure varies with covariates or not,

so we consider a conditional copula to account for the more general situation. Therefore,

the likelihood function is

L(ω) =n∏i=1

1

σ1

φ

(y1i − f1i

σ1

)1

σ2

φ

(y2i − f2i

σ2

)×

× cθ(xi)(

Φ

(y1i − f1i

σ1

),Φ

(y2i − f2i

σ2

)),

(2.2)

where c denotes a parametric copula density function, ω denotes all the parameters in the

model, while Φ and φ are the cumulative probability function and density function of a stan-

16

dard normal distribution, respectively. The parameter of a copula depends on the unknown

function θ(xi) = g−1(f(xi)), where f is assumed to take the form given in (1.20) and g is

a known invertible link function that allows an unrestricted parameter space for f . Note

that the form of the GP-SIM model used for estimating the copula parameter is invariant

to non-linear transformations. This implies that the formulation of the model is the same

whether we directly estimate the copula parameter, θ(X), Kendall’s τ(X), or other mea-

sures of dependence. However, this is not true if we use an additive model for θ(X), since

additivity is not preserved by non-linear transformations.

The GP-SIM is fully specified once we assign the GP priors to f1, f2, f and the parametric

priors for the remaining parameters, as follows:

f1 ∼ GP(w1), f2 ∼ GP(w2), f ∼ GP(w),

w1 ∼ N (0, 5Iq+1), w2 ∼ N (0, 5Iq+1), w ∼ N (0, 5I2),

β ∼ U(Sq−1), σ21 ∼ IG(0.1, 0.1), σ2

2 ∼ IG(0.1, 0.1).

(2.3)

The GP(w) is a Gaussian Process prior with mean zero, squared exponential kernel with

parameters w, U(Sq−1) is a uniform distribution on the surface of the q-dimensional unit

sphere and IG(α, β) denotes the inverse gamma distribution. The above prior for w captures

very wiggly functions for small values of w and almost constant functions for large values of

w. Prior for marginal variances is vague and would be conjugate in the absence of copula

term. In our experience, the results are not sensitive to the choice of hyperparameter values.

Because the focus of our work is on inference for the copula, we allow f1 and f2 to be

evaluated on Rq while f is on R. In order to avoid computational problems that affect

the GP-based inference when the sample size is large, the inference will rely on the Sparse

GP method that was described in the previous section. Suppose x1 are m1 inducing inputs

for function f1, x2 are m2 inducing inputs for function f2 and z are m inducing inputs for

function f . The number of inducing inputs m1, m2 and m can all be different, but in our

applications we will choose their values equal and significantly smaller than the sample size,

n. The choice is motivated by imperative computational time restrictions, given the large

number of numerical simulations we perform to investigate empirically the performance of

the approach in terms of estimation and model selection. In practice, the analyst should

ideally use the largest number of inducing points the computing environment can support.

As suggested earlier, we define x1 and x2 as centers of m1 and m2 clusters of x. If m1 = m2

then the inducing inputs are the same. We cannot use the same strategy for z, since then

17

we would need the centers for the clusters of the variable xTβ which are unknown. If we

assume that each covariate xis is between 0 and 1 (this can be achieved easily if we subtract

the minimum value and divide by range) then following the Cauchy-Schwartz inequality we

obtain

‖xTi β‖ ≤√‖xi‖2‖β‖2 ≤ √q ∀xi, β.

Hence we can choose z to be m equally spaced points in the interval [−√q,√q].Let f1 be f1 evaluated at x1, f2 be f2 evaluated at x2 and f be f evaluated at z. Then

the joint density of the observed data and parameters is proportional to:

P (y1,y2, f1 ,f2, f ,w1,w2,w, σ21, σ

22, β|x, x1, x2, z) ∝ pN(y1; f1, σ

21In)pN(y2; f2, σ

22In)×

×n∏i=1

cg−1(fi)

(Φ

(y1i − f1i

σ1

),Φ

(y2i − f2i

σ2

))pN (f1; 0, K(x1, x1; w1))×

×pN (f2; 0, K(x2, x2; w2))pN (f ; 0, K(z, z; w))pN(w1; 0, 5Iq+1)×

×pN(w2; 0, 5Iq+1)pN(w; 0, 5I2)pIG(σ21; 0.1, 0.1)pIG(σ2

2; 0.1, 0.1),

(2.4)

where f1 = A(x, x1; w1)f1, f2 = A(x, x2; w2)f2, f = A(xTβ, z; w)f and pN and pIG are

multivariate normal and inverse gamma densities, respectively. Although here we adopt

a full GP prior for the marginal models, the approach can be easily adapted to consider

GP-SIM models for the marginals too.

The contribution of the conditional copula model to the joint likelihood breaks the

tractability of the posterior conditional densities and complicates the design of an efficient

MCMC algorithm that can sample efficiently from the posterior distribution. The conditional

joint posterior distribution of the latent variables (f) and parameters (w) given the observed

data D does not have a tractable form and its study will require the use of Markov Chain

Monte Carlo (MCMC) sampling methods. Specifically, we use Random Walk Metropolis

(RWM) within Gibbs sampling for w [19, 78, 7] while for f we will use the elliptical slice

sampling [65] that has been designed specifically for GP-based models and does not require

tuning of free parameters.

2.2 Computational Algorithm

Inference is based on the posterior distribution π(ω|D, x1, x2, z) where

ω = (f1, f2, f ,w1,w2,w, σ21, σ

22, β) ∈ Rk represents the vector of parameters and latent vari-

18

ables in the model, with k = 3m+3q+7. Since the posterior is not mathematically tractable,

its properties will be explored via Markov chain Monte Carlo (MCMC) sampling. In this

section we provide the detailed steps of the MCMC sampler designed to sample from π.

The general form of the algorithm falls within the class of Metropolis-within-Gibbs (MwG)

samplers in which we update in turn each component of the chain by sampling from its

conditional distribution, given all the other components. The presence of the copula in the

likelihood breaks the usual conditional conjugacy of the GP models so none of the compo-

nents have conditional distributions that can be sampled directly.

Suppose we are interested in sampling a target π(ω). A generic MwG sampler proceeds

as follows:

Step I Initialize the chain at ω(1)1 , ω

(1)2 , . . . , ω

(1)k .

Step R At iteration t+ 1 run iteratively the following steps for each j = 1, . . . , k:

1. Sample ω∗j ∼ qj(·|ω(t)j , ω

(t+1;t)−j ) where ω

(t+1;t)−j = (ω

(t+1)1 , . . . , ω

(t+1)j−1 , ω

(t)j+1, . . . , ω

(t)k ) is

the most recent state of the chain with the first j−1 components updated already

(hence the supraindex t+1), the jth component removed and the remaining n−jcomponents having the values determined at iteration t (hence the supraindex t).

2. Compute r = min

{1,

π(ω(t+1)1 ,...,ω

(t+1)j−1 ,ω∗j ,ω

(t)j+1,...,ω

(t)k )qj(ω

(t)j |ω

∗j ,ω

(t+1;t)−j )

π(ω(t+1)1 ,...,ω

(t+1)j−1 ,ωtj ,ω

(t)j+1,...,ω

(t)k )qj(ω

(∗)j |ω

(t)j ,ω

(t+1;t)−j )

}.

3. With probability r accept proposal and set ω(t+1)j = ω∗j and with 1 − r reject

proposal and let ω(t+1)j = ω

(t)j .

The proposal density qj(·|·) corresponds to the transition kernel used for the jth compo-

nent. Our algorithm uses a number of proposals corresponding to Random Walk Metropolis-

within-Gibbs (RWMwG), Independent Metropolis-within-Gibbs (IMwG) and Elliptical

Slice Sampling within Gibbs (SSwG) moves.

At the t+ 1 step we use the following proposals to update the chain:

wj: Use a RWM transition kernel: w∗ ∼ N (w(t)j , cwjIq+1). The constant cwj is chosen so

that the acceptance rate is about 30%, j = 1, 2.

w: Use the RWM: w∗ ∼ N (w(t), cwI2). The constant cw is chosen so that the acceptance

rate is about 30%.

19

σ2j : Without the copula, the conditional posterior distribution of σ2

j would be IG(0.1 +

n/2, 0.1 + (yj −Aj f (t)j )T (yj −Aj f (t)

j )), where Aj = A(x, xj; w(t+1)j ) for all j = 1, 2. We

will use this distribution to build and independent Metropolis (IM) type of transition

for σ2j , j = 1, 2. The acceptance rate is usually in the range of [0.25, 0.60] and the

chain mixes better than it would under a RWM.

β: Since β is normalized we will use RWM on unit sphere using ‘Von-Mises-Fisher’ dis-

tribution (henceforth denoted VMF). The VMF distribution has two parameters, µ

(normalized to have norm one) which represents the mean direction and κ, the concen-

tration parameter. A larger κ implies that the distribution will be more concentrated

around µ. The density is symmetric in µ and the argument and is proportional to

fVMF (x;µ, κ) ∝ exp(κxTµ).

The proposals are generated using β∗ ∼ VMF(β(t), κ), where κ is chosen so that the

acceptance rate is around 30%.

f ’s: For fj, j = 1, 2 and f we use the elliptical slice sampling proposed by [65] which does not

require the tuning of simulation parameters. Although not needed in our examples, we

note that if the chain’s mixing is sluggish, one can improve it using the parallelization

strategy proposed by [68].

In our experience the efficiency of the algorithm benefits from initial values that are not too

far from the posterior mode. Therefore we propose first to roughly estimate the parameters

in the two independent regressions for y1 and y2 to get (f1,w1, σ21)(1) and (f2,w2, σ

22)(1). Then

run another MCMC fixing marginals and only sampling (f ,w). This procedure estimates

(f ,w)(1). These 3 short chains (100-200 iterations each) provide good initial values for the

joint MCMC sampler. This simple approach shortens the time it would take for the original

chain to find the regions of high mass under the posterior. We have also found that the

chain’s mixing is accelerated when initial values for w are small, thus allowing for more

variation in the calibration function.

Remark: In our numerical experiments, we will fit the GP-SIM model to data with constant

calibration, i.e., with true values βi = 0 for all 1 ≤ i ≤ q. The constraint ||β|| = 1

forbids sampling null values for all the components of β simultaneously, and instead the

MCMC draws for β’s components are spread randomly in the support. However, the shape

of the calibration function is correctly recovered since the sampled values for the second

component of w were large reflecting the perfect dependence between f(xTi β) and f(xTj β)

20

for any 1 ≤ i 6= j ≤ n. This led to difficulties in identifying the SA as discussed below, and

compelled us to develop a new SA identification procedure that is described in Section 4.2.

2.3 Performance of the algorithms

2.3.1 Simulations

The purpose of the simulation study is to assess empirically: 1) the performance of the

estimation method under the correct and misspecified models, as well as 2) the ability of

the model selection criteria to identify the correct copula structure, i.e. the copula family

and the parametric form of the calibration function. For the former aim we compute the

integrated mean square for various quantities of interest, including the Kendall’s τ . In order

to facilitate the assessment of the estimation performance across different copula families,

we estimate the calibration function on the Kendall’s τ scale. The latter is given by

τ(X) = 4

(∫∫C(u1, u2|X)c(u1, u2|X)du1du2

)− 1.

We will compare 3 copulas: Clayton, Frank and Gaussian under the general GP-SIM model

and the Clayton with constant calibration function. To fit the model with constant copula,

we still use MCMC but instead of f , f ,w and β in calibration we have a constant scalar copula

parameter, θ. The RWMwG transition is used to sample θ, as the proposal distributions

for marginals’ parameters and latent variables remain the same. Table 2.1 shows density

copula functions (as function of its parameter θ) for each copula family. Table 2.2 provides

inverse-link functions g−1 used for calibration, the functional relationship between Kendall’s

τ and copula parameters and parameter ranges for every copula family used in this thesis.

Table 2.1: Copula density functions for each copula family.

Copula c(u1, u2|θ)Clayton (1+θ)(u1u2)−1−θ

A1/θ+2 ; where A = u−θ1 + u−θ2 − 1

Frank θ(1− e−θ)e−θ(u1+u2)[(1− e−θ)− (1− e−θu1)(1− e−θu2)

]−2

Gaussian 1√1−θ2 exp

(− θ2(y21+y22)−2θy1y2

2(1−θ2)

); where yj = Φ−1(uj)

T(v df) 12π√

1−θ21

dv(y1)dv(y2)

(1 +

y21+y22−2θy1y2v(1−θ2)

)−v/2−1

; yj = t−1v (uj) and dv(y)=univ. density of T(v df)

Gumbel Au1u2

(y1 + y2)−2+2/θ(ln(u1) ln(u2))θ−1[1 + (θ − 1)(y1 + y2)−1/θ

]; yj = (− ln(uj))

θ, A = exp(−(y1 + y2)1/θ)

21

Table 2.2: Parameter’s range, Inverse-link functions and the functional relationship betweenKendall’s τ and the copula parameter.

Copula Range of parameter (θ) Inv-Link function Kendall’s τ formula

Clayton (−1,∞) \ {0} θ = exp(f)− 1 τ = θθ+2

Frank (−∞,∞) \ {0} θ = f No closed form

Gaussian, T (−1, 1) θ = exp(f)−1exp(f)+1

τ = 2π

arcsin θ

Gumbel (1,∞) θ = exp(f) + 1 τ = 1− 1θ

In addition to Kendall’s τ we use also the conditional mean of Y1 given y2 and x for

assessing the estimation. Such conditional means can be useful in prediction when one of

the responses is more expensive to measure than the other. The calculation is mathematically

straightforward

E(Y1|Y2 = y2, x) = f1(x) + σ1

∫ 1

0

Φ−1(z)cθ(x)

(z,Φ

(y2 − f2(x)

σ2

))dz. (2.5)

The integral in (2.5) is usually not tractable, but can be easily estimated via numerical

integration since it is one-dimensional and defined on the closed interval [0, 1].

2.3.2 Simulation Details

We generate samples of size n = 400 from each of the next 6 scenarios using the Clayton

copula. The covariates are generated independently from Uniform(0, 1) distribution. The

covariate dimension q in Scenario 3 is 10, in all other scenarios it is 2.

Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

τ(x) = 0.7 + 0.15 sin(15xTβ)

β = (1, 3)T/√

10, σ1 = σ2 = 0.2

Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)

f2(x) = 0.6 sin(3x1 + 5x2)

τ(x) = 0.3 sin(5xTβ)

β = (1, 3)T/√

10, σ1 = σ2 = 0.2

Sc3 β = (1, 10,−3, 6, 1,−6, 3, 7,−1,−5)T/√

267, σ1 = σ2 = 0.2

f1(x) = cos(xTβ)

22

f2(x) = sin(xTβ)

τ(x) = 0.7 + 0.20 sin(5xTβ)

Sc4 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)

f2(x) = 0.6 sin(3x1 + 5x2)

τ(x) = 0.5

σ1 = σ2 = 0.2

Sc5 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)

f2(x) = 0.6 sin(3x1 + 5x2)

η(x) = 1 + 0.7 sin(3x31)− 0.5 cos(6x2

2)

σ1 = σ2 = 0.2

Sc6 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)

f2(x) = 0.6 sin(3x1 + 5x2)

η(x) = 1 + 0.7x1 − 0.5x22

σ1 = σ2 = 0.2

Sc1 and Sc2 have calibration functions for which the SIM model is true for Kendall’s τ

and, consequently, also for the copula parameter. Sc1 corresponds to large dependence

(τ greater than 0.5) while Sc2 has small dependence (τ is between −0.3 and 0.3). Sc3

also has SIM form for calibration function but the covariate dimension is q = 10, so this

scenario is important to evaluate how well the algorithms scale up with dimension. Sc4

corresponds to the covariate-free dependence (τ = 0.5) and allows us to verify the power

to detect simple parametric forms for the calibration. Scenarios Sc5 and Sc6 do not have

SIM form, but have additive calibration function [as in 80]. They are used to evaluate the

effect of model misspecification on the inference. Note that Sc6 has almost SIM calibration

when x2 ∈ [0, 1]. From our experiments we found that when the number of inducing points

is m = 30 for marginals and calibration sparse GPs, we obtain a reasonable CPU time

that allows us to perform the desired number of replications and we can also capture the

general form of the estimated functions. On average one MCMC iteration (n = 400) with

GP-SIM calibration takes 0.02 seconds, one iteration with constant calibration (and GP for

marginals) takes 0.015 seconds. The MCMC samplers were run for 20, 000 iterations for all

scenarios.

The first half of the MCMC sample is discarded as burn-in and the second half is used

for inference. As noted earlier, starting values were found by running two GP regressions

23

separately to estimate marginal parameters and one MCMC sampler was run in order to

estimate calibration parameters. All three samplers were run for only 100 iterations.

2.3.3 Proof of concept based on one Replicate

In the absence of computable convergence bounds, we used the Gelman-Rubin [34] diag-

nostic statistics to decide the length of the chain’s run. To illustrate using Sc1, we ran

10 independent MCMC chains, each for 20, 000 iterations, that were started from different

initial values. The trace plots for the potential scale reduction factor (PSRF), computed up

to 10, 000 iterations for β, σ21 and σ2

2 are displayed in Figure 2.1. The plots show that the

Figure 2.1: Sc1: Clayton copula, Gelman-Rubin MCMC diagnostic for beta and twovariances.

0 2000 4000 6000 8000 10000

1.0

1.5

2.0

2.5

Beta 1

last iteration in chain

shrin

k fa

ctor

median

97.5%

0 2000 4000 6000 8000 10000

1.0

1.5

2.0

2.5

3.0

Beta 2


shrin

k fa

ctor

median

97.5%

0 2000 4000 6000 8000 10000

1.0

1.5

2.0

2.5

3.0

Sigma Squared 1


shrin

k fa

ctor

median

97.5%

0 2000 4000 6000 8000 10000

1.0

1.5

2.0

2.5

3.0

Sigma Squared 2


shrin

k fa

ctor

median

97.5%

multivariate PSRF after 10, 000 iterations is 1.1. The subsequent 10, 000 samples were used

for inference.

Parameter Estimation

The simulation results show that Sc1 and Sc2 performed similarly. Since the calibration

function in Sc1 is more complicated, for the sake of reducing the chapter’s length we present

only results for that scenario. The trace-plots, autocorrelation functions and histograms of

posterior samples of β, σ21 and σ2

2 are shown in Figure 2.2 when the fitted copula belongs

to the correct Clayton family (the horizontal solid red line is the true value). Next we

24

Figure 2.2: Sc1: Trace-plots, ACFs and histograms of parameters based on MCMC samplesgenerated under the true Clayton family.

0 2000 4000 6000 8000 10000

0.25

0.35

Traceplot of Beta1

Iteration

Bet

a1

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF of Beta1

Histogram of Beta1

Beta1

Fre

quen

cy

0.25 0.30 0.35 0.40 0.45

040

080

014

00

0 2000 4000 6000 8000 10000

0.90

0.94

Traceplot of Beta2

Iteration

Bet

a2

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF of Beta2

Histogram of Beta2

Beta2

Fre

quen

cy

0.90 0.92 0.94 0.96

050

015

00

0 2000 4000 6000 8000 10000

0.03

50.

045

Traceplot of Sigma Squared 1

Iteration

Sig

ma

Squ

ared

1

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF of Sigma Squared 1

Histogram of Sigma Squared 1

Sigma Squared 1

Fre

quen

cy0.035 0.040 0.045

050

015

00

0 2000 4000 6000 8000 10000

0.03

50.

045

Traceplot of Sigma Squared 2

Iteration

Sig

ma

Squ

ared

2

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF of Sigma Squared 2

Histogram of Sigma Squared 2

Sigma Squared 2

Fre

quen

cy

0.035 0.040 0.045

050

015

00

show predictions for the marginals means with 95% credible intervals. Since these are 2-

dimensional we estimate ‘slices’ of this surface at values 0.2 and 0.8, so that we first fix

x1 = 0.2 then x1 = 0.8 and similarly for x2. The results are in Figure 2.3 (black is true,

green is estimation, red are credible intervals).

One of the inferential goals is the prediction of calibration function or, equivalently,

Kendall’s τ function. In this case we are dealing with only two covariates so their joint

effect can be visualized via the Kendall’s surface. In Figure 2.4 we show the true calibration

surface on the left panel and the fitted one on the right. The accuracy is remarkable and we

are hard put to see major differences between the two panels. Since the visual comparison

of the three-dimensional true and fitted surfaces may be misleading, so as with conditional

marginal means we estimate one dimensional slices at values 0.2 and 0.8 and the results,

shown in Figure 2.5, confirm the accuracy of the fit.

The predictive power of the model was assessed by fixing 4 covariate points and estimating

the corresponding Kendall’s τ values: τ(0.2, 0.2), τ(0.2, 0.8), τ(0.8, 0.2), τ(0.8, 0.8). At each

MCMC iteration these predictions are calculated and histograms (Figure 2.6) are constructed

(red lines are true value of τ). The same estimates are presented in Figure 2.7 when the

Gaussian copula is used for inference. One can notice that the estimates are biased in this

instance, thus emphasizing the importance of identifying the right copula family. Similar

25

Figure 2.3: Sc1: Estimation of marginal means. The leftmost 2 columns show the accuracyfor predicting E(Y1) and the rightmost 2 columns show the results for predicting E(Y2). Theblack and green lines represent the true and estimated relationships, respectively. The redlines are the limits of the pointwise 95% credible intervals obtained under the true Claytonfamily.

0.0 0.2 0.4 0.6 0.8 1.0

−0

.40

.00

.4

E(Y1),X1=0.2

X2

E(Y

1)

0.0 0.2 0.4 0.6 0.8 1.0

−1

.4−

1.0

−0

.6

E(Y1),X1=0.8

X2

E(Y

1)

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

E(Y1),X2=0.2

X1

E(Y

1)

0.0 0.2 0.4 0.6 0.8 1.0

−1

.5−

1.0

−0

.5

E(Y1),X2=0.8

X1

E(Y

1)

0.0 0.2 0.4 0.6 0.8 1.0

−0

.6−

0.2

0.2

0.6

E(Y2),X1=0.2

X2

E(Y

2)

0.0 0.2 0.4 0.6 0.8 1.0

−0

.6−

0.2

0.2

0.6

E(Y2),X1=0.8

X2

E(Y

2)

0.0 0.2 0.4 0.6 0.8 1.0

−0

.6−

0.2

0.2

0.6

E(Y2),X2=0.2

X1

E(Y

2)

0.0 0.2 0.4 0.6 0.8 1.0

−0

.8−

0.4

0.0

0.4

E(Y2),X2=0.8

X1

E(Y

2)

Figure 2.4: Sc1: Estimation of Kendall’s τ dependence surface. The true surface (leftpanel) is very similar to the estimated one (right panel).

X_1

X_2

Kendall's tau

0.60

0.65

0.70

0.75

0.80

X_1

X_2

Kendall's tau

0.55

0.60

0.65

0.70

0.75

0.80

patterns have been observed when using the Frank copula.

We also show how well the algorithm estimates calibration function when covariate dimension

is large. Figure 2.8 shows one dimensional slices of Kendall’s τ function for Sc3 which is

estimated by Clayton GP-SIM model. Each plot is produced by varying one coordinate from

26

Figure 2.5: Sc1: Estimation of Kendall’s τ one-dimensional projections when x1 =0.2 or 0.8 (top panels) and when x2 = 0.2 or 0.8 (bottom panels). The black and greenlines represent the true and estimated relationships, respectively. The red lines are thelimits of the pointwise 95% credible intervals obtained under the true Clayton family.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tau,X1=0.2

X2

Ken

dalls

Tau

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tau,X1=0.8

X2

Ken

dalls

Tau

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tau,X2=0.2

X1

Ken

dalls

Tau

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tau,X2=0.8

X1

Ken

dalls

Tau

Figure 2.6: Sc1: Histogram of predicted Kendall’s τ values obtained under the true Claytoncopula.

Tau at X1=0.2, X2=0.2

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

020

040

060

080

0

Tau at X1=0.2, X2=0.8

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

020

040

060

080

0

Tau at X1=0.8, X2=0.2

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

020

060

010

00

Tau at X1=0.8, X2=0.8

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

050

010

00

27

Figure 2.7: Sc1: Histogram of predicted τs with Gaussian copula model.

Tau at X1=0.2, X2=0.2

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

010

030

050

0

Tau at X1=0.2, X2=0.8

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

020

040

060

080

0

Tau at X1=0.8, X2=0.2

Kendalls Tau

Fre

quen

cy

0.2 0.4 0.6 0.8

020

040

060

080

0

Tau at X1=0.8, X2=0.8

Kendalls TauF

requ

ency

0.2 0.4 0.6 0.8

050

010

0015

0020

00

0 to 1 while fixing all other coordinates at x = 0.5. We observe that even in this case the

estimated curves are very close to true Kendall’s τ function.

2.3.4 Multiple Replicates

So far, the results reported were based on a single implementation of the method. In order

to facilitate interpretation, we perform 50 independent replications under each of the six

scenarios described previously.

The MCMC sampler was run for 20, 000 iterations for all scenarios. As before, the first

half of iterations was ignored as a burn-in period. For each data set, 4 estimations were

done with Clayton, Frank, Gaussian and constant Clayton copulas. For Sc5 and Sc6 we

also fitted the Clayton copula with an additive model for the calibration function, as in

[80]. The marginal distributions models have the general GP form throughout the section.

In order to produce overall measures of fit, we report the integrated squared Bias (IBias2),

Variance (IVar) and mean squared error (IMSE) of Kendall’s τ evaluated at covariates x =

(x1, . . . , xn)T . The calculation requires finding points estimates for τr(xi) for 1 ≤ r ≤ R

independently replicated analyses and each i = 1, . . . , n. The formulas for IBias2, IVar and

28

Figure 2.8: Sc3: Estimation of Kendall’s τ one-dimensional projections for each coordinatefixing all other coordinates at 0.5 levels. The black and green lines represent the true andestimated relationships, respectively. The red lines are the limits of the pointwise 95%credible in intervals obtained under the true Clayton family.

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X1

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X2

Ken

dall'

s Ta

u0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X3

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X4

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X5

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X6

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X7

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X8

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X9

Ken

dall'

s Ta

u

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.7

0.9

X10

Ken

dall'

s Ta

u

IMSE are given by:

IBias2 =1

n

n∑i=1

R∑r=1

τr(xi)

R− τ(xi)

2

,

IVar =1

n

n∑i=1

V arr(τr(xi)),

IMSE = IBias2 + IVar.

(2.6)

We will apply these concepts not only for Kendall’s τ but also for E(Y1|Y2 = y2, X = x) for

different combinations (x, y2).

Estimation

IBias2, IVar and IMSE for each scenario and each model are shown in Table 2.3 (bold values

show smallest IMSE for each scenario). Note that the smallest IMSE is produced when

fitting the correct model and copula family. The Clayton model with GP-SIM calibration

29

Table 2.3: Estimated√

Bias2,√

IVar and√

IMSE of Kendall’s τ for each Scenario andModel.

Clayton Frank Gaussian Clayton Constant

Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSESc1 0.0393 0.0575 0.0697 0.0357 0.0657 0.0748 0.0679 0.0734 0.1 0.1046 0.0208 0.1066Sc2 0.0492 0.0665 0.0827 0.0695 0.1 0.1218 0.0509 0.0692 0.0859 0.2314 0.0242 0.2327Sc3 0.0327 0.0744 0.0813 0.041 0.0858 0.0951 0.0846 0.1069 0.1363 0.123 0.0134 0.1237Sc4 0.0061 0.0355 0.036 0.0133 0.0584 0.0599 0.0205 0.0493 0.0534 0.0016 0.0258 0.0258Sc5 0.0723 0.0777 0.1061 0.0703 0.0881 0.1127 0.0842 0.0857 0.1202 0.1589 0.024 0.1607Sc6 0.0147 0.0384 0.0411 0.0175 0.05 0.0529 0.0338 0.0559 0.0654 0.0849 0.021 0.0874

has smallest IMSE in all scenarios with the exception of Sc4. We note that models with

constant calibration have much smaller IVar than models with GP-SIM but have much larger

IBias and, consequently, IMSE. Not surprisingly, for Sc4, the Clayton copula model with

constant calibration yields the smallest IMSE. For each simulated data set and each model,

E(Y1|Y2 = y2, x) were estimated. For all scenarios except for Sc3 we let each x1, x2 take

values in the set {0.2, 0.4, 0.6, 0.8} and y2 in {−0.6,−0.2, 0.2, 0.6} making a total of 64 com-

binations. For Sc3 we let y2 take values in {−0.5, 0.0, 0.5, 1.0}, while x can take 33 values

scattered in [0, 1]10, making a total of 132 combinations. The results are presented in Ta-

ble 2.4 and largely mimic the patterns found in Table 2.3, thus showing that the predictive

power of the model and the accuracy of dependence estimation are linked.


Bias2,√

IVar and√

IMSE of E(Y1|y2, x) for each Scenario and Model.


Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSESc1 0.0231 0.0531 0.0579 0.1264 0.0322 0.1304 0.1434 0.0557 0.1539 0.0416 0.0579 0.0713Sc2 0.0293 0.0464 0.0549 0.0802 0.0475 0.0932 0.1098 0.0593 0.1247 0.1213 0.0407 0.128Sc3 0.0364 0.0707 0.0795 0.214 0.0363 0.217 0.1042 0.0708 0.1259 0.0483 0.0572 0.0749Sc4 0.0174 0.042 0.0454 0.1023 0.0325 0.1074 0.1379 0.0449 0.145 0.0179 0.041 0.0447Sc5 0.0144 0.0413 0.0437 0.0909 0.0347 0.0973 0.14 0.051 0.149 0.0355 0.04 0.0534Sc6 0.0202 0.0456 0.0498 0.1046 0.0298 0.1087 0.1367 0.0448 0.1439 0.0237 0.0442 0.0501

The results for scenarios Sc5 and Sc6 in which the true calibration has an additive

form are shown in the Table 2.5. Shown are the global measures of fit for Kendall’s τ and

E(Y1|Y2 = y2, x) when the true Clayton copula is coupled with the GP-SIM and the additive

model for representing the calibration function. An astute reader should not be exceedingly

surprised to observe that GP-SIM outperforms the additive model under Sc6 since the

30

calibration function is not far from having a SIM form in this case (due to 0 ≤ u− u2 ≤ 1/4

for any u ∈ [0, 1]). This is not observed in Sc5 where GP-SIM performs worse for Kendall’s

tau estimation than the true additive model.


Bias2,√

IVar and√

IMSE of Kendall’s τ and E(Y1|y2, x) for GP-SIMand Additive models.

Kendall’s TauClayton GP-SIM Clayton Additive

Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSESc5 0.0723 0.0777 0.1061 0.0573 0.0516 0.0771Sc6 0.0147 0.0384 0.0411 0.0063 0.0458 0.0462

E(Y1|y2, x)Clayton GP-SIM Clayton Additive

Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSESc5 0.0144 0.0413 0.0437 0.0207 0.0428 0.0475Sc6 0.0202 0.0456 0.0498 0.0236 0.0483 0.0538

31

Chapter 3

Model Selection

3.1 Conditional CVML criterion

The conditional copula construction is particularly useful in predicting one response given

the other ones. We exploit this feature by computing the predictive distribution of one

response given the rest of the data. The resulting conditional CVML (CCVML) is computed

from the P (y1i|y2i,D−i) and P (y2i|y1i,D−i) via

CCVML(M) =1

2

{n∑i=1

log [P (y1i|y2i,D−i,M)] +n∑i=1

log [P (y2i|y1i,D−i,M)]

}. (3.1)

Note that when the marginal distributions are uniform, CCVML is the same as CVML. One

can easily show that

E[P (y1i|y2i, ω,M)−1

]= E

[P (y2i|ω,M)

P (y1i, y2i|ω,M)

]= P (y1i|y2i,D−i,M)−1,

E[P (y2i|y1i, ω,M)−1

]= E

[P (y1i|ω,M)

P (y1i, y2i|ω,M)

]= P (y2i|y1i,D−i,M)−1.

(3.2)

Based on (3.2) CCVML is estimated from MCMC samples using

CCVMLest(M) = −1

2

n∑i=1

{log

[1

M

M∑t=1

P (y2i|ω(t),M)

P (y1i, y2i|ω(t),M)

]+ log

[1

M

M∑t=1

P (y1i|ω(t),M)

P (y1i, y2i|ω(t),M)

]}.

(3.3)

In [92] it was demonstrated that CVML and WAIC are asymptotically equivalent, so that

CVML(M) ≈WAIC(M)/(−2) for a large sample size n. This connection can be extended

32

to CCVML using the following two conditional WAICs:

CWAIC1(M) =− 2n∑i=1

logE [P (y1i|y2i, ω,M)] + 2n∑i=1

V ar[logP (y1i|y2i, ω,M)], (3.4)

CWAIC2(M) =− 2n∑i=1

logE [P (y2i|y1i, ω,M)] + 2n∑i=1

V ar[logP (y2i|y1i, ω,M)], (3.5)

where expectation and variance are with respect to the conditional distribution of ω given

the observed data. An argument that follows directly the one in [89] shows that CCVML

and 12{CWAIC1 + CWAIC2} are also asymptotically equivalent.

3.2 Simulation results with CVML, CCVML and WAIC

criteria

In this section we focus on the accuracy of CVML, CCVML and WAIC in selecting the

correct model. All results below are based on the same simulations done in Section 2.3.2.

3.2.1 One replicate

Table 3.1 shows the values for each scenario and model for one replicate. Bold values indicate

largest CVML/CCVML and smallest WAIC values for each scenario. Observe that for Sc1,

Sc2, Sc3, Sc5, Sc6, all three criteria point to the Clayton family, while for Sc4 they indicate

the Clayton family with a constant calibration. We note that the correct copula is selected

even when the generative calibration model is additive (Sc5, Sc6).

3.2.2 Multiple replicates

We then show how well CVML, CCVML and WAIC perform in choosing the correct model

under data set replication. For selecting between different copula families or to check whether

dependence is covariate-free we just pick the model with largest CVML/CCVML or smallest

WAIC. Table 3.2 shows how often Clayton model is selected over other models using CVML,

CCVML and WAIC for Sc1, Sc2, Sc3, Sc5 and Sc6. Similarly, Table 3.3 shows how often

Clayton-constant is selected over other models for Sc4.

We can conclude that all selection measures perform quite similarly across scenarios.

Also, the numerical study shows that the choice of a copula family is considerably more

33

Table 3.1: CVML, CCVML and WAIC values for each Scenario and Model.

CVML CCVML WAIC CVML CCVML WAICScenario 1 Scenario 4Clayton 532 458 −1065 Clayton 322 254 −644Frank 422 365 −844 Frank 277 209 −549

Gaussian 397 326 −801 Gaussian 276 207 −547Clayton-Const 503 433 −1007 Clayton-Const 323 255 −647

Scenario 2 Scenario 5Clayton 166 103 −333 Clayton 324 277 −650Frank 144 82 −289 Frank 256 216 −513


Scenario 3 Scenario 6Clayton 613 536 −1237 Clayton 286 242 −573Frank 562 491 −1126 Frank 216 179 −432


Table 3.2: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.

Frank Gaussian Clayton ConstantScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC

Sc1 100 % 100 % 100 % 100 % 100 % 100 % 94 % 94 % 94 %Sc2 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %Sc3 98 % 96 % 98 % 100 % 100 % 100 % 100 % 98 % 100 %Sc5 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %Sc6 100 % 100 % 100 % 100 % 100 % 100 % 98 % 100 % 98 %

accurate than correctly determining that the calibration function is constant. The latter

difficulty has been reported elsewhere [e.g., 20]. In part, this is due to the fact that the models

are flexible enough to capture the constant calibration and produce estimates that mislead a

cross-validation-based method. In section 4.2 we return to this problem and develop a new

permutation-based procedure that exhibits a drastically improved performance in numerical

experiments. Since Sc5 and Sc6 were simulated with Clayton additive calibration, we

show how often Clayton Additive model is selected over Clayton GP-SIM using different

34

Table 3.3: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a constant calibration with three models: Clayton, Frankand Gaussian, all of them assuming a GP-SIM calibration.

Clayton Frank GaussianScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC

Sc4 58 % 62 % 58 % 100 % 100 % 100 % 100 % 100 % 100 %

criteria (Table 3.4). The poor performance for Sc6 is not that surprising since the additive

calibration in this scenario has almost SIM form.

Table 3.4: The percentage of correct decisions for each selection criterion when comparingthe correct additive model with GP-SIM with non-constant calibration.

Clayton GP-SIMScenario CVML CCVML WAIC

Sc5 92% 94% 90%Sc6 30% 34% 28%

3.3 Additional Simulation Results Based On Multiple

Replicates

In addition to simulations shown in Sections 2.3.2 and 3.2.2 we also simulated and analyzed

50 independent replicates from each of the following scenarios:

Sc1b f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

τ(x) = 0.7 + 0.15 sin(15xTβ)

β = (1, 3)T/√

10, σ1 = σ2 = 0.2

n = 1000

Sc7 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

τ(x) = 0.7 + 0.15 sin(2xTβ)

β = (2,−3)T , σ1 = σ2 = 0.2

n = 400

35

Sc1b is exactly the same as Sc1 with only difference that sample size is 1000 instead of 400.

In Sc7 we do not assume that generating β is normalized.

Tables 3.5 and 3.6 show IBias2, IVar and IMSE for each scenario (including Sc1 for

comparison) and each model for estimation of Kendall’s τ and E(Y1|y2, x) respectively.


Bias2,√

IVar and√

IMSE of Kendall’s τ for each Scenario andModel.


Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE

Sc1 0.0393 0.0575 0.0697 0.0357 0.0657 0.0748 0.0679 0.0734 0.1000 0.1046 0.0208 0.1066Sc1b 0.0308 0.0526 0.0610 0.0353 0.0569 0.067 0.0693 0.0624 0.0933 0.1093 0.0148 0.1102Sc7 0.0266 0.0527 0.0591 0.0422 0.0673 0.0794 0.0636 0.0743 0.0978 0.1050 0.0162 0.1063


Bias2,√

IVar and√

IMSE of E(Y1|y2, x) for each Scenario andModel.


Scenario√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE√

IBias2√

IVar√

IMSE

Sc1 0.0231 0.0531 0.0579 0.1264 0.0322 0.1304 0.1434 0.0557 0.1539 0.0416 0.0579 0.0713Sc1b 0.0140 0.0362 0.0389 0.1250 0.0238 0.1272 0.1457 0.0412 0.1514 0.036 0.0473 0.0594Sc7 0.0154 0.0466 0.0491 0.1228 0.0315 0.1268 0.1300 0.0559 0.1415 0.0437 0.0494 0.0659

First we note that Clayton with GP-SIM calibration produces smallest IMSE in all scenarios.

Another important observation is that IMSEs for Kendall’s tau and conditional response

prediction are smaller for Sc1b than for Sc1. Which indicates that as the sample size

increases the model produces more accurate predictions. Results for Sc7 are similar to Sc1,

so even when the true generating β in SIM is not normalized we still obtain acceptable

predictions for each test value x. Actually the posterior β just converges to a normalized

vector (2,−3)T/√

13.

Table 3.7 shows how often Clayton model (with non-constant calibration) is selected over

other models using CVML, CCVML and WAIC. Again we notice that the above criteria

perform well in distinguishing between copula families. Also in Sc1b all criteria select true

non-constant Clayton with higher rate than in Sc1 which is probably due to larger sample

size.

36

Table 3.7: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.

Frank Gaussian Clayton ConstantScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC

Sc1 100 % 100 % 100 % 100 % 100 % 100 % 94 % 94 % 94 %Sc1b 100 % 100 % 100 % 100 % 100 % 100 % 98 % 98 % 98 %Sc7 100 % 100 % 100 % 100 % 100 % 100 % 98 % 94 % 98 %

37

Chapter 4

Simplifying Assumption

4.1 Interesting Connection between Model Misspecifi-

cation and the Simplifying Assumption

Understanding whether the data support the SA or not is usually important for the subject

matter analysis, since a dependence structure that does not depend on the covariates can be

of scientific interest. The SA has also a serious impact on the statistical analysis, because

it has the potential to simplify greatly the estimation of the copula. There is however, an

interesting connection between model misspecification and SA which, as far as we know, has

not been reported elsewhere.

To illustrate the point, consider a random sampling design setting with two independent

random variables, X1, X2 serving as covariates in the Clayton copula model in which SA is

satisfied, the sample size n = 1500 and

f1(x) = 0.6 sin(5x1 + x2),

f2(x) = 0.6 sin(x1 + 5x2),

τ(x) = 0.5,

σ1 = σ2 = 0.2.

When we fit a GP-SIM model with the correct Clayton copula family, but with the X2

covariate omitted from both marginal and copula models, the estimated Kendall’s τ(x1)

exhibits a clear non-constant shape, as seen in Figure 4.1. The CVML, CCVML and WAIC

criteria, whose values are shown in Table 4.1, unanimously vote for a nonconstant calibration

function.

38

Figure 4.1: Estimation of Kendall’s τ as a function of x1 when only first covariate is usedin estimation. The dotted black and solid green lines represent the true and estimatedrelationships, respectively. The red lines are the limits of the pointwise 95% credible inintervals obtained under the true Clayton family.

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

x_1

Ken

dalls

Tau

Table 4.1: Missed covariate: CVML, CCVML and WAIC criteria values for model withconditional copula depends on one covariate and when it is constant.

Variables CVML CCVML WAICX1 -508 -174 1017

Constant -570 -232 1140

While one may expect a non-constant pattern when the two covariates are dependent,

this residual effect of X1 on the copula may be surprising when X1 and X2 are inde-

pendent. We can gain some understanding by considering a simplified example in which

Yi|X1, X2 ∼ N (fi(X1, X2), 1) for i = 1, 2, and Cov(Y1, Y2|X1, X2) = Corr(Y1, Y2|X1, X2) = ρ,

hence constant in X1 and X2. Hence, for marginal models that include only X1, yielding

residuals Wi = Yi − E[Yi|X1] for i = 1, 2, we are interested in explaining the non-constant

dependence between Cov(W1,W2|X1) and X1. Standard statistical properties of covariance

and conditional expectation are used to obtain

Cov(W1,W2|X1) = Cov(Y1, Y2|X1), (4.1)

39

and

Cov(Y1, Y2|X1) = E[Cov(Y1, Y2|X1, X2)] + Cov(E[Y1|X1, X2], E[Y2|X1, X2])

= ρ+ Cov(f1(X1, X2), f2(X1, X2)), (4.2)

where the covariance in (4.2) is with respect to the marginal distribution of X2. Hence it is

apparent that the conditional covariance Cov(W1,W2|X1) will generally not be constant in

X1. Note that if the true means have additive form, i.e. fi(X1, X2) = fi(X1) + fi(X2), for

i = 1, 2, then the covariances in (4.1) are indeed constant in X1, but the estimated value

of Cov(Y1, Y2|X1) will be biased. Although here we focused on the covariance as a measure

of dependence, the argument is extendable to copula parameters or Kendall’s tau, but the

calculations are more involved.

In conclusion, violation of the SA may be due to the omission of important covariates

from the model. This phenomenon along with the knowledge that in general it is difficult

to measure all the variables with potential effect on the dependence pattern, suggests that

a non-constant copula is a prudent choice.

4.2 A Permutation-based CVML to Detect Data Sup-

port for the Simplified Assumption

In this section we try to modify the CVML and the conditional CCVML method to identify

data support for SA after the copula family is selected.

As was shown in Section 3.2.2 CVML, CCVML and WAIC criteria yield good results

in identifying correct copula family but do not perform well in recognizing that the true

calibration is constant. In other word they have large probability of Type I error. This is in

line with [20] who also noted that the traditional Bayesian model selection criteria, e.g. the

Deviance information criterion (DIC) of [88], tend to prefer the more complex calibration

model over a simple model with constant calibration even when the latter is actually correct.

In addition of the simulations presented in the previous section, we add here that when the

marginal distributions are estimated, the performance of the existing criteria worsens. To

illustrate, we have simulated 50 replicates of sample sizes 1500 using Clayton copula from

Sc1, Sc4 and Sc5. Each sample is fitted with the general model introduced here and a con-

stant Clayton copula, while marginals are estimated using a general GP. Table 4.2 shows the

proportion of correct decisions for the three scenarios and various selection criteria. These

40

results show that even for a large sample size, the proportion of right decisions for Sc4, i.e.

when SA holds, is quite low. One of the explanations is that the general model does a good

job at capturing the constant trend of the calibration function and yields predictions that

are not too far from the ones produced with the simpler (and correct) model. The modified

CVML we propose is inspired by two desiderata: i) to separate the set of observations used

for prediction from the set of observations used for fitting the model, and ii) to amplify the

impact of the copula-induced errors in the CCVML calculation. The former will reduce the

implicit bias one gets when the same data is used for estimation and testing, while the latter

is expected to increase the power to identify SA.

For i) we randomly partition the data into a training set D = {y1i, y2i, xi}i=1,...,n1 and a

test set D∗ = {y∗1i, y∗2i, x∗i }i=1,...,n2 . In our numerical experiments we have kept two thirds of

observations in the training set. In order to achieve ii) we note that permuting the response

indexes will not affect the copula term if SA is indeed satisfied and will perturb the pre-

diction when SA is not satisfied. However, one must cautiously implement this idea, since

the permutation λ : {1, . . . , n2} → {1, . . . , n2} will affect the marginal model fit, regardless

of the SA status, as yjλ(i) will be paired with xi, for all j = 1, 2. Below we describe the

permutation-based CVML criterion that combines i) and ii).

Assume that the fitted GP-SIM model yields posterior samples from the conditional distri-

bution of latent variables and parameters ω(t) ∼ π(ω|D), t = 1 . . .M . Then we define the

observed data criterion as the predictive log probability of the test cases which can be easily

estimated from posterior samples, as follows:

CVMLobs =

n2∑i=1

logP (y∗1i, y∗2i|D, x∗i ) ≈

n2∑i=1

log

{1

M

M∑t=1

P (y∗1i, y∗2i|w(t), x∗i )

}=

=

n2∑i=1

log

{1

M

M∑t=1

1

σ(t)1

φ

(y∗1i − f

∗(t)1i

σ(t)1

)1

σ(t)2

φ

(y∗2i − f

∗(t)2i

σ(t)2

)×

×cθ∗(t)i

[Φ

(y∗1i − f

∗(t)1i

σ(t)1

),Φ

(y∗2i − f

∗(t)2i

σ(t)2

)]},

where f∗(t)1i , f

∗(t)2i , θ

∗(t)i are the predicted values for the test cases produced by the GP-SIM

model. Consider J permutations of {1 . . . n2} which we denote as λ1, . . . , λJ , and compute

41

J permuted CVMLs as:

CVMLj =

n2∑i=1

log

{1

M

M∑t=1

1

σ(t)1

φ

(y∗1i − f

∗(t)1i

σ(t)1

)1

σ(t)2

φ

(y∗2i − f

∗(t)2i

σ(t)2

)×

× cθ∗(t)λj(i)

[Φ

(y∗1i − f

∗(t)1i

σ(t)1

),Φ

(y∗2i − f

∗(t)2i

σ(t)2

)]}. (4.3)

Note that CVMLobs differs from CVMLj only in the values of the copula parameters. While

for the former we use θ(x∗i ), in the latter we use θ(x∗λj(i)) for the dependence between y∗1i and

y∗2i. If calibration is constant then CVMLobs and CVMLj should be similar, hence we define

the evidence

EV = 2×min

J∑j=1

1{CVMLobs<CVMLj}

J,

J∑j=1

1{CVMLobs>CVMLj}

J

. (4.4)

Under the null model with constant calibration with known marginals and if we assume

that CVMLobs and {CVMLj : 1 ≤ j ≤ J} are iid for each j, then each term inside the min

function in (4.4) has a Uniform(0, 1) limiting distribution when J → ∞. In that case it

follows that P (EV < 0.05) = 0.05. In practice, the ideal situation just described is merely

an approximation since the {CVMLj : 1 ≤ j ≤ J} are not independent and we compute EV

using a fixed number of permutations. Nevertheless, the ideal setup can be used to build

our decision that when EV > 0.05 the data support SA, and otherwise they do not.

A similar rule can be build using the CCVML criterion. For instance, its value for test

data is

CCVMLobs =1

2

n2∑i=1

logP (y∗1i|D, x∗i , y∗2i) +1

2

n2∑i=1

logP (y∗2i|D, x∗i , y∗1i). (4.5)

The permutation-based version of (4.5) can be obtained using the same principle as in (4.3)

thus leading to the counterpart of (4.4) for CCVML.

Table 4.3 shows the proportion of correct decisions using proposed methods with 1000

and 500 samples in training and test set respectively, and J = 500 permutations. The results,

especially those for Sc4, clearly show an important improvement in the rate of making the

correct selection, with only a slight decrease in the power to detect non-constant calibrations.

We can also notice that CVML and CCVML performed similarly.

42

Table 4.2: The percentage of correct deci-sions for each selection criterion and scenar-ios. GP-SIM and SA were fitted with Clay-ton copula, sample size is 1500.

Scenario CVML CCVML WAICSc1 100% 100% 100%Sc4 74% 78% 74%Sc5 100% 100% 100%

Table 4.3: The percentage of correct de-cisions for each selection criterion and sce-nario. Predicted CVML and CCVML valuesbased on n1 = 1000 training and n2 = 500test data, respectively. The calculation ofEV is based on a random sample of 500 per-mutations.

Scenario CVML CCVMLSc1 98% 96%Sc4 92% 90%Sc5 100% 100%

4.3 Two other methods for Detecting Data Support

for SA

Permutation-based CVML and CCVML perform much better in identifying constant cop-

ula than generic CVML and WAIC. However they lack theoretical justification and cannot

guarantee specific probability of type I error. Therefore in this section we develop the idea

of splitting the whole data set into training and test further and propose two other SA

testing procedures that can also be applied to other models. We propose to use some of

the properties that are invariant to the group of permutations when SA holds. In the first

stage we randomly divide the data D into training and test sets, D1 and D2, with n1 and

n2 sample sizes, respectively. The full model defined by (1.9) is fitted on D1, and we denote

ω(t) the t-th draw sampled from the posterior. For the ith item in D2, compute point esti-

mates ηi and Ui = (U1i, U2i) where Uji = Fj(yji|ωj, xi), j = 1, 2, i = 1, . . . , n2 and ωj are

specific parameters and latent variables related to marginal distribution j. The marginal

parameters estimates, ωj, are obtained from the training data posterior draws. For instance,

if the marginal models are Y1i ∼ N (f1(xi), σ21) and Y2i ∼ N (f2(xi), σ

22), then each of the

MCMC sample ω(t) leads to an estimate f t1(xi), ft2(xi), σ

t1, σ

t2, η

t(xi). Then Ui = (U1i, U2i) are

obtained using

(U1i, U2i) = (Φ((y1i − f1(xi))/σ1),Φ((y2i − f2(xi))/σ2)),

where the overline a signifies the averages of Monte Carlo draws at.

Given the vector of calibration function evaluations at the test points, η = (η1, . . . , ηn2),

and a partition min(η) = a1 < . . . < aK+1 = max(η) of the range of η into K disjoint

43

intervals, define the set of observations in D2 that yield calibration function values between

ak and ak+1, Bk = {i : ak ≤ ηi < ak+1} k = 1, . . . , K. We choose the partition such that each

”bin” Bk has approximately the same number of elements, n2/K. Under SA, the bin-specific

A1 Compute the kth bin-specific Kendall’s tau τk from {Ui : i ∈ Bk}) k = 1, . . . , K.

A2 Compute the observed statistic T obs = SDk(τk) (where SDk(ak) is a standard deviationof ak over index k). Note that if SA holds, we expect the observed statistic to be closeto zero.

A3 Consider J permutations λj : {1, . . . , n2} → {1, . . . , n2}. For each permutation λj:

A3.1 Compute τjk = τ({Ui : λj(i) ∈ Bk}) k = 1, . . . , K.

A3.2 Compute test statistic Tj = SDk(τjk). Note if SA holds, then we expect Tj to beclose to T obs.

A4 We consider that there is support in favour of SA at significance level α if T obs is smallerthan the (1− α)-th empirical quantile calculated from the sample {Tj : 1 ≤ j ≤ J}.

Table 4.4: Method 1: A permutation-based procedure for assessing data support in favour ofSA

estimates for various measures of dependence, e.g. Kendall’s τ or Spearman’s ρ, computed

from the samples Ui, are invariant to permutations, or swaps across bins. Based on this

observation, we consider the procedure described in Table 1 for identifying data support

for SA. The distribution of the resulting test statistics obtained in Method 1 is determined

empirically, via permutations. Alternatively, one can rely on the asymptotic properties of the

bin-specific dependence parameter estimates and construct a Chi-square test. Specifically,

suppose the bin-specific Pearson correlations ρk are computed from samples {Ui : i ∈ Bk}),for all k = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K be the number of points in each

bin. It is known that ρk is asymptotically normal distributed for each k so that

√n(ρk − ρk)

d→ N (0, (1− ρ2k)

2),

where ρk is the true correlation in bin k. If we assume that {ρk: k = 1, . . . , K} are

independent, and set ρ = (ρ1, . . . , ρK)T and Σ = diag((1 − ρ21)2, . . . , (1 − ρ2

K)2), then we

have: √n(ρ− ρ)

d→ N (0,Σ).

44

B1 Compute the bin-specific Pearson correlation ρk from samples {Ui : i ∈ Bk}), for allk = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K, the number of points in each bin.

B2 Define ρ = (ρ1, . . . , ρK)T , Σ = diag((1− ρ21)2, . . . , (1− ρ2

K)2) and A ∈ R(K−1)×K as in(4.6) then under SA we have that ρ1 = . . . = ρK and

n(Aρ)T (AΣAt)−1(Aρ)d→ χ2

K−1.

Compute T obs = n(Aρ)T (AΣAt)−1(Aρ).

B3 Compute p-value = P (χ2K−1 > T obs) and reject SA if p-value< α.

Table 4.5: Method 2: A Chi-square test for assessing data support in favour of SA

In order to combine evidence across bins, we define the matrix A ∈ R(K−1)×K as

A =

1 −1 0 · · · 0

0 1 −1 · · · 0...

......

......

0 0 · · · 1 −1

. (4.6)

Since under the null hypothesis SA holds, one gets ρ1 = . . . = ρK , implying

n(Aρ)T (AΣAt)−1(Aρ)d→ χ2

K−1.

Method 2, with its steps detailed in Table 4.5, relies on the ideas above to test SA.

Method 1 evaluates the p-value using a randomization procedure [56], while the second

is based on the asymptotic normal theory of Pearson correlations. To get reliable results

it is essential to assign test observations to ”correct” bins which is true when calibration

predictions are as close as possible to the true unknown values, i.e. η(xi) ≈ η(xi). The latter

heavily depends on the estimation procedure and sample size of the training set. Therefore

it is advisable to apply very flexible models for the calibration function estimation and have

enough data points in the training set. We immediately see a tradeoff as more observations

are assigned to D1 the better will be the calibration test predictions, at the expense of

decreasing power due to a smaller sample size in D2. For our simulations we have used

n1 ≈ 0.5n and n2 ≈ 0.5n, and K ∈ {2, 3}.To get the intuition behind the proposed methods consider an idealized example where

45

marginals are uniform, true calibration is known and we have access to ”infinite” data,

moreover we focus on situation with only 2 bins. Note that if SA is true then the correlation

in each bin is the same and any permutation should yield the same or very similar correlation.

On the hand if SA is not satisfied, and assuming for simplicity that the calibration only takes

2 values, then since observations are assigned to bins by their calibration values, bin 1 and

bin 2 will contain pairs following distributions π1(u1, u2) and π2(u1, u2) with corresponding

correlations ρ1 < ρ2 respectively. Note that after a random permutation, pairs in each bin will

follow a mixture distribution λπ1(u1, u2)+(1−λ)π2(u1, u2) and (1−λ)π1(u1, u2)+λπ2(u1, u2)

in bin 1 and 2 respectively with λ ∈ (0, 1) . It is obvious, that correlations of permuted data

in bins 1 and 2 are λρ1 + (1 − λ)ρ2 and (1 − λ)ρ1 + λρ2. Observe that each correlation is

between ρ1 and ρ2 which imply that the absolute difference between these two correlations

after any permutation must be less than ρ2 − ρ1 which is our observed test statistic. Of

course in a real finite data, the observed statistic is not such an obvious outlier but should

be somewhere in a tail of the distribution of statistics. In other words this example illustrates

that if SA is not satisfied then our proposed methods should reject this hypothesis (at least

for large enough sample size).

4.4 Simulations

In this section we present the performance of the proposed methods and comparisons with

generic CVML and WAIC criteria on simulated data sets. Different functional forms of

calibration function, sample sizes and magnitude of deviation from SA will be explored.

Simulation details

We generate samples of sizes n = 500 and n = 1000 from 3 scenarios described below.

For all scenarios the Clayton copula will be used to model dependence between responses,

while covariates are independently sampled from U [0, 1]. For all scenarios the covariate

dimension q = 2. Marginal conditional distributions Y1|X and Y2|X are modeled as Gaussian

with constant variances σ21, σ

22 and conditional means f1(X), f2(X) respectively. The model

parameters must be estimated jointly with the calibration function η(X). For convenience

we parametrize calibration on Kendall’s tau τ(X) scale.

Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

46

τ(x) = 0.5,σ1 = σ2 = 0.2.

Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

τ(x) = δ + γ × sin(10xTβ)

β = (1, 3)T/√

10, σ1 = σ2 = 0.2.

Sc3 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),

f2(x) = 0.6 sin(3x1 + 5x2),

τ(x) = δ + γ × 2(x1 + cos(6x2)− 0.45)/3

σ1 = σ2 = 0.2.

Sc1 corresponds to SA since Kendall’s τ is independent of covariate level. The calibration

function in Sc2 has single index form for the calibration function, while in Sc3 it has an

additive structure on τ scale (generally not additive on η scale), these simulations are useful

to evaluate performance under model misspecification. We note that τ in Sc2 and Sc3

depends on parameters δ (average correlation strength) and γ (deviation from SA), which

in this study take values δ ∈ {0.25, 0.75} and γ ∈ {0.1, 0.2}, respectively.

Simulation results

For each sample size and scenario we have repeated the analysis using 250 independently

replicated data sets. For each data, the GP-SIM model suggested in chapter 2 ([57]) is fitted.

Such Bayesian non-parametric models are much more flexible than parametric ones and can

effectively capture various patterns. The inference is based on 5000 MCMC samples for all

scenarios, as the chains were run for 10000 iterations with 5000 samples discarded as burn-in.

The number of inducing inputs was set to 30 for all GP. For generic SA testing, GP-SIM

fitting is done for the whole data sets and posterior draws are used to estimate CVML,

CCVML and WAIC. Since the proposed methods require data splitting, we first randomly

divide the data equally into training and testing sets. We fit GP-SIM on the training set and

then use the obtained posterior draws to construct point estimates of F1(y1i|xi), F2(y2i|xi)and η(xi) for every observation in the test set. In Method 1 we used 500 permutations.

Table 4.6 shows the percentage of SA rejections for generic Bayesian selection criteria. The

presented results clearly illustrate that generic methods have difficulties identifying SA. This

leads to a loss of statistical efficiency since a complex model is selected over a much simpler

one. In the context of CVML or CCVML it can be explained by observing that both

47

Table 4.6: Simulation Results: Generic, proportion of rejection of SA for each scenario,sample size and generic criteria.

n = 500 n = 1000Scenario CVML CCVML WAIC CVML CCVML WAICSc1 33.3% 31.1% 34.7% 38.2% 37.3% 37.8%Sc2(δ = 0.75, γ = 0.1) 99.1% 98.7% 99.1% 100% 100% 100%Sc2(δ = 0.75, γ = 0.2) 100% 100% 100% 100% 100% 100%Sc2(δ = 0.25, γ = 0.1) 80.1% 84.4% 80.1% 99.1% 100% 99.1%Sc2(δ = 0.25, γ = 0.2) 100% 100% 100% 100% 100% 100%Sc3(δ = 0.75, γ = 0.1) 76.9% 73.3% 77.8% 85.7% 82.2% 85.8%Sc3(δ = 0.75, γ = 0.2) 99.1% 97.3% 99.1% 99.1% 97.8% 99.1%Sc3(δ = 0.25, γ = 0.1) 54.7% 56.4% 55.6% 65.3% 68.4% 64.9%Sc3(δ = 0.25, γ = 0.2) 89.8% 92.0% 91.1% 99.6% 100% 99.6%

these measures do not penalize for complexity of the model. Therefore flexible calibration

function (as GP-SIM) provides similar fit to that of the reduced model and as a consequence

has similar predictive power. In addition, the SA may be of interest in itself in certain

applications, e.g. stock exchange modelling where it is useful to determine whether the

dependence structure between different stock prices does not depend on other factors. The

Table 4.7: Simulation Results: Proposed method, proportion of rejection of SA for eachscenario, sample size, number of bins (K) and method.

Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000

Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 4.9% 6.2% 3.5% 5.3% 9.7% 11.1% 10.7% 13.7%Sc2(δ = 0.75, γ = 0.1) 90.2% 80.4% 99.6% 99.1% 94.7% 94.2% 99.6% 99.1%Sc2(δ = 0.75, γ = 0.2) 100% 100% 100% 100% 100% 100% 100% 100%Sc2(δ = 0.25, γ = 0.1) 25.8% 18.7% 55.1% 47.1% 30.2% 21.8% 58.7% 53.8%Sc2(δ = 0.25, γ = 0.2) 91.6% 84.9% 99.6% 99.6% 92.4% 91.1% 99.6% 99.6%Sc3(δ = 0.75, γ = 0.1) 28.0% 24.0% 57.3% 52.9% 41.3% 45.8% 72.4% 72.9%Sc3(δ = 0.75, γ = 0.2) 88.4% 85.8% 98.7% 98.7% 94.2% 92.0% 100% 99.1%Sc3(δ = 0.25, γ = 0.1) 8.0% 7.5% 11.1% 10.7% 9.8% 10.7% 15.1% 12.9%Sc3(δ = 0.25, γ = 0.2) 19.6% 18.2% 63.6% 60.9% 24.9% 23.6% 70.2% 69.3%

simulations summarized in Table 4.7 show that the proposed methods (setting α = 0.05)

have much smaller probability of Type I error which vary around the threshold of 0.05.

48

It must be pointed, however, that under SA the performance of χ2 test worsens with the

number of bins K, which is not surprising since as K increases, the number of observations

in each bin goes down and normal approximation for the distribution of Pearson correlation

becomes tenuous, while the permutation-based test is more robust to small samples. The

performance of both methods improves with sample size. We also notice a loss of power

between Scenarios 2 and 3, which is due to model misspecification, since in the latter case

the generative model is different from the postulated one. All methods break down when

the departure from SA is not large, e.g. γ = 0.1. Although not desirable, this has limited

impact in practice since, in our experience, in this case the predictions produced by either

model are very similar.

4.5 Theoretical justification

In this section we prove that under canonical assumptions, the probability of Type I error

for Method 2 in Section 4.3 converges to α when SA is true.

Suppose we have independent samples from K populations (groups), (u11i, u

12i)

n1i=1 ∼

(U11 , U

12 ),. . . ,(uK1i , u

K2i)

nKi=1 ∼ (UK

1 , UK2 ), the goal is to test ρ1 = . . . = ρK (here ρ is Pear-

son correlation).

To simplify notation, we assume n1 = . . . , nK = n. Let ρ = (ρ1, . . . , ρK) be the vector

of sample correlations , Σ = diag((1 − ρ21)2, . . . , (1 − ρ2

K)2) and (K − 1) × K matrix A as

defined in Section 4.3, then canonical asymptotic results imply that if ρ1 = . . . = ρK and as

n→∞,

T = n(Aρ)T (AΣAT )−1(Aρ)d→ χ2

K−1. (4.7)

Based on the model fitted on D1, we define estimates of F1(y1i|xi) and F2(y2i|xi) by U =

{Ui = (F1(y1i|xi), F2(y2i|xi))}n2i=1. Note that U depends on D1 and X (covariates in test set).

Given a fixed number of bins K and assuming, without loss of generality, equal sample sizes

in each bin n = n2/K, we define a test statistic T (U) as in (4.7) with ρj estimated from

{U(j−1)n+1, . . . , Ujn}, for 1 ≤ j ≤ K.

Note that in Method 2, test cases are assigned to ”bins” based on the value of predicted

calibration function η(xi) which is not taken into account in the generic definition of test

statistic T (U) above. To close this gap, we introduce a permutation λ∗ : {1, . . . , n2} →{1, . . . , n2} that ”sorts” U from smallest η(x) value to largest i.e. Uλ∗ = {Uλ∗(i)}n2

i=1 with

η(xλ∗(1)) < η(xλ∗(2)) < · · · < η(xλ∗(n2)). Hence, the test statistic in Method 2 has form T (Uλ∗)

49

as in (4.7) but in this case test cases with smallest predicted calibrations are assigned to the

first group, or bin, and with largest calibrations to the Kth group/bin. Finally, define a test

function φ with specified significance level α to test SA:

φ(U |λ∗) =

1 if T (Uλ∗) > χ2K−1(1− α),

0 if T (Uλ∗) ≤ χ2K−1(1− α).

(4.8)

Intuitively, if SA is false then we would expect T (Uλ∗) to be larger then the critical value

χ2K−1(1− α).

The goal is to show that this procedure have probability of type I error equal to α, which is

equivalent to the expectation of the test function:

P(Type I error) =

∫φ(U |λ∗)P (λ∗|D1, X)P (U |D1, X)P (D1)P (X)dUdD1dXdλ

∗. (4.9)

Note that λ∗ does not depend on U because of the data splitting to train and test sets. Also

usually P (λ∗|D1, X) is just a point mass at some particular permutation. In general the

above integral cannot be evaluated, however if we assume that for all test cases:

F1(y1i|xi)p→ F1(y1i|xi) as n→∞ ∀i,

F2(y2i|xi)p→ F2(y2i|xi) as n→∞ ∀i,

(4.10)

then under SA and as n → ∞, P (U |D1, X) = P (U) ≈∏n2

i=1 c(u1i, u2i) where c is a copula

density and the expectation becomes:

P(Type I error) =

∫φ(U |λ∗)P (λ∗|D1, X)P (U)P (D1)P (X)dUdD1dXdλ

∗ =

=

∫ (∫φ(U |λ∗)P (U)dU

)P (λ∗|D1, X)P (D1)P (X)dD1dXdλ

∗ = α.

(4.11)

Since if SA is true,∫φ(U |λ∗)P (U)dU = α for any λ∗. Therefore if marginal CDF predictions

for test cases are consistent then this procedure has the required probability of type I error

for sufficiently large sample size.

50

4.6 Extensions to other models

The proposed idea of dividing the data into train and test subsets, splitting the observations

in the test set to bins and then use a test to check distributional difference between bins

can be extended to other models. For example, one can use a similar construction in mul-

tiple mean regression, logistic or quantile regression problems to check SA. The proposed

approaches for assessing SA can be particularly useful when f(X) (conditional mean or

quantile) is assumed to have a complex form, and flexible models such as generalized addi-

tive models, additive tree structures, non-parametric or based on Bayesian non-parametric

methods are utilized. Simulations we conducted suggest that generic testing procedures yield

large Type I error probabilities for non-parametric fitted models, a problem that is attenu-

ated using the permutation-based ideas described in this chapter. Below we describe how to

modify Methods 1 and 2 to regression problems and conduct a series of simulations to com-

pare performances (probability of Type I error and power) of the proposed algorithms with

standard testing procedures used in literature. Also, note in contrast to Bayesian view that

we have adopted for conditional copula problems we use frequentist paradigm for regression

problems below.

4.6.1 Multiple Regression

Multiple regression with Gaussian errors is used extensively in applied statistics for its sim-

plicity and theoretical properties. Here we assume that errors are identically distributed

(no heteroskedasticity). It is frequently of interest to test whether all the predictors (or

covariates) contribute to the prediction of the response. Therefore we define ”full” model as

a model that contains all the relevant covariates while ”reduced” or SA is a model with only

an intercept. For linear multiple regression we could use the global F test [28] with known

distribution under SA but as will be shown, when more general models for conditional mean

are fit, the generic F test no longer exhibits the correct significance level.

For this set of simulations we assume that yi = f(Xi) + εi for i = 1, · · · , n with εi

independent and identically distributed. We generate samples of sizes n = 500 and n = 1000

from 5 scenarios described below. For all scenarios covariates are independently sampled

from U [0, 3] with covariate dimension q = 3.

Sc1 f(x) = 1; εi ∼ N (0, 9).

Sc2 f(x) = γ × (x21 + 3x2 − 2x3

3 + 6)/3 + 1; εi ∼ N (0, 9).

51

Sc3 f(x) = γ × (x1x2x3 + x1x22 − 7.7)/3 + 1; εi ∼ N (0, 9).

Sc4 f(x) = 1; εi ∼ Cauchy(0, 1).

Sc5 f(x) = γ × (x1x2x3 + x1x22 − 7.7) + 1; εi ∼ Cauchy(0, 1).

Sc1 and Sc4 correspond to SA as conditional means do not depend on the covariates. Sc2

and Sc3 represent nonlinear models with Gaussian errors, the former has additive structure

the latter has interaction between covariates. Note that both depend on parameter γ which

controls the deviation from SA. Sc4, Sc5 include errors from Cauchy distribution which has

much heavier tails than Gaussian. These scenarios are useful to evaluate the performance of

testing algorithms when the assumption of normality is violated.

To apply the proposed testing procedures for this regression problem we do the following:

divide the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit a flexible model on the training set, here we use Generalized Additive model (GAM)

with cubic splines for each component (penalty is also estimated in the same fitting). With

estimated parameters make prediction of function f on the test set f(x) (similar to estimated

calibration function in conditional copula problem), based on f(x) split the test set to K

bins so that number of points in each bin is n = n2/K. Here we can implement either

Chi-square or permutation approach. For Chi-square test (Method 2), let

~y = (y1, · · · , yK)T ,

where yk is the average of responses in bin k ∈ {1, · · · , K}. Then for regression with Gaussian

noise it is trivially derived from the normal theory that under SA:

n(A~y)T (AΣAT )−1A~y ∼ χ2K−1, (4.12)

where A as in (4.6) and Σ = diag(σ2, · · · , σ2) with σ2 is estimated from GAM fit on the

training set. Note that this result is approximately true for non-Gaussian noise (with finite

variance) and sufficiently large n by Central Limit Theorem. This result can be used to assess

evidence against SA. Permutation test (Method 1) is constructed by first finding observed test

statistic T obs = y(K)− y(1) (largest minus smallest average), and then by permuting responses

compute proportion of permutations with permuted test statistics Tj, j = 1, · · · , J greater

than T obs. This proportion is an estimate of the p-value and if it is less than pre-specified

α, SA is rejected. For all scenarios we set significance level α = 0.05.

52

We compare these methods to the following generic approaches to SA testing. Since we

focus on non-linear conditional mean, the standard procedure is to fit flexible non-parametric

model (on the whole data set), find SSEfull (sum of residuals squared) and degrees of freedom

DFfull and compare them to SSEred and DFred = 1 of the model with only an intercept. This

is an instance of partial F statistic that has exact F distribution under SA and if the fitted

model is linear in parameters.

F ∗ =(SSEred − SSEfull)/(DFfull −DFred)

SSEfull/(n−DFfull)∼ F (DFfull −DFred, n−DFfull), (4.13)

where F (a, b) represents F-distribution with a and b degrees of freedom for numerator and

denominator respectively. We denote this approach by (F-test). Note that the above exact

distribution may not be valid when a non-parametric model such as Generalized Additive,

Random Forest, Support Vector or Gaussian process is fitted. That is why we also consider

less realistic test (F-exact) where the observed test statistic F ∗ is compared to the exact

critical value (that corresponds to α level) which is estimated by repeatedly (M times)

generating response vector Y from SA (keeping covariates fixed), each time fitting the ”full”

and ”reduced” models, and finally calculating observed test statistic F ∗. Once we have

approximate distribution of the test statistic F ∗ under SA we can estimate critical value by

taking empirical (1 − α) quantile. For simulation below we set M = 200, note that this

approach may not be feasible in practice when the dimensionality of data is large and/or

fitting the full model is computationally costly. We also show performances of AIC and BIC

criteria, so that after fitting ”full” and ”reduced” models, the model with smallest criterion’s

value is selected. Since we assume that errors are identically distributed it follows that in

these examples SA implies independence of covariates and response therefore we introduce

Bootstrap test for independence which again usually not feasible in real world problems for

computational inefficiency. For this test given pairs {yi, xi}ni=1 we fit the ”full” model and

calculate some measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the ”full” model to {yλj(i), xi}ni=1 and calculate Tj, finally reject

SA (or independence) if T obs is greater than (1 − α) quantile of {Tj}nj=1. For discrepancy

measure T we consider −SSEfull (BOOT-SSE) and F ∗ (BOOT-F) as in Eq (4.13), note that

these measures are ”large” when SA is false. Similar to (F-exact) test these tests require

many estimations of the complicated (or full) model, for all simulations we set J = 100.

We set the ”full” model to be GAM with cubic splines for each covariate and estimation of

the degree of smoothness. For better comparisons, in addition to fitting flexible GAM we

53

also fit additive multiple regression with polynomial degrees of 1,3 and 10 for each predictor.

The SA is tested with standard global (F-test) in (4.13).

For every scenario we generate 500 sets of responses (keeping covariates fixed) and test

simplifying assumption using all of the described procedures with significance level α = 0.05.

Results:

Table 4.8: Simulation Results for Regression: Generic, proportion of rejections of SA foreach scenario, sample size and generic criteria.

Bootstrap Poly, F-testScenario F-test F-exact AIC BIC -SSE F-test Deg=1 Deg=3 Deg=10

n = 500Sc1 29.8% 3.6% 29.8% 0.0% 5.0% 5.6% 5.8% 4.2% 3.0%Sc2(γ = 0.05) 56.4% 20.6% 59.2% 0.2% 9.4% 24.6% 30.2% 21.0% 10.8%Sc2(γ = 0.11) 98.6% 91.8% 99.4% 22.0% 55.6% 92.0% 94.8% 89.8% 66.2%Sc3(γ = 0.1) 56.4% 28.8% 63.2% 0.6% 8.2% 25.6% 34.8% 19.6% 11.2%Sc3(γ = 0.2) 95.6% 83.8% 95.6% 15.4% 35.0% 78.6% 89.0% 75.6% 47.2%Sc4 27.2% 6.8% 28.4% 0.0% 5.4% 4.2% 2.6% 7.2% 8.0%Sc5(γ = 0.1) 27.8% 16.0% 31.8% 0.2% 7.4% 9.0% 8.6% 10.2% 9.8%Sc5(γ = 0.9) 81.2% 76.8% 84.2% 52.4% 60.8% 73.2% 75.2% 71.0% 65.0%

n = 1000Sc1 27.2% 3.8% 26.6% 0.0% 2.8% 4.8% 3.8% 4.0% 5.0%Sc2(γ = 0.05) 76.4% 46.6% 81.4% 1.0% 17.8% 45.8% 55.0% 43.0% 23.6%Sc2(γ = 0.11) 100.0% 99.6% 100.0% 61.6% 96.2% 99.4% 100.0% 100.0% 97.4%Sc3(γ = 0.1) 75.0% 53.2% 79.4% 2.0% 18.0% 43.8% 56.8% 38.0% 20.4%Sc3(γ = 0.2) 100.0% 99.2% 100.0% 39.6% 78.0% 97.8% 99.8% 98.0% 84.2%Sc4 22.4% 3.0% 23.6% 0.0% 4.4% 4.6% 1.6% 5.4% 10.2%Sc5(γ = 0.1) 26.8% 7.0% 33.4% 0.2% 5.0% 8.0% 9.2% 8.4% 8.2%Sc5(γ = 0.9) 84.0% 76.8% 85.6% 53.4% 62.4% 77.6% 79.0% 75.6% 67.4%

Tables 4.8 and 4.9 show proportion of SA rejections (out of 500 replicates and signif-

icance α = 0.05) for generic and proposed tests respectively. Generally we should select

a procedure that obtains 5% of rejections for Sc1 and Sc4 and has largest proportion of

rejections for other scenarios (has highest power). First we immediately notice that generic

F-test and AIC have much larger probability of Type I error than expected 5%. Therefore

it is misleading to compare their power with other tests. BIC has no rejections under SA

however its power is very small, indicating that the penalty that BIC uses is too large in

this example. Global F-tests for polynomial models have acceptable probability of Type I

54

Table 4.9: Simulation Results for Regression: Proposed methods, proportion of rejectionsof SA for each scenario, sample size and number of bins.


Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 4.6% 4.8% 5.0% 5.0% 4.4% 4.8% 6.2% 5.4%Sc2(γ = 0.05) 11.2% 10.8% 16.8% 16.4% 12.8% 13.2% 17.4% 17.2%Sc2(γ = 0.11) 44.4% 46.0% 85.4% 87.8% 48.0% 51.6% 86.6% 90.4%Sc3(γ = 0.1) 9.6% 8.4% 16.2% 14.8% 9.2% 9.2% 17.2% 15.6%Sc3(γ = 0.2) 34.0% 32.0% 65.2% 66.0% 36.4% 34.6% 67.8% 68.8%Sc4 4.8% 4.4% 2.4% 5.6% 26.8% 30.4% 28.6% 31.8%Sc5(γ = 0.1) 8.8% 6.4% 8.4% 10.2% 28.0% 32.0% 29.2% 34.0%Sc5(γ = 0.9) 76.0% 78.4% 78.0% 82.6% 69.2% 72.2% 73.4% 76.6%

error (as predicted by the theory) however their power depends heavily on the polynomial

order and in real problems it is usually impossible to choose correct order for each compo-

nent. F-exact and BOOT tests for independence have the required probability of Type I

error, however largest power is produced by F-exact and BOOT-F. Note that for F-exact we

simulate (under SA) from the exact distribution (normal for Sc1-Sc3 and Cauchy(0, 1) for

Sc4 and Sc5) which is usually unknown in practice.

Focusing now on Table 4.9 we first note that performances of Method 1 and Method 2 are

very similar except for Sc4 and Sc5 where χ2 test should not be used since averages in

each bin do not follow normal distributions for these scenarios. Number of bins K does not

exhibit strong relationship with the performance either. When SA is true (Sc1 and Sc4)

then probability of Type I error for the proposed methods is around 5% as expected. Again,

χ2 test for Sc4 should not be implemented as Cauchy distribution has undefined mean and

variance. It is also evident that the power of these tests increases with the sample size. For

Sc2 and Sc3 Methods 1 and 2 clearly underperforms F-exact and BOOT-F in power however

show similar power to BOOT-SSE. When Cauchy errors are used in Sc5 then Method 1 has

similar power to F-exact and BOOT-F.

It should not be surprising that Methods 1 and 2 have generally smaller power than compu-

tationally expensive bootstrap tests since only half of a sample (test set) is used to decide

the rejection of SA.

55

4.6.2 Logistic Regression

Logistic regression is probably the most frequently used statistical tool to analyze dependence

of a response variable which can take only 2 values with a set of covariates. Suppose we

observe independent pairs {yi, xi}ni=1 with the following generative process:

πi =1

1 + exp(−f(Xi)); i = 1, · · · , n,

Yi ∼ Bern(πi); i = 1, · · · , n,(4.14)

where Bern(p) is a Bernoulli random variable with the probability of success p. The main

objective is to estimate unknown function f(X) and check whether f(X) is independent

of X and therefore constant (SA). Similar to multiple regression we can apply proposed

approaches for assessing data support for SA for logistic regression. As before we define a

flexible and frequently non-parametric model as ”full” and the model with only an intercept

as ”reduced” model.

Consider the following four simulation scenarios:

Sc1 f(x) = 0; x ∈ R2,

Sc2 f(x) = γ × (x1 + x22 − 0.34),

Sc3 f(x) = 0; x ∈ R10,

Sc4 f(x) = γ × (sin(2x1) + cos(x2) + 2x3x4− 2 cos(x5 + 2x6− x7) + x28 + x3

9− 2x210 + 0.12).

We generate samples of sizes n = 500 and n = 1000 from these four scenarios, covariates

are independently sampled from U [−1, 1] with covariate dimension q = 2 for the first two

scenarios and q = 10 for Sc3 and Sc4. Sc1 and Sc3 correspond to SA as probability of Y = 1

is set to 0.5 and does not depend on covariates. In Sc2 function f(X) has additive structure

while Sc4 represents nonlinear, non-additive dependence of probability on predictors. Note

that Sc2 and Sc4 have additional parameter γ which is associated with the deviation from

SA.

To apply the proposed testing procedures for this logistic problem we do the following: divide

the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit flexible model on the training set, here we use Generalized Additive model (GAM) with

cubic splines for each component. With estimated parameters make prediction of function f

on test set f(x) (similar to estimated calibration function in conditional copula problems),

56

based on f(x) split the test set to K bins so that number of points in each bin is n = n2/K.

Here we can implement either Chi-square (Method 2) or permutation (Method 1) approach.

For Chi-square test, let~p = (p1, · · · , pK)T ,

where pk is the average of responses or sample proportion in bin k ∈ {1, · · · , K}. If n is

large enough then by Central Limit Theorem we get that under SA the following approximate

distributional result holds:

n(A~p)T (AΣAT )−1A~p·∼ χ2

K−1, (4.15)

where A as in Eq (4.6) and Σ = diag(p0(1 − p0), · · · , p0(1 − p0)) with p0 is estimated as

the sample proportion of all the responses in the test set. This result can be used to assess

evidence against SA.

Permutation test is performed by first finding observed test statistic T obs = p(K)−p(1) (largest

minus smallest sample proportion), and then by permuting responses compute proportion of

permutations with test statistics Tj, j = 1, · · · , J greater than T obs (here J is the number of

permutations). This proportion is an estimate of the p-value and if it is less than α, SA is

rejected. For all scenarios we set significance level α = 0.05.

We compare these algorithms to the following generic approaches to SA testing. Since we

focus on non-linear function f(X), the standard procedure is to fit a flexible ”full” model

(on the whole data set), find DEVfull (deviance of this model) and degrees of freedom DFfull

and compare them to DEVred and DFred of the model with only an intercept, see for example

[27]. Here deviance is defined as −2 times log-likelihood, and deviance for the ”full” must

be smaller than for the ”reduced” one. This is an example of the likelihood ratio test that

has an approximate χ2 distribution under SA.

T ∗ = DEVred −DEVfull·∼ χ2

DFfull−DFred. (4.16)

We denote this approach by (T-test). Note that the above result is generally used for

nested models with the full model having parametric form and as will illustrated later may

not be valid when non-parametric model such as Generalized Additive, Random Forest

or Gaussian process is fitted for f(X). That is why we also consider less realistic test

(T-exact) where observed test statistic T ∗ is compared to the exact critical value (that

corresponds to α level) which is estimated, by repeatedly (M times) generating response

57

Y from SA (keeping covariates fixed), each time fitting ”full” and ”reduced” models, and

finally calculating observed test statistic T ∗. Once we obtain an approximate distribution of

the test statistic T ∗ under SA we can estimate the critical value by taking (1− α) quantile.

For simulations below we set M = 200, note that this approach may not be feasible in

practice when the dimensionality of the data is large and the ”full” model is computationally

expensive to fit. We also show performances of AIC and BIC criteria, so that after fitting

”full” and ”reduced” models, the model with smallest criterion’s value is selected. Since

we assume that Yi are independent it follows that in this example SA implies independence

therefore we introduce generic bootstrap test for independence which again may not be

feasible in real world problems. For this test given pairs {yi, xi}ni=1 we fit ”full” model and

calculate a measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the same model to {yλj(i), xi}ni=1 and calculate Tj, finally reject

SA (or independence) if T obs is greater than (1−α) quantile of {Tj}nj=1. For the discrepancy

measure T we consider LogLikfull (BOOT-LL) and T ∗ (BOOT-T) as in Eq (4.16), note

that these measures as ”large” when SA is false. Similar to (T-exact) test these tests for

independence require many estimations of the ”full” model, for all simulations we set J =

100.

We set here the ”full” model to be GAM with cubic splines and estimation of smoothing

parameter for each predictor. For better comparisons, in additional to fitting flexible GAM

we also fit simple additive models with polynomial degrees 1,3 and 10 for every covariate.

The SA is tested with standard global (T-test) as in (4.16).

For every scenario we generate 500 sets of responses (keeping covariates fixed) and test

simplifying assumption using all of the described procedures with significance level α = 0.05.

Results:

Table 4.10 presents proportion of rejections using described above generic testing procedures.

First we examine probability of Type I error by looking on Sc1 and Sc3. Note that standard

likelihood ratio test (T-test) has very large rejection rate for scenarios 1 and 3, it even exceeds

74% for Sc3. Note that the error increases with the dimensionality of covariates. AIC also

has much larger probability of Type I error than expected 5%. BIC on the other hand

produces very small rejection rate under SA but the power for other scenarios is the lowest

for this criterion. Likelihood ratio test works quite well for polynomial models (except for

10-degree) as predicted by the theory. However as with regression example the power of

polynomial models depends significantly on the degree and choosing correct degree for each

58

Table 4.10: Simulation Results for Logistic Regression: Generic, proportion of rejections ofSA for each scenario, sample size and generic criteria.

Bootstrap Poly, T-testScenario T-test T-exact AIC BIC LogLik T-stat deg=1 deg=3 deg=10

n = 500Sc1 24.4% 5.6% 28.2% 0.4% 5.8% 5.8% 4.4% 4.8% 7.0%Sc2(γ = 0.1) 23.8% 5.0% 33.2% 0.0% 5.8% 5.8% 6.8% 7.2% 4.8%Sc2(γ = 0.55) 96.4% 44.0% 98.4% 23.2% 43.2% 43.2% 85.0% 82.2% 58.0%Sc3 74.4% 7.0% 26.8% 0.0% 4.0% 4.0% 5.6% 6.4% 24.2%Sc4(γ = 0.1) 79.8% 9.2% 30.8% 0.0% 7.0% 7.0% 6.2% 7.4% 25.4%Sc4(γ = 0.33) 98.6% 32.4% 88.6% 0.0% 26% 26% 40.8% 68.2% 67.6%

n = 1000Sc1 25.0% 5.6% 29.8% 0.0% 3.8% 3.8% 5.6% 6.2% 4.4%Sc2(γ = 0.1) 32.6% 8.0% 40.8% 0.6% 7.6% 7.6% 12.8% 10.4% 9.8%Sc2(γ = 0.55) 99.8% 95.0% 100.0% 67.8% 93.6% 93.6% 99.6% 99.6% 96.0%Sc3 73.2% 5.8% 22.2% 0.0% 6.0% 6.0% 4.8% 5.4% 12.6%Sc4(γ = 0.1) 84.0% 11.2% 44.4% 0.0% 11% 11% 13.6% 16.0% 20.4%Sc4(γ = 0.33) 100.0% 68.0% 99.2% 0.0% 70% 70% 79.0% 94.6% 80.6%

Table 4.11: Simulation Results for Logistic Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size and number of bins.


Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 7.4% 5.0% 7.8% 6.8% 4.4% 4.2% 6.2% 6.4%Sc2(γ = 0.1) 9.0% 5.2% 9.2% 6.4% 6.4% 4.0% 8.8% 6.6%Sc2(γ = 0.55) 44.8% 35.8% 86.6% 84.8% 38.8% 35.6% 85.0% 85.6%Sc3 7.4% 7.0% 7.2% 5.2% 6.2% 7.2% 6.6% 4.8%Sc4(γ = 0.1) 6.4% 5.4% 8.2% 8.2% 4.4% 5.8% 7.4% 7.8%Sc4(γ = 0.33) 19.2% 16.0% 51.6% 46.2% 16.2% 17.0% 48.6% 46.8%

predictor can be very challenging task in real problems. T-exact and Bootstrap procedures

perform very similarly with correct probability of Type I error and largest power.

Next we focus on Table 4.11 which shows proportion of rejections of the proposed SA tests

for all the scenarios and sample sizes. First we notice that under SA (Sc1 and Sc3) the

probability of rejection for Methods 1 and 2 is around 5% as required. Also the performances

of both tests are quite similar. The power of both methods for Sc2 is comparable with

the power produced by T-exact, BOOT-LL and BOOT-T. For Sc4 with large covariate

59

dimension, Methods 1 and 2 shows smaller power than the best generic procedures.

Based on all these observations we can conclude that T-exact and bootstrap procedures

generally work quite well however they are computationally expensive and for complicated

fitted models are not even feasible. For such problems the proposed SA assessment can be

appropriate as the fitting of a flexible model must only be done once on the training set

which is only half of the original sample size.

4.6.3 Quantile Regression

Suppose similar to multiple regression we observe independent pairs {yi, xi}ni=1 with yi ∈ R

and xi ∈ Rq. Given a quantile τ ∈ (0, 1) the objective of a quantile regression is to estimate

Qτ (Y |X) = fτ (X) which is just τ quantile of response Y as a function of covariates [50]. For

example if τ = 0.5 then the quantile regression aims to approximate conditional median of

the response.

Generally it is assumed that conditional quantile has a linear (in parameters) form:

Qτ (Y |x) = fτ (x) = β0 + β1x1 + · · ·+ βqxq. (4.17)

Unknown parameters ~β are estimated by minimizing appropriate function which can be

reformulated as a linear programming problem. Another important question is whether the

conditional τ -quantile depends on covariates X or not, which is exactly SA in this context. In

this section we show that the proposed testing techniques can be used in quantile regression to

assess data support for SA. In previous sections we were fitting non-parametric models such

as GAM to the data, for quantile regression there are no such build-in functions therefore for

the ”full” model we define additive model with 3 B-splines for each covariate (no complexity

penalty). This model can be considered as linear since fτ (X) is a linear function of unknown

parameters. As before the ”reduced” model is a quantile regression with only an intercept β0,

and the main objective is to check if the ”reduced” model is adequate based on the observed

data. For simulations we assume that Yi = f(Xi) + εi, i = 1, · · · , n with εi are independent

and identically distributed. To compare testing procedures we consider the following two

scenarios:

Sc1 f(x) = 0; ε ∼ 5χ21,

Sc2 f(x) = γ × (x21 + 3x2 − 2x3

3 + 6)/3 + 1; ε ∼ 5χ21.

60

Here χ21 is a Chi-square distribution with 1 degree of freedom. We generate samples of sizes

n = 500 and n = 1000 from these two scenarios, covariates are independently sampled from

U [0, 3] with covariate dimension q = 3 for both scenarios. Sc1 corresponds to SA for all τ

as the distribution of Y does not depend on covariates. In Sc2 function f(X) has additive

structure, note that as a consequence the conditional quantile also has additive form for any

τ . As in previous sections Sc2 depends on parameter γ which controls the deviation from

SA.

To apply the proposed SA testing procedure for quantile regression we do the following

(fixing τ): divide the whole data set into training and test sets (half to each set so that

n1 = n2 = bn/2c), fit flexible τ -quantile regression model on the training set, here we use

additive model with 3 B-splines for each component (or equivalently cubic polynomial).

With estimated parameters make predictions of function fτ on the test set fτ (x) (similar to

estimated calibration function in conditional copula problems), based on fτ (x) split the test

set to K bins so that number of points in each bin is n = n2/K. Here we can implement

either Chi-square (Method 2) or permutation approach (Method 1). For the Chi-square

test we can do the following procedure, first find q0τ which is the ”global” τ -quantile of the

responses in the whole test set. Then construct a 2×K table Tik with T1k is the number of

responses in bin k that are larger (or equal) to q0τ , similarly T2k is the count of responses in

bin k that are smaller than q0τ . Note that if SA is true then columns and rows of the table

Tik must be independent, thus Pearson χ2 test for independence can be implemented that

asymptotically has χ2K−1 distribution in this case. If the observed test statistic

∑i,k

((Tik − Eik)2

Eik

),

is larger than appropriate critical value of χ2K−1 we reject SA (here Eik is the expected count

under independence), we use this test for our Method 2.

Permutation test is performed by first finding observed test statistic T obs = q(K)−q(1) (largest

minus smallest quantile) where qk is τ quantile of the responses in bin k, k = 1, · · · , K.

Then by permuting responses compute proportion of permutations with test statistic Tj,

j = 1, · · · , J greater than T obs. This proportion is the estimate of the p-value and if it is

less than α, SA is rejected. For all scenarios we set significance level α = 0.05.

We compare these algorithms to the following generic approaches to SA testing. Let degrees

of freedom of the ”full” and ”reduced” models be denoted by DFfull and DFred respectively.

Similarity if available let deviances of two models be represented by DEVfull and DEVred.

61

A generic SA test is based on ANOVA decomposition (”anova.rq” function in R software)

which calculates test statistic F ∗ that should follow F-distribution(DFfull−DFred,n−DFfull)

if SA is true. We denote this approach by (F-test). As will be shown this test has much

larger probability of Type I error than expected. That is why we also consider F-exact where

the observed test statistic F ∗ is compared to the exact critical value (that corresponds to α

level) which is estimated, by repeatedly (M times) generating response Y from SA (keeping

covariates fixed), each time fitting ”full” and ”reduced” models, and calculating test statis-

tic F ∗ (from ANOVA procedure). Once we obtain an approximate distribution of the test

statistic under SA we can estimate critical value by taking (1−α) quantile. For simulations

below we set M = 200. We also show performance of AIC criterion, so that after fitting

”full” and ”reduced” model with only intercept, the model with smallest criterion’s value is

selected. Likelihood ratio is also available in R software (”lrtest”) which calculates difference

between deviances as in (4.16) and compares to critical values of χ2

DFfull−DFred. Since in this

example SA of any τ implies independence of Y and X, we introduce generic bootstrap test

for independence. For this test given pairs {yi, xi}ni=1 we fit the ”full” model and calculate a

measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n}, j = 1, · · · , J .

For each j fit ”full” model to {yλj(i), xi}ni=1 and calculate Tj, finally reject SA (or indepen-

dence) if T obs is greater than (1 − α) quantile of {Tj}nj=1. For the discrepancy measure T

we consider negative of sum of residuals squared −SSEfull and sum of estimated coefficients

squared (without intercept) so that T =∑q

1 β2i . Note that these measure are ”large” when

SA is false. Similar to (F-exact) these tests require many estimations of complicated models,

for all simulations we set J = 100.

For every scenario we generate 500 sets of responses (keeping covariates fixed) and test sim-

plifying assumption using all of the described procedures with significance level α = 0.05.

We also check SA for different quantile values τ ∈ {0.1, 0.5, 0.9}. As was mentioned previ-

ously, for the ”full” model we fit additive model with cubic polynomials for each component

(no smoothness penalty).

Results:

Table 4.12 shows proportion of rejections (out of 500 replicated data sets) for the generic SA

tests for different scenarios, sample sizes and quantile values τ ∈ {0.1, 0.5, 0.9}. Note that in

this example we use simple polynomial of order 3 for every covariate (not non-parametric)

therefore it is expected that the generic F-test should obtain 5% of rejection under Sc1.

However this does not happen, for all τ the probability of type I error is larger than 15%

62

Table 4.12: Simulation Results for Quantile Regression: Generic, proportion of rejectionsof SA for each scenario, sample size, τ and generic criteria.

n = 500 n = 1000Bootstrap Bootstrap

Scenario F-test F-exact Lik Ratio AIC -SSE∑β2i F-test F-exact Lik Ratio AIC -SSE

∑β2i

τ = 0.1Sc1 36.0% 5.0% 0.0% 0.0% 6.2% 5.8% 36.4% 6.0% 0.0% 0.0% 4.8% 5.2%Sc2(γ = 0.05) 100.0% 99.4% 98.2% 97.6% 8.6% 1.6% 100.0% 100.0% 100.0% 100.0% 9.6% 10.0%Sc2(γ = 0.11) 100.0% 100.0% 100.0% 100.0% 17.6% 0.0% 100.0% 100.0% 100.0% 100.0% 38.2% 2.6%

τ = 0.5Sc1 15.6% 5.0% 14.0% 11.0% 5.2% 5.8% 10.8% 4.8% 10.8% 8.6% 4.4% 4.2%Sc2(γ = 0.05) 26.0% 10.4% 20.2% 15.6% 8.0% 5.6% 26.6% 16.4% 28.0% 23.4% 9.6% 7.8%Sc2(γ = 0.11) 44.8% 19.0% 50.2% 42.4% 17.2% 7.2% 71.4% 52.0% 81.8% 78.6% 37.8% 8.6%

τ = 0.9Sc1 44.2% 3.8% 93.4% 92.8% 5.6% 4.4% 29.2% 3.4% 95.0% 94.2% 4.6% 4.8%Sc2(γ = 0.05) 42.0% 3.6% 94.6% 93.4% 8.4% 4.6% 32.0% 4.6% 95.6% 94.0% 12.0% 4.2%Sc2(γ = 0.11) 45.0% 3.6% 93.8% 93.0% 17.8% 4.2% 30.8% 6.6% 96.6% 94.8% 39.4% 5.2%

Table 4.13: Simulation Results for Quantile Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size, τ and number of bins.


Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3τ = 0.1

Sc1 4.4% 4.6% 5.8% 5.2% 4.2% 5.0% 2.2% 4.2%Sc2(γ = 0.05) 98.4% 98.2% 100.0% 100.0% 96.4% 97.4% 100.0% 100.0%Sc2(γ = 0.11) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

τ = 0.5Sc1 4.0% 3.4% 5.2% 5.2% 3.8% 4.2% 4.8% 3.8%Sc2(γ = 0.05) 8.2% 8.0% 6.6% 8.2% 6.2% 7.6% 4.8% 8.0%Sc2(γ = 0.11) 11.8% 10.0% 22.4% 22.2% 9.6% 10.6% 15.8% 18.0%

τ = 0.9Sc1 4.4% 5.8% 6.2% 4.2% 3.6% 4.2% 2.4% 4.8%Sc2(γ = 0.05) 4.4% 4.8% 6.6% 4.0% 3.2% 5.2% 3.6% 5.0%Sc2(γ = 0.11) 5.4% 3.4% 4.8% 6.4% 3.0% 3.8% 2.0% 4.8%

which is certainly surprising. Likelihood ratio and AIC also have very strange behavior

since their significance level changes from 0 to 93% as τ increases from 0.1 to 0.9. This

can be explained by the observation that both these measures require a likelihood function

which is not assumed to have a particular form when fitting a quantile regression. Hence we

63

can conclude that utilizing these build-in functions can lead to significant errors. As with

the mean and logistic regressions F-exact and bootstrap procedures have the required 5%

significance. Another observation is that BOOT-SSE has much larger power then BOOT-∑β2i . Quantile value τ plays a very important role in this example. Note that for Sc2 the

power for F-exact is around 100% when τ = 0.1 and then it declines to 4% when τ = 0.9.

BOOT-SSE method on the other hand obtains a power that is quite small but very stable

as τ changes. Note that for the quantile regression to get critical values for F-exact we need

to simulate from the correct model under SA. It is generally not feasible since in real world

problems we do not know the actual distribution that generated the data.

Table 4.13 shows proportion of rejections for permutation and Pearson χ2 test using ”bins”.

Probability of Type I error for both these methods is around 5% for every τ level. Again

number of bins K does not impact the performance and Method 1 works here a little better

than Method 2. When τ = 0.1, Methods 1 and 2 have very similar power to the best

generic F-exact test (and much better than BOOT-SSE). If τ = 0.9 again Methods 1 and 2

have similar power to F-exact however both have much less power than BOOT-SSE. When

conditional median (τ = 0.5) is estimated the proposed methods have lower power than both

F-exact and BOOT-SSE.

Based on the described observations we can conclude that for the quantile regression the

SA assessment crucially depends on the τ level. There is no one method that always works.

Two novel SA testing procedures are much faster and computationally manageable than F-

exact or Bootstrap tests and work very well for small quantile values however as τ increases

BOOT-SSE should be implemented. Of course this is only true in this example where the

noise has χ21 distribution which is skewed to the right, for other scenarios we may observe

different dependence on τ .

64

Chapter 5

Data Analysis

5.1 Red Wine Data

We consider the data of [18] consisting of various physicochemical tests of 1599 red variants

of the Portuguese ”Vinho Verde” wine. Acidity and density are properties closely associated

with the quality of wine and grape, respectively. Of interest here is to study the dependence

pattern between ‘fixed acidity’ (Yfa) and ‘density’ (Yde) and how it changes with values of

other variables: ‘volatile acidity’, ‘citric acid’, ‘residual sugar’, ‘chlorides’, ‘free sulfur diox-

ide’, ‘total sulfur dioxide’, ‘pH’, ‘sulphates’ and ‘alcohol’, denoted

Xva, Xca, Xrs, Xch, Xfs, Xts, Xph, Xsu, Xal, respectively. Figure 5.1 shows pairwise scatter-

plots of all original variables (responses and covariates). Response variables are linearly

transformed to have mean 0 and standard deviation of 1, similarly covariates where trans-

formed to be between 0 and 1.

5.2 Analysis and results

To select the appropriate copula family, we fit GP-SIM with ‘Clayton’, ‘Frank’, ‘Gaussian’,

‘Gumbel’ and ‘T-3’ (student T with 3 degrees of freedom) dependencies. For each model the

MCMC was run for 20,000 iterations with 10,000 burn-in period. We used 30 inducing inputs

for the marginals and calibration function estimation (m1 = m2 = m = 30). The resulting

CVML, CCVML and WAIC values are shown in Table 5.1. All model selection measures

indicate that among candidate copula families the most suitable one is the Gaussian one.

The GP-SIM coefficients (β) fitted under the Gaussian copula family are shown in Table 5.2.

65

Figure 5.1: Wine Data: Pairwise scatterplots of all the variables in the analyzed data.

Table 5.1: Red Wine data: CVML, CCVML and WAIC criteria values different models.

Clayton Frank Gaussian Gumbel T-3CVML -1858 -1816 -1788 -1829 -1810

CCVML -582 -547 -522 -558 -534WAIC 3713 3634 3572 3656 3621

The credible intervals suggest that not all covariates may be needed to model dependence

between responses. For example, ‘residual sugars’ and ‘chlorides’ seem to not affect the

calibration function so we consider a model in which they are omitted from the conditional

copula model. In all models, we include all the covariates in the marginal distributions.

For comparison, we have also fitted all Gaussian GP-SIM models with only one covariate,

and with no covariates at all (constant). The computational algorithm to fit GP-SIM when

the conditional copula depends on only one variable is very similar to the one described

above. The main difference is that there is no β variable and the inducing inputs (for

calibration function) are evenly spread on [0, 1]. The testing results are shown in Table 5.3.

Based on the selection criteria results we conclude that all nine covariates are required to

explain the dependence structure of two responses. Figure 5.2 shows 1-dimensional plots

of Kendall’s τ calibration curve with 95% credible as a function of covariates. The plots

66

Table 5.2: Wine data: Posterior means and quantiles of β.

Variable Posterior Mean 95% Credible IntervalXva 0.274 [0.154, 0.389]Xca −0.336 [−0.413,−0.254]Xrs −0.076 [−0.278, 0.271]Xch 0.060 [−0.246, 0.259]Xfs 0.276 [0.106, 0.410]Xts 0.402 [0.248, 0.608]Xph 0.155 [0.054, 0.286]Xsu 0.501 [0.342, 0.601]Xal 0.463 [0.382, 0.517]

Table 5.3: Wine data: CVML, CCVML and WAIC criteria values for variable selection inconditional copula.

Variables CVML CCVML WAICALL -1788 -522 3572

Xva, Xca, Xfs, Xts, Xph, Xsu, Xal -1805 -532 3608Xva -1823 -552 3646Xca -1815 -541 3629Xrs -1849 -582 3698Xch -1842 -578 3688Xfs -1852 -584 3705Xts -1851 -583 3700Xph -1816 -557 3633Xsu -1841 -571 3682Xal -1847 -577 3697

Constant -1849 -584 3700

are constructed by varying one predictor while fixing all others at their mid-range values.

The plots clearly demonstrate that when covariates are fixed at their mid-range values, the

conditional correlation between ‘fixed acidity’ and ‘density’ increases with ‘volatile acidity’,

‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘pH’, ‘sulphates’ and ‘alcohol’, and decreases with

levels of ‘citric acid’. These relationships can influence the preparation method of the wine.

In order to demonstrate the difficulty one would have in gauging the complex evolution

of dependence between two responses as a function of covariates we plot in Figure 5.3 the

response variables together as they vary with each covariate. It is clear that the model

manages to identify a pattern that would be very difficult to distinguish without the help of

67

Figure 5.2: Wine Data: Slices of predicted Kendall’s τ as function of covariates. Red curvesrepresent 95% credible intervals.

0.5 1.0 1.5

0.0

0.4

0.8

volatile.acidity

Ken

dalls

Tau

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

citric.acid

Ken

dalls

Tau

5 10 15

0.0

0.4

0.8

residual.sugar

Ken

dalls

Tau

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.4

0.8

chlorides

Ken

dalls

Tau

0 10 20 30 40 50 60 700.

00.

40.

8

free.sulfur.dioxide

Ken

dalls

Tau

0 50 100 150 200 250 300

0.0

0.4

0.8

total.sulfur.dioxide

Ken

dalls

Tau

2.8 3.0 3.2 3.4 3.6 3.8 4.0

0.0

0.4

0.8

pH

Ken

dalls

Tau

0.5 1.0 1.5 2.0

0.0

0.4

0.8

sulphates

Ken

dalls

Tau

9 10 11 12 13 14 15

0.0

0.4

0.8

alcohol

Ken

dalls

Tau

Figure 5.3: Wine Data: Plots of ‘fixed acidity’(blue) and ‘density’(red) (linearly transformedto fit on one plot) against covariates.

0.5 1.0 1.5

volatile.acidity

0.0 0.2 0.4 0.6 0.8 1.0

citric.acid

5 10 15

residual.sugar

0.0 0.1 0.2 0.3 0.4 0.5 0.6

chlorides

0 10 20 30 40 50 60 70

free.sulfur.dioxide

0 50 100 150 200 250 300

total.sulfur.dioxide

2.8 3.0 3.2 3.4 3.6 3.8 4.0

pH

0.5 1.0 1.5 2.0

sulphates

9 10 11 12 13 14 15

alcohol

a flexible mathematical model.

68

Part II

Approximated Bayesian Methods

69

Chapter 6

Introduction

6.1 The Need of Simulation Based Methods

When data y0 ∈ X n is observed and the sampling distribution has density function f(y0|θ)indexed by parameter θ ∈ Rq, Bayesian inference for functions of θ rely on the characteristics

of the posterior distribution:

π(θ|y0) =p(θ)f(y0|θ)∫

Rqp(θ)f(y0|θ)dθ

∝ p(θ)f(y0|θ), (6.1)

where p(θ) denotes the prior distribution.

Since the early 1990s Bayesian statisticians have been able to operate largely free of

computation-induced constraints due to the rapid development of Markov chain Monte Carlo

(MCMC) sampling methods [see, for example 19, for a recent review]. This class of methods

allows one to produce samples from π in (6.1) despite its often intractable denominator.

While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC [see

14, and references therein] can draw from distributions with unknown normalizing con-

stants, they rely on a closed form for the functional form of the unnormalized posterior, i.e.

p(θ)f(y0|θ) (as was discussed in chapter 1).

Larger data should yield answers to more complex problems. The latter can be tackled

statistically using increasingly complex models, in as much as the sampling distribution is

no longer available in closed form. In this complex settings, a much weaker assumption often

holds, namely that, for any θ ∈ Rq, draws y ∼ f(y|θ) can be sampled. To get a motivation

70

for the simulation based methods, consider for example Hidden Markov Model:

X0 ∼P (x0),

Xi|xi−1 ∼P (Xi|xi−1, θ), i = 1, . . . , n,

Yi|xi ∼P (Yi|xi, θ), i = 1, . . . , n.

(6.2)

Note that except for examples with Gaussian transition and emission distributions, marginal

distribution P (y1, · · · , yn|θ) cannot be calculated in closed form. It is possible to treat hidden

random variables Xi as auxiliary and sample them with parameters using Particle MCMC

(PMCMC) [8] or ensemble MCMC [82] but it becomes increasing difficult as n increases.

Moreover for some financial time series models [69] (Stochastic Volatility for log return for

example) α-Stable distribution may be useful to model transition and/or emission probabil-

ities. The challenge is that stable distribution do not have closed form density but can be

sampled from and hence particle and ensemble MCMC are not feasible. Other widely used

examples where the likelihood functions cannot be expressed analytically include various

networks models [51] and Markov random fields [79].

In the absence of a tractable likelihood function, statisticians have turned to approximate

methods to perform Bayesian inference. Here we consider two alternative approaches that

have been proposed and gained momentum recently: the Approximate Bayesian Computa-

tion (ABC) [60, 10, 83, 25] and the Bayesian Synthetic Likelihood (BSL)[94, 26, 73]. These

algorithms are only based on pseudo-data simulations from f(y|θ) and do not require a

tractable form of the likelihood. Both algorithms are effective when they are combined with

Markov chain Monte Carlo sampling schemes to produce samples from an approximation of

π and both share the potential need for intense computational effort at each update. In the

next sections we describe in detail existing methods for Approximate Bayesian Computation

and Bayesian Synthetic Likelihood.

6.2 Approximate Bayesian Computation (ABC)

For models with intractable or computationally expensive likelihood evaluations, simulation

based algorithms such as ABC are frequently the only choice for the inference. In its sim-

plest form, the ABC is a reject/accept sampler. Suppose we observe the data y0, given

a user-defined summary statistic S(y) ∈ Rp, the Accept/Reject ABC repeatedly samples

parameters ζ∗ from the prior, each time simulates a pseudo-data y∗ ∼ f(y|ζ∗), and then

71

compares S(y∗) with S(y0). If both of them are the same, the generated ζ∗ is accepted,

otherwise rejected, see Algorithm 2. We emphasize that a closed form equation for the likeli-

Algorithm 2 Accept/Reject ABC

1: Given observed y0 and required number of samples M .2: for t = 1, · · · ,M do3: Match = FALSE4: while Not Match do5: ζ∗ ∼ p(θ)6: y ∼ f(y|ζ∗)7: if S(y) = S(y0) then8: θ(t) = ζ∗.9: Match = TRUE

10: end if11: end while12: end for

hood is not needed, only the ability to generate from f(y|θ) for any θ. If S(y) is a sufficient

statistics and Pr(S(y) = S(y0)) > 0 then the algorithm yields posterior samples from the

true posterior π(θ|y0). Alas, the level of complexity for models where ABC is needed, makes

it unlikely for these two conditions to hold. In order to implement ABC under more realistic

assumptions, a (small) constant ε is chosen and ζ∗ is accepted whenever d(S(y), S(y0)) < ε,

where d(S(y), S(y0)) is a user-defined distance function. If we denote the distribution of

accepted draws by πε(θ|S(y0)) then we obtain

limε↓0

πε(θ|S(y0)) = π(θ|S(y0)). (6.3)

In light of (6.3) one would like to have S(y) = y, but if the sample size of y0 is large, then the

curse of dimensionality leads to Pr(d(y,y0) < ε) ≈ 0. Thus, getting even a moderate number

of samples using ABC can be unattainable in this case. Unless S is sufficient, some infor-

mation about θ is lost so much attention is placed on finding appropriate low-dimensional

summary statistics [see, for example 29, 72]. We assume that the summary statistic function

S(·) is given. While Accept/Reject ABC can be used to sample from the posterior distri-

bution of θ, the computation cost is prohibitively large when we require closeness between

pseudo and observed data, i.e. d(S(y), S(y0)) < ε. This imposes constraints on the size of

the threshold ε which ends up being selected based on the available computational power

and time rather than on other factors such as precision.

72

Under weak or no information about the parameters in the model, the prior and a poste-

rior may be misaligned, i.e. regions of mass concentration do not overlap. Hence, parameters

values that are drawn from the prior will be rarely retained making the algorithm very inef-

ficient. Algorithm 3 presents the ABC-MCMC algorithm of [61] which avoids sampling from

Algorithm 3 ABC MCMC

1: Given y0, ε > 0 and required number of samples M .2: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.3: for t = 1, · · · ,M do4: Generate ζ∗ ∼ q(·|θ(t−1)).5: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).

6: Calculate α = min(

1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗)

p(θ(t−1))q(ζ∗|θ(t−1))

)7: Generate independent U ∼ U(0, 1).8: if U ≤ α then9: θ(t) = ζ∗.

10: else11: θ(t) = θ(t−1)

12: end if13: end for

the prior and instead relies on chain with a Metropolis-Hastings (MH) transition kernel, with

state space {(θ,y) ∈ Rq×X n}, proposal distribution q(ζ|θ)×f(y|ζ) and target distribution

πε(θ,y|y0) ∝ p(θ)f(y|θ)1{δ(y0,y)<ε}, (6.4)

where δ(y0,y) = d(S(y), S(y0)). Note that the goal is the marginal distribution for θ which

is:

πε(θ|y0) =

∫πε(θ,y|y0)dy ∝

∫p(θ)f(y|θ)1{δ(y0,y)<ε}dy = p(θ)P (δ(y0,y) < ε|θ). (6.5)

Therefore if we knew the conditional probability P (δ(y0,y) < ε|θ) for every θ, then we could

run a MH algorithm to sample from the approximate target given in (6.5). Other MCMC de-

signs suitable for ABC can be found in [55] and [13]. Sequential Monte Carlo SMC have also

been successfully used for ABC (we denote it by ABC-SMC) [85, 31]. ABC-SMC requires a

specified decreasing sequence ε0 > · · · > εJ . This method uses the Particle MCMC design

[8] in which samples are updated as the target distribution evolves with ε. More specifically,

it starts by sampling θ(1)0 , . . . , θ

(M)0 from πε0(θ|y0) using Accept-Reject ABC. Subsequently,

at time t + 1 samples available at time t are sequentially updated so their distribution is

73

πεt+1(θ|y0) [see 55, for a complete description of the SMC-MCMC]. The advantage of this

method is not only that it starts from large ε but also that it generates independent draws. A

comprehensive coverage of computational techniques for ABC can be found in [84] and refer-

ences therein. The ABC-MCMC algorithm proposed by [54] approximates P (δ(y0,y) < ε|θ)by J−1

∑Jj=1 1{δ(y0,yj)<ε} where J ≥ 1 and each yj is simulated from f(y|θ). Note that this

estimator is unbiased and hence the stationary distribution of θ is πε(θ|y0) as a consequence

of the pseudo-marginal MCMC [9]. Clearly, when the probability P (δ < ε|θ) is very small,

this method would require simulating large number of δs (or equivalently ys) in order to

move to a new state. Note that even when the proposed parameter θ is near the ”true”

unknown parameter θ0, the simulated δ(y,y0) given θ can be greater than ε due to random-

ness of conditional distribution δ|θ, in this case the chain can become sluggish to the point

of being impractical. We also note a general lack of guidelines concerning the selection of ε,

which is unfortunate as the performance of ABC sampling depends heavily on its value.

Notice that the choice of proposal distribution q(·|θ) can dramatically influence the per-

formance of ABC-MCMC. To make a fair comparison between different methods we revise

ABC-MCMC algorithm by introducing a decreasing sequence ε0 > · · · > εJ (J is number of

”steps”) similar to ABC-SMC and ”learning” transition kernel during burn-in as in Algo-

rithm 4. The main difference is that during burn-in period of length B the ε-sequence starts

with a higher value (which makes finding initial θ(0) values much more feasible) and gradu-

ally decreases while the proposal distribution is adapted in the same period. The adaptation

of the proposal takes place only during the burn-in period. For independent MH sampling

the generic proposal is Gaussian N (·|µ, Σ) with constant c set to 2 or 3, for the random

walk sampler the standard transitional probability is N (·|θ(t−1), Σ) with c = 2.382/q which

is proven to be optimal for Gaussian posterior [75, 76].

All the algorithms discussed so far rely on numerous generations of pseudo-data. These can

be computationally costly, some attempts to reduce computational cost in ABC are made

in [93] and [45]. The approaches are based on learning the dependence between δ and θ.

Therefore instead of simulating many statistics for each proposed θ the accelerated algorithm

captures information from all simulated pairs through the functional form. Flexible regres-

sion models are used to model these unknown functional relationships, and the performance

depends on the signal to noise ratio and on the ability to capture patterns that can be highly

complex.

74

Algorithm 4 ABC MCMC modified (ABC-MCMC-M)

1: Given y0, sequence ε0 > · · · > εJ , constant c, burn-in period B and required number ofsamples M .

2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).

5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb).6: for t = 1, · · · ,M do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of {θ(t)} t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance of{θ(t)} t = 1, · · · , (aj − 1).

10: end if11: Generate ζ∗ ∼ q(·|θ(t−1), µ, Σ).12: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).


1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗,µ,Σ)

p(θ(t−1))q(ζ∗|θ(t−1),µ,Σ)

)14: Generate independent U ∼ U(0, 1).15: if U ≤ α then16: θ(t) = ζ∗.17: else18: θ(t) = θ(t−1)


75

6.3 Bayesian Synthetic Likelihood

As an alternative to ABC which requires tuning of ε and selection of a distance function d(·, ·),[94] tackled the intractability of the sampling distribution by assuming that the conditional

distribution for the statistic S(y) given θ is Gaussian with mean µθ and covariance matrix

Σθ. The Synthetic Likelihood (SL) procedure assigns to each θ the likelihood SL(θ) =

N (s0;µθ,Σθ), where N (x;µ,Σ) denotes the density of a normal with mean µ and covariance

Σ, and s0 = S(y0). SL can be used for maximum likelihood estimation as in [94] or within the

Bayesian paradigm as proposed by [26] and [73]. In Bayesian Synthetic Likelihood (BSL) [73]

propose to implement a Metropolis-Hastings sampler that has π(θ|s0) ∝ p(θ)N (s0;µθ,Σθ)

as stationary distribution. It is clear that direct implementation is not possible as the

conditional mean and covariance matrix are unknown. However, both can be estimated

based on m statistics (s1, · · · , sm) sampled from their conditional distribution given θ. More

precisely, after simulating yi ∼ f(y|θ) and setting si = S(yi), i = 1, · · · ,m, estimate

µθ =

∑mi=1 sim

,

Σθ =

∑mi=1(si − µθ)(si − µθ)T

m− 1,

(6.6)

so that the synthetic likelihood is

SL(θ|y0) = N (S(y0)|µθ, Σθ). (6.7)

The pseudo-code in Algorithm 5 shows the steps involved in the BSL-MCMC sampler.

Since each Metropolis-Hastings step requires calculating the likelihood ratios between two

SLs calculated at different θs one can anticipate the heavy computational load involved in

running the chain for thousands of iterations, especially if sampling each y is expensive. Note

that even though these estimates for the conditional mean and covariance are unbiased, the

estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC

theory is not applicable here. [73] presents an unbiased Gaussian likelihood estimator and

empirically show that using biased and unbiased estimates generally perform similarly. They

also remark that this procedure is very robust to the number of simulations m, and demon-

strate empirically that using m = 50 to 200 produce similar results.

The normality assumption for summary statistics is certainly a strong assumption which

may not hold in practice. Following up on this, [6] relaxed the jointly Gaussian assumption

76

Algorithm 5 Bayesian Synthetic Likelihood (BSL-MCMC)

1: Given s0 = S(y0), number of simulations m and required number of samples M .2: Get initial θ(0), estimate µθ(0) , Σθ(0) by simulating m statistics given θ(0).3: Define h(θ(0)) = N (s0; µθ(0) , Σθ(0)).4: for t = 1, · · · ,M do5: Generate ζ∗ ∼ q(·|θ(t−1)).6: Estimate µζ∗ , Σζ∗ by simulating m statistics given ζ∗.

7: Calculate h(ζ∗) = N (s0; µζ∗ , Σζ∗).


1, p(ζ∗)h(ζ∗)q(θ(t−1)|ζ∗)p(θ(t−1))h(θ(t−1))q(ζ∗|θ(t−1))

)9: Generate independent U ∼ U(0, 1).

10: if U ≤ α then11: Set θ(t) = ζ∗ .12: else13: Set θ(t) = θ(t−1).14: end if15: end for

to Gaussian copula with non-parametric marginal distribution estimates (NON-PAR BSL),

which includes joint Gaussian as a special case but is much more flexible. The estimation is

based, as in the BSL framework, on m pseudo-data samples simulated for each θ.

6.4 Plan

It is evident that both ABC and BSL are computationally costly and require enormous num-

ber of pseudo-data simulations to run even a moderate size MCMC chain. Accelerating these

algorithms is especially important for very large data sets, time consuming pseudo-data sim-

ulations or summary statistic calculations. We propose to speed up these methods by storing

past simulated draws and using those to approximate unknown likelihood. While this re-

duces drastically the computation time, it raises the need to control the approximating error

introduced when modifying the original transition kernel. The objective is to approximate

P (δ < ε|θ) and N (s0;µθ,Σθ) for any θ at every MCMC iteration using past simulated (θ, δ)

and (θ, s) for ABC and BSL respectively. K-Nearest-Neighbor (kNN) method is used as a

non-parametric estimation tool for quantities described above. The main advantage of kNN

is that it is uniformly strongly consistent which guarantees that for a large enough chain

history, we can control the error between the intended stationary distribution and that of

the proposed accelerated MCMC.

77

The structure of this part is the following: in Chapter 7 we describe the accelerated

MCMC algorithms for ABC. In Chapter 8 we extend the proposed method to BSL. The

practical impact of these algorithms is evaluated via simulations in Chapter 9 and the data

analysis involving the Stochastic Volatility model (with α-stable errors) applied to a time

series of daily log returns of Dow Jones index between Jan 1, 2010 and Dec 31, 2017 is

presented in Chapter 10. The theoretical analysis showing control of perturbation errors in

total variation norm is presented in Chapter 11.

78

Chapter 7

Approximated ABC (AABC)

7.1 Computational Inefficiency of ABC

The problem that we tackle in this thesis, is the computational burden of standard ABC

procedures, like ABC-MCMC. As was pointed out in the previous chapter ABC algorithms

require large number of pseudo-data simulations when the threshold ε is small enough which

results in low computational efficiency. Letting θ0 be parameter that generated the observed

data, if P (δ < ε|θ0) is small then even when the proposed ζ∗ state is close to θ0 there

is high probability that generated δ∗ will be greater than ε and therefore rejected. Thus

many samples are rejected not because they are in the tail of the (approximate) posterior

but simply due to variability of δ conditional on θ. Note that for all ABC methods past

simulation results are completely ignored, we think however that they could provide essential

information that can significantly accelerate the algorithm. This observation is the basis of

the proposed method.

7.2 Approximated ABC-MCMC (AABC-MCMC)

In this section we described a novel algorithm for ABC-MCMC sampler that utilizes past

pseudo data simulations and significant improve performance of a chain. We mentioned in

the last chapter that the objective of ABC-MCMC (given threshold ε) is to sample from this

distribution with support Θ:

πε(θ|y0) ∝ p(θ)P (δ(y0,y) < ε|θ), (7.1)

79

where δ(y0,y) = d(S(y), S(y0)) with y ∼ f(y|θ). If P (δ(y0,y) < ε|θ) = h(θ) was known for

every θ then we could run an exact MH-MCMC chain with invariant distribution proportional

to p(θ)h(θ). Since this function of θ is generally unknown it is estimated by an indicator

1{δ(y0,y)<ε} which is an unbiased estimator. Suppose now that at iteration t + 1, we stored

N − 1 past simulations ZN−1 = {ζn, δn}N−1n=1 where ζ denotes θ proposal which is generated

independently of the MCMC (otherwise the Markovian property of the chain is violated).

Given two new independent proposals ζ∗, ζ∗ ∼ q(|θ(t)) the first is the proposal used for

the chain update, the second is used to enrich the ”history”. We then generate one δ∗ by

first simulating y∗ (given ζ∗) then calculating statistics s∗ = S(y∗) and finally computing

discrepancy between s∗ and s0 = S(y0), δ∗ = d(s∗, s0). We then combine past samples ZN−1

with a new pair (ζ∗, δ∗), ZN = ZN−1 ∪ (ζ∗, δ∗), and estimate h(ζ∗) as follows

h(ζ∗) =

∑Nn=1WNn(ζ∗)1{δn<ε}∑N

n=1WNn(ζ∗). (7.2)

Here weight function WNn(ζ∗) = WN(ζn, ζ∗) = W (‖ζn− ζ∗‖) depends on Euclidean distance

between ζn and ζ∗ assigning more weights to pairs that are closest. We will discuss several

choices for W (·) function below.

In other words a non-parametric estimate of h(ζ) is produced for each ζ based on previous

simulations ZN . Notice that if there is a close neighbor of ζ∗ for which discrepancy is less

than ε then the estimated h(ζ∗) will not be zero and there is a chance of moving to a different

state. Intuitively, this is expected to stabilize the acceptance probability of the chain and

preform better than standard ABC-MCMC. Since the proposed weighted estimate is no

longer an unbiased estimator of h(θ), a new theoretical evaluation is needed to study the

effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm’s

utility is the ability to control the total variation distance between the desired distribution

of interest given in (7.1) and the modified chain’s target. As will be shown in chapter 11, we

rely on three assumptions to ensure that the chain would approximately sample from (7.1):

1) compactness of Θ; 2) uniform ergodicity of the chain using the true h and 3) uniform

convergence in probability of h(θ) to h(θ) as N →∞.

K-Nearest-Neighbor (kNN) regression approach [32, 11] has a property of uniform con-

sistency [16] therefore for h(θ) we employ this technique. Here we define K = g(N) (for

simulations we use g(·) =√

(·)) and let λ : {1, · · · , N} → {1, · · · , N} be a permutation of

indices that sorts {ζn} ∈ ZN = {ζn, δn}Nn=1 from closest to ζ∗ to furthest. Suppose now that

80

Algorithm 6 Approximated ABC MCMC (AABC-MCMC)

1: Given y0 with summary statistics s0, sequence ε0 > · · · > εJ , constant c, burn-in periodB, required number of samples M , initial simulations ZN = {ζn, δn}Nn=1 with ζn ∼ p(ζ),yn ∼ f(·|ζn) and δn = d(S(yn), s0).

2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), s0) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).

5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)6: for t = 1, · · · , J do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).

10: end if11: Generate ζ∗, ζ∗

iid∼ N (·; µ, Σ).12: Simulate y∗ ∼ f(·|ζ∗) and let δ∗ = d(S(y∗), s0).13: Add simulated pair of parameter and discrepancy to the past set: ZN = ZN−1 ∪{ζ∗, δ∗} and set N = N + 1.

14: h(ζ∗) =∑Nn=1WNn(ζ∗)1{δn<ε}∑N

n=1WNn(ζ∗).

15: h(θ(t)) =∑Nn=1WNn(θ(t))1{δn<ε}∑N

n=1WNn(θ(t)).


1, p(ζ∗)h(ζ∗)N (θ(t);µ,Σ)

p(θ(t))h(θ(t))N (ζ∗;µ,Σ)

)17: Generate independent U ∼ U(0, 1).18: if U ≤ α then19: θ(t+1) = ζ∗.20: else21: θ(t+1) = θ(t)


81

after the permutation, the past set ZN is rearranged by distance to ζ∗ so that (ζ1, δ1) has

smallest ‖ζ1− ζ∗‖ while (ζN , δN) has largest distance ‖ζN − ζ∗‖. Then kNN sets WNn(ζ∗) to

zero all n > K and there are several weight choices for n ≤ K, we focus on two:

(U) The uniform kNN with WNn(ζ∗) = 1 for all n ≤ K;

(L) The linear kNN with WNn(ζ∗) = W (‖ζn − ζ∗‖) = 1− ‖ζn − ζ∗‖/‖ζK − ζ∗‖ for n ≤ K

so that the weight decreases from 1 to 0 as n increases from 1 to K.

Moreover kNN theoretical arguments generally require independent pairs of {ζn, δn}Nn=1,

therefore for proposal distribution we apply independent sampler so that q(·|θ(t)) = q(·).As in Algorithm 4 we allow ε-sequence decrease gradually during the burn-in period.

In all our simulation we assume that proposal is Gaussian which of course can be changed

to any other appropriate distribution (with positive support on Θ) related to a particular

problem. The entire procedure is outlined in Algorithm 6.

To conclude, at the end of a simulation of size M the MCMC samples are {θ(1), . . . , θ(M)}and the history used for updating the chain is {(ζ1, δ1), . . . , (ζM , δM)}. The two sequences

are independent of one another, i.e. for any N > 0, the elements in ZN are independent of

the chain’s history up to time N .

Note also that h(θ) is estimated in numerator and denominator of probability of acceptance

in every iteration, so even for a current state this value is recalculated and not borrowed

from the previous iteration. This procedure generally improves mixing of a chain and it is

theoretically justified as will be shown in chapter 11. Constant c here controls the variability

of the proposal, if it is too small then MCMC will not explore the posterior effectively, if too

large then there would be many rejections as frequently proposed values will be in tails of

the posterior. For all our simulations and real data example we use c = 1.5 which was found

empirically quite satisfactory.

82

Chapter 8

Approximated BSL (ABSL)

8.1 Computational Inefficiency of BSL

Similar to ABC, BSL is computationally costly and requires many pseudo-data simulations

to run even moderate chains, since for every iteration it generates m pseudo-data sets. This

m cannot be small since then estimations of conditional mean µθ and covariance Σθ will not

be accurate especially for moderate or large dimension of summary statistic p. To accelerate

BSL-MCMC we propose to store and utilize past simulations of (ζ , s) to approximate the

conditional mean and covariance for every proposed ζ∗ (proposed parameter), making the

whole procedure computationally faster. Instead of simulating m pseudo-data sets, only one

is simulated and used in combination with the past simulations. The approach can trivially be

extended to NONPAR-BSL algorithm but we do not pursue it further. K-Nearest-Neighbor

(kNN) method is used as a non-parametric estimation tool for quantities described above.

8.2 Approximated Bayesian Synthetic Likelihood (ABSL)

Setting s0 = S(y0) and assuming conditional normally for this statistic the objective is to

sample from

π(θ|s0) ∝ p(θ)N (s0;µθ,Σθ). (8.1)

During the MCMC run, the proposal ζ∗ generated from q(·) and the history ZN is enriched

using ζ∗ ∼ q(·), y∗ ∼ f(·|ζ∗) and s∗ = S(y∗) (can trivially be extended to more than

one pseudo-data set generation). Thus all proposals with associated statistics are stored in

ZN = {ζn, sn}Nn=1, note that similar to AABC-MCMC, this ”history” set is independent of

83

the chain states. Then for any ζ, conditional mean and covariance of statistics vector is

estimated using past samples as weighted averages:

µζ =

∑Nn=1 WNn(ζ)sn∑Nn=1WNn(ζ)

,

Σζ =

∑Ni=1 WNn(ζ)(sn − µζ)(sn − µζ)T∑N

i=1 WNn(ζ).

(8.2)

Again the weights are functions of distance between proposed value and parameters’ values

Algorithm 7 Approximated Bayesian Synthetic Likelihood (ABSL)

1: Given s0 = S(y0), constant c, burn-in period B, J number of adaption points dur-ing burn-in , required number of samples M , initial pseudo data simulations ZN ={ζn, sn}Nn=1 with ζn ∼ p(ζ), yn ∼ f(·|ζn) and sn = S(yn).

2: Get initial θ(0).3: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).

4: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)5: for i = 1, · · · ,M do6: if i = aj for some j = 1, · · · , J then7: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).

8: end if9: Generate ζ∗, ζ∗

iid∼ N (·; µ, Σ).10: Simulate y∗ ∼ f(·|ζ∗) and let s∗ = S(y∗).11: Add simulated pair of parameter and statistic to the past set: ZN = ZN−1 ∪ {ζ∗, s∗}

and set N = N + 1.

12: Calculate: µζ∗ =∑Nn=1WNn(ζ∗)sn∑Nn=1WNn(ζ∗)

, Σζ∗ =∑Ni=1WNn(ζ∗)(sn−µζ∗ )(sn−µζ∗ )T∑N

i=1WNn(ζ∗).

13: Calculate: µθ(t) =∑Nn=1WNn(θ(t))sn∑Nn=1WNn(θ(t))

, Σθ(t) =∑Ni=1WNn(θ(t))(sn−µθ(t) )(sn−µθ(t) )T∑N

i=1WNn(θ(t)).

14: h(ζ∗) = N (s0; µζ∗ , Σζ∗).

15: h(θ(t)) = N (s0; µθ(t) , Σθ(t)).


1, p(ζ∗)h(ζ∗)N (θ(t);µ,Σ)

p(θ(t))h(θ(t))N (ζ∗;µ,Σ)

)17: Generate independent U ∼ U(0, 1).18: if U ≤ α then19: θ(t+1) = ζ∗.20: else21: θ(t+1) = θ(t)


84

from the past WNn(ζ) = W (‖ζ− ζn‖), we use simple Euclidean distance. To get appropriate

convergence properties we use kNN approach to calculate weights WNn, where only K =√N

closest values to ζ is used in calculation of conditional means and covariance. As was

described in the previous chapter uniform and linear weights are available, the former assumes

equal weights for all K values the latter has linear decreasing weights. The advantage of this

method is that the use of the past to estimate conditional mean and covariance matrix can

significantly speed up the whole procedure since there is no longer need to generate many

data sets y at every step as was done in original BSL, the hope is that these estimates are still

close enough to the true values. Applying these estimated in MCMC we assume that proposal

distribution is independent of the previous state so that q(·|θ(t)) = N (·;µ,Σ) for some fixed

µ and Σ (of course any other distribution with non-zero density on the support of the exact

posterior can be used). As will be shown in Chapter 11 we need the assumption that pairs

in past set ZN must be independent which is satisfied when the proposal is independent and

when accepted and rejected simulations are saved. We will also show that if Θ is compact

and under minor assumptions of expectation and covariance of s|θ the proposed algorithm

exhibits good convergence properties and we can control the error between the intended

stationary distribution and that of the proposed accelerated MCMC.

To get rough idea of the transition kernel we propose to learn it during the burn-in period

with J adaptation points, Algorithm 7 outlines the entire ABSL method. At the end of a

simulation of size M the sequence of MCMC samples is θ(0), . . . , θ(M) while the sequence of

proposals is ζ1, . . . , ζM . For simulations below we set c = 1.5 and J = 15 to be consistent

with AABC-MCMC, ABC-MCMC-M and ABC-SMC procedures.

85

Chapter 9

Simulations

9.1 Details

In this section we carry through a series of simulations to compare performances of various

algorithms. We analyze simple Moving Average model of lag 2, Ricker’s model, Stochastic

volatility with Gaussian emission noise and finally Stochastic volatility with α-Stable errors.

For all these models simulation of pseudo data for any parameter is simple and computation-

ally fast however the use of standard estimation methods can be quite challenging (especially

for the last 3 models) therefore it is tempting to implement ABC and BSL type algorithms

for inferential purposes for these examples.

For ABC samplers before running a MCMC chain we estimate initial and final thresholds ε0

and ε15 (15 equal steps in log scale were used for all models) and matrix A which is used in

discrepancy calculation δ = d(S(y), S(y0)) = (S(y)−S(y0))TA(S(y)−S(y0)). To estimate

A we first set it to the identity matrix then generate 500 pairs {ζi,yi}500i=1 from p(ζ)f(y|ζ).

Next calculate discrepancies {ζi, δi}500i=1 with δi = d(S(yi), S(y0)), and pick ζ∗ with smallest

discrepancy. Finally generate 100 pseudo-data (y1, . . . ,y100) from f(y|ζ∗), compute corre-

sponding summary statistics (s1, . . . , s100) and set A to be the inverse of covariance matrix

of (s1, . . . , s100). This procedure (with an updated A) is repeated several times, at the final

stage we calculate (δ1, . . . , δ100) and set ε0 to be 5% quantile of these discrepancies. The

number of simulations was set to 500 and 100 just for computational convenience and is not

driven by any theoretical arguments. To estimate the final threshold ε15 we use the Random

Walk version of Algorithm 4 with M = B = 5000 and initial threshold ε0. We add one

modification by setting εj, j = 1, . . . , 15 equal to 1% quantile of generated discrepancies δ

between adaptation points aj−1 and aj, the final threshold is set to ε15. Intuitively ε se-

86

quence should decrease as the chain states move closer to the ”true” parameter. Note that

this chain is only used to approximate the final ε and cannot be used to study the properties

of approximate posterior. Intermediate values ε1, . . . , ε14 are then computed as equidistant

points on the natural log scale between ε0 and ε15.

We compare the following algorithms:

(SMC) Standard Sequential Monte Carlo for ABC;

(ABC-RW) The modified ABC-MCMC algorithm which updates ε and the random walk Metropolis

transition kernel during burn-in;

(ABC-IS) The modified ABC-MCMC algorithm which updates ε and the Independent Metropolis

transition kernel during burn-in;

(BSL-RW) Modified BSL where it adapts the random walk Metropolis transition kernel during

burn-in;

(BSL-IS) Modified BSL where it adapts the independent Metropolis transition kernel during

burn-in;

(AABC-U) Approximated ABC-MCMC with independent proposals and uniform (U) weights;

(AABC-L) Approximated ABC-MCMC with independent proposals and linear (L) weights;

(ABSL-U) Approximated BSL-MCMC with independent proposals and uniform (U) weights;

(AABC-L) Approximated BSL-MCMC with independent proposals and linear (L) weights.

(Exact) When likelihood is computable, posterior samples were generated using MCMC.

For SMC 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-

U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and

BSL-IS are much more computationally expensive, total number of iterations were fixed at

10000 with 2000 burn-in and 50 simulations of y for every proposed ζ∗ (i.e. m = 50). Exact

chain was run for 5000 iterations and 2000 for burn-in. It must be pointed out that all

approximate samplers are based on the same summary statistics, same discrepancy function

and the same ε sequence, so that they all start with the same initial conditions.

87

9.2 Measures for Comparisons

For more reliable results we compare these sampling algorithms under data set replications,

in this study we set number of replicates R to be 100, so that for each model 100 data

sets were generated and each one was analyzed with the described above sampling methods.

Assorted statistics and measures were calculated for every model and data set, letting θ(t)rs

represent posterior samples from replicate r = 1, · · · , R, iteration t = 1, · · · ,M and param-

eter component s = 1, · · · , q and similarly θ(t)rs posterior from an exact chain (all draws are

after burn-in period). We also define θtrues denote the true parameter that generated the

data. Moreover let Drs(x), Drs(x) be estimated density function at replicate r = 1, · · · , Rand components s = 1, · · · , q for approximate and exact chains respectively. Then the

following quantities are defined:

Diff in mean (DIM) = Meanr,s(|Meant(θ(t)rs )−Meant(θ

(t)rs )|),

Diff in covariance (DIC) = Meanr,s(|Covt(θ(t)rs )− Covt(θ(t)

rs )|),

Total Variation (TV) = Meanr,s

(0.5

∫|Drs(x)− Drs(x)|dx

),

Bias2 = Means

((Meantr(θ

(t)rs )− θtrues

)2),

Var = Means(V arr(Meant(θ(t)rs ))),

MSE = Bias2 + Var,

where Meant(ast) is defined as average of {ast} over index t and in similar manner V art(ast)

and Covt(ast) representing variance and covariance respectively. The first three measures

are useful in determining how close posterior draws from different samplers are to the draws

generated by the exact chain (when it is available). On the other hand the last three are

standard quantities that measure how close in mean square posterior means are to the true

parameters that generated the data. To study efficiency of proposed algorithms we need to

take into account CPU time that it takes to run a chain as well as auto-correlation properties.

Define auto-correlation time (ACT) for every parameter’s component and replicate of samples

θ(t)rs as:

ACTrs = 1 + 2∞∑a=1

ρa(θ(t)rs ),

where ρa is auto-correlation coefficient at lag a. In practice we sum all the lags up to the first

negative correlation. Letting M − B to be number of chain iterations (after burn-in) and

88

CPUr correspond to total CPU time to run the whole chain during replicate r, we introduce

Effective Sample Size (ESS) and Effective Sample Size per CPU (ESS/CPU) as:

ESS = Meanrs((M −B)/ACTrs),

ESS/CPU = Meanrs((M −B)/ACTrs/CPUr).(9.1)

Note that these indicators are averaged over parameter components and replicates. ESS

intuitively can be thought as approximate number of ”independent” samples out of M −B, the higher is ESS the more efficient is the sampling algorithm, when ESS is combined

with CPU (ESS/CPU) it provides a powerful indicator for MCMC’s efficiency. Generally a

sampler with highest ESS/CPU is preferred as it produces larger number of ”independent”

draws per unit time.

9.3 Moving Average Model

A popular toy example to check performances of ABC and BSL techniques is MA2 model:

ziiid∼ N (0, 1); i = {−1, 0, 1, · · · , n},

yi = zi + θ1zi−1 + θ2zi−2; i = {1, · · · , n}.(9.2)

The data are represented by the sequence y = {y1, · · · , yn}. It is well known that Yi follow

a stationary distribution for any θ1, θ2, but there are conditions required for identifiability.

Hence, we impose uniform prior on the following set:

θ1 + θ2 > −1,

θ1 − θ2 < 1,

−2 < θ1 < 2,

−1 < θ2 < 2.

(9.3)

It is very easy to see that the joint distribution of y is multivariate Gaussian with mean

0, diagonal variances 1 + θ21 + θ2

2, covariance at lags 1 and 2, θ1 + θ1θ2 and θ2 respectively

and zero at other lags. In this case, (Exact) sampling is feasible. For simulations we set

{θ1 = 0.6, θ2 = 0.6}, n = 200 and define summary statistics S(y) = (γ0(y), γ1(y), γ2(y))

as sample variance and covariances at lags 1 and 2. First we show results based on one

replicate. Figure 9.1 shows the trace plots, histograms and auto-correlation functions esti-

89

mated from posterior draws for parameters θ1 and θ2 for the AABC-U sampler. Note that

only post burn-in samples are shown. Similarly Figure 9.2 displays behavior of ABSL-U

Figure 9.1: MA2 model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.2

0.3

0.4

0.5

0.6

Trace−plot for θ1

Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.2 0.3 0.4 0.5 0.6

010

0020

0030

0040

0050

00

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

0.2

0.3

0.4

0.5

0.6

0.7

0.8


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

0.2 0.3 0.4 0.5 0.6 0.7 0.8

050

010

0020

0030

00

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ2

Figure 9.2: MA2 model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

−0.

50.

00.

51.

0


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

−0.5 0.0 0.5 1.0

020

0040

0060

0080

00

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

0.5

1.0

1.5


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

0.0 0.5 1.0 1.5

010

0030

0050

00

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ2

90

Figure 9.3: MA2 model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.30

0.40

0.50

0.60


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.3 0.4 0.5 0.6

050

015

0025

0035

00

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

0.2

0.3

0.4

0.5

0.6

0.7

0.8


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

0.2 0.3 0.4 0.5 0.6 0.7 0.8

010

0030

0050

00

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF for θ2

sampler. These algorithms can be compared to standard ABC-RW method in Figure 9.3.

In the interest of keeping paper length within reasonable limits, we do not report the per-

formance of the remaining algorithms, but report that AABC-L is similar to AABC-U and

ABSL-L to ABSL-U. ABC-IS is generally less efficient than ABC-RW. From these plots it

is evident that proposed AABC-U and ABSL-U have much better mixing than ABC-RW.

Auto-correlation function for these two methods has quite small values because independent

proposal is implemented there compared to random walk where proposal depends on the

current state.

To see how close the draws from approximated samplers are to the draws from the exact

chain, we plot estimated densities in Figure 9.4. Left and right side plots refer to θ1 and

θ2, respectively. The two upper plots compare estimated density of exact MCMC sampler

with ABC-based ones (SMC, ABC-RW and AABC-U), while the two lower plots compare

the exact sampler with Synthetic Likelihood based methods (BSL-IS and ABSL-U). All ap-

proximate samplers’ draws deviate from the exact samples however posterior distribution of

AABC-U is very similar to SMC and ABC-RW, similarly distribution produced by ABSL-U

is very close to BSL-IS. This observation is true for both components, θ1 and θ2. The dif-

ference between approximate posterior distributions produced by simulation-based methods

and the exact posterior is probably due to the choice of summary statistic which does not

91

Figure 9.4: MA model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1 and θ2.

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

θ1

Den

isty

Exact

SMC

ABC−RW

AABC−U

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

θ2

Den

isty

Exact

SMC

ABC−RW

AABC−U

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

θ1

Den

isty

Exact

BSL−IS

ABSL−U

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

θ2D

enis

ty

Exact

BSL−IS

ABSL−U

capture the information about the parameters in the most effective way.

To study accuracy, precision and efficiency of proposed samplers we perform a simulation

study where 100 data sets are generated and all samplers are run for every data set. The

results are summarized in Table 9.1. Examining this table we immediately note that ES-

Table 9.1: Simulation Results (MA model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, Variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.

Diff with exact Diff with true parmater Efficiency

Sampler DIM DIC TV√

Bias2√

Var√

MSE ESS ESS/CPUSMC 0.082 0.0045 0.418 0.014 0.115 0.116 471 0.505ABC-RW 0.088 0.0063 0.466 0.016 0.123 0.124 23 0.231ABC-IS 0.084 0.0067 0.455 0.016 0.115 0.116 44 0.389AABC-U 0.083 0.0071 0.444 0.018 0.116 0.117 3446 6.215AABC-L 0.080 0.0067 0.438 0.017 0.112 0.113 2820 5.107BSL-RW 0.082 0.0070 0.438 0.015 0.114 0.115 252 0.282BSL-IS 0.081 0.0070 0.436 0.015 0.114 0.115 841 0.923ABSL-U 0.081 0.0095 0.443 0.017 0.114 0.115 3950 5.584ABSL-L 0.082 0.0078 0.441 0.015 0.114 0.115 4165 6.030

S/CPU measure is much larger for proposed algorithms than for standard methods. The

92

improvement is very substantial, for example ESS/CPU for AABC-U is 12 times larger than

for the best standard ABC procedures like SMC. Similar results are shown for Bayesian Syn-

thetic Likelihood. The main reason for such efficiency is using past draws to make decision

about accepting or rejecting a proposal. The improvement in efficiency is of no use if pos-

terior distributions are very different from the exact one. Therefore examining DIM, DIC,

TV and MSE quantities that calculate how close posterior draws are to samples generated

by the exact MCMC is essential. For all these quantities the smaller the value the better is

the sampler. We see that all these measures for AABC-U and AABC-L are very similar to

SMC, ABC-RW and ABC-IS and frequently outperforms them. Similarly for BSL approach.

Another observation is that the approximated algorithm with uniform and linear weights

generally perform very similarly.

9.4 Ricker’s Model

Ricker’s model is analyzed very frequently to test Synthetic Likelihood procedures [94, 73].

It is a particular instance of hidden Markov model:

x−49 = 1; ziiid∼ N (0, exp(θ2)2); i = {−48, · · · , n},

xi = exp(exp(θ1))xi−1 exp(−xi−1 + zi); i = {−48, · · · , n},

yi = Pois(exp(θ3)xi); i = {−48, · · · , n},

(9.4)

where Pois(λ) is Poisson distribution with mean parameter λ and n = 100. Only y =

(y1, · · · , yn) sequence is observed, first 50 values are ignored. Note that all parameters

θ = (θ1, θ2, θ3) are unrestricted, the prior is given as (each prior parameter is independent):

θ1 ∼ N (0, 1),

θ2 ∼ Unif(−2.3, 0),

θ3 ∼ N (0, 4).

(9.5)

We restrict the range of θ2 as all algorithms become unstable for θ2 outside this interval. Note

that the marginal distribution of y is not available in closed form, but transition distribution

of hidden variables Xi|xi−1 and emission probabilities Yi|xi are known and hence we can

run Particle MCMC (PMCMC) [8] or Ensemble MCMC [82] to sample from the posterior

distribution π(θ|y0). Here we are utilizing the Particle MCMC with 100 particles. As

93

suggested in [94] we set θ0 = (log(3.8), 0.9, 2.3) and define summary statistics S(y) as the

14-dimensional vector whose components are:

(C1) #{i : yi = 0},

(C2) Average of y, y,

(C3:C7) Sample auto-correlations at lags 1 through 5,

(C8:C11) Coefficients β0, β1, β2, β3 of cubic regression

(yi − yi−1) = β0 + β1yi + β2y2i + β3y

3i + εi, i = 2, . . . , n,

(C12-C14) Coefficients β0, β1, β2 of quadratic regression

y0.3i = β0 + β1y

0.3i−1 + β2y

0.6i−1 + εi, i = 2, . . . , n.

Figures 9.5, 9.6 and 9.7 show trace-plots, histograms and ACF function for AABC-

U, ABSL-U and ABC-RW samplers for each component (red lines correspond to the true

parameter). We show here ABC-RW instead of ABC-IS since the latter has much worse

Figure 9.5: Ricker’s model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.8

1.2

1.6


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.8 1.0 1.2 1.4 1.6

020

0050

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−2.

0−

1.0

0.0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−2.0 −1.5 −1.0 −0.5 0.0

020

0040

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

1.8

2.2

2.6


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

1.8 2.0 2.2 2.4 2.6 2.8

020

0050

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

performance for this model. The main observation is that mixing of AABC-U is much

better than in ABC-RW with smaller auto-correlation values. ABSL-U has higher auto-

correlations than AABC-U but still performs quite well. To see how close the draws from

94

Figure 9.6: Ricker’s model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.8

1.0

1.2

1.4


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.8 1.0 1.2 1.4

020

0040

00

0 50 100 150 200 250 300

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−1.

5−

0.5

0.0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy−1.5 −1.0 −0.5 0.0

020

0040

00

0 50 100 150 200 250 300

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

2.1

2.3

2.5


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

2.1 2.2 2.3 2.4 2.5 2.6 2.7

020

0040

00

0 50 100 150 200 250 300

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

Figure 9.7: Ricker’s model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

1.0

1.2

1.4


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

1.0 1.1 1.2 1.3 1.4 1.5

020

0040

00

0 100 200 300 400 500 600 700

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−1.

5−

0.5

0.0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−1.5 −1.0 −0.5 0.0

010

0025

00

0 100 200 300 400 500 600 700

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

2.0

2.2

2.4

2.6


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

2.0 2.2 2.4 2.6

010

0030

00

0 100 200 300 400 500 600 700

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

simulation-based algorithms to the draws from the exact chain, we plot estimated densities

in Figure 9.8. Two upper plots (left and right are associated to parameter’s component)

compares estimated density of exact PMCMC sampler (with 100 particles) with ABC-based

95

ones (SMC, ABC-RW and AABC-U), two lower plots compare exact sampler with Synthetic

Likelihood based methods (BSL-RW and ABSL-U), here we have chosen BSL-RW over BSL-

IS since it has better general performance in this model. Observe that ABC-based samplers

Figure 9.8: Ricker’s model: Estimated densities for each component. First row comparesExact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-RW andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.

0.5 1.0 1.5 2.0

01

23

4

θ1

Den

isty

Exact

SMC

ABC−RW

AABC−U

−2.0 −1.5 −1.0 −0.5 0.0

0.0

0.5

1.0

1.5

θ2

Den

isty

Exact

SMC

ABC−RW

AABC−U

1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

θ3

Den

isty

Exact

SMC

ABC−RW

AABC−U

0.5 1.0 1.5 2.0

01

23

45

6

θ1

Den

isty

Exact

BSL−RW

ABSL−U

−2.0 −1.5 −1.0 −0.5 0.0

0.0

0.5

1.0

1.5

θ2

Den

isty

Exact

BSL−RW

ABSL−U

1.5 2.0 2.5 3.0

01

23

45

θ3

Den

isty

Exact

BSL−RW

ABSL−U

(SMC, ABC-RW and AABC-U) have very similar estimated densities, densities of Synthetic

Likelihood methods are also similar. For the second component there is quite large difference

between exact and approximate posteriors which may be due to non-informative summary

statistics.

A more general study, where results are averaged over 100 independent replicates, is

shown in Table 9.2. Again, the proposed strategies clearly outperforms in overall efficiency

(ESS/CPU). For instance, AABC-U is about 10 times more efficient than standard SMC and

ABSL-U is 6 times more efficient than BSL-RW. At the same time DIM, DIC, TV and MSE

are generally smaller for approximate methods than for standard ones. Therefore it is evident

that for this model the improvement of sampler’s efficiency (or number of independent draws

per CPU time) does not decrease accuracy and precision of posterior’s moments.

96

Table 9.2: Simulation Results (Ricker’s model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.



Bias2√

Var√


9.5 Stochastic Volatility with Gaussian emissions

When analyzing stationary time series it is frequently observed that there are periods of high

and periods of low volatility. Such phenomenon is called volatility clustering, see for example

[59]. One way to model such a behaviour is through a Stochastic Volatility (SV) model,

where variances of the observed time series depend on hidden states that themselves form

a stationary time series. Consider the following model which depends on three parameters

(θ1, θ2, θ3):

x1 ∼ N (0, 1/(1− θ21)); vi

iid∼ N (0, 1); wiiid∼ N (0, 1); i = {1, · · · , n},

xi = θ1xi−1 + vi; i = {2, · · · , n},

yi =√

exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.

(9.6)

Only y = (y1, · · · , yn) is observed while (x1, · · · , xn) are hidden states. First parameter

θ1 must be between -1 and 1 and controls auto-correlation of hidden states, θ2 and θ3 are

unrestricted and relate to the way hidden states influence variability of the observed series.

Note that for fixed hidden states the distribution of the observed variable is normal which

might not be appropriate in some examples. We introduce the following priors, independently

97

for each parameter:

θ1 ∼ Unif(0, 1),

θ2 ∼ N (0, 1),

θ3 ∼ N (0, 1).

(9.7)

We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1) and length of the time series

n = 500. Since the marginal distribution of y is not known in closed-form, standard MCMC

strategy cannot be implemented. We use Particle MCMC (PMCMC) as the Exact sampling

scheme. Since pseudo-data sets can be easily generated for every parameter value, the SV

is a good example to demonstrate the performances of the algorithms considered here. For

summary statistics we use a 7-dimensional vector whose components are:

(C1) #{i : y2i > quantile(y2

0, 0.99)},

(C2) Average of y2,

(C3) Standard deviation of y2,

(C4) Sum of the first 5 auto-correlations of y2,

(C5) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.1)}}ni=1,


(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.

Here quantile(y, τ) is defined as τ -quantile of the sequence y. As was shown in [81] and [24]

the auto-correlation of indicators (under different quantiles) can be very useful in charac-

terizing a time series and that is why we have added (C5),(C6) and (C7) to the summary

statistic. We focus here on y2 and its auto-correlations since model parameters only affect

variability of y (auto-correlation of y is zero for any lag). Figures 9.9, 9.10 and 9.11 show

trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers

respectively for each component (red lines correspond to the true parameter). The major

observation is that mixing of AABC-U is much better than in ABC-RW with smaller auto-

correlation values. ABSL-U has higher auto-correlations than AABC-U but still performs

well. In Figure 9.12 we compare the sample-based kernel smoothing density estimates ob-

tained from BSL-IS and BSL-RW. We note that all samples obtained from the approximate

98

Figure 9.9: SV model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.6

0.8

1.0


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.6 0.7 0.8 0.9 1.0

010

0020

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−3.

0−

2.0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy−3.0 −2.5 −2.0 −1.5

010

0020

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−1.

5−

0.5


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−1.5 −1.0 −0.5 0.0

010

0020

00

0 10 20 30 40 50

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

Figure 9.10: SV model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.6

0.8

1.0


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.6 0.7 0.8 0.9 1.0

040

0080

00

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−4.

0−

3.0

−2.

0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5

020

0050

00

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−2.

0−

1.0

0.0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−2.5 −2.0 −1.5 −1.0 −0.5 0.0

020

0050

00

0 50 100 150 200

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

algorithms are exact posterior (produced using PMCMC with 100 particles). Generally all

ABC-based samplers perform similarly, on the other hand ABSL-U performs worse than

generic BSL-IS in this run as it is shifted away from the exact posterior for θ1 and θ3.

99

Figure 9.11: SV model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.

0 10000 20000 30000 40000

0.80

0.90


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.80 0.85 0.90 0.95 1.00

010

0020

00

0 500 1000 1500 2000

0.0

0.4

0.8

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−2.

8−

2.2

−1.

6


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6

020

0040

00

0 500 1000 1500 2000

0.0

0.4

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−2.

0−

1.0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−2.0 −1.5 −1.0 −0.5

020

0040

00

0 500 1000 1500 2000

0.0

0.4

0.8

Lag

AC

F

ACF for θ3

Figure 9.12: SV model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.

0.5 0.6 0.7 0.8 0.9 1.0

02

46

810

12

θ1

Den

isty

Exact

SMC

ABC−RW

AABC−U

−4 −3 −2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

θ2

Den

isty

Exact

SMC

ABC−RW

AABC−U

−3 −2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

θ3

Den

isty

Exact

SMC

ABC−RW

AABC−U

0.5 0.6 0.7 0.8 0.9 1.0

02

46

810

12

θ1

Den

isty

Exact

BSL−IS

ABSL−U

−4 −3 −2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

θ2

Den

isty

Exact

BSL−IS

ABSL−U

−3 −2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

θ3

Den

isty

Exact

BSL−IS

ABSL−U

To get more general conclusions we show average results in Table 9.3 over 100 data repli-

cates. Again we note that proposed algorithms outperform the benchmark samplers by 8

times in ESS/CPU. Moreover AABC-U and AABC-L have very similar or smaller values

100

Table 9.3: Simulation Results (SV model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.



Bias2√

Var√


for DIM, TV and MSE, which demonstrates that these samplers are much more efficient

than standard methods and at the same produce as accurate (or more accurate) parameter

estimates as generic algorithms.

ABSL-U and ABSL-L on the other hand did not perform well for this model, TV and MSE

for these samplers are larger by 10% than generic ones.

9.6 Stochastic Volatility with α-Stable errors

As was pointed out in the previous sub-section, standard SV model assumes that conditional

on hidden states observed variables have a normal distribution, which is a strong assumption.

Frequently, in financial time series, a large sudden drop occurs that is very unlikely under

Gaussianity. Therefore, it is suggested to use heavy tailed distributions (instead of Gaussian)

to model financial data. We consider a family of distributions named α-Stable (Stab(α, β))

with two parameters α ∈ (0, 2] (stability parameter) and β ∈ [−1, 1] (skew parameter). Two

special cases are α = 1 and α = 2 which correspond to Cauchy and Gaussian distribution

respectively, note that for α < 2 the distribution has undefined variance. We define the

101

following SV model with α-Stable errors with four parameter (θ1, θ2, θ3, θ4):

x1 ∼ N (0, 1/(1− θ21)); vi

iid∼ N (0, 1); wiiid∼ Stab(θ4,−1); i = {1, · · · , n},

xi = θ1xi−1 + vi; i = {2, · · · , n},

yi =√

exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.

(9.8)

This model is very similar to the simple SV with only difference that emission errors follow

α-Stable distribution with unknown stable parameter and fixed skew of −1. We generally

prefer negative skew emission probability to model large negative financial returns. As in

the previous simulation example θ2 and θ3 are unrestricted. The prior distribution for this

model is (independently for each parameter):

θ1 ∼ Unif(0, 1),

θ2 ∼ N (0, 1),

θ3 ∼ N (0, 1),

θ4 ∼ Unif(1.5, 2).

(9.9)

We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1, θ4 = 1.8) and length of the

time series n = 500. The major challenge with this model is that there are no closed-form

densities for α-Stable distributions. Hence, most MCMC samplers, including PMCMC and

ensemble MCMC, cannot be used to sample from the posterior. However sampling from this

family of distributions is feasible which makes it particularly amenable for simulation based

methods like ABC and BSL. For summary statistics we use a 7-dimensional vector whose

components are:

(C1) #{i : y2i > quantile(y2

0, 0.99)},

(C2) Average of y2,

(C3) Standard deviation of y2,

(C4) Sum of the first 5 auto-correlations of y2,



(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.

102

Figures 9.13,9.14 and 9.15 show trace-plots, histograms and ACF function for AABC-U,

ABSL-U and ABC-RW samplers respectively for each component (red lines correspond to

the true parameters). As in previous examples the mixing of AABC-U and ABSL-U is

Figure 9.13: SV α-Stable model: AABC-U Sampler. Each row corresponds to parameters θ1

(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.

0 10000 20000 30000 40000

0.4

0.8


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−2.

6−

1.6


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6 −1.4

030

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−1.

00.

5


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−1.5 −1.0 −0.5 0.0 0.5

015

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ3

0 10000 20000 30000 40000

1.5

1.8


Iteration

θ 4

Histogram for θ4

θ4

Fre

quen

cy

1.5 1.6 1.7 1.8 1.9 2.0

030

00

0 20 40 60 80 1000.

00.

6

Lag

AC

F

ACF for θ4

much better than of ABC-RW. Since exact sampling is not feasible in this example we

compare samplers to SMC (instead of exact samples), the plotted estimated densities are

in Figure 9.16, here we have chosen BSL-IS over BSL-RW because it has better general

performance in this model. Generally all simulation-based samplers have similar densities in

this example.

For more general conclusions we show average results in Table 9.4 over 100 data replicates.

Here to calculate DIM, DIC and TV, samplers are compared to SMC since exact draws

cannot be obtained. As in previous examples ESS/CPUs for AABC-U, AABC-L, ABSL-

U and ABSL-L are roughly 8 times larger than benchmark algorithms. For this example

looking at DIM, DIC and TV maybe misleading since approximated samplers are compared

to another approximate sampler. Much more informative is MSE measure, it is very similar

across ABC-based and BSL-based algorithms. Therefore we can conclude that proposed

samplers perform very well in this example.

103

Figure 9.14: SV α-Stable model: ABSL-U Sampler. Each row corresponds to parameters θ1


0 10000 20000 30000 40000

0.3

0.7


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

030

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−3.

0−

1.0


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy−3.5 −3.0 −2.5 −2.0 −1.5 −1.0

050

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−2.

00.

0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−2.0 −1.5 −1.0 −0.5 0.0 0.5

040

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ3

0 10000 20000 30000 40000

1.5

1.8


Iteration

θ 4

Histogram for θ4

θ4

Fre

quen

cy

1.5 1.6 1.7 1.8 1.9 2.0

020

00

0 20 40 60 80 100

0.0

0.6

Lag

AC

F

ACF for θ4

Figure 9.15: SV α-Stable model: ABC-RW Sampler.Each row corresponds to parameters θ1


0 10000 20000 30000 40000

0.6

0.9


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.6 0.7 0.8 0.9 1.0

015

00

0 500 1000 1500 2000 2500 3000

0.0

0.6

Lag

AC

F

ACF for θ1

0 10000 20000 30000 40000

−2.

4−

1.6


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−2.4 −2.2 −2.0 −1.8 −1.6

030

00

0 500 1000 1500 2000 2500 3000

0.0

0.8

Lag

AC

F

ACF for θ2

0 10000 20000 30000 40000

−1.

50.

0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−1.5 −1.0 −0.5 0.0

020

00

0 500 1000 1500 2000 2500 3000

0.0

0.6

Lag

AC

F

ACF for θ3

0 10000 20000 30000 40000

1.7

2.0


Iteration

θ 4

Histogram for θ4

θ4

Fre

quen

cy

1.7 1.8 1.9 2.0

020

00

0 500 1000 1500 2000 2500 3000

0.0

0.6

Lag

AC

F

ACF for θ4

104

Figure 9.16: SV α-Stable model: Estimated densities for each component. First row com-pares SMC, ABC-RW and AABC-U samplers. Second row compares SMC, BSL-IS andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2, θ3 andθ4.

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

θ1

Den

isty

SMC

ABC−RW

AABC−U

−3.5 −3.0 −2.5 −2.0 −1.5

0.0

0.5

1.0

1.5

2.0

θ2

Den

isty

SMC

ABC−RW

AABC−U

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ3

Den

isty

SMC

ABC−RW

AABC−U

1.5 1.6 1.7 1.8 1.9 2.0

01

23

4

θ4

Den

isty

SMC

ABC−RW

AABC−U

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

θ1

Den

isty

SMC

BSL−IS

ABSL−U

−3.5 −3.0 −2.5 −2.0 −1.5

0.0

0.5

1.0

1.5

θ2

Den

isty

SMC

BSL−IS

ABSL−U

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

θ3

Den

isty

SMC

BSL−IS

ABSL−U

1.5 1.6 1.7 1.8 1.9 2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

θ4

Den

isty

SMC

BSL−IS

ABSL−U

Table 9.4: Simulation Results (SV α-Stable model): Average Difference in mean, Differencein covariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm. In DIM, DIC andTV, samplers are compared to SMC.

Diff with SMC Diff with true parmater Efficiency


Bias2√

Var√


105

Chapter 10

Data Analysis

10.1 Dow-Jones log-returns

For real world example we consider Dow-Jones index daily log returns from January 1, 2010

until December 31, 2018. The data was downloaded from Yahoo Finance1 website. Given a

time series of prices Pi, i = 1, · · · , n, log returns are calculated in the following way:

ri = log(Pi)− log(Pi−1), i = 2, · · · , n.

The resulting time series is of length 2262. To make log returns more suitable for analysis,

we standardize rt by subtracting its mean and then multiply each return by 200, so that

absolute values were not too small, Figure 10.1 shows transformed returns. This time series

(y0) has mean zero by construction, its auto-correlations and partial auto-correlations are

insignificant for any lag however it is obvious that variances are correlated. There are periods

of low and high variability, therefore to analyze its properties we apply Stochastic Volatility

model with α-Stable errors as described in previous chapter. Since likelihood does not exist

for this class of models, simulation-based methods are probably the only available tools for

the inference.

10.2 Analysis

The evolution of time series described by equation (9.8) (note that skewed parameter of

Stable distribution is fixed at value of −1) and parameters’ prior as in equation (9.9). To

1https://ca.finance.yahoo.com/

106

Figure 10.1: Dow Jones daily transformed log return for a period of Jan 2010 - Dec 2018.−

10−

50

510

Tran

sfor

med

log

retu

rn

Jan 10 May 10 Sep 10 Jan 11 May 11 Sep 11 Jan 12 Apr 12 Aug 12 Dec 12 Apr 13 Aug 13 Dec 13 Apr 14 Aug 14 Dec 14 Apr 15 Aug 15 Dec 15 Apr 16 Aug 16 Dec 16 Apr 17 Jul 17 Nov 17 Mar 18 Jul 18 Oct 18

estimate posterior distribution we run AABC-U and ABLS-U samplers. Summary statistic

for both methods is 7-dimensional vector as in section 9.6. Each chain was run for 100

thousand iterations with last 80 thousands used for inference. Figures 10.2 and 10.3 show

trace-plots and histograms for AABC-U and ABSL-U samplers respectively for each param-

eter. We observe that similar to simulation results, the mixing of AABC-U is generally

better than of ABSL-U. However posterior draws of ABSL-U for the first 3 components are

uni-modal, symmetric and bell-shaped, very similar to Gaussian distributions, which is not

surprising since when Gaussian priors are used the posterior of BSL algorithms must also

be Gaussian by conjugacy. Table 10.1 reports posterior mean and 95% credible intervals

for every parameter and for both samplers. AABC-U and ABSL-U produce similar results.

Table 10.1: Dow Jones log return stochastic volatility: 95% credible intervals and posterioraverages for 4 parameters for two proposed samplers (AABC-U and ABSL-U).

AABC-U ABSL-UParameter 2.5% Quantile Average 97.5% Quantile 2.5% Quantile Average 97.5% Quantileθ1 0.787 0.899 0.990 0.775 0.856 0.959θ2 -0.411 -0.147 0.112 -0.369 -0.092 0.222θ3 -1.405 -0.790 -0.304 -1.858 -0.841 -0.206θ4 1.758 1.916 1.997 1.721 1.909 1.996

We see that estimated correlation between adjacent variables in hidden layer of Stochastic

107

Figure 10.2: Dow Jones log returns: AABC-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.

0 20000 40000 60000 80000

0.65

0.75

0.85

0.95


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

010

0030

0050

00

0 20000 40000 60000 80000

−1.

0−

0.5

0.0

0.5


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−1.0 −0.5 0.0 0.5

010

0020

0030

0040

0050

0060

00

0 20000 40000 60000 80000

−2.

5−

2.0

−1.

5−

1.0

−0.

50.

0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−2.5 −2.0 −1.5 −1.0 −0.5 0.0

020

0040

0060

00

0 20000 40000 60000 80000

1.5

1.6

1.7

1.8

1.9

2.0


Iteration

θ 4

Histogram for θ4

θ4

Fre

quen

cy

1.5 1.6 1.7 1.8 1.9 2.0

010

0030

0050

00

Figure 10.3: Dow Jones log returns: ABSL-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.

0 20000 40000 60000 80000

0.65

0.75

0.85

0.95


Iteration

θ 1

Histogram for θ1

θ1

Fre

quen

cy

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

020

0040

0060

0080

0012

000

0 20000 40000 60000 80000

−1.

0−

0.5

0.0

0.5


Iteration

θ 2

Histogram for θ2

θ2

Fre

quen

cy

−1.0 −0.5 0.0 0.5

020

0060

0010

000

0 20000 40000 60000 80000

−2.

5−

2.0

−1.

5−

1.0

−0.

50.

0


Iteration

θ 3

Histogram for θ3

θ3

Fre

quen

cy

−2.5 −2.0 −1.5 −1.0 −0.5 0.0

020

0040

0060

0080

00

0 20000 40000 60000 80000

1.5

1.6

1.7

1.8

1.9

2.0


Iteration

θ 4

Histogram for θ4

θ4

Fre

quen

cy

1.5 1.6 1.7 1.8 1.9 2.0

020

0060

0010

000

volatility is about 0.9 and parameter of α-Stable emission noise is 1.91 which can produce

more extreme values than predicted by standard Gaussian noise. Moreover since 0 is inside

the credible interval for θ2, we can disregard this parameter. Overall this example shows

that the proposed samplers AABC-U and ABSL-U can be implemented successfully for real

world data problems.

108

Chapter 11

Theoretical Justifications

In this chapter we show that our novel approximated ABC MCMC and BSL samplers with

independent proposals exhibit ergodic properties in a long run. In other words we want to

show that as number of MCMC iterations increases marginal distribution of {θ(t)} converges

to appropriate posterior distribution in total variation and sample averages converge to the

true expectations.

11.1 Preliminary Theorems

We start by reviewing our notation. Let p(θ), q(θ) represent the prior and proposal distri-

butions for θ ∈ Θ respectively. For AABC we define a function h(θ) as P (δ < ε|θ) where

δ = δ(y,y0) and y ∼ f(y|θ). Then given a proposed ζ∗ the acceptance probability is:

a(θ, ζ∗) = min(1, α(θ, ζ∗)),

α(θ, ζ∗) =p(ζ∗)q(θ)h(ζ∗)

p(θ)q(ζ∗)h(θ).

(11.1)

This MH procedure defines an exact transition kernel which we call P (·, ·). Since h(θ) is not

available in closed form we will estimate it using k-Nearest-Neighbor approach.

Let ZN = {ζn,1{δn<ε}}Nn=1 represent N independent samples from q(ζ)P (1{δ<ε}|ζ) for AABC.

Actually ZN contains past generated samples that were saved before Nth iteration. Given θ

and ζ∗ we apply kNN to approximate h(θ) and h(ζ∗) by calculating local weighted averages

of 1{δn<ε} for ζn that are close to θ or ζ∗. We denote such estimate h(θ;ZN), and the

probability of proposal acceptance for this perturbed algorithm (more on perturbed MCMC

109

can be found in [77, 71, 46]) is:

a(θ, ζ∗;ZN) = min(1, α(θ, ζ∗;ZN)),

α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)

p(θ)q(ζ∗)h(θ;ZN).

(11.2)

The approximate kernel transition is PN(·, ·) = EZN

[PN(·, ·;ZN)

], the goal is to show that

as N →∞ the distance between this transition and the exact one converges to zero, where

distance is defined as:

‖PN(·, ·)− P (·, ·)‖ = supθ‖PN(θ, ·)− P (θ, ·)‖TV , (11.3)

where the last distance is ”total variation” distance between two measures. First we show

that under strong consistency assumption of h(θ;ZN), perturbed kernel converges to the

exact one.

Theorem 11.1.1. Suppose Θ is compact, supθ ‖h(θ;ZN)− h(θ)‖ → 0 with probability 1 and

h(θ) > 0 for all θ ∈ Θ. Then for any ε > 0 there exists C such that for all N > C,

‖PN − P‖ < ε.

Next let Pε = {PN : ‖PN − P‖ < ε} be a collection of perturbed kernels each ε distance

from the exact kernel. For illustration consider an example when auxiliary set ZN grows

with number of iterations, in this case at each iteration a new kernel PN ∈ Pε is used in the

chain. We want to show that this procedure will results in ergodic chain with appropriate

convergence results. For most of the presented results below we refer to the work of [47] on

convergence properties of perturbed kernels.

To obtain useful convergence results we need to make additional Doeblin Condition assump-

tion about the exact kernel P :

Definition 11.1.1 (Doeblin Condition). Given a kernel P , there exists 0 < α < 1 such that

sup(θ,ζ∗)∈Θ×Θ

‖P (θ, ·)− P (ζ∗, ·)‖TV < 1− α.

We also choose ε so that α∗ = α + 2ε < 1 and ε < α/2 which by Remark 2.1 in [47]

guarantees that every member of Pε satisfies Doeblin Condition with α = α∗ and has a

unique invariant measure. Thus we define the following 3 assumptions:

(A1) Exact transition kernel P satisfies Doeblin Condition,

110

(A2) For any P ∈ Pε, ‖P − P‖ < ε,

(A3) ε < min(α/2, (1− α)/2).

Now, let µ be invariant measure of the exact kernel P , and the perturbed chain θ(0), θ(1), · · · , θ(t)

is a Markov chain with θ(0) ∼ ν = µ0. Also define marginal distribution of θ(t) denoted by

µt, t = 1, 2, , and equal to µt = νP0P1 · · · Pt with each Pt ∈ P , t = 1, 2, · · · and P0 being

identity transition (for convenience). First we need to examine the total variation distance

between µ and average measure∑M−1

t=0 µt/M , in other words:∥∥∥∥∥µ−∑M−1

t=0 νP0 · · · PtM

∥∥∥∥∥TV

, where P0 = I. (11.4)

Then we have the following important convergence result:

Theorem 11.1.2. Suppose that exact kernel P satisfies (A1), every member of Pε (A2) and

ε is chosen to satisfy (A3). Let ν be any probability measure on (Θ,F0), then∥∥∥∥∥µ−∑M−1

t=0 νP0 · · · PtM

∥∥∥∥∥TV

≤ (1− (1− α)M)‖µ− ν‖TVMα

− ε(1− (1− α)M)

Mα2+ε

α, (11.5)

which implies that this difference can be arbitrary small for sufficiently large M and small

enough ε.

Next we focus on the following mean squared error (MSE):

E

(µf − ∑M−1t=0 f(θ(t))

M

)2 ,

where f is bounded function and µf = Eµ[f(θ)]. The main objective here is to find the upper

bound for this MSE when perturbed MCMC is used and how it depends on the sample size

M . To obtain the main result we introduce the following lemma:

Lemma 11.1.3. Suppose θ(0) ∼ ν with ν being any distribution, µt = νP1 · · · Pt be marginal

distribution of θ(t), t = 1, 2, · · · with Pt ∈ Pε and ε satisfying (A2) and (A3) respectively.

Moreover let f(θ) and g(θ) be bounded functions with |f | = supθ f(θ) and |g| = supθ g(θ),

then

cov(f(θ(k)), f(θ(j))) ≤ 8|f ||g|(1− α∗)|k−j|.

111

For the proofs we will also utilize the following two theorems, one is about the strong

uniform consistency of kNN estimators the later one is about uniform ergodicity of Hastings

algorithm with independent proposal.

Theorem 11.1.4 (Uniform Consistency of kNN - [16]). Given independent {ζn, δn}Nn=1, let Θ

be support of distribution of ζ, h(ζ) = E(δ|ζ) and hN(ζ) =∑N

j=1WNj δj (kNN estimator)

(here j are permuted indices that order distances between ζn and ζ from smallest to largest).

Suppose weights WNj satisfy

(i)∑N

j=1 WNj = 1,

(ii) WNj = 0 for j > K, and K = K(N) with K →∞ and K/N → 0,

(iii) supN K maxjWNj <∞.

If

(i) Θ is compact,

(ii) h(ζ) is continuous function,

(iii) V ar(δ|ζ) is bounded random variable,

(iv) K(N) satisfies K/√N log(N)→∞,

then supζ∈Θ |hN(ζ)− h(ζ)| → 0 with probability 1.

Note that the uniform and linear weights satisfy WNj assumptions above.

Theorem 11.1.5 (Independent Metropolis sampler - [62]). Suppose θ(t) is a MH Markov Chain

with invariant distribution π(θ), independent proposal q(θ) and acceptance probabilities

a(θ, ζ∗) = min(

1, π(ζ∗)q(θ)π(θ)q(ζ∗)

).

If there exists β > 0 such that q(θ)/π(θ) > β for all θ ∈ Θ, then the algorithm is uniformly

ergodic so that ‖P n(θ, ·)− π‖TV < (1− β)n (here P n(θ, ·) is conditional distribution of θ(n)

given θ(0) = θ).

11.2 Main Results

The next important convergence results follows (similar to Theorem 2.5 of [47]):

112

Theorem 11.2.1 (Approximation of MSE). Suppose P , Pε and ε satisfy (A1), (A2) and

(A3) respectively. Letting µ represent invariant measure of P , f(θ) be a bounded function

and θ(0) ∼ ν. Then

E

(µf − 1

M

M−1∑t=0

f(θ(t))

)2

≤ 4|f |2(

(1− (1− α)M)

Mα− ε(1− (1− α)M)

Mα2+ε

α

)2

+ 8|f |2(

1

M+

2

(α∗)2

((1− α∗)M+1 − (1− α∗)

M2+

(1− α∗)− (1− α∗)2

M

)).

(11.6)

In other words this expectation can be made arbitrary small for sufficiently large M and

small enough ε.

Based on these theorems we now can obtain convergence results for AABC and ABSL

algorithms.

Theorem 11.2.2 (Ergodicity of AABC). Consider the proposed AABC sampler (with ε thresh-

old), let p(θ) represent prior measure on Θ, ZN simulated pairs {ζn,1{δn<ε}}Nn=1 (ζn ∼ q(ζ))

with the following assumptions:

(B1) Θ being compact set.

(B2) q(θ) > 0 continuous density of independent proposal distribution.

(B3) p(θ) > 0 continuous density of prior distribution.

(B4) h(θ) = P (δ < ε|θ) > 0 and continuous function of θ.

(B5) In kNN estimation assume that K(N) =√N with uniform or linear weights.

Then for sufficiently large N (number of past simulations) and M (number of chain itera-

tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.

Corollary 11.2.2.1 (Ergodicity of ABSL). Consider the proposed ABSL algorithm, let p(θ)

represent prior measure on Θ, h(θ) = N (s0;µθ,Σθ), ZN simulated pairs {ζn, sn}Nn=1 (ζn ∼q(ζ) and sn is summary statistics) with the following assumptions:

(B1) Θ being compact set.

(B2) q(θ) > 0 continuous density of independent proposal distribution.

113

(B3) p(θ) > 0 continuous density of prior distribution.

(B4) h(θ) continuous function of θ.

(B5) |Σθ| > a0 where Σθ = V ar(s|θ) for every θ ∈ Θ.

(B6) E[sj|θ] and E[sjsk|θ] are continuous functions of θ for every 1 ≤ j, k ≤ p with sj

representing jth component of summary statistic s.

(B7) V ar[sj|θ] and V ar[sjsk|θ] are bounded functions.

(B8) In kNN estimation assume that K(N) =√N with uniform or linear weights.

Then for sufficiently large N (number of past simulations) and M (number of chain itera-

tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.

11.3 Proofs of theorems

Proof. [Proof of Theorem 11.1.1] Note that supθ ‖h(θ;ZN)− h(θ)‖ → 0 w.p.1 implies that

for all θ and ζ∗ in Θ:

h(θ;ZN)p→ h(θ),

h(ζ∗;ZN)p→ h(ζ∗),

therefore by Slutsky’s theorem we obtain

h(ζ∗;ZN)

h(θ;ZN)

p→ h(ζ∗)

h(θ),

for all (θ, ζ∗) in Θ×Θ. therefore

α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)

p(θ)q(ζ∗)h(θ;ZN)

p→ p(ζ∗)q(θ)h(ζ∗)

p(θ)q(ζ∗)h(θ)= α(θ, ζ∗).

Since min(1, x) is a continuous function, Continuous Mapping Theorem implies that

a(θ, ζ∗;ZN) = min(1, α(θ, ζ∗;ZN))p→ min(1, α(θ, ζ∗)) = a(θ, ζ∗).

Note that this not just a point-wise convergence, but uniform convergence in probability so

that one C will work for all (θ, ζ∗). That is, for any (θ, ζ∗), δ > 0 and ε > 0 there exists C

114

such that for all N > C, P (|a(θ, ζ∗;ZN)− a(θ, ζ∗)| > δ) < ε.

Another important observation is that (fixing θ, ζ∗ and letting a(θ, ζ∗) = a and a(θ, ζ∗;ZN) =

a for convenience)

EZN (|a− a|) =

∫|a− a|dF (ZN) =

∫|a−a|<δ

|a− a|dF (ZN) +

∫|a−a|≥δ

|a− a|dF (ZN) ≤

≤ δ +

∫|a−a|≥δ

dF (ZN) ≤ δ + ε.

(11.7)

Because |a − a| ≤ 1 and applying definition of convergence in probability. The above in-

equality shows that we can make this expected value arbitrary small by taking large enough

N , moreover this result is uniform so one N will work for all θ and ζ∗.

Next we focus on the distance between two transition kernels, this discussion is similar to

the proof of Corollary 2.3 in [5]. Observe that (using independent proposals):

P (θ, dζ∗) = q(ζ∗)a(θ, ζ∗)dζ∗ + δθ(ζ∗)r(θ),

PN(θ, dζ∗) =

∫q(ζ∗)a(θ, ζ∗;ZN)dζ∗dF (ZN) + δθ(ζ

∗)rN(θ),

where r(θ) = 1−∫q(ζ∗)a(θ, ζ∗)dζ∗ and rN(θ) = 1−

∫ ∫q(ζ∗)a(θ, ζ∗)dζ∗dF (ZN). Fix θ ∈ Θ,

and noting that total variation between two probability distributions that have densities is

also equal to:

‖π − π‖TV = 0.5

∫|π(θ)− π(θ)|dθ.

Therefore

P (θ, dζ∗)− PN(θ, dζ∗) =

∫q(ζ∗)(a(θ, ζ∗)− a(θ, ζ∗;ZN))dF (ZN)

+ δθ(dζ∗)

∫ ∫q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt,

(11.8)

115

and it follows that

‖P (θ, dζ∗)− PN(θ, dζ∗)‖TV ≤0.5

{∣∣∣∣∫ ∫ q(ζ∗)(a(θ, ζ∗)− a(θ, t;ZN))dF (ZN)dζ∗∣∣∣∣

+

∣∣∣∣∫ ∫ q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt

∣∣∣∣}=

∣∣∣∣∫ ∫ q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt

∣∣∣∣≤∫ ∫

q(t) |a(θ, t)− a(θ, t;ZN)| dF (ZN)dt ≤ δ + ε

(11.9)

for any ε > 0 and δ > 0 and large enough N by (11.7). Since this result is true for any θ ∈ Θ

we finally get the main result:

supθ‖PN(θ, dζ∗)− P (θ, dζ∗)‖TV ≤ δ + ε (11.10)

Proof. [Proof of Theorem 11.1.2] We generally follow the proof of Theorem 2.4 in [47]. First

observe that:

νP0 · · · PM − µPM = (ν − µ)PM +M−1∑t=0

νP0 · · · Pt(Pt+1 − P )PM−t−1.

By Assumptions 2 and 3, we get:

‖νP0 · · · PtPt+1 − νP0 · · · PtP‖TV ≤ ε,

and

‖νP0 · · · PtPt+1Pn−t−1 − νP0 · · · PtPPM−t−1‖TV ≤ ε(1− α)M−t−1.

Using these results, the triangular inequality and formula for sum of finite geometric series

116

we establish that:

‖νP0 · · · PM − µPM‖TV ≤‖µPM − νPM‖TV +M−1∑t=0

‖νP0 · · · PtPt+1PM−t−1 − νP0 · · · PtPPM−t−1‖TV

≤(1− α)M‖µ− ν‖TV + ε

M−1∑t=0

(1− α)M−t−1

=(1− α)M‖µ− ν‖TV + ε1− (1− α)M

α.

(11.11)

Finally we get the main result using that fact that µ is invariant for P (again using sum of

finite geometric series)∥∥∥∥∥µ−∑M−1

t=0 νP0 · · · PtM

∥∥∥∥∥TV

=

∥∥∥∥∥∑M−1

t=0 µP t

M−∑M−1

t=0 νP0 · · · PtM

∥∥∥∥∥TV

≤ 1

M

M−1∑t=0

‖µP t − νP0 · · · Pt‖TV

≤ 1

M

M−1∑t=0

((1− α)t‖µ− ν‖TV + ε

1− (1− α)t

α

)=

(1− (1− α)M)‖µ− ν‖TVMα

− ε(1− (1− α)M)

Mα2+ε

α.

(11.12)

Proof. [Proof of Lemma 11.1.3] Without loss of generality we assume that k > j, next

define:

f(θ(j)) = f(θ(j))− µjf,

g(θ(k)) = g(θ(k))− µkg,

so that E[f(θ(j))] = E[g(θ(k))] = 0. Then we get the following

cov(f(θ(j)), g(θ(k))) =E[f(θ(j))g(θ(k))] = E[E[f(θ(j))g(θ(k))|θ(j)]]

=E[f(θ(j))E[g(θ(k))|θ(j)]] = Eθ(j) [f(θ(j))δθ(j)Pj+1 · · · Pkg],(11.13)

where δθ is point mass at θ and using our notation δθ(j)Pj+1 · · · Pk corresponds to conditional

distribution of θ(k) given fixed value of θ(j).

Using the general observation that for any two measures ν1 and ν2 and any bounded function

117

f the following inequality holds

|ν1f − ν2f | ≤ 2|f |‖ν1 − ν2‖TV , (11.14)

we find that:

|δθ(j)Pj+1 · · · Pkg| =|δθ(j)Pj+1 · · · Pkg − 0| = |δθ(j)Pj+1 · · · Pkg − µkg|

=|δθ(j)Pj+1 · · · Pkg − µjPj+1 · · · Pkg| ≤ 2|g|‖δθ(j)Pj+1 · · · Pk − µjPj+1 · · · Pk‖TV≤2|g|(1− α∗)|k−j|

(11.15)

note that this result is for any θ(j) ∈ Θ. Returning to (11.13) we get that:

cov(f(θ(j)), g(θ(k))) ≤ 2|f ||g|(1− α∗)|k−j|. (11.16)

Finally by triangular inequality |f | ≤ 2|f | for any j = 1, 2, · · · and similarly for |g|. The

desired result follows immediately.

Proof. [Proof of Theorem 11.2.1] Using our standard notation νP0 · · · Ptf = E[f(θ(t))],

Theorem 11.1.2, Lemma 11.1.3 and simple results for double sum of geometric series we get

E

(µf − 1

M

M−1∑t=0

f(θ(t))

)2 = E

(µf − 1

M

M−1∑t=0

νP0 · · · Ptf +1

M

M−1∑t=0

νP0 · · · Ptf −1

M

M−1∑t=0

f(θ(t))

)2

=

(µf − 1

M

M−1∑t=0

νP0 · · · Ptf

)2

+ E

( 1

M

M−1∑t=0

νP0 · · · Ptf −1

M

M−1∑t=0

f(θ(t))

)2

≤(

2|f |(

(1− (1− α)M)‖µ− ν‖TVMα

− ε(1− (1− α)M)

Mα2+ε

α

))2

+1

M2

M−1∑j=0

M−1∑t=0

cov(f(θ(j)), f(θ(t)))

≤ 4|f |2(

(1− (1− α)M)

Mα− ε(1− (1− α)M)

Mα2+ε

α

)2

+8|f |2

M2

M−1∑j=0

M−1∑t=0

(1− α∗)|t−j|

= 4|f |2(

(1− (1− α)M)

Mα− ε(1− (1− α)M)

Mα2+ε

α

)2

+ 8|f |2(

1

M+

2

(α∗)2

((1− α∗)M+1 − (1− α∗)

M2+

(1− α∗)− (1− α∗)2

M

)).

(11.17)

118

Obtaining the desired result.

Proof. [Proof of Theorem 11.2.2] First by (B1), (B2) and (B3), Theorem 11.1.5 guarantees

uniform ergodicity of the exact chain P with β = minθ∈Θq(θ)

p(θ)h(θ)/cwhere c is the normalizing

constant of the posterior. Note that β > 0 since Θ is compact, ratio is continuous and

never zero. Therefore P also satisfies Doeblin Condition. Next from (B1), (B4) and (B5),

Theorem 11.1.4 implies that supθ∈Θ ‖h(θ;ZN) − h(θ)‖ → 0 with probability 1. Hence by

Theorem 11.1.1 perturbed kernel P can be made arbitrary close to the exact kernel P for

sufficiently large N . Note that total variation distance between PN and P decreases to zero

as N increases. Finally assumptions of Theorems 11.1.2 and 11.2.1 follow trivially.

Proof. [Proof of Corollary 11.2.2.1] First by (B1)-(B5), Theorem 11.1.5 guarantees uniform

ergodicity of the exact chain P with β = minθ∈Θq(θ)

p(θ)h(θ)/cwhere c is the normalizing constant

of the posterior. Note that β > 0 since Θ is compact, ratio is continuous and never zero.

Therefore P satisfies Doeblin Condition. Next from (B1), (B6)-(B8), Theorem 11.1.4

implies that supθ∈Θ ‖h(θ;ZN) − h(θ)‖ → 0 with probability 1. Hence by Theorem 11.1.1

perturbed kernel P can be made arbitrary close to the exact kernel P for sufficiently large

N . Note that total variation distance between PN and P decreases to zero as N increases.

Finally assumptions of Theorems 11.1.2 and 11.2.1 follow trivially.

119

Part III

Final Remarks

120

Chapter 12

Conclusions and Future Work

We started this thesis by looking at bivariate conditional copulas. The inclusion of a dynamic

copula in the model comes with a significant computational price. It can be justified by the

need for an exploration of dependence, or because it can improve the predictive accuracy of

the model. We have proposed a Bayesian procedure to estimate the calibration function of

a conditional copula model jointly with the marginal distributions. In our attempt to move

away from an additive model hypothesis we consider a sparse Gaussian process priors used in

conjunction with a single index model. The resulting procedure reduces the dimensionality

of the parameter space and can be used for moderate number of covariates. The simplifying

assumption is often adopted as a way to bypass the need for estimating a conditional copula

model. However, even if the SA is true when conditioning on the true set of covariates, we

showed that if one or more covariates are not included in the fitted model, then the SA is

violated. We have introduced a couple of selection criteria to help select the copula family

from a set of candidates and to gauge data support in favour of the simplifying assump-

tion. While the former task seems to be achieved by all criteria considered, the latter is

a particularly difficult problem and we are excited about the good performance exhibited

by our permutation-based version of the cross-validated marginal likelihood criterion. Its

theoretical properties are the focus of our ongoing work and we plan to extend its use to

identifying the set of covariates that do not influence the calibration function.

As a natural continuation of the first project we then proposed two methods to check data

support for the simplifying assumption in conditional bivariate copula problems as well as

in various other models. Both are based on splitting the data into train and test sets, then

partitioning test set into bins using predicted calibration values and finally use randomiza-

tion or χ2 test to detect if the distribution in each bin is the same or not. It was presented

121

empirically that under SA probability of Type I error is controlled while generic Bayesian

model selection methods fail to provide reliable results. When generative process does not

satisfy SA, these two methods also perform quite well in showing large power. Similar results

are obtained for mean, logistic and quantile regressions. There are still some uncertainty

about what proportion of the data should be assigned to train and which to a test set. It

was also assumed that sample sizes in each ”bin” is the same however in some problems

power can be increased by changing sample sizes in each bin. These problems should be

investigated further.

Lastly we focused on simulation-based algorithms and proposed to speed up generic ABC-

MCMC and BSL algorithms by storing and utilizing past simulations. This approach sig-

nificantly speeds up the process and can be very useful for models where simulation of a

pseudo data set is computationally expensive or when large number of MCMC iterations is

required. We presented theoretical arguments and necessary assumptions for convergence

properties of the perturbed Markov chain. The performance of these strategies were ex-

amined via a series of simulations under different models. All simulation summaries show

that the proposed methods significantly improve mixing and efficiency of the chain and at

the same time produce as accurate and precise parameter estimates as generic samplers.

One obvious drawback is that due to the curse of dimensionality, k-NN estimator may not

produce good results when parameter dimension q is moderate or large. Thus it is of great

interest to modify the proposed algorithms and try to extend them to larger dimensions.

122

Bibliography

[1] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Pair-copula con-

structions of multiple dependence. Insurance Mathematics & Economics, 44(2):182–198,

April 2009.

[2] E.F. Acar, C Genest, and Johanna Neslehova. Beyond simplified pair-copula construc-

tions. Journal of Multivariate Analysis, 110:74–90, 2012.

[3] Elif F Acar, Radu V Craiu, and Fang Yao. Dependence calibration in conditional

copulas: A nonparametric approach. Biometrics, 67(2):445–453, 2011.

[4] Elif F Acar, Radu V Craiu, Fang Yao, et al. Statistical testing of covariate effects in

conditional copula models. Electronic Journal of Statistics, 7:2822–2850, 2013.

[5] Pierre Alquier, Nial Friel, Richard Everitt, and Aidan Boland. Noisy monte carlo: Con-

vergence of markov chains with approximate transition kernels. Statistics and Comput-

ing, 26(1-2):29–47, 2016.

[6] Ziwen An, David J Nott, and Christopher Drovandi. Robust bayesian synthetic likeli-

hood via a semi-parametric approach. arXiv preprint arXiv:1809.05800, 2018.

[7] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An

introduction to MCMC for machine learning. Machine learning, 50(1-2):5–43, 2003.

[8] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain

monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 72(3):269–342, 2010.

[9] Christophe Andrieu, Gareth O Roberts, et al. The pseudo-marginal approach for effi-

cient monte carlo computations. The Annals of Statistics, 37(2):697–725, 2009.

123

[10] Meıli Baragatti and Pierre Pudlo. An overview on approximate bayesian computation.

In ESAIM: Proceedings, volume 44, pages 291–299. EDP Sciences, 2014.

[11] Gerard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.

[12] Christopher M Bishop. Pattern recognition and machine learning. Springer-Verlag New

York Inc., 2006.

[13] LUKE Bornn, NATESH Pillai, AARON Smith, and DAWN Woodard. One pseudo-

sample is enough in approximate bayesian computation mcmc. Biometrika, 99(1):1–10,

2014.

[14] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov

chain monte carlo. CRC press, 2011.

[15] V. Chavez-Demoulin and T. Vatter. Generalized additive models for conditional copulas.

J. Multivariate Anal., 141:147–167, 2015.

[16] Philip E Cheng. Strong consistency of nearest neighbor regression function estimators.

Journal of Multivariate Analysis, 15(1):63–72, 1984.

[17] Taeryon Choi, Jian Q Shi, and Bo Wang. A Gaussian process regression approach to a

single-index model. Journal of Nonparametric Statistics, 23(1):21–36, 2011.

[18] Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, and Jose Reis. Mod-

eling wine preferences by data mining from physicochemical properties. Decision Support

Systems, 47(4):547–553, 2009.

[19] Radu V Craiu and Jeffrey S Rosenthal. Bayesian computation via markov chain monte

carlo. Annual Review of Statistics and Its Application, 1:179–201, 2014.

[20] Radu V. Craiu and Avideh Sabeti. In mixed company: Bayesian inference for bivariate

conditional copula models with discrete and continuous outcomes. Journal of Multi-

variate Analysis, 110:106–120, 2012.

[21] Claudia Czado. Pair-copula constructions of multivariate copulas. Copula theory and

its applications, pages 93–109, 2010.

124

[22] Luciana Dalla Valle, Fabrizio Leisen, and Luca Rossini. Bayesian non-parametric con-

ditional copula estimation of twin data. Journal of the Royal Statistical Society: Series

C (Applied Statistics), 2017.

[23] Alexis Derumigny and Jean-David Fermanian. About tests of the “simplifying” assump-

tion for conditional copulas. Dependence Modeling, 5(1):154–197, 2017.

[24] Holger Dette, Marc Hallin, Tobias Kley, Stanislav Volgushev, et al. Of copulas, quan-

tiles, ranks and spectra: An l {1}-approach to spectral analysis. Bernoulli, 21(2):781–

831, 2015.

[25] Christopher C Drovandi. Abc and indirect inference. In Handbook of Approximate

Bayesian Computation, pages 179–209. Chapman and Hall/CRC, 2018.

[26] Christopher C Drovandi, Clara Grazian, Kerrie Mengersen, and Christian Robert. Ap-

proximating the likelihood in abc. In Handbook of Approximate Bayesian Computation,

pages 321–368. Chapman and Hall/CRC, 2018.

[27] Julian J Faraway. Extending the linear model with R: generalized linear, mixed effects

and nonparametric regression models. Chapman and Hall/CRC, 2016.

[28] Julian J Faraway. Linear models with R. Chapman and Hall/CRC, 2016.

[29] Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate

bayesian computation: semi-automatic approximate bayesian computation. Journal of

the Royal Statistical Society: Series B (Statistical Methodology), 74(3):419–474, 2012.

[30] J.-D. Fermanian and O. Lopez. Single-index copulae. ArXiv preprint: 1512.07621, 2015.

[31] Sarah Filippi, Chris P Barnes, Julien Cornebise, and Michael PH Stumpf. On optimality

of kernels for approximate bayesian computation using sequential monte carlo. Statistical

applications in genetics and molecular biology, 12(1):87–107, 2013.

[32] Evelyn Fix and JL Hodges. Discriminatory analysis, nonparametric discrimination:

Consistency properties usaf school of aviation medicine, randolph field. Technical report,

Texas, Tech. Report 4, 1951.

[33] Seymour Geisser and William F Eddy. A predictive approach to model selection. Journal

of the American Statistical Association, 74(365):153–160, 1979.

125

[34] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple se-

quences. Statistical Science, Vol. 7(4):457–472, 1992.

[35] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information

criteria for bayesian models. Statistics and Computing, 24(6):997–1016, 2014.

[36] Christian Genest and Anne-Catherine Favre. Everything you always wanted to know

about copula modeling but were afraid to ask. Journal of Hydrologic Engineering,

12(4):347–368, Jul-Aug 2007.

[37] Christian Genest, Kilani Ghoudi, and L-P Rivest. A semiparametric estimation pro-

cedure of dependence parameters in multivariate families of distributions. Biometrika,

82(3):543–552, 1995.

[38] Irene Gijbels, Marek Omelka, and Noel Veraverbeke. Estimation of a copula when

a covariate affects only marginal distributions. Scandinavian Journal of Statistics,

42(4):1109–1126, 2015.

[39] Robert B Gramacy and Heng Lian. Gaussian process single-index models as emulators

for computer experiments. Technometrics, 54(1):30–41, 2012.

[40] T. Hanson, A. Branscum, and W. Johnson. Predictive comparison of joint longitudinal-

survival modelling: a case study illustrating competing approaches. Lifetime Data

Analysis, 17:2–28, 2011.

[41] W Keith Hastings. Monte carlo sampling methods using markov chains and their ap-

plications. Biometrika, 57:97–109, 1970.

[42] Jose Miguel Hernandez-Lobato, James R Lloyd, and Daniel Hernandez-Lobato. Gaus-

sian process conditional copulas with applications to financial time series. In Advances

in Neural Information Processing Systems, pages 1736–1744, 2013.

[43] Philip Hougaard. Analysis of Multivariate Survival Data. Statistics for Biology and

Health. Springer-Verlag, New York, 2000.

[44] Yuao Hu, Robert B Gramacy, and Heng Lian. Bayesian quantile regression for single-

index models. Statistics and Computing, 23(4):437–454, 2013.

126

[45] Marko Jarvenpaa, Michael U Gutmann, Arijus Pleska, Aki Vehtari, Pekka Marttinen,

et al. Efficient acquisition rules for model-based approximate bayesian computation.

Bayesian Analysis, 2018.

[46] James E Johndrow and Jonathan C Mattingly. Error bounds for approximations of

markov chains used in bayesian sampling. arXiv preprint arXiv:1711.05382, 2017.

[47] James E Johndrow, Jonathan C Mattingly, Sayan Mukherjee, and David Dun-

son. Optimal approximating markov chains for bayesian inference. arXiv preprint

arXiv:1508.03387, 2015.

[48] Matthias Killiches, Daniel Kraus, and Claudia Czado. Examination and visualisation

of the simplifying assumption for vine copulas in three dimensions. Australian & New

Zealand Journal of Statistics, 59(1):95–117, 2017.

[49] N. Klein and T. Kneiß. Simultaneous inference in structured additive conditional copula

regression models: a unifying Bayesian approach. Stat. Comput., pages 1–20, 2015.

[50] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspec-

tives, 15(4):143–156, 2001.

[51] Eric D Kolaczyk and Gabor Csardi. Statistical analysis of network data with R, vol-

ume 65. Springer, 2014.

[52] Lajmi Lakhal, Louis-Paul Rivest, and Belkacem Abdous. Estimating survival and as-

sociation in a semicompeting risks model. Biometrics, 64(1):180–188, March 2008.

[53] Philippe Lambert and F Vandenhende. A copula-based model for multivariate non-

normal longitudinal data: analysis of a dose titration safety study on a new antidepres-

sant. Statist. Medicine, 21:3197–3217, 2002.

[54] A Lee, C Andrieu, and A Doucet. Discussion of constructing summary statistics for ap-

proximate bayesian computation: semi-automatic approximate bayesian computation.

JR Stat. Soc. Ser. B Stat. Methodol, 74(3):449–450, 2012.

[55] Anthony Lee. On the choice of mcmc kernels for approximate bayesian computation

with smc samplers. In Proceedings of the 2012 Winter Simulation Conference (WSC),

pages 1–12. IEEE, 2012.

127

[56] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science

& Business Media, 2006.

[57] Evgeny Levi and Radu V Craiu. Bayesian inference for conditional copulas using gaus-

sian process single index models. Computational Statistics & Data Analysis, 122:115–

134, 2018.

[58] David Lopez-Paz, Jose Miguel Hernandez-Lobato, and Zoubin Ghahramani. Gaussian

process vine copulas for multivariate dependence. In Proceedings of the 30th Inter-

national Conference on Machine Learning, volume 28, pages 10–18, Atlanta, Georgia,

USA, 2013. JMLR: W&CP.

[59] Thomas Lux and Michele Marchesi. Volatility clustering in financial markets: a mi-

crosimulation of interacting agents. International journal of theoretical and applied

finance, 3(04):675–702, 2000.

[60] Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ryder. Approximate

bayesian computational methods. Statistics and Computing, 22(6):1167–1180, 2012.

[61] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavare. Markov chain

monte carlo without likelihoods. Proceedings of the National Academy of Sciences,

100(26):15324–15328, 2003.

[62] Kerrie L Mengersen, Richard L Tweedie, et al. Rates of convergence of the hastings

and metropolis algorithms. The annals of Statistics, 24(1):101–121, 1996.

[63] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller,

and Edward Teller. Equation of state calculations by fast computing machines. The

journal of chemical physics, 21(6):1087–1092, 1953.

[64] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer

Science & Business Media, 2012.

[65] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling.

In International Conference on Artificial Intelligence and Statistics, 2010.

[66] Andrew Naish-Guzman and Sean Holden. The generalized FITC approximation. In

Advances in Neural Information Processing Systems, pages 1057–1064, 2007.

128

[67] Shaoyang Ning and Neil Shephard. A nonparametric bayesian approach to copula esti-

mation. arXiv preprint arXiv:1702.07089, 2017.

[68] Robert Nishihara, Iain Murray, and Ryan P Adams. Parallel mcmc with generalized

elliptical slice sampling. Journal of Machine Learning Research, 15(1):2087–2112, 2014.

[69] John P Nolan. Modeling financial data with stable distributions. In Handbook of heavy

tailed distributions in finance, pages 105–130. Elsevier, 2003.

[70] Andrew J Patton. Modelling asymmetric exchange rate dependence*. International

economic review, 47(2):527–556, 2006.

[71] Natesh S Pillai and Aaron Smith. Ergodicity of approximate mcmc chains with appli-

cations to large data sets. arXiv preprint arXiv:1405.0182, 2014.

[72] Dennis Prangle. Summary statistics in approximate bayesian computation. arXiv

preprint arXiv:1512.05633, 2015.

[73] Leah F Price, Christopher C Drovandi, Anthony Lee, and David J Nott. Bayesian

synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1–11,

2018.

[74] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse

approximate gaussian process regression. The Journal of Machine Learning Research,

6:1939–1959, 2005.

[75] Gareth O Roberts, Andrew Gelman, Walter R Gilks, et al. Weak convergence and

optimal scaling of random walk metropolis algorithms. The annals of applied probability,

7(1):110–120, 1997.

[76] Gareth O Roberts, Jeffrey S Rosenthal, et al. Optimal scaling for various metropolis-

hastings algorithms. Statistical science, 16(4):351–367, 2001.

[77] Gareth O Roberts, Jeffrey S Rosenthal, and Peter O Schwartz. Convergence properties

of perturbed markov chains. Journal of applied probability, 35(1):1–11, 1998.

[78] Jeffrey S Rosenthal. Markov chain monte carlo algorithms: Theory and practice. In

Monte Carlo and Quasi-Monte Carlo Methods 2008, pages 157–169. Springer, 2009.

129

[79] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applica-

tions. Chapman and Hall/CRC, 2005.

[80] Avideh Sabeti, Mian Wei, and Radu V Craiu. Additive models for conditional copulas.

Stat, 3(1):300–312, 2014.

[81] Thilo A Schmitt, Rudi Schafer, Holger Dette, and Thomas Guhr. Quantile correlations:

Uncovering temporal dependencies in financial time series. International Journal of

Theoretical and Applied Finance, 18(07):1550044, 2015.

[82] Alexander Y Shestopaloff and Radford M Neal. Mcmc for non-linear state space models

using ensembles of latent sequences. arXiv preprint arXiv:1305.0320, 2013.

[83] SA Sisson, Y Fan, and MA Beaumont. Overview of approximate bayesian computation.

arXiv preprint arXiv:1802.09720, 2018.

[84] Scott A Sisson, Yanan Fan, and Mark Beaumont. Handbook of Approximate Bayesian

Computation. Chapman and Hall/CRC, 2018.

[85] Scott A Sisson, Yanan Fan, and Mark M Tanaka. Sequential monte carlo without

likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007.

[86] A. Sklar. Fonctions de repartition a n dimensions et leurs marges. Publications de

l’Institut de Statistique de l’Universite de Paris, 8:229–231, 1959.

[87] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-

inputs. In Advances in neural information processing systems, pages 1257–1264, 2005.

[88] David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde.

Bayesian measures of model complexity and fit (with discussion). Journal of the Royal

Statistical Society, Series B, 64:583–639(57), 2002.

[89] Aki Vehtari, Andrew Gelman, and Jonah Gabry. Practical bayesian model evaluation

using leave-one-out cross-validation and waic. Statistics and Computing, 27(5):1413–

1432, 2017.

[90] N Veraverbeke, M. Omelka, and Irene Gijbels. Estimation of a conditional copula and

association measures. Scand. J. Statist., 38:766–780, 2011.

130

[91] Sumio Watanabe. Asymptotic equivalence of Bayes cross validation and widely applica-

ble information criterion in singular learning theory. The Journal of Machine Learning

Research, 11:3571–3594, 2010.

[92] Sumio Watanabe. A widely applicable bayesian information criterion. Journal of Ma-

chine Learning Research, 14(Mar):867–897, 2013.

[93] Richard D Wilkinson. Accelerating abc methods using gaussian processes. arXiv

preprint arXiv:1401.1436, 2014.

[94] Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems.

Nature, 466(7310):1102, 2010.

[95] Juan Wu, Xue Wang, and Stephen G Walker. Bayesian nonparametric inference for

a multivariate copula function. Methodology and Computing in Applied Probability,

16(3):747–763, 2014.

[96] Juan Wu, Xue Wang, and Stephen G Walker. Bayesian nonparametric estimation of a

copula. Journal of Statistical Computation and Simulation, 85(1):103–116, 2015.

131

Documents

Conditional Copula Inference and Efficient Approximate MCMC...his insightful comments about MCMC theory. I wish to thank Nancy Reid, David Brenner, Radford Neal, Jerry Brunner, Fang