Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Conditional Copula Inference and Efficient ApproximateMCMC
by
Evgeny Levi
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical SciencesUniversity of Toronto
c© Copyright 2019 by Evgeny Levi
Abstract
Conditional Copula Inference and Efficient Approximate MCMC
Evgeny Levi
Doctor of Philosophy
Graduate Department of Statistical Sciences
University of Toronto
2019
This thesis consists of two main parts. The first part focuses on parametric conditional cop-
ula models that allow the copula parameters to vary with a set of covariates according to an
unknown calibration function. Flexible Bayesian inference for the calibration function of a
bivariate conditional copula is introduced. The prior distribution over the set of smooth cal-
ibration functions is built using a sparse Gaussian Process prior for the Single Index Model.
The estimation of parameters from the marginal distributions and the calibration function
is done jointly via Markov Chain Monte Carlo sampling from the full posterior distribution.
A new Conditional Cross Validated Pseudo-Marginal criterion is used to perform copula se-
lection and is modified using a permutation-based procedure to assess data support for the
simplifying assumption.
The first part concludes with methods for establishing data support for the simplifying as-
sumption in a bivariate conditional copula model. After splitting the observed data into
training and test sets, the method proposed will use a flexible Bayesian model fit to the
training data to define tests based on randomization and standard asymptotic theory. I
discuss theoretical justification for the method and implementations in alternative models
of interest: Gaussian, Logistic and Quantile regressions. The performance is studied via
simulated data.
The second part of the thesis focuses on approximate Bayesian methods. Approximate
Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL) are popular sim-
ii
ulation based methods for sampling from the posterior distribution when the likelihood is
not tractable but simulations for each parameter are easily available. However these meth-
ods can be computationally inefficient since a large number of pseudo-data simulations is
required. I propose to use perturbed MCMC versions of ABC and BSL algorithms and at-
tempt to significantly accelerate these samplers. The main idea of the proposed strategy is
to utilize past samples with k-Nearest-Neighbor approach for likelihood approximation. This
general method works for ABC and BSL and greatly reduces computational cost and num-
ber of required simulations for these samplers. Performance and computational advantage
are examined via series of simulation examples. The second part concludes with theoretical
justifications and convergence properties of the proposed strategies.
iii
Acknowledgements
First I would like to express my sincere gratitude to my supervisor Radu Craiu for his
guidance, patience and generous financial support. I am sure this work would not be accom-
plished without his help, I am very lucky to have him as my supervisor.
I would like to thank my committee members, Stanislav Volgushev and Daniel Roy for com-
ments and advices that greatly improved this thesis. Special gratitude to Stanislav Volgushev
for great discussions about time series topics and simplifying assumption problem that now
constitutes a quite large portion of my thesis. Special appreciation to Jeffrey Rosenthal for
his insightful comments about MCMC theory.
I wish to thank Nancy Reid, David Brenner, Radford Neal, Jerry Brunner, Fang Yao and
Keith Knight for courses that I have taken and greatly increased my interest and compre-
hension of the statistical science.
I would also like to truly thank the staff members in the Statistical Sciences department,
Andrea Carter, Christine Bulguryemez, Annette Courtemanche, Angela Fleury and Dermot
Whelan who were always available to help at every step of my graduate studies.
iv
Contents
I Conditional Copula and Simplifying Assumption Testing 1
1 Introduction 2
1.1 Conditional Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Brief review of Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . 6
1.2.1 Metropolis Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Bayesian Inference and Gaussian Processes . . . . . . . . . . . . . . . . . . . 9
1.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Simplifying Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Bayesian Conditional Copula using Gaussian Processes 16
2.1 GP-SIM for Conditional copula . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Computational Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Performance of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Simulation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Proof of concept based on one Replicate . . . . . . . . . . . . . . . . 24
2.3.4 Multiple Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Model Selection 32
3.1 Conditional CVML criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Simulation results with CVML, CCVML and WAIC criteria . . . . . . . . . 33
3.2.1 One replicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Multiple replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Additional Simulation Results Based On Multiple Replicates . . . . . . . . . 35
v
4 Simplifying Assumption 38
4.1 Interesting Connection between Model Misspecification and the Simplifying
Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 A Permutation-based CVML to Detect Data Support for the Simplified As-
sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Two other methods for Detecting Data Support for SA . . . . . . . . . . . . 43
4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Extensions to other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.3 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Data Analysis 65
5.1 Red Wine Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
II Approximated Bayesian Methods 69
6 Introduction 70
6.1 The Need of Simulation Based Methods . . . . . . . . . . . . . . . . . . . . . 70
6.2 Approximate Bayesian Computation (ABC) . . . . . . . . . . . . . . . . . . 71
6.3 Bayesian Synthetic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Approximated ABC (AABC) 79
7.1 Computational Inefficiency of ABC . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Approximated ABC-MCMC (AABC-MCMC) . . . . . . . . . . . . . . . . . 79
8 Approximated BSL (ABSL) 83
8.1 Computational Inefficiency of BSL . . . . . . . . . . . . . . . . . . . . . . . 83
8.2 Approximated Bayesian Synthetic Likelihood (ABSL) . . . . . . . . . . . . . 83
9 Simulations 86
9.1 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
vi
9.2 Measures for Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.3 Moving Average Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.4 Ricker’s Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.5 Stochastic Volatility with Gaussian emissions . . . . . . . . . . . . . . . . . . 97
9.6 Stochastic Volatility with α-Stable errors . . . . . . . . . . . . . . . . . . . . 101
10 Data Analysis 106
10.1 Dow-Jones log-returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11 Theoretical Justifications 109
11.1 Preliminary Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.3 Proofs of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
III Final Remarks 120
12 Conclusions and Future Work 121
vii
List of Tables
2.1 Copula density functions for each copula family. . . . . . . . . . . . . . . . 21
2.2 Parameter’s range, Inverse-link functions and the functional relationship be-
tween Kendall’s τ and the copula parameter. . . . . . . . . . . . . . . . . . 22
2.3 Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ for each Scenario and
Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Estimated√
Bias2,√
IVar and√
IMSE of E(Y1|y2, x) for each Scenario and
Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ and E(Y1|y2, x) for GP-
SIM and Additive models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 CVML, CCVML and WAIC values for each Scenario and Model. . . . . . . 34
3.2 The percentage of correct decisions for each selection criterion when comparing
the correct Clayton model with a non-constant calibration with all the other
models: Frank model with non-constant calibration, Gaussian model with
non-constant calibration, Clayton model with constant calibration. . . . . . 34
3.3 The percentage of correct decisions for each selection criterion when comparing
the correct Clayton model with a constant calibration with three models:
Clayton, Frank and Gaussian, all of them assuming a GP-SIM calibration. . 35
3.4 The percentage of correct decisions for each selection criterion when comparing
the correct additive model with GP-SIM with non-constant calibration. . . 35
3.5 Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ for each Scenario and
Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Estimated√
Bias2,√
IVar and√
IMSE of E(Y1|y2, x) for each Scenario and
Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
3.7 The percentage of correct decisions for each selection criterion when comparing
the correct Clayton model with a non-constant calibration with all the other
models: Frank model with non-constant calibration, Gaussian model with
non-constant calibration, Clayton model with constant calibration. . . . . . 37
4.1 Missed covariate: CVML, CCVML and WAIC criteria values for model with
conditional copula depends on one covariate and when it is constant. . . . . 39
4.2 The percentage of correct decisions for each selection criterion and scenarios.
GP-SIM and SA were fitted with Clayton copula, sample size is 1500. . . . 43
4.3 The percentage of correct decisions for each selection criterion and scenario.
Predicted CVML and CCVML values based on n1 = 1000 training and n2 =
500 test data, respectively. The calculation of EV is based on a random sample
of 500 permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Method 1: A permutation-based procedure for assessing data support in favour
of SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Method 2: A Chi-square test for assessing data support in favour of SA . . . 45
4.6 Simulation Results: Generic, proportion of rejection of SA for each scenario,
sample size and generic criteria. . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Simulation Results: Proposed method, proportion of rejection of SA for each
scenario, sample size, number of bins (K) and method. . . . . . . . . . . . . 48
4.8 Simulation Results for Regression: Generic, proportion of rejections of SA for
each scenario, sample size and generic criteria. . . . . . . . . . . . . . . . . . 54
4.9 Simulation Results for Regression: Proposed methods, proportion of rejections
of SA for each scenario, sample size and number of bins. . . . . . . . . . . . 55
4.10 Simulation Results for Logistic Regression: Generic, proportion of rejections
of SA for each scenario, sample size and generic criteria. . . . . . . . . . . . 59
4.11 Simulation Results for Logistic Regression: Proposed methods, proportion of
rejections of SA for each scenario, sample size and number of bins. . . . . . . 59
4.12 Simulation Results for Quantile Regression: Generic, proportion of rejections
of SA for each scenario, sample size, τ and generic criteria. . . . . . . . . . . 63
4.13 Simulation Results for Quantile Regression: Proposed methods, proportion of
rejections of SA for each scenario, sample size, τ and number of bins. . . . . 63
5.1 Red Wine data: CVML, CCVML and WAIC criteria values different models. 66
5.2 Wine data: Posterior means and quantiles of β. . . . . . . . . . . . . . . . . 67
ix
5.3 Wine data: CVML, CCVML and WAIC criteria values for variable selection
in conditional copula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.1 Simulation Results (MA model): Average Difference in mean, Difference in
covariance, Total variation, square roots of Bias, Variance and MSE, Effec-
tive sample size and Effective sample size per CPU time, for every sampling
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.2 Simulation Results (Ricker’s model): Average Difference in mean, Difference
in covariance, Total variation, square roots of Bias, variance and MSE, Effec-
tive sample size and Effective sample size per CPU time, for every sampling
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.3 Simulation Results (SV model): Average Difference in mean, Difference in
covariance, Total variation, square roots of Bias, variance and MSE, Effec-
tive sample size and Effective sample size per CPU time, for every sampling
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.4 Simulation Results (SV α-Stable model): Average Difference in mean, Differ-
ence in covariance, Total variation, square roots of Bias, variance and MSE,
Effective sample size and Effective sample size per CPU time, for every sam-
pling algorithm. In DIM, DIC and TV, samplers are compared to SMC. . . 105
10.1 Dow Jones log return stochastic volatility: 95% credible intervals and posterior
averages for 4 parameters for two proposed samplers (AABC-U and ABSL-U). 107
x
List of Figures
2.1 Sc1: Clayton copula, Gelman-Rubin MCMC diagnostic for beta and two
variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Sc1: Trace-plots, ACFs and histograms of parameters based on MCMC sam-
ples generated under the true Clayton family. . . . . . . . . . . . . . . . . . 25
2.3 Sc1: Estimation of marginal means. The leftmost 2 columns show the accu-
racy for predicting E(Y1) and the rightmost 2 columns show the results for
predicting E(Y2). The black and green lines represent the true and estimated
relationships, respectively. The red lines are the limits of the pointwise 95%
credible intervals obtained under the true Clayton family. . . . . . . . . . . 26
2.4 Sc1: Estimation of Kendall’s τ dependence surface. The true surface (left
panel) is very similar to the estimated one (right panel). . . . . . . . . . . . 26
2.5 Sc1: Estimation of Kendall’s τ one-dimensional projections when x1 = 0.2 or 0.8
(top panels) and when x2 = 0.2 or 0.8 (bottom panels). The black and green
lines represent the true and estimated relationships, respectively. The red
lines are the limits of the pointwise 95% credible intervals obtained under the
true Clayton family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Sc1: Histogram of predicted Kendall’s τ values obtained under the true Clay-
ton copula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Sc1: Histogram of predicted τs with Gaussian copula model. . . . . . . . . 28
2.8 Sc3: Estimation of Kendall’s τ one-dimensional projections for each coor-
dinate fixing all other coordinates at 0.5 levels. The black and green lines
represent the true and estimated relationships, respectively. The red lines are
the limits of the pointwise 95% credible in intervals obtained under the true
Clayton family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
xi
4.1 Estimation of Kendall’s τ as a function of x1 when only first covariate is
used in estimation. The dotted black and solid green lines represent the true
and estimated relationships, respectively. The red lines are the limits of the
pointwise 95% credible in intervals obtained under the true Clayton family. 39
5.1 Wine Data: Pairwise scatterplots of all the variables in the analyzed data. . 66
5.2 Wine Data: Slices of predicted Kendall’s τ as function of covariates. Red
curves represent 95% credible intervals. . . . . . . . . . . . . . . . . . . . . 68
5.3 Wine Data: Plots of ‘fixed acidity’(blue) and ‘density’(red) (linearly trans-
formed to fit on one plot) against covariates. . . . . . . . . . . . . . . . . . 68
9.1 MA2 model: AABC-U Sampler. Each row corresponds to parameters θ1 (top
row) and θ2 (bottom row) and shows in order from left to right: Trace-plot,
Histogram and Auto-correlation function. Red lines represent true parameter
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2 MA2 model: ABSL-U Sampler. Each row corresponds to parameters θ1 (top
row) and θ2 (bottom row) and shows in order from left to right: Trace-plot,
Histogram and Auto-correlation function. Red lines represent true parameter
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.3 MA2 model: ABC-RW Sampler. Each row corresponds to parameters θ1 (top
row) and θ2 (bottom row) and shows in order from left to right: Trace-plot,
Histogram and Auto-correlation function. Red lines represent true parameter
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.4 MA model: Estimated densities for each component. First row compares
Exact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,
BSL-IS and ABSL-U. Columns correspond to parameter’s components, from
left to right: θ1 and θ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.5 Ricker’s model: AABC-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (middle row) and θ3 (bottom row) and shows in order from
left to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 94
9.6 Ricker’s model: ABSL-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (middle row) and θ3 (bottom row) and shows in order from
left to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 95
xii
9.7 Ricker’s model: ABC-RW Sampler. Each row corresponds to parameters θ1
(top row), θ2 (middle row) and θ3 (bottom row) and shows in order from
left to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.8 Ricker’s model: Estimated densities for each component. First row compares
Exact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,
BSL-RW and ABSL-U. Columns correspond to parameter’s components, from
left to right: θ1, θ2 and θ3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.9 SV model: AABC-U Sampler. Each row corresponds to parameters θ1 (top
row), θ2 (middle row) and θ3 (bottom row) and shows in order from left
to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 99
9.10 SV model: ABSL-U Sampler. Each row corresponds to parameters θ1 (top
row), θ2 (middle row) and θ3 (bottom row) and shows in order from left
to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 99
9.11 SV model: ABC-RW Sampler. Each row corresponds to parameters θ1 (top
row), θ2 (middle row) and θ3 (bottom row) and shows in order from left
to right: Trace-plot, Histogram and Auto-correlation function. Red lines
represent true parameter values. . . . . . . . . . . . . . . . . . . . . . . . . 100
9.12 SV model: Estimated densities for each component. First row compares Ex-
act, SMC, ABC-RW and AABC-U samplers. Second row compares Exact,
BSL-IS and ABSL-U. Columns correspond to parameter’s components, from
left to right: θ1, θ2 and θ3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.13 SV α-Stable model: AABC-U Sampler. Each row corresponds to parameters
θ1 (top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and
shows in order from left to right: Trace-plot, Histogram and Auto-correlation
function. Red lines represent true parameter values. . . . . . . . . . . . . . 103
9.14 SV α-Stable model: ABSL-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and
shows in order from left to right: Trace-plot, Histogram and Auto-correlation
function. Red lines represent true parameter values. . . . . . . . . . . . . . . 104
xiii
9.15 SV α-Stable model: ABC-RW Sampler.Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and
shows in order from left to right: Trace-plot, Histogram and Auto-correlation
function. Red lines represent true parameter values. . . . . . . . . . . . . . . 104
9.16 SV α-Stable model: Estimated densities for each component. First row com-
pares SMC, ABC-RW and AABC-U samplers. Second row compares SMC,
BSL-IS and ABSL-U. Columns correspond to parameter’s components, from
left to right: θ1, θ2, θ3 and θ4. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.1 Dow Jones daily transformed log return for a period of Jan 2010 - Dec 2018. 107
10.2 Dow Jones log returns: AABC-U Sampler. Every column corresponds to a
particular parameter component from left to right: θ1, θ2, θ3, θ4 and shows
trace-plot on top and histogram on bottom. . . . . . . . . . . . . . . . . . . 108
10.3 Dow Jones log returns: ABSL-U Sampler. Every column corresponds to a
particular parameter component from left to right: θ1, θ2, θ3, θ4 and shows
trace-plot on top and histogram on bottom. . . . . . . . . . . . . . . . . . . 108
xiv
Part I
Conditional Copula and Simplifying
Assumption Testing
1
Chapter 1
Introduction
1.1 Conditional Copulas
A copula is a mathematical concept often used to model the joint distribution of several
random variables. Copulas are useful in modelling the dependent structure in the data
when there is interest in separating it from the marginal models or when none of the existent
multivariate distributions are suitable. The applications of copula models permeate a number
of fields where of interest is the simultaneous study of dependent variables, e.g. [43], [70], [36]
and [52]. For continuous multivariate distributions, the elegant result of [86] guarantees the
existence and uniqueness of the copula C : [0, 1]p → [0, 1] that links the marginal cumulative
distribution functions (cdf) and the joint cdf. Specifically,
H(Y1, . . . , Yp) = C(F1(Y1), . . . , Fp(Yp)), (1.1)
where H is the joint cdf, and Fi is the marginal cdf for variable Yi, for 1 ≤ i ≤ p, respec-
tively. For statistical modelling it is also useful to note that, a p-dimensional copula C and
marginal continuous cdfs Fi(yi), i = 1, . . . , p are building blocks for a valid p-dimensional
cdf, C(F1(y1), . . . , Fp(yp)) with ith marginal cdf equal to Fi(yi), thus providing much-needed
flexibility in modelling multivariate distributions. For example with this construction we can
create a random vector with ”Gaussian” dependence structure (i.e. copula) but any marginal
distributions. To find Gaussian copula we extract it from the joint Gaussian distribution
since from (1.1) we immediately get:
C(U1, . . . , Up) = H(F−11 (U1), . . . , F−1
p (Up)), (1.2)
2
this gives us a way to get a copula function from any joint distribution.
The focus of this part of the thesis is on copula models used in a regression setting in which
covariate values are expected to influence the responses Y1, . . . , Yp through the marginal mod-
els and the interdependence between them through the copula. The extension to conditional
distributions via the conditional copula was used by [53] and subsequently formalized by [70]
so that
H(Y1, . . . , Yp|X) = CX(F1|X(Y1|X), . . . , Fp|X(Yp|X)), (1.3)
where X ∈ Rq is a vector of conditioning variables, CX is the conditional copula that
may change with X and Fi|X is the conditional cdf of Yi given X for 1 ≤ i ≤ p. A
parametric model for the conditional copula assumes CX = Cθ(X) belongs to a parametric
family of copulas and only the parameter θ ∈ Θ varies as a function of X. Throughout
this chapter uppercase letters identify random variables, while their realizations are denoted
using lowercase. We also assume that there exists a known one-to-one function g : Θ → R
such that θ(X) = g−1(η(X)) with the calibration function η : R→ R in the inferential focus.
Moreover it is sometimes convenient to parametrize a copula family in terms of Kendall’s tau
correlation τ(X) which takes values in [−1, 1] and has one-to-one correspondence with θ(X)
for one-parameter copula families. Thus there is a known one-to-one function g′(·) such that
η(X) = g′(τ(X)).
There are a number of reasons one is interested in estimating the conditional copula. First,
in regression models with multivariate responses, which is the main focus of this chapter,
one may want to determine how the dependence structure among the components of the
response varies with the covariates. Second, the copula model will ultimately impact the
performance of model-based prediction. For instance, for a bivariate response, (Y1, Y2), in
which one component is predicted given the other, the conditional density of Y1, given X = x
and Y2 = y2, takes the form
h(y1|y2, x) = f(y1|x)cθ(x)(F1|x(y1|x), F2|x(y2|x)), (1.4)
where cθ(x) is the density of the conditional copula Cθ(x) and f(y1|x) is the marginal condi-
tional density of y1 given X = x. Hence, in addition to the information contained in the
marginal model, in equation (1.4) we use for prediction also the information in the other
responses.
Third, when specifying a general multivariate distribution, the conditional copula is an
essential ingredient. For instance, if U1, U2, U3 are three Uniform(0, 1) variables, when ap-
3
plying a vine decomposition using bivariate copulas [e.g., 21] their joint density is
c(u1, u2, u3) = c12(u1, u2)c23(u2, u3)cθ(u2) (P (U1 ≤ u1|u2), P (U3 ≤ u3|u2)|u2) ,
where cij is the density of the copula between variables Ui and Uj and cθ(u2) is the density of
the conditional copula of U1, U3|U2 = u2. Finally, a conditional copula with predictor values
X ∈ Rq in which η(X) is constant, may exhibit non-constant patterns when some of the
components of X are not included in the model. This point will be revisited in section 4.1.
When estimation for the conditional copula model is contemplated, one must consider
that there are multiple sources of error and each will have an impact on the model. Even in
the simple case in which the estimation of the marginals and copula suffer from errors that
depend only on X = x one obtains via Taylor expansion:
cθ(x)+δ3(x)(F1|x(y1|x) + δ1(x), F2|x(y2|x) + δ2(x)) = cθ(x)(F1|x(y1|x), F2|x(y2|x)) (1.5)
+ c(1,0,0)θ(x) (F1|x(y1|x), F2|x(y2|x))δ1(x) (1.6)
+ c(0,1,0)θ(x) (F1|x(y1|x), F2|x(y2|x))δ2(x) (1.7)
+ c(0,0,1)θ(x) (F1|x(y1|x), F2|x(y2|x))δ3(x) +O(‖δ(x)‖2), (1.8)
where c(1,0,0), c(0,1,0) and c(0,0,1) are the partial derivatives of cz(x, y) w.r.t. x, y and z,
respectively, and δi(x), 1 ≤ i ≤ 3, denote various estimation error terms due to model
misspecification, e.g. δ3(x) is the error in estimation of the copula parameter at a given
covariate value x. The right hand term in equation (1.5) marks the correct joint likelihood
while (1.6)-(1.8) show the biases incurred due to errors in estimating the first and second
marginal conditional cdf’s and the copula calibration function, respectively. It becomes
apparent that in order to keep the estimation error low, one must consider flexible models
for the marginals and the copula.
In practice we observe data that consist of n independent triplets D = {(xi, y1i, y2i), i =
1, . . . , n} where yji ∈ R, j = 1, 2, and xi ∈ Rq. Denote y1 = (y11, . . . , y1n), y2 =
(y21, . . . , y2n) and x ∈ Rn×q the matrix with ith row equal to xTi . Then using (1.3) den-
sity for Y1 and Y2 given x is given by
p(y1,y2|x, ω) =n∏i=1
f1(y1i|ω, xi)f2(y2i|ω, xi)cθ(xi) (F1(y1i|ω, xi), F2(y2i|ω, xi)) , (1.9)
where fj, Fj are the density and, respectively, the cdf for Yj, and ω denotes all the parameters
4
and latent variables in the joint and marginals models. The copula density function is
denoted by c and it depends on X through unknown function θ(X) = g−1(η(X)).
Depending on the strength of assumptions we are willing to make about η(X), a number of
possible approaches are available. The most direct is to assume a known parametric form for
the calibration function, e.g. constant or linear, and estimate the corresponding parameters
by maximum likelihood estimation [37]. This approach relies on knowledge about the shape
of the calibration function which, in practice, can be unrealistic. A more flexible approach
uses non-parametric methods [3, 90] and estimate the calibration function using smoothing
methods. Recently, we have seen a number of developments using nonparametric Bayesian
techniques for estimating a multivariate copula using an infinite mixture of Gaussian copulas
[95], or via flexible Dirichlet process priors [96, 67]. The infinite mixture approach in [95]
was extended to estimate any conditional copula with a univariate covariate by [22], while
an alternative Bayesian approach based on a flexible cubic spline model for the calibration
functions was built by [20]. For multivariate covariates, [80], [15] and [49] avoid the curse
of dimensionality that appears even for moderate values of q, say q ≥ 5, by specifying
an additive model structure for the calibration function. Few alternatives to the additive
structure exist. One exception is [42] who used a sparse Gaussian Process (GP) prior for
estimating the calibration function and subsequently used the same construction for vine
copulas estimation in [58]. However, when the dimension of the predictor space is even
moderately large the curse of dimensionality prevails and it is expected that the q-dimensional
GP used for calibration estimation will not capture important patterns for sample sizes that
are not very large. Moreover, the full efficiency of the method proposed in [42] is difficult
to assess since their model is build with uniform marginals, which in a general setup is
equivalent to assuming exact knowledge about the marginal distributions. In fact, when
the marginal distributions are estimated it is of paramount importance to account for the
resulting variance inflation due to error propagation in the copula estimation as reflected
by equations (1.5)-(1.8). The Bayesian model in which joint and marginal components are
simultaneously considered will appropriately handle error propagation as long as it is possible
to study the full posterior distribution of all the parameters in the model, be they involved
in the marginals or copula specification.
5
1.2 Brief review of Markov Chain Monte Carlo (MCMC)
Since we implement MCMC algorithms for both parts of this thesis, a brief introduction is
given here.
We start by assuming that we need to sample from some distribution with density π(x)
where x ∈ X ∈ Rq. Moreover we do not have access to the closed form of this density but
only π(x) can be evaluated where:
π(x) =π(x)
C,
and C is some unknown normalization constant. This situation is typical in Bayesian statis-
tics where posterior distribution of parameters θ given observed data y is:
π(θ|y) =p(θ)f(y|θ)∫p(θ)f(y|θ)dθ
,
here p(θ) is the prior and f(y|θ) is the model density. In this setting π(θ) = p(θ)f(y|θ)can be easily computed while the normalization constant is generally not known. In these
problems where posterior cannot be found in closed form, the objective is to draw from the
posterior θ(t), t = 1, . . . ,M and then use these to approximate quantities of interest. By the
Strong Law of Large Numbers:
1
M
M∑t=1
g(θ(t))a.s.→ E[g(θ)], (1.10)
for any measurable function g(·). When dimension q is moderate or large it becomes in-
creasing difficult to draw independent samples from π therefore MCMC algorithms aim to
simulate dependent samples by constructing a Markov chain with the stationary distribution
π , see [19, 14].
First we define a Markov Chain:
Definition 1.2.1. Given a pair of state space and Borel σ-field (X ,F). A stochastic process
X(0), X(1), . . . , X(M), . . ., is a Markov chain if:
P (X(t) ∈ A|X(0), . . . , X(t−1)) = P (X(t) ∈ A|X(t−1)) for all A ∈ F .
We can think about this process as ordered in time and future values depend only on
the present and not on the past. In MCMC theory we usually assume that these conditional
probabilities are homogeneous:
Definition 1.2.2. A Markov chain is homogeneous if P (X(t) ∈ A|X(t−1)) is the same for all
6
t = 1, 2, . . ..
So that conditional probability does not change with time t. Therefore the joint distri-
bution of a random vector (X(0), . . . , X(M)) is
P (x(0), . . . , x(M)) = P (x(0))× P (x(1)|x(0))× · · · × P (x(M)|x(M−1)),
so it is fully specified by the initial distribution P (x(0)) and the transition probability (or
kernel) which we denote by P (x, dy) so that P (X(1) ∈ A|X(0) = x) =∫A P (x, dy)dy. It
is also convenient to define PM(x, ·) = P (X(M) ∈ ·|X(0) = x) which is the conditional
probability of a Markov chain in M steps.
A very important concept in MCMC is stationary (or invariant) distribution of a Markov
chain which is defined as:
Definition 1.2.3. π(x) defined on X is called invariant for a Markov chain on (X ,F) if∫π(dx)P (x,A) = π(A) for all A ∈ F .
It just means that if for example X(0) has distribution π(x) then X(1) also has the same
distribution and similarly all other random variables. Note that if we set X(0) ∼ π(x) (where
π(x) is invariant) then homogeneous Markov Chain becomes strongly stationary.
In MCMC we actually have a reverse problem; here we are given π(x) that we want to
sample from and the first step is to construct a Markov chain (with appropriate transition
kernel) for which π(x) is invariant. Once we have it, if we start a chain by sampling from π(x)
then by stationarity all X(1), X(2), · · · , will follow the same target distribution and the goal
is achieved. However if we can simulate X(0) from the target distribution, we do not need the
Markov chain at all, so the main question is if we sample from a different distribution at step
0 will the distribution of X(M) converge to π(x) as M increases. Under several assumptions
this convergence result is actually true. First we define a Total Variation Distance between
two measures ν1 and ν2 as
‖ν1(·)− ν2(·)‖TV = supA|ν1(A)− ν2(A)|.
The main convergence result is:
Theorem 1.2.1. If a Markov chain defined on (X ,F) with transition kernel P (x, ·) and in-
variant measure π(·) is φ-irreducible and aperiodic, then for π almost every x ∈ X we have:
limM→∞
‖PM(x, ·)− π(·)‖TV = 0.
7
The importance of this theorem is that we can start a Markov chain by setting x(0) at
almost any value then run the chain (for which π(x) is stationary) and after large number
of iterations we expect samples from the target distribution. The result depends on two
assumptions that are usually satisfied in practice, see [64] for details:
Definition 1.2.4. A Markov chain is φ-irreducible if there exist a non-zero σ-finite measure φ
in X such that for all A ∈ X with φ(A) > 0 and for all x ∈ X , there exist positive integer
M such that PM(x,A) > 0.
Definition 1.2.5. A Markov chain with invariant distribution π(x) is aperiodic if there do
not exist disjoint subsets A1, · · · ,Ad ∈ F , d ≥ 2 with P (x,Ai+1) = 1 for all x ∈ Ai(1 ≤ i ≤ d− 1) and P (x,A1) = 1 for all x ∈ Ad.
1.2.1 Metropolis Hastings algorithm
Given unnormalized density π(x) of the target distribution π(x), to apply MCMC theory we
need to find such conditional distribution P (x, ·) so that the target is invariant. Metropolis-
Hastings [63, 41] is probably the most frequently used algorithm to construct such transitional
distributions. The main idea, at iteration t is to sample a proposal x∗ from some distribution
q(·|X(t−1) = x) which can depend on the previous state, calculate appropriate probability of
acceptance α(x, x∗) and then accept x∗ with this probability, see Algorithm 1. Notice that
Algorithm 1 Metropolis-Hastings
1: Given initial x(0) and required number of samples M .2: for t = 1, · · · ,M do3: Set x = x(t−1).4: Simulate x∗ ∼ q(·|x) where q(·|x) is some density.
5: Calculate α(x, x∗) = min(
1, π(x∗)q(x|x∗)π(x)q(x∗|x)
)= min
(1, π(x∗)q(x|x∗)
π(x)q(x∗|x)
).
6: Simulate u ∼ U(0, 1).7: if If u ≤ α(x, x∗) then8: Accept, X(t) = x∗.9: else
10: Reject, X(t) = x.11: end if12: end for
α(x, x∗) depends on the ratio between π(x∗) and π(x) and therefore this algorithm can be
implemented when the normalization constant C is unknown.
8
It is easy to see that the transition kernel for this algorithm is:
P (x, dx∗) = α(x, x∗)q(x∗|x)dx∗ + r(x)δx(dx∗),
where r(x) = 1 −∫α(x, x∗)q(x∗|x)dx∗ and δx(·) is point mass at point x. It can be shown
that this transition probability preserves the target distribution π(x).
1.3 Bayesian Inference and Gaussian Processes
Assume we observe n independent realizations, y1, . . . , yn, of a random variable Y ∈ R
and that each observation yi corresponds to a covariate measurement xi ∈ Rq. Henceforth,
we assume that x1, . . . , xn are fixed by design. The distribution of Yi has a known form
and depends on xi through some unknown function f and parameter σ so that the joint
distribution of the data is
P (y|x1, . . . , xn, σ) = P (y|f(x1), . . . , f(xn), σ) =n∏i=1
P (yi|f(xi), σ). (1.11)
Usually, the main inferential goal is to estimate the unknown smooth function f : Rq → R,
while σ is a nuisance parameter. If we let x = (x1, . . . , xn)T denote the n covariate values,
then a Gaussian Process (GP) prior on the function f implies
f = (f(x1), f(x2), . . . , f(xn))T ∼ N (0, K(x,x; w)), (1.12)
where N (µ,Σ) denotes a normal distribution with mean µ and variance matrix Σ and K is a
variance matrix which depends on x and additional parameters w. Here we use the squared
exponential kernel to model the matrix K(x,x; w), i.e. its (i, j) element is
k(xi, xj; w) = ew0 exp
[−
q∑s=1
(xis − xjs)2
ews
], (1.13)
where xis is the sth coordinate value for ith covariate measurement xi. The unknown param-
eters w = (w0, . . . , wq) that determine the strength of dependence in (1.13) are inferred from
the data. Of interest is predicting the values of the nonlinear predictor at new observations
x∗ = (x∗1, . . . , x∗m)T , which we denote f∗ = (f(x∗1), . . . , f(x∗m))T . In the case in which the
covariate dimension, q, is moderately large, an accurate estimation of f∗ will require a large
9
sample size, n. Unfortunately, this desideratum is hindered by the computational cost of
fitting a GP model when n is large. For example, if Yi∼N (f(xi), σ2) then equations (1.12)
and (1.11) yield a joint Gaussian distribution of Y = (Y1, . . . , Yn) and f∗. If y = (y1, . . . , yn)
denotes the observed response, then the conditional distribution of f∗|Y = y is N(µ∗,Σ∗)
where
µ∗ = K(x∗,x; w)[K(x∗,x; w) + σ2In]−1y, (1.14)
Σ∗ = K(x∗,x∗; w)−K(x∗,x; w)[K(x,x; w) + σ2In]−1K(x,x∗; w), (1.15)
and K(x∗,x∗; w), K(x∗,x; w) and K(x,x∗; w) have their elements defined using (1.13).
With the Gaussian sampling model it is clear from (1.14) and (1.15) that the MCMC sam-
pling of the posterior requires at each iteration the calculation and inversion of the matrix
K(x,x; w) + σ2In ∈ Rn×n which becomes prohibitive when n is large. To make GP models
applicable for larger data we refer to the literature on sparse GP [74, 87, 66] in which it is
assumed that learning about f can be achieved using a smaller sample of m latent variables,
called inducing variables, which may be a subsample of the original data or can be built
using other considerations as further discussed. The intuitive idea is to use the inducing
variables to channel the information contained in the covariate values x = {x1, . . . , xn}. We
denote the inducing values as x = (x1, . . . , xm)T ∈ Rm×q and K(x, x; w) ∈ Rn×m the matrix
K(x, x; w) =
k(x1, x1; w) · · · k(x1, xm; w)
.... . .
...
k(xn, x1; w) · · · k(xn, xm; w)
, (1.16)
where k(xi, xj; w) is defined as in (1.13). The ratio m/n influences the trade-off between
computational efficiency and statistical efficiency, as a smaller m will favour the former
and a larger m will ensure no significant loss of the latter. If the function values for the
inducing points are defined as f = (f(x1), . . . , f(xm))T then the joint density of the response
vector Y, the latent variable f and the parameter w can be expressed only in terms of the
m-dimensional vector f since
P (y, f ,w|x, x) = P (y|A(x, x; w)f)N (f ; 0, K(x, x; w))p(w), (1.17)
10
where N (x;µ,Σ) is the normal density with mean µ and covariance Σ, p(w) is the prior
probability for the parameters w and
A(x, x; w) = K(x, x; w)K(x, x; w)−1. (1.18)
The form of P (y|A(x, x; w)f) is derived under the assumption that f = A(x, x; w)f and de-
pends on form of the sampling model P (y|f , σ), e.g., when the latter is N (f , σIn) we obtain
P (y|A(x, x; w)f) = N (A(x, x; w)f , σIn).
The posterior distribution π(f ,w|y,x) is not tractable, but sampling from it will be much
less expensive since K(x, x; w) ∈ Rn×m and K(x, x; w) ∈ Rm×m. While the inducing inputs
x can be selected from the samples collected, we will use an alternative approach where
we group the covariate values observed, x, into m clusters, and choose the cluster-specific
covariate averages as x1, . . . , xm. For instance, given a specific value k, one can use a simple
k-means algorithm [12] to classify x into k clusters and estimate clusters’ means using an
iterative method. Intuitively, it makes sense to have more inducing points in regions that
exhibit more variation in covariate values.
Given a new test point x∗ we are interested in the corresponding posterior predictive distri-
bution of f ∗ = f(x∗):
P (f ∗|x∗,x,y) =
∫P (f ∗|f ,w, x∗, x)P (f ,w|x,y)dfdw. (1.19)
In general, the integral involved in (1.19) cannot be calculated in closed form but we can use
posterior draws (f ,w)(t), t = 1, . . . ,M given x,y to approximate distribution of f ∗|x∗,x,yby samples
(f ∗)(t) = A(x∗, x; w(t))f (t), t = 1, · · · ,M.
Statistical inference can be build on these samples.
Finally, in order to reduce the dimensionality of the parameter space, we assume that
f(xi) = f(xTi β), (1.20)
and we set f = (f(z1), . . . , f(zm))T , where (z1, . . . , zm) are inducing variables in R, f : R→R is an unknown function that is of interest and β ∈ Rq is normalized, i.e. ‖β‖ = 1.
Note that without normalization the parameter β is not identifiable. Here {z1, . . . , zm} play
the same role as {x1, . . . , xm} for general sparse GP. They help sample the posterior latent
11
variables much faster and should be spread in the range of {xT1 β, . . . , xTnβ}. In the next
chapter we show how to choose the positions of these inducing inputs. The single index
model (SIM) defined by (1.20) coupled with the sparse GP approach (henceforth denoted
GP-SIM) has the advantage that it casts the original problem of estimating a general function
f in q dimensions based on n observations into the estimation of q-dimensional parameter
vector β and of the one-dimensional map f based on m << n inducing points. The GP-SIM
approach was successfully implemented for mean regression problems [17, 39] and quantile
regression [44]. It can be used for large covariate dimension and is much more flexible than
simple linear model.
1.4 Model Selection
The conditional copula model involves two types of selection. First one needs to choose the
copula family from a set of possible candidates. Second, it is often of interest to determine
whether a parametric simple form for the calibration is supported by the data. For instance,
a constant calibration function indicates that the dependence structure does not vary with
the covariates, a conclusion that may be of scientific interest in some applications. Let ω(t)
denote the vector of parameters and latent variables drawn at step t from the posterior
corresponding to model M. We consider two measures of fit that can be estimated from
the MCMC samples ω(t), t = 1, . . . ,M . As was mentioned before the observed data set is
denoted by D = {y1i, y2i, xi}ni=1.
Cross-Validated Pseudo Marginal Likelihood
The cross-validated pseudo marginal likelihood (CVML) [33, 40] calculates the average (over
parameter values) prediction power for model M via
CVML(M) =n∑i=1
log (P (y1i, y2i|D−i,M)) , (1.21)
where D−i is the data set from which the ith observation has been removed. An estimate of
(1.21) can be obtained using posterior draws for all the parameters and latent variables in
the model. Specifically, if the latter are denoted by ω, then
E[P (y1i, y2i|ω,M)−1
]= P (y1i, y2i|D−i,M)−1, (1.22)
12
where the expectation is with respect to conditional distribution of ω given full data D and
the model M. Based on the posterior samples we can estimate the CVML as
CVMLest(M) = −n∑i=1
log
(1
M
M∑t=1
P (y1i, y2i|ω(t),M)−1
). (1.23)
The model with the largest CVML is selected.
Watanabe-Akaike Information Criterion
The Watanabe-Akaike Information Criterion [WAIC, 91] is an information-based criterion
that is closely related to the CVML, as discussed in [92], [35] and [89].
The WAIC is defined as
WAIC(M) = −2fit(M) + 2p(M), (1.24)
where the model fitness is
fit(M) =n∑i=1
logE [P (y1i, y2i|ω,M)] (1.25)
and the penalty
p(M) =n∑i=1
V ar[logP (y1i, y2i|ω,M)]. (1.26)
The expectation in (1.25) and the variance in (1.26) are with respect to the conditional
distribution of ω given the data and can be computed using Monte Carlo samples from π.
For instance, the Monte Carlo estimate of the fit is
fit(M) =n∑i=1
log
(∑Mt=1 P (y1i, y2i|ω(t),M)
M
), (1.27)
and p(M) can be estimated similarly using the posterior samples. The model with the
smallest WAIC is preferred.
13
1.5 Simplifying Assumption
Great dimension reduction of the parameter space is achieved under the so-called simplifying
assumption (SA) that assumes Cθ(X) = C (as in (1.3)), i.e. the conditional copula is constant
[38, 21]. The SA condition can significantly simplify the vine copula estimation [for example,
see 1], but it is known to lead to bias when it is wrongly assumed [2]. Therefore, for
conditional copula models it is of practical interest to assess whether the data supports or
not SA. A first step towards a formal test for SA can be found in [4]. The reader is referred
to [23] for an excellent review of work on SA, and ideas for future developments.
If the calibration function and marginal distributions are modeled parametrically, e.g.
η(X) = α0 +K∑j=1
αjΨj(X),
where Ψ(X) are some basis functions and unknown parameters are estimated using maximum
likelihood estimation (MLE) [37] then we can utilize standard asymptotic theory to test
α1 = α2 = . . . = αK = 0 via a canonical likelihood ratio test. However this approach relies
on knowledge of the shape of calibration function, which is unrealistic in practice. A number
of research contributions address this issue for frequentist analyses, e.g. [4], [38], [23], [48].
Moreover even if the calibration form is guessed correctly while marginals are misspecified it
can lead to wrong conclusions about the calibration behavior as noted in chapter 4 and [57].
Our contribution belongs within the Bayesian paradigm, following the general philosophy
expounded also in [49]. In this setting, it was observed in [20] that generic model selection
criteria tend to choose a more complex model even when SA holds.
1.6 Plan
In Chapter 2 we consider Bayesian joint analysis of the marginal and copula models using
flexible GP models. Our emphasis is placed on the estimation of the calibration function
η(X) which is assumed to have a GP prior that is evaluated at βTX for some normalized
β, thus coupling the GP-prior construct with the single index model (SIM) of [17] and
[39]. The GP-SIM is more flexible than a canonical linear model and computationally more
manageable than a full GP with q variables. The proposed model can be used for large
covariate dimension q and for large samples. Both marginal means will be fitted using
sparse GP approaches so that large data sets can be computationally manageable. The
14
dimension reduction of the SIM approach has been noted also by [30] who used two-stage
semiparametric methods to estimate the calibration function. In contrast to [30], we use a
Bayesian approach and estimate marginals and copula parameters jointly. So far, GP-SIM’s
have been used mostly in regression settings where the algorithm of [39] can be used to
efficiently sample the posterior distribution. However, the GP-SIM model for conditional
copulas involves a non-Gaussian likelihood which requires important modifications of their
algorithm.
A second contribution of this work (Chapters 3 and 4) deals with model selection issues that
are particularly relevant for the conditional copula construction. We consider of importance
the choice of copula family and identifying whether the simplifying assumption (SA) is
supported by the data. For the former task we develop a conditional cross-validated marginal
likelihood (CCVML) criterion and also examine its relation with the Watanabe Information
Criterion [91], while for determining the data support for SA we construct a permutation-
based variant of the CVML that shows good performance in our numerical experiments.
We then identify an important link between SA and missing covariates in the conditional
copula model. To our knowledge, this connection has not been reported elsewhere. Finally
we propose two other testing SA procedures that utilizes the idea of splitting the data set to
two sets, fitting a flexible model on the first and based on predictions made by this model on
the second data set. We then divide the data in the second (test) set to ”bins” by the order of
predicted values. To check if the distribution in each bin is the same we utilize permutation
or Chi-square test. We show with theoretical arguments that this procedure produces the
required Probability of Type I error and we support them with simulation results. We then
extend these ideas to other models, and show that generic tests may not be reliable when
the complexity of the model is data-driven. A merit of the proposed methods is that it is
quite general in its applicability, but this comes, unsurprisingly, at the expense of power.
In order to investigate whether the trade-off is reasonable we design a simulation study and
present its conclusions.
We close this part by applying proposed methods to a real world problem and analyze the
Wine Data set in Chapter 5.
15
Chapter 2
Bayesian Conditional Copula using
Gaussian Processes
2.1 GP-SIM for Conditional copula
We consider a bivariate response variable (Y1, Y2) ∈ R2 together with covariate measurement
X ∈ Rq. Hence, the data D = {(y1i, y2i, xi), i = 1 . . . n} consist of triplets (y1i, y2i, xi)
where y1i, y2i ∈ R and xi ∈ Rq. For notational convenience, let y1 = (y11, . . . , y1n)T ,
y2 = (y21, . . . , y2n)T and x = (x1, . . . , xn)T . We assume that the marginal distribution
of Yj (j = 1, 2) is Gaussian with mean fj(X) and constant variance σ2j . If we let Yj =
(Yj1, . . . , Yjn)T , j = 1, 2, and fj = (fj(x1), . . . , fj(xn))T we can compactly write:
Yj ∼ N (fj, σ2j In) j = 1, 2. (2.1)
Generally, it is difficult to discern whether the copula structure varies with covariates or not,
so we consider a conditional copula to account for the more general situation. Therefore,
the likelihood function is
L(ω) =n∏i=1
1
σ1
φ
(y1i − f1i
σ1
)1
σ2
φ
(y2i − f2i
σ2
)×
× cθ(xi)(
Φ
(y1i − f1i
σ1
),Φ
(y2i − f2i
σ2
)),
(2.2)
where c denotes a parametric copula density function, ω denotes all the parameters in the
model, while Φ and φ are the cumulative probability function and density function of a stan-
16
dard normal distribution, respectively. The parameter of a copula depends on the unknown
function θ(xi) = g−1(f(xi)), where f is assumed to take the form given in (1.20) and g is
a known invertible link function that allows an unrestricted parameter space for f . Note
that the form of the GP-SIM model used for estimating the copula parameter is invariant
to non-linear transformations. This implies that the formulation of the model is the same
whether we directly estimate the copula parameter, θ(X), Kendall’s τ(X), or other mea-
sures of dependence. However, this is not true if we use an additive model for θ(X), since
additivity is not preserved by non-linear transformations.
The GP-SIM is fully specified once we assign the GP priors to f1, f2, f and the parametric
priors for the remaining parameters, as follows:
f1 ∼ GP(w1), f2 ∼ GP(w2), f ∼ GP(w),
w1 ∼ N (0, 5Iq+1), w2 ∼ N (0, 5Iq+1), w ∼ N (0, 5I2),
β ∼ U(Sq−1), σ21 ∼ IG(0.1, 0.1), σ2
2 ∼ IG(0.1, 0.1).
(2.3)
The GP(w) is a Gaussian Process prior with mean zero, squared exponential kernel with
parameters w, U(Sq−1) is a uniform distribution on the surface of the q-dimensional unit
sphere and IG(α, β) denotes the inverse gamma distribution. The above prior for w captures
very wiggly functions for small values of w and almost constant functions for large values of
w. Prior for marginal variances is vague and would be conjugate in the absence of copula
term. In our experience, the results are not sensitive to the choice of hyperparameter values.
Because the focus of our work is on inference for the copula, we allow f1 and f2 to be
evaluated on Rq while f is on R. In order to avoid computational problems that affect
the GP-based inference when the sample size is large, the inference will rely on the Sparse
GP method that was described in the previous section. Suppose x1 are m1 inducing inputs
for function f1, x2 are m2 inducing inputs for function f2 and z are m inducing inputs for
function f . The number of inducing inputs m1, m2 and m can all be different, but in our
applications we will choose their values equal and significantly smaller than the sample size,
n. The choice is motivated by imperative computational time restrictions, given the large
number of numerical simulations we perform to investigate empirically the performance of
the approach in terms of estimation and model selection. In practice, the analyst should
ideally use the largest number of inducing points the computing environment can support.
As suggested earlier, we define x1 and x2 as centers of m1 and m2 clusters of x. If m1 = m2
then the inducing inputs are the same. We cannot use the same strategy for z, since then
17
we would need the centers for the clusters of the variable xTβ which are unknown. If we
assume that each covariate xis is between 0 and 1 (this can be achieved easily if we subtract
the minimum value and divide by range) then following the Cauchy-Schwartz inequality we
obtain
‖xTi β‖ ≤√‖xi‖2‖β‖2 ≤ √q ∀xi, β.
Hence we can choose z to be m equally spaced points in the interval [−√q,√q].Let f1 be f1 evaluated at x1, f2 be f2 evaluated at x2 and f be f evaluated at z. Then
the joint density of the observed data and parameters is proportional to:
P (y1,y2, f1 ,f2, f ,w1,w2,w, σ21, σ
22, β|x, x1, x2, z) ∝ pN(y1; f1, σ
21In)pN(y2; f2, σ
22In)×
×n∏i=1
cg−1(fi)
(Φ
(y1i − f1i
σ1
),Φ
(y2i − f2i
σ2
))pN (f1; 0, K(x1, x1; w1))×
×pN (f2; 0, K(x2, x2; w2))pN (f ; 0, K(z, z; w))pN(w1; 0, 5Iq+1)×
×pN(w2; 0, 5Iq+1)pN(w; 0, 5I2)pIG(σ21; 0.1, 0.1)pIG(σ2
2; 0.1, 0.1),
(2.4)
where f1 = A(x, x1; w1)f1, f2 = A(x, x2; w2)f2, f = A(xTβ, z; w)f and pN and pIG are
multivariate normal and inverse gamma densities, respectively. Although here we adopt
a full GP prior for the marginal models, the approach can be easily adapted to consider
GP-SIM models for the marginals too.
The contribution of the conditional copula model to the joint likelihood breaks the
tractability of the posterior conditional densities and complicates the design of an efficient
MCMC algorithm that can sample efficiently from the posterior distribution. The conditional
joint posterior distribution of the latent variables (f) and parameters (w) given the observed
data D does not have a tractable form and its study will require the use of Markov Chain
Monte Carlo (MCMC) sampling methods. Specifically, we use Random Walk Metropolis
(RWM) within Gibbs sampling for w [19, 78, 7] while for f we will use the elliptical slice
sampling [65] that has been designed specifically for GP-based models and does not require
tuning of free parameters.
2.2 Computational Algorithm
Inference is based on the posterior distribution π(ω|D, x1, x2, z) where
ω = (f1, f2, f ,w1,w2,w, σ21, σ
22, β) ∈ Rk represents the vector of parameters and latent vari-
18
ables in the model, with k = 3m+3q+7. Since the posterior is not mathematically tractable,
its properties will be explored via Markov chain Monte Carlo (MCMC) sampling. In this
section we provide the detailed steps of the MCMC sampler designed to sample from π.
The general form of the algorithm falls within the class of Metropolis-within-Gibbs (MwG)
samplers in which we update in turn each component of the chain by sampling from its
conditional distribution, given all the other components. The presence of the copula in the
likelihood breaks the usual conditional conjugacy of the GP models so none of the compo-
nents have conditional distributions that can be sampled directly.
Suppose we are interested in sampling a target π(ω). A generic MwG sampler proceeds
as follows:
Step I Initialize the chain at ω(1)1 , ω
(1)2 , . . . , ω
(1)k .
Step R At iteration t+ 1 run iteratively the following steps for each j = 1, . . . , k:
1. Sample ω∗j ∼ qj(·|ω(t)j , ω
(t+1;t)−j ) where ω
(t+1;t)−j = (ω
(t+1)1 , . . . , ω
(t+1)j−1 , ω
(t)j+1, . . . , ω
(t)k ) is
the most recent state of the chain with the first j−1 components updated already
(hence the supraindex t+1), the jth component removed and the remaining n−jcomponents having the values determined at iteration t (hence the supraindex t).
2. Compute r = min
{1,
π(ω(t+1)1 ,...,ω
(t+1)j−1 ,ω∗j ,ω
(t)j+1,...,ω
(t)k )qj(ω
(t)j |ω
∗j ,ω
(t+1;t)−j )
π(ω(t+1)1 ,...,ω
(t+1)j−1 ,ωtj ,ω
(t)j+1,...,ω
(t)k )qj(ω
(∗)j |ω
(t)j ,ω
(t+1;t)−j )
}.
3. With probability r accept proposal and set ω(t+1)j = ω∗j and with 1 − r reject
proposal and let ω(t+1)j = ω
(t)j .
The proposal density qj(·|·) corresponds to the transition kernel used for the jth compo-
nent. Our algorithm uses a number of proposals corresponding to Random Walk Metropolis-
within-Gibbs (RWMwG), Independent Metropolis-within-Gibbs (IMwG) and Elliptical
Slice Sampling within Gibbs (SSwG) moves.
At the t+ 1 step we use the following proposals to update the chain:
wj: Use a RWM transition kernel: w∗ ∼ N (w(t)j , cwjIq+1). The constant cwj is chosen so
that the acceptance rate is about 30%, j = 1, 2.
w: Use the RWM: w∗ ∼ N (w(t), cwI2). The constant cw is chosen so that the acceptance
rate is about 30%.
19
σ2j : Without the copula, the conditional posterior distribution of σ2
j would be IG(0.1 +
n/2, 0.1 + (yj −Aj f (t)j )T (yj −Aj f (t)
j )), where Aj = A(x, xj; w(t+1)j ) for all j = 1, 2. We
will use this distribution to build and independent Metropolis (IM) type of transition
for σ2j , j = 1, 2. The acceptance rate is usually in the range of [0.25, 0.60] and the
chain mixes better than it would under a RWM.
β: Since β is normalized we will use RWM on unit sphere using ‘Von-Mises-Fisher’ dis-
tribution (henceforth denoted VMF). The VMF distribution has two parameters, µ
(normalized to have norm one) which represents the mean direction and κ, the concen-
tration parameter. A larger κ implies that the distribution will be more concentrated
around µ. The density is symmetric in µ and the argument and is proportional to
fVMF (x;µ, κ) ∝ exp(κxTµ).
The proposals are generated using β∗ ∼ VMF(β(t), κ), where κ is chosen so that the
acceptance rate is around 30%.
f ’s: For fj, j = 1, 2 and f we use the elliptical slice sampling proposed by [65] which does not
require the tuning of simulation parameters. Although not needed in our examples, we
note that if the chain’s mixing is sluggish, one can improve it using the parallelization
strategy proposed by [68].
In our experience the efficiency of the algorithm benefits from initial values that are not too
far from the posterior mode. Therefore we propose first to roughly estimate the parameters
in the two independent regressions for y1 and y2 to get (f1,w1, σ21)(1) and (f2,w2, σ
22)(1). Then
run another MCMC fixing marginals and only sampling (f ,w). This procedure estimates
(f ,w)(1). These 3 short chains (100-200 iterations each) provide good initial values for the
joint MCMC sampler. This simple approach shortens the time it would take for the original
chain to find the regions of high mass under the posterior. We have also found that the
chain’s mixing is accelerated when initial values for w are small, thus allowing for more
variation in the calibration function.
Remark: In our numerical experiments, we will fit the GP-SIM model to data with constant
calibration, i.e., with true values βi = 0 for all 1 ≤ i ≤ q. The constraint ||β|| = 1
forbids sampling null values for all the components of β simultaneously, and instead the
MCMC draws for β’s components are spread randomly in the support. However, the shape
of the calibration function is correctly recovered since the sampled values for the second
component of w were large reflecting the perfect dependence between f(xTi β) and f(xTj β)
20
for any 1 ≤ i 6= j ≤ n. This led to difficulties in identifying the SA as discussed below, and
compelled us to develop a new SA identification procedure that is described in Section 4.2.
2.3 Performance of the algorithms
2.3.1 Simulations
The purpose of the simulation study is to assess empirically: 1) the performance of the
estimation method under the correct and misspecified models, as well as 2) the ability of
the model selection criteria to identify the correct copula structure, i.e. the copula family
and the parametric form of the calibration function. For the former aim we compute the
integrated mean square for various quantities of interest, including the Kendall’s τ . In order
to facilitate the assessment of the estimation performance across different copula families,
we estimate the calibration function on the Kendall’s τ scale. The latter is given by
τ(X) = 4
(∫∫C(u1, u2|X)c(u1, u2|X)du1du2
)− 1.
We will compare 3 copulas: Clayton, Frank and Gaussian under the general GP-SIM model
and the Clayton with constant calibration function. To fit the model with constant copula,
we still use MCMC but instead of f , f ,w and β in calibration we have a constant scalar copula
parameter, θ. The RWMwG transition is used to sample θ, as the proposal distributions
for marginals’ parameters and latent variables remain the same. Table 2.1 shows density
copula functions (as function of its parameter θ) for each copula family. Table 2.2 provides
inverse-link functions g−1 used for calibration, the functional relationship between Kendall’s
τ and copula parameters and parameter ranges for every copula family used in this thesis.
Table 2.1: Copula density functions for each copula family.
Copula c(u1, u2|θ)Clayton (1+θ)(u1u2)−1−θ
A1/θ+2 ; where A = u−θ1 + u−θ2 − 1
Frank θ(1− e−θ)e−θ(u1+u2)[(1− e−θ)− (1− e−θu1)(1− e−θu2)
]−2
Gaussian 1√1−θ2 exp
(− θ2(y21+y22)−2θy1y2
2(1−θ2)
); where yj = Φ−1(uj)
T(v df) 12π√
1−θ21
dv(y1)dv(y2)
(1 +
y21+y22−2θy1y2v(1−θ2)
)−v/2−1
; yj = t−1v (uj) and dv(y)=univ. density of T(v df)
Gumbel Au1u2
(y1 + y2)−2+2/θ(ln(u1) ln(u2))θ−1[1 + (θ − 1)(y1 + y2)−1/θ
]; yj = (− ln(uj))
θ, A = exp(−(y1 + y2)1/θ)
21
Table 2.2: Parameter’s range, Inverse-link functions and the functional relationship betweenKendall’s τ and the copula parameter.
Copula Range of parameter (θ) Inv-Link function Kendall’s τ formula
Clayton (−1,∞) \ {0} θ = exp(f)− 1 τ = θθ+2
Frank (−∞,∞) \ {0} θ = f No closed form
Gaussian, T (−1, 1) θ = exp(f)−1exp(f)+1
τ = 2π
arcsin θ
Gumbel (1,∞) θ = exp(f) + 1 τ = 1− 1θ
In addition to Kendall’s τ we use also the conditional mean of Y1 given y2 and x for
assessing the estimation. Such conditional means can be useful in prediction when one of
the responses is more expensive to measure than the other. The calculation is mathematically
straightforward
E(Y1|Y2 = y2, x) = f1(x) + σ1
∫ 1
0
Φ−1(z)cθ(x)
(z,Φ
(y2 − f2(x)
σ2
))dz. (2.5)
The integral in (2.5) is usually not tractable, but can be easily estimated via numerical
integration since it is one-dimensional and defined on the closed interval [0, 1].
2.3.2 Simulation Details
We generate samples of size n = 400 from each of the next 6 scenarios using the Clayton
copula. The covariates are generated independently from Uniform(0, 1) distribution. The
covariate dimension q in Scenario 3 is 10, in all other scenarios it is 2.
Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(15xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
τ(x) = 0.3 sin(5xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
Sc3 β = (1, 10,−3, 6, 1,−6, 3, 7,−1,−5)T/√
267, σ1 = σ2 = 0.2
f1(x) = cos(xTβ)
22
f2(x) = sin(xTβ)
τ(x) = 0.7 + 0.20 sin(5xTβ)
Sc4 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
τ(x) = 0.5
σ1 = σ2 = 0.2
Sc5 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
η(x) = 1 + 0.7 sin(3x31)− 0.5 cos(6x2
2)
σ1 = σ2 = 0.2
Sc6 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
η(x) = 1 + 0.7x1 − 0.5x22
σ1 = σ2 = 0.2
Sc1 and Sc2 have calibration functions for which the SIM model is true for Kendall’s τ
and, consequently, also for the copula parameter. Sc1 corresponds to large dependence
(τ greater than 0.5) while Sc2 has small dependence (τ is between −0.3 and 0.3). Sc3
also has SIM form for calibration function but the covariate dimension is q = 10, so this
scenario is important to evaluate how well the algorithms scale up with dimension. Sc4
corresponds to the covariate-free dependence (τ = 0.5) and allows us to verify the power
to detect simple parametric forms for the calibration. Scenarios Sc5 and Sc6 do not have
SIM form, but have additive calibration function [as in 80]. They are used to evaluate the
effect of model misspecification on the inference. Note that Sc6 has almost SIM calibration
when x2 ∈ [0, 1]. From our experiments we found that when the number of inducing points
is m = 30 for marginals and calibration sparse GPs, we obtain a reasonable CPU time
that allows us to perform the desired number of replications and we can also capture the
general form of the estimated functions. On average one MCMC iteration (n = 400) with
GP-SIM calibration takes 0.02 seconds, one iteration with constant calibration (and GP for
marginals) takes 0.015 seconds. The MCMC samplers were run for 20, 000 iterations for all
scenarios.
The first half of the MCMC sample is discarded as burn-in and the second half is used
for inference. As noted earlier, starting values were found by running two GP regressions
23
separately to estimate marginal parameters and one MCMC sampler was run in order to
estimate calibration parameters. All three samplers were run for only 100 iterations.
2.3.3 Proof of concept based on one Replicate
In the absence of computable convergence bounds, we used the Gelman-Rubin [34] diag-
nostic statistics to decide the length of the chain’s run. To illustrate using Sc1, we ran
10 independent MCMC chains, each for 20, 000 iterations, that were started from different
initial values. The trace plots for the potential scale reduction factor (PSRF), computed up
to 10, 000 iterations for β, σ21 and σ2
2 are displayed in Figure 2.1. The plots show that the
Figure 2.1: Sc1: Clayton copula, Gelman-Rubin MCMC diagnostic for beta and twovariances.
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
Beta 1
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Beta 2
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Sigma Squared 1
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Sigma Squared 2
last iteration in chain
shrin
k fa
ctor
median
97.5%
multivariate PSRF after 10, 000 iterations is 1.1. The subsequent 10, 000 samples were used
for inference.
Parameter Estimation
The simulation results show that Sc1 and Sc2 performed similarly. Since the calibration
function in Sc1 is more complicated, for the sake of reducing the chapter’s length we present
only results for that scenario. The trace-plots, autocorrelation functions and histograms of
posterior samples of β, σ21 and σ2
2 are shown in Figure 2.2 when the fitted copula belongs
to the correct Clayton family (the horizontal solid red line is the true value). Next we
24
Figure 2.2: Sc1: Trace-plots, ACFs and histograms of parameters based on MCMC samplesgenerated under the true Clayton family.
0 2000 4000 6000 8000 10000
0.25
0.35
Traceplot of Beta1
Iteration
Bet
a1
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Beta1
Histogram of Beta1
Beta1
Fre
quen
cy
0.25 0.30 0.35 0.40 0.45
040
080
014
00
0 2000 4000 6000 8000 10000
0.90
0.94
Traceplot of Beta2
Iteration
Bet
a2
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Beta2
Histogram of Beta2
Beta2
Fre
quen
cy
0.90 0.92 0.94 0.96
050
015
00
0 2000 4000 6000 8000 10000
0.03
50.
045
Traceplot of Sigma Squared 1
Iteration
Sig
ma
Squ
ared
1
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Sigma Squared 1
Histogram of Sigma Squared 1
Sigma Squared 1
Fre
quen
cy0.035 0.040 0.045
050
015
00
0 2000 4000 6000 8000 10000
0.03
50.
045
Traceplot of Sigma Squared 2
Iteration
Sig
ma
Squ
ared
2
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Sigma Squared 2
Histogram of Sigma Squared 2
Sigma Squared 2
Fre
quen
cy
0.035 0.040 0.045
050
015
00
show predictions for the marginals means with 95% credible intervals. Since these are 2-
dimensional we estimate ‘slices’ of this surface at values 0.2 and 0.8, so that we first fix
x1 = 0.2 then x1 = 0.8 and similarly for x2. The results are in Figure 2.3 (black is true,
green is estimation, red are credible intervals).
One of the inferential goals is the prediction of calibration function or, equivalently,
Kendall’s τ function. In this case we are dealing with only two covariates so their joint
effect can be visualized via the Kendall’s surface. In Figure 2.4 we show the true calibration
surface on the left panel and the fitted one on the right. The accuracy is remarkable and we
are hard put to see major differences between the two panels. Since the visual comparison
of the three-dimensional true and fitted surfaces may be misleading, so as with conditional
marginal means we estimate one dimensional slices at values 0.2 and 0.8 and the results,
shown in Figure 2.5, confirm the accuracy of the fit.
The predictive power of the model was assessed by fixing 4 covariate points and estimating
the corresponding Kendall’s τ values: τ(0.2, 0.2), τ(0.2, 0.8), τ(0.8, 0.2), τ(0.8, 0.8). At each
MCMC iteration these predictions are calculated and histograms (Figure 2.6) are constructed
(red lines are true value of τ). The same estimates are presented in Figure 2.7 when the
Gaussian copula is used for inference. One can notice that the estimates are biased in this
instance, thus emphasizing the importance of identifying the right copula family. Similar
25
Figure 2.3: Sc1: Estimation of marginal means. The leftmost 2 columns show the accuracyfor predicting E(Y1) and the rightmost 2 columns show the results for predicting E(Y2). Theblack and green lines represent the true and estimated relationships, respectively. The redlines are the limits of the pointwise 95% credible intervals obtained under the true Claytonfamily.
0.0 0.2 0.4 0.6 0.8 1.0
−0
.40
.00
.4
E(Y1),X1=0.2
X2
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.4−
1.0
−0
.6
E(Y1),X1=0.8
X2
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
E(Y1),X2=0.2
X1
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.5−
1.0
−0
.5
E(Y1),X2=0.8
X1
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X1=0.2
X2
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X1=0.8
X2
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X2=0.2
X1
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.8−
0.4
0.0
0.4
E(Y2),X2=0.8
X1
E(Y
2)
Figure 2.4: Sc1: Estimation of Kendall’s τ dependence surface. The true surface (leftpanel) is very similar to the estimated one (right panel).
X_1
X_2
Kendall's tau
0.60
0.65
0.70
0.75
0.80
X_1
X_2
Kendall's tau
0.55
0.60
0.65
0.70
0.75
0.80
patterns have been observed when using the Frank copula.
We also show how well the algorithm estimates calibration function when covariate dimension
is large. Figure 2.8 shows one dimensional slices of Kendall’s τ function for Sc3 which is
estimated by Clayton GP-SIM model. Each plot is produced by varying one coordinate from
26
Figure 2.5: Sc1: Estimation of Kendall’s τ one-dimensional projections when x1 =0.2 or 0.8 (top panels) and when x2 = 0.2 or 0.8 (bottom panels). The black and greenlines represent the true and estimated relationships, respectively. The red lines are thelimits of the pointwise 95% credible intervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X1=0.2
X2
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X1=0.8
X2
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X2=0.2
X1
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X2=0.8
X1
Ken
dalls
Tau
Figure 2.6: Sc1: Histogram of predicted Kendall’s τ values obtained under the true Claytoncopula.
Tau at X1=0.2, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.2, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
060
010
00
Tau at X1=0.8, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
050
010
00
27
Figure 2.7: Sc1: Histogram of predicted τs with Gaussian copula model.
Tau at X1=0.2, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
010
030
050
0
Tau at X1=0.2, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.8
Kendalls TauF
requ
ency
0.2 0.4 0.6 0.8
050
010
0015
0020
00
0 to 1 while fixing all other coordinates at x = 0.5. We observe that even in this case the
estimated curves are very close to true Kendall’s τ function.
2.3.4 Multiple Replicates
So far, the results reported were based on a single implementation of the method. In order
to facilitate interpretation, we perform 50 independent replications under each of the six
scenarios described previously.
The MCMC sampler was run for 20, 000 iterations for all scenarios. As before, the first
half of iterations was ignored as a burn-in period. For each data set, 4 estimations were
done with Clayton, Frank, Gaussian and constant Clayton copulas. For Sc5 and Sc6 we
also fitted the Clayton copula with an additive model for the calibration function, as in
[80]. The marginal distributions models have the general GP form throughout the section.
In order to produce overall measures of fit, we report the integrated squared Bias (IBias2),
Variance (IVar) and mean squared error (IMSE) of Kendall’s τ evaluated at covariates x =
(x1, . . . , xn)T . The calculation requires finding points estimates for τr(xi) for 1 ≤ r ≤ R
independently replicated analyses and each i = 1, . . . , n. The formulas for IBias2, IVar and
28
Figure 2.8: Sc3: Estimation of Kendall’s τ one-dimensional projections for each coordinatefixing all other coordinates at 0.5 levels. The black and green lines represent the true andestimated relationships, respectively. The red lines are the limits of the pointwise 95%credible in intervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X1
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X2
Ken
dall'
s Ta
u0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X3
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X4
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X5
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X6
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X7
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X8
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X9
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X10
Ken
dall'
s Ta
u
IMSE are given by:
IBias2 =1
n
n∑i=1
R∑r=1
τr(xi)
R− τ(xi)
2
,
IVar =1
n
n∑i=1
V arr(τr(xi)),
IMSE = IBias2 + IVar.
(2.6)
We will apply these concepts not only for Kendall’s τ but also for E(Y1|Y2 = y2, X = x) for
different combinations (x, y2).
Estimation
IBias2, IVar and IMSE for each scenario and each model are shown in Table 2.3 (bold values
show smallest IMSE for each scenario). Note that the smallest IMSE is produced when
fitting the correct model and copula family. The Clayton model with GP-SIM calibration
29
Table 2.3: Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ for each Scenario andModel.
Clayton Frank Gaussian Clayton Constant
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSESc1 0.0393 0.0575 0.0697 0.0357 0.0657 0.0748 0.0679 0.0734 0.1 0.1046 0.0208 0.1066Sc2 0.0492 0.0665 0.0827 0.0695 0.1 0.1218 0.0509 0.0692 0.0859 0.2314 0.0242 0.2327Sc3 0.0327 0.0744 0.0813 0.041 0.0858 0.0951 0.0846 0.1069 0.1363 0.123 0.0134 0.1237Sc4 0.0061 0.0355 0.036 0.0133 0.0584 0.0599 0.0205 0.0493 0.0534 0.0016 0.0258 0.0258Sc5 0.0723 0.0777 0.1061 0.0703 0.0881 0.1127 0.0842 0.0857 0.1202 0.1589 0.024 0.1607Sc6 0.0147 0.0384 0.0411 0.0175 0.05 0.0529 0.0338 0.0559 0.0654 0.0849 0.021 0.0874
has smallest IMSE in all scenarios with the exception of Sc4. We note that models with
constant calibration have much smaller IVar than models with GP-SIM but have much larger
IBias and, consequently, IMSE. Not surprisingly, for Sc4, the Clayton copula model with
constant calibration yields the smallest IMSE. For each simulated data set and each model,
E(Y1|Y2 = y2, x) were estimated. For all scenarios except for Sc3 we let each x1, x2 take
values in the set {0.2, 0.4, 0.6, 0.8} and y2 in {−0.6,−0.2, 0.2, 0.6} making a total of 64 com-
binations. For Sc3 we let y2 take values in {−0.5, 0.0, 0.5, 1.0}, while x can take 33 values
scattered in [0, 1]10, making a total of 132 combinations. The results are presented in Ta-
ble 2.4 and largely mimic the patterns found in Table 2.3, thus showing that the predictive
power of the model and the accuracy of dependence estimation are linked.
Table 2.4: Estimated√
Bias2,√
IVar and√
IMSE of E(Y1|y2, x) for each Scenario and Model.
Clayton Frank Gaussian Clayton Constant
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSESc1 0.0231 0.0531 0.0579 0.1264 0.0322 0.1304 0.1434 0.0557 0.1539 0.0416 0.0579 0.0713Sc2 0.0293 0.0464 0.0549 0.0802 0.0475 0.0932 0.1098 0.0593 0.1247 0.1213 0.0407 0.128Sc3 0.0364 0.0707 0.0795 0.214 0.0363 0.217 0.1042 0.0708 0.1259 0.0483 0.0572 0.0749Sc4 0.0174 0.042 0.0454 0.1023 0.0325 0.1074 0.1379 0.0449 0.145 0.0179 0.041 0.0447Sc5 0.0144 0.0413 0.0437 0.0909 0.0347 0.0973 0.14 0.051 0.149 0.0355 0.04 0.0534Sc6 0.0202 0.0456 0.0498 0.1046 0.0298 0.1087 0.1367 0.0448 0.1439 0.0237 0.0442 0.0501
The results for scenarios Sc5 and Sc6 in which the true calibration has an additive
form are shown in the Table 2.5. Shown are the global measures of fit for Kendall’s τ and
E(Y1|Y2 = y2, x) when the true Clayton copula is coupled with the GP-SIM and the additive
model for representing the calibration function. An astute reader should not be exceedingly
surprised to observe that GP-SIM outperforms the additive model under Sc6 since the
30
calibration function is not far from having a SIM form in this case (due to 0 ≤ u− u2 ≤ 1/4
for any u ∈ [0, 1]). This is not observed in Sc5 where GP-SIM performs worse for Kendall’s
tau estimation than the true additive model.
Table 2.5: Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ and E(Y1|y2, x) for GP-SIMand Additive models.
Kendall’s TauClayton GP-SIM Clayton Additive
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSESc5 0.0723 0.0777 0.1061 0.0573 0.0516 0.0771Sc6 0.0147 0.0384 0.0411 0.0063 0.0458 0.0462
E(Y1|y2, x)Clayton GP-SIM Clayton Additive
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSESc5 0.0144 0.0413 0.0437 0.0207 0.0428 0.0475Sc6 0.0202 0.0456 0.0498 0.0236 0.0483 0.0538
31
Chapter 3
Model Selection
3.1 Conditional CVML criterion
The conditional copula construction is particularly useful in predicting one response given
the other ones. We exploit this feature by computing the predictive distribution of one
response given the rest of the data. The resulting conditional CVML (CCVML) is computed
from the P (y1i|y2i,D−i) and P (y2i|y1i,D−i) via
CCVML(M) =1
2
{n∑i=1
log [P (y1i|y2i,D−i,M)] +n∑i=1
log [P (y2i|y1i,D−i,M)]
}. (3.1)
Note that when the marginal distributions are uniform, CCVML is the same as CVML. One
can easily show that
E[P (y1i|y2i, ω,M)−1
]= E
[P (y2i|ω,M)
P (y1i, y2i|ω,M)
]= P (y1i|y2i,D−i,M)−1,
E[P (y2i|y1i, ω,M)−1
]= E
[P (y1i|ω,M)
P (y1i, y2i|ω,M)
]= P (y2i|y1i,D−i,M)−1.
(3.2)
Based on (3.2) CCVML is estimated from MCMC samples using
CCVMLest(M) = −1
2
n∑i=1
{log
[1
M
M∑t=1
P (y2i|ω(t),M)
P (y1i, y2i|ω(t),M)
]+ log
[1
M
M∑t=1
P (y1i|ω(t),M)
P (y1i, y2i|ω(t),M)
]}.
(3.3)
In [92] it was demonstrated that CVML and WAIC are asymptotically equivalent, so that
CVML(M) ≈WAIC(M)/(−2) for a large sample size n. This connection can be extended
32
to CCVML using the following two conditional WAICs:
CWAIC1(M) =− 2n∑i=1
logE [P (y1i|y2i, ω,M)] + 2n∑i=1
V ar[logP (y1i|y2i, ω,M)], (3.4)
CWAIC2(M) =− 2n∑i=1
logE [P (y2i|y1i, ω,M)] + 2n∑i=1
V ar[logP (y2i|y1i, ω,M)], (3.5)
where expectation and variance are with respect to the conditional distribution of ω given
the observed data. An argument that follows directly the one in [89] shows that CCVML
and 12{CWAIC1 + CWAIC2} are also asymptotically equivalent.
3.2 Simulation results with CVML, CCVML and WAIC
criteria
In this section we focus on the accuracy of CVML, CCVML and WAIC in selecting the
correct model. All results below are based on the same simulations done in Section 2.3.2.
3.2.1 One replicate
Table 3.1 shows the values for each scenario and model for one replicate. Bold values indicate
largest CVML/CCVML and smallest WAIC values for each scenario. Observe that for Sc1,
Sc2, Sc3, Sc5, Sc6, all three criteria point to the Clayton family, while for Sc4 they indicate
the Clayton family with a constant calibration. We note that the correct copula is selected
even when the generative calibration model is additive (Sc5, Sc6).
3.2.2 Multiple replicates
We then show how well CVML, CCVML and WAIC perform in choosing the correct model
under data set replication. For selecting between different copula families or to check whether
dependence is covariate-free we just pick the model with largest CVML/CCVML or smallest
WAIC. Table 3.2 shows how often Clayton model is selected over other models using CVML,
CCVML and WAIC for Sc1, Sc2, Sc3, Sc5 and Sc6. Similarly, Table 3.3 shows how often
Clayton-constant is selected over other models for Sc4.
We can conclude that all selection measures perform quite similarly across scenarios.
Also, the numerical study shows that the choice of a copula family is considerably more
33
Table 3.1: CVML, CCVML and WAIC values for each Scenario and Model.
CVML CCVML WAIC CVML CCVML WAICScenario 1 Scenario 4Clayton 532 458 −1065 Clayton 322 254 −644Frank 422 365 −844 Frank 277 209 −549
Gaussian 397 326 −801 Gaussian 276 207 −547Clayton-Const 503 433 −1007 Clayton-Const 323 255 −647
Scenario 2 Scenario 5Clayton 166 103 −333 Clayton 324 277 −650Frank 144 82 −289 Frank 256 216 −513
Gaussian 146 84 −293 Gaussian 260 214 −520Clayton-Const 121 60 −243 Clayton-Const 299 257 −600
Scenario 3 Scenario 6Clayton 613 536 −1237 Clayton 286 242 −573Frank 562 491 −1126 Frank 216 179 −432
Gaussian 494 417 −1002 Gaussian 205 165 −410Clayton-Const 537 462 −1076 Clayton-Const 283 238 −567
Table 3.2: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.
Frank Gaussian Clayton ConstantScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC
Sc1 100 % 100 % 100 % 100 % 100 % 100 % 94 % 94 % 94 %Sc2 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %Sc3 98 % 96 % 98 % 100 % 100 % 100 % 100 % 98 % 100 %Sc5 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 % 100 %Sc6 100 % 100 % 100 % 100 % 100 % 100 % 98 % 100 % 98 %
accurate than correctly determining that the calibration function is constant. The latter
difficulty has been reported elsewhere [e.g., 20]. In part, this is due to the fact that the models
are flexible enough to capture the constant calibration and produce estimates that mislead a
cross-validation-based method. In section 4.2 we return to this problem and develop a new
permutation-based procedure that exhibits a drastically improved performance in numerical
experiments. Since Sc5 and Sc6 were simulated with Clayton additive calibration, we
show how often Clayton Additive model is selected over Clayton GP-SIM using different
34
Table 3.3: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a constant calibration with three models: Clayton, Frankand Gaussian, all of them assuming a GP-SIM calibration.
Clayton Frank GaussianScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC
Sc4 58 % 62 % 58 % 100 % 100 % 100 % 100 % 100 % 100 %
criteria (Table 3.4). The poor performance for Sc6 is not that surprising since the additive
calibration in this scenario has almost SIM form.
Table 3.4: The percentage of correct decisions for each selection criterion when comparingthe correct additive model with GP-SIM with non-constant calibration.
Clayton GP-SIMScenario CVML CCVML WAIC
Sc5 92% 94% 90%Sc6 30% 34% 28%
3.3 Additional Simulation Results Based On Multiple
Replicates
In addition to simulations shown in Sections 2.3.2 and 3.2.2 we also simulated and analyzed
50 independent replicates from each of the following scenarios:
Sc1b f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(15xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
n = 1000
Sc7 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(2xTβ)
β = (2,−3)T , σ1 = σ2 = 0.2
n = 400
35
Sc1b is exactly the same as Sc1 with only difference that sample size is 1000 instead of 400.
In Sc7 we do not assume that generating β is normalized.
Tables 3.5 and 3.6 show IBias2, IVar and IMSE for each scenario (including Sc1 for
comparison) and each model for estimation of Kendall’s τ and E(Y1|y2, x) respectively.
Table 3.5: Estimated√
Bias2,√
IVar and√
IMSE of Kendall’s τ for each Scenario andModel.
Clayton Frank Gaussian Clayton Constant
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE
Sc1 0.0393 0.0575 0.0697 0.0357 0.0657 0.0748 0.0679 0.0734 0.1000 0.1046 0.0208 0.1066Sc1b 0.0308 0.0526 0.0610 0.0353 0.0569 0.067 0.0693 0.0624 0.0933 0.1093 0.0148 0.1102Sc7 0.0266 0.0527 0.0591 0.0422 0.0673 0.0794 0.0636 0.0743 0.0978 0.1050 0.0162 0.1063
Table 3.6: Estimated√
Bias2,√
IVar and√
IMSE of E(Y1|y2, x) for each Scenario andModel.
Clayton Frank Gaussian Clayton Constant
Scenario√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE√
IBias2√
IVar√
IMSE
Sc1 0.0231 0.0531 0.0579 0.1264 0.0322 0.1304 0.1434 0.0557 0.1539 0.0416 0.0579 0.0713Sc1b 0.0140 0.0362 0.0389 0.1250 0.0238 0.1272 0.1457 0.0412 0.1514 0.036 0.0473 0.0594Sc7 0.0154 0.0466 0.0491 0.1228 0.0315 0.1268 0.1300 0.0559 0.1415 0.0437 0.0494 0.0659
First we note that Clayton with GP-SIM calibration produces smallest IMSE in all scenarios.
Another important observation is that IMSEs for Kendall’s tau and conditional response
prediction are smaller for Sc1b than for Sc1. Which indicates that as the sample size
increases the model produces more accurate predictions. Results for Sc7 are similar to Sc1,
so even when the true generating β in SIM is not normalized we still obtain acceptable
predictions for each test value x. Actually the posterior β just converges to a normalized
vector (2,−3)T/√
13.
Table 3.7 shows how often Clayton model (with non-constant calibration) is selected over
other models using CVML, CCVML and WAIC. Again we notice that the above criteria
perform well in distinguishing between copula families. Also in Sc1b all criteria select true
non-constant Clayton with higher rate than in Sc1 which is probably due to larger sample
size.
36
Table 3.7: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.
Frank Gaussian Clayton ConstantScenario CVML CCVML WAIC CVML CCVML WAIC CVML CCVML WAIC
Sc1 100 % 100 % 100 % 100 % 100 % 100 % 94 % 94 % 94 %Sc1b 100 % 100 % 100 % 100 % 100 % 100 % 98 % 98 % 98 %Sc7 100 % 100 % 100 % 100 % 100 % 100 % 98 % 94 % 98 %
37
Chapter 4
Simplifying Assumption
4.1 Interesting Connection between Model Misspecifi-
cation and the Simplifying Assumption
Understanding whether the data support the SA or not is usually important for the subject
matter analysis, since a dependence structure that does not depend on the covariates can be
of scientific interest. The SA has also a serious impact on the statistical analysis, because
it has the potential to simplify greatly the estimation of the copula. There is however, an
interesting connection between model misspecification and SA which, as far as we know, has
not been reported elsewhere.
To illustrate the point, consider a random sampling design setting with two independent
random variables, X1, X2 serving as covariates in the Clayton copula model in which SA is
satisfied, the sample size n = 1500 and
f1(x) = 0.6 sin(5x1 + x2),
f2(x) = 0.6 sin(x1 + 5x2),
τ(x) = 0.5,
σ1 = σ2 = 0.2.
When we fit a GP-SIM model with the correct Clayton copula family, but with the X2
covariate omitted from both marginal and copula models, the estimated Kendall’s τ(x1)
exhibits a clear non-constant shape, as seen in Figure 4.1. The CVML, CCVML and WAIC
criteria, whose values are shown in Table 4.1, unanimously vote for a nonconstant calibration
function.
38
Figure 4.1: Estimation of Kendall’s τ as a function of x1 when only first covariate is usedin estimation. The dotted black and solid green lines represent the true and estimatedrelationships, respectively. The red lines are the limits of the pointwise 95% credible inintervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
−1.
0−
0.5
0.0
0.5
1.0
x_1
Ken
dalls
Tau
Table 4.1: Missed covariate: CVML, CCVML and WAIC criteria values for model withconditional copula depends on one covariate and when it is constant.
Variables CVML CCVML WAICX1 -508 -174 1017
Constant -570 -232 1140
While one may expect a non-constant pattern when the two covariates are dependent,
this residual effect of X1 on the copula may be surprising when X1 and X2 are inde-
pendent. We can gain some understanding by considering a simplified example in which
Yi|X1, X2 ∼ N (fi(X1, X2), 1) for i = 1, 2, and Cov(Y1, Y2|X1, X2) = Corr(Y1, Y2|X1, X2) = ρ,
hence constant in X1 and X2. Hence, for marginal models that include only X1, yielding
residuals Wi = Yi − E[Yi|X1] for i = 1, 2, we are interested in explaining the non-constant
dependence between Cov(W1,W2|X1) and X1. Standard statistical properties of covariance
and conditional expectation are used to obtain
Cov(W1,W2|X1) = Cov(Y1, Y2|X1), (4.1)
39
and
Cov(Y1, Y2|X1) = E[Cov(Y1, Y2|X1, X2)] + Cov(E[Y1|X1, X2], E[Y2|X1, X2])
= ρ+ Cov(f1(X1, X2), f2(X1, X2)), (4.2)
where the covariance in (4.2) is with respect to the marginal distribution of X2. Hence it is
apparent that the conditional covariance Cov(W1,W2|X1) will generally not be constant in
X1. Note that if the true means have additive form, i.e. fi(X1, X2) = fi(X1) + fi(X2), for
i = 1, 2, then the covariances in (4.1) are indeed constant in X1, but the estimated value
of Cov(Y1, Y2|X1) will be biased. Although here we focused on the covariance as a measure
of dependence, the argument is extendable to copula parameters or Kendall’s tau, but the
calculations are more involved.
In conclusion, violation of the SA may be due to the omission of important covariates
from the model. This phenomenon along with the knowledge that in general it is difficult
to measure all the variables with potential effect on the dependence pattern, suggests that
a non-constant copula is a prudent choice.
4.2 A Permutation-based CVML to Detect Data Sup-
port for the Simplified Assumption
In this section we try to modify the CVML and the conditional CCVML method to identify
data support for SA after the copula family is selected.
As was shown in Section 3.2.2 CVML, CCVML and WAIC criteria yield good results
in identifying correct copula family but do not perform well in recognizing that the true
calibration is constant. In other word they have large probability of Type I error. This is in
line with [20] who also noted that the traditional Bayesian model selection criteria, e.g. the
Deviance information criterion (DIC) of [88], tend to prefer the more complex calibration
model over a simple model with constant calibration even when the latter is actually correct.
In addition of the simulations presented in the previous section, we add here that when the
marginal distributions are estimated, the performance of the existing criteria worsens. To
illustrate, we have simulated 50 replicates of sample sizes 1500 using Clayton copula from
Sc1, Sc4 and Sc5. Each sample is fitted with the general model introduced here and a con-
stant Clayton copula, while marginals are estimated using a general GP. Table 4.2 shows the
proportion of correct decisions for the three scenarios and various selection criteria. These
40
results show that even for a large sample size, the proportion of right decisions for Sc4, i.e.
when SA holds, is quite low. One of the explanations is that the general model does a good
job at capturing the constant trend of the calibration function and yields predictions that
are not too far from the ones produced with the simpler (and correct) model. The modified
CVML we propose is inspired by two desiderata: i) to separate the set of observations used
for prediction from the set of observations used for fitting the model, and ii) to amplify the
impact of the copula-induced errors in the CCVML calculation. The former will reduce the
implicit bias one gets when the same data is used for estimation and testing, while the latter
is expected to increase the power to identify SA.
For i) we randomly partition the data into a training set D = {y1i, y2i, xi}i=1,...,n1 and a
test set D∗ = {y∗1i, y∗2i, x∗i }i=1,...,n2 . In our numerical experiments we have kept two thirds of
observations in the training set. In order to achieve ii) we note that permuting the response
indexes will not affect the copula term if SA is indeed satisfied and will perturb the pre-
diction when SA is not satisfied. However, one must cautiously implement this idea, since
the permutation λ : {1, . . . , n2} → {1, . . . , n2} will affect the marginal model fit, regardless
of the SA status, as yjλ(i) will be paired with xi, for all j = 1, 2. Below we describe the
permutation-based CVML criterion that combines i) and ii).
Assume that the fitted GP-SIM model yields posterior samples from the conditional distri-
bution of latent variables and parameters ω(t) ∼ π(ω|D), t = 1 . . .M . Then we define the
observed data criterion as the predictive log probability of the test cases which can be easily
estimated from posterior samples, as follows:
CVMLobs =
n2∑i=1
logP (y∗1i, y∗2i|D, x∗i ) ≈
n2∑i=1
log
{1
M
M∑t=1
P (y∗1i, y∗2i|w(t), x∗i )
}=
=
n2∑i=1
log
{1
M
M∑t=1
1
σ(t)1
φ
(y∗1i − f
∗(t)1i
σ(t)1
)1
σ(t)2
φ
(y∗2i − f
∗(t)2i
σ(t)2
)×
×cθ∗(t)i
[Φ
(y∗1i − f
∗(t)1i
σ(t)1
),Φ
(y∗2i − f
∗(t)2i
σ(t)2
)]},
where f∗(t)1i , f
∗(t)2i , θ
∗(t)i are the predicted values for the test cases produced by the GP-SIM
model. Consider J permutations of {1 . . . n2} which we denote as λ1, . . . , λJ , and compute
41
J permuted CVMLs as:
CVMLj =
n2∑i=1
log
{1
M
M∑t=1
1
σ(t)1
φ
(y∗1i − f
∗(t)1i
σ(t)1
)1
σ(t)2
φ
(y∗2i − f
∗(t)2i
σ(t)2
)×
× cθ∗(t)λj(i)
[Φ
(y∗1i − f
∗(t)1i
σ(t)1
),Φ
(y∗2i − f
∗(t)2i
σ(t)2
)]}. (4.3)
Note that CVMLobs differs from CVMLj only in the values of the copula parameters. While
for the former we use θ(x∗i ), in the latter we use θ(x∗λj(i)) for the dependence between y∗1i and
y∗2i. If calibration is constant then CVMLobs and CVMLj should be similar, hence we define
the evidence
EV = 2×min
J∑j=1
1{CVMLobs<CVMLj}
J,
J∑j=1
1{CVMLobs>CVMLj}
J
. (4.4)
Under the null model with constant calibration with known marginals and if we assume
that CVMLobs and {CVMLj : 1 ≤ j ≤ J} are iid for each j, then each term inside the min
function in (4.4) has a Uniform(0, 1) limiting distribution when J → ∞. In that case it
follows that P (EV < 0.05) = 0.05. In practice, the ideal situation just described is merely
an approximation since the {CVMLj : 1 ≤ j ≤ J} are not independent and we compute EV
using a fixed number of permutations. Nevertheless, the ideal setup can be used to build
our decision that when EV > 0.05 the data support SA, and otherwise they do not.
A similar rule can be build using the CCVML criterion. For instance, its value for test
data is
CCVMLobs =1
2
n2∑i=1
logP (y∗1i|D, x∗i , y∗2i) +1
2
n2∑i=1
logP (y∗2i|D, x∗i , y∗1i). (4.5)
The permutation-based version of (4.5) can be obtained using the same principle as in (4.3)
thus leading to the counterpart of (4.4) for CCVML.
Table 4.3 shows the proportion of correct decisions using proposed methods with 1000
and 500 samples in training and test set respectively, and J = 500 permutations. The results,
especially those for Sc4, clearly show an important improvement in the rate of making the
correct selection, with only a slight decrease in the power to detect non-constant calibrations.
We can also notice that CVML and CCVML performed similarly.
42
Table 4.2: The percentage of correct deci-sions for each selection criterion and scenar-ios. GP-SIM and SA were fitted with Clay-ton copula, sample size is 1500.
Scenario CVML CCVML WAICSc1 100% 100% 100%Sc4 74% 78% 74%Sc5 100% 100% 100%
Table 4.3: The percentage of correct de-cisions for each selection criterion and sce-nario. Predicted CVML and CCVML valuesbased on n1 = 1000 training and n2 = 500test data, respectively. The calculation ofEV is based on a random sample of 500 per-mutations.
Scenario CVML CCVMLSc1 98% 96%Sc4 92% 90%Sc5 100% 100%
4.3 Two other methods for Detecting Data Support
for SA
Permutation-based CVML and CCVML perform much better in identifying constant cop-
ula than generic CVML and WAIC. However they lack theoretical justification and cannot
guarantee specific probability of type I error. Therefore in this section we develop the idea
of splitting the whole data set into training and test further and propose two other SA
testing procedures that can also be applied to other models. We propose to use some of
the properties that are invariant to the group of permutations when SA holds. In the first
stage we randomly divide the data D into training and test sets, D1 and D2, with n1 and
n2 sample sizes, respectively. The full model defined by (1.9) is fitted on D1, and we denote
ω(t) the t-th draw sampled from the posterior. For the ith item in D2, compute point esti-
mates ηi and Ui = (U1i, U2i) where Uji = Fj(yji|ωj, xi), j = 1, 2, i = 1, . . . , n2 and ωj are
specific parameters and latent variables related to marginal distribution j. The marginal
parameters estimates, ωj, are obtained from the training data posterior draws. For instance,
if the marginal models are Y1i ∼ N (f1(xi), σ21) and Y2i ∼ N (f2(xi), σ
22), then each of the
MCMC sample ω(t) leads to an estimate f t1(xi), ft2(xi), σ
t1, σ
t2, η
t(xi). Then Ui = (U1i, U2i) are
obtained using
(U1i, U2i) = (Φ((y1i − f1(xi))/σ1),Φ((y2i − f2(xi))/σ2)),
where the overline a signifies the averages of Monte Carlo draws at.
Given the vector of calibration function evaluations at the test points, η = (η1, . . . , ηn2),
and a partition min(η) = a1 < . . . < aK+1 = max(η) of the range of η into K disjoint
43
intervals, define the set of observations in D2 that yield calibration function values between
ak and ak+1, Bk = {i : ak ≤ ηi < ak+1} k = 1, . . . , K. We choose the partition such that each
”bin” Bk has approximately the same number of elements, n2/K. Under SA, the bin-specific
A1 Compute the kth bin-specific Kendall’s tau τk from {Ui : i ∈ Bk}) k = 1, . . . , K.
A2 Compute the observed statistic T obs = SDk(τk) (where SDk(ak) is a standard deviationof ak over index k). Note that if SA holds, we expect the observed statistic to be closeto zero.
A3 Consider J permutations λj : {1, . . . , n2} → {1, . . . , n2}. For each permutation λj:
A3.1 Compute τjk = τ({Ui : λj(i) ∈ Bk}) k = 1, . . . , K.
A3.2 Compute test statistic Tj = SDk(τjk). Note if SA holds, then we expect Tj to beclose to T obs.
A4 We consider that there is support in favour of SA at significance level α if T obs is smallerthan the (1− α)-th empirical quantile calculated from the sample {Tj : 1 ≤ j ≤ J}.
Table 4.4: Method 1: A permutation-based procedure for assessing data support in favour ofSA
estimates for various measures of dependence, e.g. Kendall’s τ or Spearman’s ρ, computed
from the samples Ui, are invariant to permutations, or swaps across bins. Based on this
observation, we consider the procedure described in Table 1 for identifying data support
for SA. The distribution of the resulting test statistics obtained in Method 1 is determined
empirically, via permutations. Alternatively, one can rely on the asymptotic properties of the
bin-specific dependence parameter estimates and construct a Chi-square test. Specifically,
suppose the bin-specific Pearson correlations ρk are computed from samples {Ui : i ∈ Bk}),for all k = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K be the number of points in each
bin. It is known that ρk is asymptotically normal distributed for each k so that
√n(ρk − ρk)
d→ N (0, (1− ρ2k)
2),
where ρk is the true correlation in bin k. If we assume that {ρk: k = 1, . . . , K} are
independent, and set ρ = (ρ1, . . . , ρK)T and Σ = diag((1 − ρ21)2, . . . , (1 − ρ2
K)2), then we
have: √n(ρ− ρ)
d→ N (0,Σ).
44
B1 Compute the bin-specific Pearson correlation ρk from samples {Ui : i ∈ Bk}), for allk = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K, the number of points in each bin.
B2 Define ρ = (ρ1, . . . , ρK)T , Σ = diag((1− ρ21)2, . . . , (1− ρ2
K)2) and A ∈ R(K−1)×K as in(4.6) then under SA we have that ρ1 = . . . = ρK and
n(Aρ)T (AΣAt)−1(Aρ)d→ χ2
K−1.
Compute T obs = n(Aρ)T (AΣAt)−1(Aρ).
B3 Compute p-value = P (χ2K−1 > T obs) and reject SA if p-value< α.
Table 4.5: Method 2: A Chi-square test for assessing data support in favour of SA
In order to combine evidence across bins, we define the matrix A ∈ R(K−1)×K as
A =
1 −1 0 · · · 0
0 1 −1 · · · 0...
......
......
0 0 · · · 1 −1
. (4.6)
Since under the null hypothesis SA holds, one gets ρ1 = . . . = ρK , implying
n(Aρ)T (AΣAt)−1(Aρ)d→ χ2
K−1.
Method 2, with its steps detailed in Table 4.5, relies on the ideas above to test SA.
Method 1 evaluates the p-value using a randomization procedure [56], while the second
is based on the asymptotic normal theory of Pearson correlations. To get reliable results
it is essential to assign test observations to ”correct” bins which is true when calibration
predictions are as close as possible to the true unknown values, i.e. η(xi) ≈ η(xi). The latter
heavily depends on the estimation procedure and sample size of the training set. Therefore
it is advisable to apply very flexible models for the calibration function estimation and have
enough data points in the training set. We immediately see a tradeoff as more observations
are assigned to D1 the better will be the calibration test predictions, at the expense of
decreasing power due to a smaller sample size in D2. For our simulations we have used
n1 ≈ 0.5n and n2 ≈ 0.5n, and K ∈ {2, 3}.To get the intuition behind the proposed methods consider an idealized example where
45
marginals are uniform, true calibration is known and we have access to ”infinite” data,
moreover we focus on situation with only 2 bins. Note that if SA is true then the correlation
in each bin is the same and any permutation should yield the same or very similar correlation.
On the hand if SA is not satisfied, and assuming for simplicity that the calibration only takes
2 values, then since observations are assigned to bins by their calibration values, bin 1 and
bin 2 will contain pairs following distributions π1(u1, u2) and π2(u1, u2) with corresponding
correlations ρ1 < ρ2 respectively. Note that after a random permutation, pairs in each bin will
follow a mixture distribution λπ1(u1, u2)+(1−λ)π2(u1, u2) and (1−λ)π1(u1, u2)+λπ2(u1, u2)
in bin 1 and 2 respectively with λ ∈ (0, 1) . It is obvious, that correlations of permuted data
in bins 1 and 2 are λρ1 + (1 − λ)ρ2 and (1 − λ)ρ1 + λρ2. Observe that each correlation is
between ρ1 and ρ2 which imply that the absolute difference between these two correlations
after any permutation must be less than ρ2 − ρ1 which is our observed test statistic. Of
course in a real finite data, the observed statistic is not such an obvious outlier but should
be somewhere in a tail of the distribution of statistics. In other words this example illustrates
that if SA is not satisfied then our proposed methods should reject this hypothesis (at least
for large enough sample size).
4.4 Simulations
In this section we present the performance of the proposed methods and comparisons with
generic CVML and WAIC criteria on simulated data sets. Different functional forms of
calibration function, sample sizes and magnitude of deviation from SA will be explored.
Simulation details
We generate samples of sizes n = 500 and n = 1000 from 3 scenarios described below.
For all scenarios the Clayton copula will be used to model dependence between responses,
while covariates are independently sampled from U [0, 1]. For all scenarios the covariate
dimension q = 2. Marginal conditional distributions Y1|X and Y2|X are modeled as Gaussian
with constant variances σ21, σ
22 and conditional means f1(X), f2(X) respectively. The model
parameters must be estimated jointly with the calibration function η(X). For convenience
we parametrize calibration on Kendall’s tau τ(X) scale.
Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
46
τ(x) = 0.5,σ1 = σ2 = 0.2.
Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = δ + γ × sin(10xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2.
Sc3 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = δ + γ × 2(x1 + cos(6x2)− 0.45)/3
σ1 = σ2 = 0.2.
Sc1 corresponds to SA since Kendall’s τ is independent of covariate level. The calibration
function in Sc2 has single index form for the calibration function, while in Sc3 it has an
additive structure on τ scale (generally not additive on η scale), these simulations are useful
to evaluate performance under model misspecification. We note that τ in Sc2 and Sc3
depends on parameters δ (average correlation strength) and γ (deviation from SA), which
in this study take values δ ∈ {0.25, 0.75} and γ ∈ {0.1, 0.2}, respectively.
Simulation results
For each sample size and scenario we have repeated the analysis using 250 independently
replicated data sets. For each data, the GP-SIM model suggested in chapter 2 ([57]) is fitted.
Such Bayesian non-parametric models are much more flexible than parametric ones and can
effectively capture various patterns. The inference is based on 5000 MCMC samples for all
scenarios, as the chains were run for 10000 iterations with 5000 samples discarded as burn-in.
The number of inducing inputs was set to 30 for all GP. For generic SA testing, GP-SIM
fitting is done for the whole data sets and posterior draws are used to estimate CVML,
CCVML and WAIC. Since the proposed methods require data splitting, we first randomly
divide the data equally into training and testing sets. We fit GP-SIM on the training set and
then use the obtained posterior draws to construct point estimates of F1(y1i|xi), F2(y2i|xi)and η(xi) for every observation in the test set. In Method 1 we used 500 permutations.
Table 4.6 shows the percentage of SA rejections for generic Bayesian selection criteria. The
presented results clearly illustrate that generic methods have difficulties identifying SA. This
leads to a loss of statistical efficiency since a complex model is selected over a much simpler
one. In the context of CVML or CCVML it can be explained by observing that both
47
Table 4.6: Simulation Results: Generic, proportion of rejection of SA for each scenario,sample size and generic criteria.
n = 500 n = 1000Scenario CVML CCVML WAIC CVML CCVML WAICSc1 33.3% 31.1% 34.7% 38.2% 37.3% 37.8%Sc2(δ = 0.75, γ = 0.1) 99.1% 98.7% 99.1% 100% 100% 100%Sc2(δ = 0.75, γ = 0.2) 100% 100% 100% 100% 100% 100%Sc2(δ = 0.25, γ = 0.1) 80.1% 84.4% 80.1% 99.1% 100% 99.1%Sc2(δ = 0.25, γ = 0.2) 100% 100% 100% 100% 100% 100%Sc3(δ = 0.75, γ = 0.1) 76.9% 73.3% 77.8% 85.7% 82.2% 85.8%Sc3(δ = 0.75, γ = 0.2) 99.1% 97.3% 99.1% 99.1% 97.8% 99.1%Sc3(δ = 0.25, γ = 0.1) 54.7% 56.4% 55.6% 65.3% 68.4% 64.9%Sc3(δ = 0.25, γ = 0.2) 89.8% 92.0% 91.1% 99.6% 100% 99.6%
these measures do not penalize for complexity of the model. Therefore flexible calibration
function (as GP-SIM) provides similar fit to that of the reduced model and as a consequence
has similar predictive power. In addition, the SA may be of interest in itself in certain
applications, e.g. stock exchange modelling where it is useful to determine whether the
dependence structure between different stock prices does not depend on other factors. The
Table 4.7: Simulation Results: Proposed method, proportion of rejection of SA for eachscenario, sample size, number of bins (K) and method.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 4.9% 6.2% 3.5% 5.3% 9.7% 11.1% 10.7% 13.7%Sc2(δ = 0.75, γ = 0.1) 90.2% 80.4% 99.6% 99.1% 94.7% 94.2% 99.6% 99.1%Sc2(δ = 0.75, γ = 0.2) 100% 100% 100% 100% 100% 100% 100% 100%Sc2(δ = 0.25, γ = 0.1) 25.8% 18.7% 55.1% 47.1% 30.2% 21.8% 58.7% 53.8%Sc2(δ = 0.25, γ = 0.2) 91.6% 84.9% 99.6% 99.6% 92.4% 91.1% 99.6% 99.6%Sc3(δ = 0.75, γ = 0.1) 28.0% 24.0% 57.3% 52.9% 41.3% 45.8% 72.4% 72.9%Sc3(δ = 0.75, γ = 0.2) 88.4% 85.8% 98.7% 98.7% 94.2% 92.0% 100% 99.1%Sc3(δ = 0.25, γ = 0.1) 8.0% 7.5% 11.1% 10.7% 9.8% 10.7% 15.1% 12.9%Sc3(δ = 0.25, γ = 0.2) 19.6% 18.2% 63.6% 60.9% 24.9% 23.6% 70.2% 69.3%
simulations summarized in Table 4.7 show that the proposed methods (setting α = 0.05)
have much smaller probability of Type I error which vary around the threshold of 0.05.
48
It must be pointed, however, that under SA the performance of χ2 test worsens with the
number of bins K, which is not surprising since as K increases, the number of observations
in each bin goes down and normal approximation for the distribution of Pearson correlation
becomes tenuous, while the permutation-based test is more robust to small samples. The
performance of both methods improves with sample size. We also notice a loss of power
between Scenarios 2 and 3, which is due to model misspecification, since in the latter case
the generative model is different from the postulated one. All methods break down when
the departure from SA is not large, e.g. γ = 0.1. Although not desirable, this has limited
impact in practice since, in our experience, in this case the predictions produced by either
model are very similar.
4.5 Theoretical justification
In this section we prove that under canonical assumptions, the probability of Type I error
for Method 2 in Section 4.3 converges to α when SA is true.
Suppose we have independent samples from K populations (groups), (u11i, u
12i)
n1i=1 ∼
(U11 , U
12 ),. . . ,(uK1i , u
K2i)
nKi=1 ∼ (UK
1 , UK2 ), the goal is to test ρ1 = . . . = ρK (here ρ is Pear-
son correlation).
To simplify notation, we assume n1 = . . . , nK = n. Let ρ = (ρ1, . . . , ρK) be the vector
of sample correlations , Σ = diag((1 − ρ21)2, . . . , (1 − ρ2
K)2) and (K − 1) × K matrix A as
defined in Section 4.3, then canonical asymptotic results imply that if ρ1 = . . . = ρK and as
n→∞,
T = n(Aρ)T (AΣAT )−1(Aρ)d→ χ2
K−1. (4.7)
Based on the model fitted on D1, we define estimates of F1(y1i|xi) and F2(y2i|xi) by U =
{Ui = (F1(y1i|xi), F2(y2i|xi))}n2i=1. Note that U depends on D1 and X (covariates in test set).
Given a fixed number of bins K and assuming, without loss of generality, equal sample sizes
in each bin n = n2/K, we define a test statistic T (U) as in (4.7) with ρj estimated from
{U(j−1)n+1, . . . , Ujn}, for 1 ≤ j ≤ K.
Note that in Method 2, test cases are assigned to ”bins” based on the value of predicted
calibration function η(xi) which is not taken into account in the generic definition of test
statistic T (U) above. To close this gap, we introduce a permutation λ∗ : {1, . . . , n2} →{1, . . . , n2} that ”sorts” U from smallest η(x) value to largest i.e. Uλ∗ = {Uλ∗(i)}n2
i=1 with
η(xλ∗(1)) < η(xλ∗(2)) < · · · < η(xλ∗(n2)). Hence, the test statistic in Method 2 has form T (Uλ∗)
49
as in (4.7) but in this case test cases with smallest predicted calibrations are assigned to the
first group, or bin, and with largest calibrations to the Kth group/bin. Finally, define a test
function φ with specified significance level α to test SA:
φ(U |λ∗) =
1 if T (Uλ∗) > χ2K−1(1− α),
0 if T (Uλ∗) ≤ χ2K−1(1− α).
(4.8)
Intuitively, if SA is false then we would expect T (Uλ∗) to be larger then the critical value
χ2K−1(1− α).
The goal is to show that this procedure have probability of type I error equal to α, which is
equivalent to the expectation of the test function:
P(Type I error) =
∫φ(U |λ∗)P (λ∗|D1, X)P (U |D1, X)P (D1)P (X)dUdD1dXdλ
∗. (4.9)
Note that λ∗ does not depend on U because of the data splitting to train and test sets. Also
usually P (λ∗|D1, X) is just a point mass at some particular permutation. In general the
above integral cannot be evaluated, however if we assume that for all test cases:
F1(y1i|xi)p→ F1(y1i|xi) as n→∞ ∀i,
F2(y2i|xi)p→ F2(y2i|xi) as n→∞ ∀i,
(4.10)
then under SA and as n → ∞, P (U |D1, X) = P (U) ≈∏n2
i=1 c(u1i, u2i) where c is a copula
density and the expectation becomes:
P(Type I error) =
∫φ(U |λ∗)P (λ∗|D1, X)P (U)P (D1)P (X)dUdD1dXdλ
∗ =
=
∫ (∫φ(U |λ∗)P (U)dU
)P (λ∗|D1, X)P (D1)P (X)dD1dXdλ
∗ = α.
(4.11)
Since if SA is true,∫φ(U |λ∗)P (U)dU = α for any λ∗. Therefore if marginal CDF predictions
for test cases are consistent then this procedure has the required probability of type I error
for sufficiently large sample size.
50
4.6 Extensions to other models
The proposed idea of dividing the data into train and test subsets, splitting the observations
in the test set to bins and then use a test to check distributional difference between bins
can be extended to other models. For example, one can use a similar construction in mul-
tiple mean regression, logistic or quantile regression problems to check SA. The proposed
approaches for assessing SA can be particularly useful when f(X) (conditional mean or
quantile) is assumed to have a complex form, and flexible models such as generalized addi-
tive models, additive tree structures, non-parametric or based on Bayesian non-parametric
methods are utilized. Simulations we conducted suggest that generic testing procedures yield
large Type I error probabilities for non-parametric fitted models, a problem that is attenu-
ated using the permutation-based ideas described in this chapter. Below we describe how to
modify Methods 1 and 2 to regression problems and conduct a series of simulations to com-
pare performances (probability of Type I error and power) of the proposed algorithms with
standard testing procedures used in literature. Also, note in contrast to Bayesian view that
we have adopted for conditional copula problems we use frequentist paradigm for regression
problems below.
4.6.1 Multiple Regression
Multiple regression with Gaussian errors is used extensively in applied statistics for its sim-
plicity and theoretical properties. Here we assume that errors are identically distributed
(no heteroskedasticity). It is frequently of interest to test whether all the predictors (or
covariates) contribute to the prediction of the response. Therefore we define ”full” model as
a model that contains all the relevant covariates while ”reduced” or SA is a model with only
an intercept. For linear multiple regression we could use the global F test [28] with known
distribution under SA but as will be shown, when more general models for conditional mean
are fit, the generic F test no longer exhibits the correct significance level.
For this set of simulations we assume that yi = f(Xi) + εi for i = 1, · · · , n with εi
independent and identically distributed. We generate samples of sizes n = 500 and n = 1000
from 5 scenarios described below. For all scenarios covariates are independently sampled
from U [0, 3] with covariate dimension q = 3.
Sc1 f(x) = 1; εi ∼ N (0, 9).
Sc2 f(x) = γ × (x21 + 3x2 − 2x3
3 + 6)/3 + 1; εi ∼ N (0, 9).
51
Sc3 f(x) = γ × (x1x2x3 + x1x22 − 7.7)/3 + 1; εi ∼ N (0, 9).
Sc4 f(x) = 1; εi ∼ Cauchy(0, 1).
Sc5 f(x) = γ × (x1x2x3 + x1x22 − 7.7) + 1; εi ∼ Cauchy(0, 1).
Sc1 and Sc4 correspond to SA as conditional means do not depend on the covariates. Sc2
and Sc3 represent nonlinear models with Gaussian errors, the former has additive structure
the latter has interaction between covariates. Note that both depend on parameter γ which
controls the deviation from SA. Sc4, Sc5 include errors from Cauchy distribution which has
much heavier tails than Gaussian. These scenarios are useful to evaluate the performance of
testing algorithms when the assumption of normality is violated.
To apply the proposed testing procedures for this regression problem we do the following:
divide the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit a flexible model on the training set, here we use Generalized Additive model (GAM)
with cubic splines for each component (penalty is also estimated in the same fitting). With
estimated parameters make prediction of function f on the test set f(x) (similar to estimated
calibration function in conditional copula problem), based on f(x) split the test set to K
bins so that number of points in each bin is n = n2/K. Here we can implement either
Chi-square or permutation approach. For Chi-square test (Method 2), let
~y = (y1, · · · , yK)T ,
where yk is the average of responses in bin k ∈ {1, · · · , K}. Then for regression with Gaussian
noise it is trivially derived from the normal theory that under SA:
n(A~y)T (AΣAT )−1A~y ∼ χ2K−1, (4.12)
where A as in (4.6) and Σ = diag(σ2, · · · , σ2) with σ2 is estimated from GAM fit on the
training set. Note that this result is approximately true for non-Gaussian noise (with finite
variance) and sufficiently large n by Central Limit Theorem. This result can be used to assess
evidence against SA. Permutation test (Method 1) is constructed by first finding observed test
statistic T obs = y(K)− y(1) (largest minus smallest average), and then by permuting responses
compute proportion of permutations with permuted test statistics Tj, j = 1, · · · , J greater
than T obs. This proportion is an estimate of the p-value and if it is less than pre-specified
α, SA is rejected. For all scenarios we set significance level α = 0.05.
52
We compare these methods to the following generic approaches to SA testing. Since we
focus on non-linear conditional mean, the standard procedure is to fit flexible non-parametric
model (on the whole data set), find SSEfull (sum of residuals squared) and degrees of freedom
DFfull and compare them to SSEred and DFred = 1 of the model with only an intercept. This
is an instance of partial F statistic that has exact F distribution under SA and if the fitted
model is linear in parameters.
F ∗ =(SSEred − SSEfull)/(DFfull −DFred)
SSEfull/(n−DFfull)∼ F (DFfull −DFred, n−DFfull), (4.13)
where F (a, b) represents F-distribution with a and b degrees of freedom for numerator and
denominator respectively. We denote this approach by (F-test). Note that the above exact
distribution may not be valid when a non-parametric model such as Generalized Additive,
Random Forest, Support Vector or Gaussian process is fitted. That is why we also consider
less realistic test (F-exact) where the observed test statistic F ∗ is compared to the exact
critical value (that corresponds to α level) which is estimated by repeatedly (M times)
generating response vector Y from SA (keeping covariates fixed), each time fitting the ”full”
and ”reduced” models, and finally calculating observed test statistic F ∗. Once we have
approximate distribution of the test statistic F ∗ under SA we can estimate critical value by
taking empirical (1 − α) quantile. For simulation below we set M = 200, note that this
approach may not be feasible in practice when the dimensionality of data is large and/or
fitting the full model is computationally costly. We also show performances of AIC and BIC
criteria, so that after fitting ”full” and ”reduced” models, the model with smallest criterion’s
value is selected. Since we assume that errors are identically distributed it follows that in
these examples SA implies independence of covariates and response therefore we introduce
Bootstrap test for independence which again usually not feasible in real world problems for
computational inefficiency. For this test given pairs {yi, xi}ni=1 we fit the ”full” model and
calculate some measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the ”full” model to {yλj(i), xi}ni=1 and calculate Tj, finally reject
SA (or independence) if T obs is greater than (1 − α) quantile of {Tj}nj=1. For discrepancy
measure T we consider −SSEfull (BOOT-SSE) and F ∗ (BOOT-F) as in Eq (4.13), note that
these measures are ”large” when SA is false. Similar to (F-exact) test these tests require
many estimations of the complicated (or full) model, for all simulations we set J = 100.
We set the ”full” model to be GAM with cubic splines for each covariate and estimation of
the degree of smoothness. For better comparisons, in addition to fitting flexible GAM we
53
also fit additive multiple regression with polynomial degrees of 1,3 and 10 for each predictor.
The SA is tested with standard global (F-test) in (4.13).
For every scenario we generate 500 sets of responses (keeping covariates fixed) and test
simplifying assumption using all of the described procedures with significance level α = 0.05.
Results:
Table 4.8: Simulation Results for Regression: Generic, proportion of rejections of SA foreach scenario, sample size and generic criteria.
Bootstrap Poly, F-testScenario F-test F-exact AIC BIC -SSE F-test Deg=1 Deg=3 Deg=10
n = 500Sc1 29.8% 3.6% 29.8% 0.0% 5.0% 5.6% 5.8% 4.2% 3.0%Sc2(γ = 0.05) 56.4% 20.6% 59.2% 0.2% 9.4% 24.6% 30.2% 21.0% 10.8%Sc2(γ = 0.11) 98.6% 91.8% 99.4% 22.0% 55.6% 92.0% 94.8% 89.8% 66.2%Sc3(γ = 0.1) 56.4% 28.8% 63.2% 0.6% 8.2% 25.6% 34.8% 19.6% 11.2%Sc3(γ = 0.2) 95.6% 83.8% 95.6% 15.4% 35.0% 78.6% 89.0% 75.6% 47.2%Sc4 27.2% 6.8% 28.4% 0.0% 5.4% 4.2% 2.6% 7.2% 8.0%Sc5(γ = 0.1) 27.8% 16.0% 31.8% 0.2% 7.4% 9.0% 8.6% 10.2% 9.8%Sc5(γ = 0.9) 81.2% 76.8% 84.2% 52.4% 60.8% 73.2% 75.2% 71.0% 65.0%
n = 1000Sc1 27.2% 3.8% 26.6% 0.0% 2.8% 4.8% 3.8% 4.0% 5.0%Sc2(γ = 0.05) 76.4% 46.6% 81.4% 1.0% 17.8% 45.8% 55.0% 43.0% 23.6%Sc2(γ = 0.11) 100.0% 99.6% 100.0% 61.6% 96.2% 99.4% 100.0% 100.0% 97.4%Sc3(γ = 0.1) 75.0% 53.2% 79.4% 2.0% 18.0% 43.8% 56.8% 38.0% 20.4%Sc3(γ = 0.2) 100.0% 99.2% 100.0% 39.6% 78.0% 97.8% 99.8% 98.0% 84.2%Sc4 22.4% 3.0% 23.6% 0.0% 4.4% 4.6% 1.6% 5.4% 10.2%Sc5(γ = 0.1) 26.8% 7.0% 33.4% 0.2% 5.0% 8.0% 9.2% 8.4% 8.2%Sc5(γ = 0.9) 84.0% 76.8% 85.6% 53.4% 62.4% 77.6% 79.0% 75.6% 67.4%
Tables 4.8 and 4.9 show proportion of SA rejections (out of 500 replicates and signif-
icance α = 0.05) for generic and proposed tests respectively. Generally we should select
a procedure that obtains 5% of rejections for Sc1 and Sc4 and has largest proportion of
rejections for other scenarios (has highest power). First we immediately notice that generic
F-test and AIC have much larger probability of Type I error than expected 5%. Therefore
it is misleading to compare their power with other tests. BIC has no rejections under SA
however its power is very small, indicating that the penalty that BIC uses is too large in
this example. Global F-tests for polynomial models have acceptable probability of Type I
54
Table 4.9: Simulation Results for Regression: Proposed methods, proportion of rejectionsof SA for each scenario, sample size and number of bins.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 4.6% 4.8% 5.0% 5.0% 4.4% 4.8% 6.2% 5.4%Sc2(γ = 0.05) 11.2% 10.8% 16.8% 16.4% 12.8% 13.2% 17.4% 17.2%Sc2(γ = 0.11) 44.4% 46.0% 85.4% 87.8% 48.0% 51.6% 86.6% 90.4%Sc3(γ = 0.1) 9.6% 8.4% 16.2% 14.8% 9.2% 9.2% 17.2% 15.6%Sc3(γ = 0.2) 34.0% 32.0% 65.2% 66.0% 36.4% 34.6% 67.8% 68.8%Sc4 4.8% 4.4% 2.4% 5.6% 26.8% 30.4% 28.6% 31.8%Sc5(γ = 0.1) 8.8% 6.4% 8.4% 10.2% 28.0% 32.0% 29.2% 34.0%Sc5(γ = 0.9) 76.0% 78.4% 78.0% 82.6% 69.2% 72.2% 73.4% 76.6%
error (as predicted by the theory) however their power depends heavily on the polynomial
order and in real problems it is usually impossible to choose correct order for each compo-
nent. F-exact and BOOT tests for independence have the required probability of Type I
error, however largest power is produced by F-exact and BOOT-F. Note that for F-exact we
simulate (under SA) from the exact distribution (normal for Sc1-Sc3 and Cauchy(0, 1) for
Sc4 and Sc5) which is usually unknown in practice.
Focusing now on Table 4.9 we first note that performances of Method 1 and Method 2 are
very similar except for Sc4 and Sc5 where χ2 test should not be used since averages in
each bin do not follow normal distributions for these scenarios. Number of bins K does not
exhibit strong relationship with the performance either. When SA is true (Sc1 and Sc4)
then probability of Type I error for the proposed methods is around 5% as expected. Again,
χ2 test for Sc4 should not be implemented as Cauchy distribution has undefined mean and
variance. It is also evident that the power of these tests increases with the sample size. For
Sc2 and Sc3 Methods 1 and 2 clearly underperforms F-exact and BOOT-F in power however
show similar power to BOOT-SSE. When Cauchy errors are used in Sc5 then Method 1 has
similar power to F-exact and BOOT-F.
It should not be surprising that Methods 1 and 2 have generally smaller power than compu-
tationally expensive bootstrap tests since only half of a sample (test set) is used to decide
the rejection of SA.
55
4.6.2 Logistic Regression
Logistic regression is probably the most frequently used statistical tool to analyze dependence
of a response variable which can take only 2 values with a set of covariates. Suppose we
observe independent pairs {yi, xi}ni=1 with the following generative process:
πi =1
1 + exp(−f(Xi)); i = 1, · · · , n,
Yi ∼ Bern(πi); i = 1, · · · , n,(4.14)
where Bern(p) is a Bernoulli random variable with the probability of success p. The main
objective is to estimate unknown function f(X) and check whether f(X) is independent
of X and therefore constant (SA). Similar to multiple regression we can apply proposed
approaches for assessing data support for SA for logistic regression. As before we define a
flexible and frequently non-parametric model as ”full” and the model with only an intercept
as ”reduced” model.
Consider the following four simulation scenarios:
Sc1 f(x) = 0; x ∈ R2,
Sc2 f(x) = γ × (x1 + x22 − 0.34),
Sc3 f(x) = 0; x ∈ R10,
Sc4 f(x) = γ × (sin(2x1) + cos(x2) + 2x3x4− 2 cos(x5 + 2x6− x7) + x28 + x3
9− 2x210 + 0.12).
We generate samples of sizes n = 500 and n = 1000 from these four scenarios, covariates
are independently sampled from U [−1, 1] with covariate dimension q = 2 for the first two
scenarios and q = 10 for Sc3 and Sc4. Sc1 and Sc3 correspond to SA as probability of Y = 1
is set to 0.5 and does not depend on covariates. In Sc2 function f(X) has additive structure
while Sc4 represents nonlinear, non-additive dependence of probability on predictors. Note
that Sc2 and Sc4 have additional parameter γ which is associated with the deviation from
SA.
To apply the proposed testing procedures for this logistic problem we do the following: divide
the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit flexible model on the training set, here we use Generalized Additive model (GAM) with
cubic splines for each component. With estimated parameters make prediction of function f
on test set f(x) (similar to estimated calibration function in conditional copula problems),
56
based on f(x) split the test set to K bins so that number of points in each bin is n = n2/K.
Here we can implement either Chi-square (Method 2) or permutation (Method 1) approach.
For Chi-square test, let~p = (p1, · · · , pK)T ,
where pk is the average of responses or sample proportion in bin k ∈ {1, · · · , K}. If n is
large enough then by Central Limit Theorem we get that under SA the following approximate
distributional result holds:
n(A~p)T (AΣAT )−1A~p·∼ χ2
K−1, (4.15)
where A as in Eq (4.6) and Σ = diag(p0(1 − p0), · · · , p0(1 − p0)) with p0 is estimated as
the sample proportion of all the responses in the test set. This result can be used to assess
evidence against SA.
Permutation test is performed by first finding observed test statistic T obs = p(K)−p(1) (largest
minus smallest sample proportion), and then by permuting responses compute proportion of
permutations with test statistics Tj, j = 1, · · · , J greater than T obs (here J is the number of
permutations). This proportion is an estimate of the p-value and if it is less than α, SA is
rejected. For all scenarios we set significance level α = 0.05.
We compare these algorithms to the following generic approaches to SA testing. Since we
focus on non-linear function f(X), the standard procedure is to fit a flexible ”full” model
(on the whole data set), find DEVfull (deviance of this model) and degrees of freedom DFfull
and compare them to DEVred and DFred of the model with only an intercept, see for example
[27]. Here deviance is defined as −2 times log-likelihood, and deviance for the ”full” must
be smaller than for the ”reduced” one. This is an example of the likelihood ratio test that
has an approximate χ2 distribution under SA.
T ∗ = DEVred −DEVfull·∼ χ2
DFfull−DFred. (4.16)
We denote this approach by (T-test). Note that the above result is generally used for
nested models with the full model having parametric form and as will illustrated later may
not be valid when non-parametric model such as Generalized Additive, Random Forest
or Gaussian process is fitted for f(X). That is why we also consider less realistic test
(T-exact) where observed test statistic T ∗ is compared to the exact critical value (that
corresponds to α level) which is estimated, by repeatedly (M times) generating response
57
Y from SA (keeping covariates fixed), each time fitting ”full” and ”reduced” models, and
finally calculating observed test statistic T ∗. Once we obtain an approximate distribution of
the test statistic T ∗ under SA we can estimate the critical value by taking (1− α) quantile.
For simulations below we set M = 200, note that this approach may not be feasible in
practice when the dimensionality of the data is large and the ”full” model is computationally
expensive to fit. We also show performances of AIC and BIC criteria, so that after fitting
”full” and ”reduced” models, the model with smallest criterion’s value is selected. Since
we assume that Yi are independent it follows that in this example SA implies independence
therefore we introduce generic bootstrap test for independence which again may not be
feasible in real world problems. For this test given pairs {yi, xi}ni=1 we fit ”full” model and
calculate a measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the same model to {yλj(i), xi}ni=1 and calculate Tj, finally reject
SA (or independence) if T obs is greater than (1−α) quantile of {Tj}nj=1. For the discrepancy
measure T we consider LogLikfull (BOOT-LL) and T ∗ (BOOT-T) as in Eq (4.16), note
that these measures as ”large” when SA is false. Similar to (T-exact) test these tests for
independence require many estimations of the ”full” model, for all simulations we set J =
100.
We set here the ”full” model to be GAM with cubic splines and estimation of smoothing
parameter for each predictor. For better comparisons, in additional to fitting flexible GAM
we also fit simple additive models with polynomial degrees 1,3 and 10 for every covariate.
The SA is tested with standard global (T-test) as in (4.16).
For every scenario we generate 500 sets of responses (keeping covariates fixed) and test
simplifying assumption using all of the described procedures with significance level α = 0.05.
Results:
Table 4.10 presents proportion of rejections using described above generic testing procedures.
First we examine probability of Type I error by looking on Sc1 and Sc3. Note that standard
likelihood ratio test (T-test) has very large rejection rate for scenarios 1 and 3, it even exceeds
74% for Sc3. Note that the error increases with the dimensionality of covariates. AIC also
has much larger probability of Type I error than expected 5%. BIC on the other hand
produces very small rejection rate under SA but the power for other scenarios is the lowest
for this criterion. Likelihood ratio test works quite well for polynomial models (except for
10-degree) as predicted by the theory. However as with regression example the power of
polynomial models depends significantly on the degree and choosing correct degree for each
58
Table 4.10: Simulation Results for Logistic Regression: Generic, proportion of rejections ofSA for each scenario, sample size and generic criteria.
Bootstrap Poly, T-testScenario T-test T-exact AIC BIC LogLik T-stat deg=1 deg=3 deg=10
n = 500Sc1 24.4% 5.6% 28.2% 0.4% 5.8% 5.8% 4.4% 4.8% 7.0%Sc2(γ = 0.1) 23.8% 5.0% 33.2% 0.0% 5.8% 5.8% 6.8% 7.2% 4.8%Sc2(γ = 0.55) 96.4% 44.0% 98.4% 23.2% 43.2% 43.2% 85.0% 82.2% 58.0%Sc3 74.4% 7.0% 26.8% 0.0% 4.0% 4.0% 5.6% 6.4% 24.2%Sc4(γ = 0.1) 79.8% 9.2% 30.8% 0.0% 7.0% 7.0% 6.2% 7.4% 25.4%Sc4(γ = 0.33) 98.6% 32.4% 88.6% 0.0% 26% 26% 40.8% 68.2% 67.6%
n = 1000Sc1 25.0% 5.6% 29.8% 0.0% 3.8% 3.8% 5.6% 6.2% 4.4%Sc2(γ = 0.1) 32.6% 8.0% 40.8% 0.6% 7.6% 7.6% 12.8% 10.4% 9.8%Sc2(γ = 0.55) 99.8% 95.0% 100.0% 67.8% 93.6% 93.6% 99.6% 99.6% 96.0%Sc3 73.2% 5.8% 22.2% 0.0% 6.0% 6.0% 4.8% 5.4% 12.6%Sc4(γ = 0.1) 84.0% 11.2% 44.4% 0.0% 11% 11% 13.6% 16.0% 20.4%Sc4(γ = 0.33) 100.0% 68.0% 99.2% 0.0% 70% 70% 79.0% 94.6% 80.6%
Table 4.11: Simulation Results for Logistic Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size and number of bins.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3Sc1 7.4% 5.0% 7.8% 6.8% 4.4% 4.2% 6.2% 6.4%Sc2(γ = 0.1) 9.0% 5.2% 9.2% 6.4% 6.4% 4.0% 8.8% 6.6%Sc2(γ = 0.55) 44.8% 35.8% 86.6% 84.8% 38.8% 35.6% 85.0% 85.6%Sc3 7.4% 7.0% 7.2% 5.2% 6.2% 7.2% 6.6% 4.8%Sc4(γ = 0.1) 6.4% 5.4% 8.2% 8.2% 4.4% 5.8% 7.4% 7.8%Sc4(γ = 0.33) 19.2% 16.0% 51.6% 46.2% 16.2% 17.0% 48.6% 46.8%
predictor can be very challenging task in real problems. T-exact and Bootstrap procedures
perform very similarly with correct probability of Type I error and largest power.
Next we focus on Table 4.11 which shows proportion of rejections of the proposed SA tests
for all the scenarios and sample sizes. First we notice that under SA (Sc1 and Sc3) the
probability of rejection for Methods 1 and 2 is around 5% as required. Also the performances
of both tests are quite similar. The power of both methods for Sc2 is comparable with
the power produced by T-exact, BOOT-LL and BOOT-T. For Sc4 with large covariate
59
dimension, Methods 1 and 2 shows smaller power than the best generic procedures.
Based on all these observations we can conclude that T-exact and bootstrap procedures
generally work quite well however they are computationally expensive and for complicated
fitted models are not even feasible. For such problems the proposed SA assessment can be
appropriate as the fitting of a flexible model must only be done once on the training set
which is only half of the original sample size.
4.6.3 Quantile Regression
Suppose similar to multiple regression we observe independent pairs {yi, xi}ni=1 with yi ∈ R
and xi ∈ Rq. Given a quantile τ ∈ (0, 1) the objective of a quantile regression is to estimate
Qτ (Y |X) = fτ (X) which is just τ quantile of response Y as a function of covariates [50]. For
example if τ = 0.5 then the quantile regression aims to approximate conditional median of
the response.
Generally it is assumed that conditional quantile has a linear (in parameters) form:
Qτ (Y |x) = fτ (x) = β0 + β1x1 + · · ·+ βqxq. (4.17)
Unknown parameters ~β are estimated by minimizing appropriate function which can be
reformulated as a linear programming problem. Another important question is whether the
conditional τ -quantile depends on covariates X or not, which is exactly SA in this context. In
this section we show that the proposed testing techniques can be used in quantile regression to
assess data support for SA. In previous sections we were fitting non-parametric models such
as GAM to the data, for quantile regression there are no such build-in functions therefore for
the ”full” model we define additive model with 3 B-splines for each covariate (no complexity
penalty). This model can be considered as linear since fτ (X) is a linear function of unknown
parameters. As before the ”reduced” model is a quantile regression with only an intercept β0,
and the main objective is to check if the ”reduced” model is adequate based on the observed
data. For simulations we assume that Yi = f(Xi) + εi, i = 1, · · · , n with εi are independent
and identically distributed. To compare testing procedures we consider the following two
scenarios:
Sc1 f(x) = 0; ε ∼ 5χ21,
Sc2 f(x) = γ × (x21 + 3x2 − 2x3
3 + 6)/3 + 1; ε ∼ 5χ21.
60
Here χ21 is a Chi-square distribution with 1 degree of freedom. We generate samples of sizes
n = 500 and n = 1000 from these two scenarios, covariates are independently sampled from
U [0, 3] with covariate dimension q = 3 for both scenarios. Sc1 corresponds to SA for all τ
as the distribution of Y does not depend on covariates. In Sc2 function f(X) has additive
structure, note that as a consequence the conditional quantile also has additive form for any
τ . As in previous sections Sc2 depends on parameter γ which controls the deviation from
SA.
To apply the proposed SA testing procedure for quantile regression we do the following
(fixing τ): divide the whole data set into training and test sets (half to each set so that
n1 = n2 = bn/2c), fit flexible τ -quantile regression model on the training set, here we use
additive model with 3 B-splines for each component (or equivalently cubic polynomial).
With estimated parameters make predictions of function fτ on the test set fτ (x) (similar to
estimated calibration function in conditional copula problems), based on fτ (x) split the test
set to K bins so that number of points in each bin is n = n2/K. Here we can implement
either Chi-square (Method 2) or permutation approach (Method 1). For the Chi-square
test we can do the following procedure, first find q0τ which is the ”global” τ -quantile of the
responses in the whole test set. Then construct a 2×K table Tik with T1k is the number of
responses in bin k that are larger (or equal) to q0τ , similarly T2k is the count of responses in
bin k that are smaller than q0τ . Note that if SA is true then columns and rows of the table
Tik must be independent, thus Pearson χ2 test for independence can be implemented that
asymptotically has χ2K−1 distribution in this case. If the observed test statistic
∑i,k
((Tik − Eik)2
Eik
),
is larger than appropriate critical value of χ2K−1 we reject SA (here Eik is the expected count
under independence), we use this test for our Method 2.
Permutation test is performed by first finding observed test statistic T obs = q(K)−q(1) (largest
minus smallest quantile) where qk is τ quantile of the responses in bin k, k = 1, · · · , K.
Then by permuting responses compute proportion of permutations with test statistic Tj,
j = 1, · · · , J greater than T obs. This proportion is the estimate of the p-value and if it is
less than α, SA is rejected. For all scenarios we set significance level α = 0.05.
We compare these algorithms to the following generic approaches to SA testing. Let degrees
of freedom of the ”full” and ”reduced” models be denoted by DFfull and DFred respectively.
Similarity if available let deviances of two models be represented by DEVfull and DEVred.
61
A generic SA test is based on ANOVA decomposition (”anova.rq” function in R software)
which calculates test statistic F ∗ that should follow F-distribution(DFfull−DFred,n−DFfull)
if SA is true. We denote this approach by (F-test). As will be shown this test has much
larger probability of Type I error than expected. That is why we also consider F-exact where
the observed test statistic F ∗ is compared to the exact critical value (that corresponds to α
level) which is estimated, by repeatedly (M times) generating response Y from SA (keeping
covariates fixed), each time fitting ”full” and ”reduced” models, and calculating test statis-
tic F ∗ (from ANOVA procedure). Once we obtain an approximate distribution of the test
statistic under SA we can estimate critical value by taking (1−α) quantile. For simulations
below we set M = 200. We also show performance of AIC criterion, so that after fitting
”full” and ”reduced” model with only intercept, the model with smallest criterion’s value is
selected. Likelihood ratio is also available in R software (”lrtest”) which calculates difference
between deviances as in (4.16) and compares to critical values of χ2
DFfull−DFred. Since in this
example SA of any τ implies independence of Y and X, we introduce generic bootstrap test
for independence. For this test given pairs {yi, xi}ni=1 we fit the ”full” model and calculate a
measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n}, j = 1, · · · , J .
For each j fit ”full” model to {yλj(i), xi}ni=1 and calculate Tj, finally reject SA (or indepen-
dence) if T obs is greater than (1 − α) quantile of {Tj}nj=1. For the discrepancy measure T
we consider negative of sum of residuals squared −SSEfull and sum of estimated coefficients
squared (without intercept) so that T =∑q
1 β2i . Note that these measure are ”large” when
SA is false. Similar to (F-exact) these tests require many estimations of complicated models,
for all simulations we set J = 100.
For every scenario we generate 500 sets of responses (keeping covariates fixed) and test sim-
plifying assumption using all of the described procedures with significance level α = 0.05.
We also check SA for different quantile values τ ∈ {0.1, 0.5, 0.9}. As was mentioned previ-
ously, for the ”full” model we fit additive model with cubic polynomials for each component
(no smoothness penalty).
Results:
Table 4.12 shows proportion of rejections (out of 500 replicated data sets) for the generic SA
tests for different scenarios, sample sizes and quantile values τ ∈ {0.1, 0.5, 0.9}. Note that in
this example we use simple polynomial of order 3 for every covariate (not non-parametric)
therefore it is expected that the generic F-test should obtain 5% of rejection under Sc1.
However this does not happen, for all τ the probability of type I error is larger than 15%
62
Table 4.12: Simulation Results for Quantile Regression: Generic, proportion of rejectionsof SA for each scenario, sample size, τ and generic criteria.
n = 500 n = 1000Bootstrap Bootstrap
Scenario F-test F-exact Lik Ratio AIC -SSE∑β2i F-test F-exact Lik Ratio AIC -SSE
∑β2i
τ = 0.1Sc1 36.0% 5.0% 0.0% 0.0% 6.2% 5.8% 36.4% 6.0% 0.0% 0.0% 4.8% 5.2%Sc2(γ = 0.05) 100.0% 99.4% 98.2% 97.6% 8.6% 1.6% 100.0% 100.0% 100.0% 100.0% 9.6% 10.0%Sc2(γ = 0.11) 100.0% 100.0% 100.0% 100.0% 17.6% 0.0% 100.0% 100.0% 100.0% 100.0% 38.2% 2.6%
τ = 0.5Sc1 15.6% 5.0% 14.0% 11.0% 5.2% 5.8% 10.8% 4.8% 10.8% 8.6% 4.4% 4.2%Sc2(γ = 0.05) 26.0% 10.4% 20.2% 15.6% 8.0% 5.6% 26.6% 16.4% 28.0% 23.4% 9.6% 7.8%Sc2(γ = 0.11) 44.8% 19.0% 50.2% 42.4% 17.2% 7.2% 71.4% 52.0% 81.8% 78.6% 37.8% 8.6%
τ = 0.9Sc1 44.2% 3.8% 93.4% 92.8% 5.6% 4.4% 29.2% 3.4% 95.0% 94.2% 4.6% 4.8%Sc2(γ = 0.05) 42.0% 3.6% 94.6% 93.4% 8.4% 4.6% 32.0% 4.6% 95.6% 94.0% 12.0% 4.2%Sc2(γ = 0.11) 45.0% 3.6% 93.8% 93.0% 17.8% 4.2% 30.8% 6.6% 96.6% 94.8% 39.4% 5.2%
Table 4.13: Simulation Results for Quantile Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size, τ and number of bins.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3τ = 0.1
Sc1 4.4% 4.6% 5.8% 5.2% 4.2% 5.0% 2.2% 4.2%Sc2(γ = 0.05) 98.4% 98.2% 100.0% 100.0% 96.4% 97.4% 100.0% 100.0%Sc2(γ = 0.11) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
τ = 0.5Sc1 4.0% 3.4% 5.2% 5.2% 3.8% 4.2% 4.8% 3.8%Sc2(γ = 0.05) 8.2% 8.0% 6.6% 8.2% 6.2% 7.6% 4.8% 8.0%Sc2(γ = 0.11) 11.8% 10.0% 22.4% 22.2% 9.6% 10.6% 15.8% 18.0%
τ = 0.9Sc1 4.4% 5.8% 6.2% 4.2% 3.6% 4.2% 2.4% 4.8%Sc2(γ = 0.05) 4.4% 4.8% 6.6% 4.0% 3.2% 5.2% 3.6% 5.0%Sc2(γ = 0.11) 5.4% 3.4% 4.8% 6.4% 3.0% 3.8% 2.0% 4.8%
which is certainly surprising. Likelihood ratio and AIC also have very strange behavior
since their significance level changes from 0 to 93% as τ increases from 0.1 to 0.9. This
can be explained by the observation that both these measures require a likelihood function
which is not assumed to have a particular form when fitting a quantile regression. Hence we
63
can conclude that utilizing these build-in functions can lead to significant errors. As with
the mean and logistic regressions F-exact and bootstrap procedures have the required 5%
significance. Another observation is that BOOT-SSE has much larger power then BOOT-∑β2i . Quantile value τ plays a very important role in this example. Note that for Sc2 the
power for F-exact is around 100% when τ = 0.1 and then it declines to 4% when τ = 0.9.
BOOT-SSE method on the other hand obtains a power that is quite small but very stable
as τ changes. Note that for the quantile regression to get critical values for F-exact we need
to simulate from the correct model under SA. It is generally not feasible since in real world
problems we do not know the actual distribution that generated the data.
Table 4.13 shows proportion of rejections for permutation and Pearson χ2 test using ”bins”.
Probability of Type I error for both these methods is around 5% for every τ level. Again
number of bins K does not impact the performance and Method 1 works here a little better
than Method 2. When τ = 0.1, Methods 1 and 2 have very similar power to the best
generic F-exact test (and much better than BOOT-SSE). If τ = 0.9 again Methods 1 and 2
have similar power to F-exact however both have much less power than BOOT-SSE. When
conditional median (τ = 0.5) is estimated the proposed methods have lower power than both
F-exact and BOOT-SSE.
Based on the described observations we can conclude that for the quantile regression the
SA assessment crucially depends on the τ level. There is no one method that always works.
Two novel SA testing procedures are much faster and computationally manageable than F-
exact or Bootstrap tests and work very well for small quantile values however as τ increases
BOOT-SSE should be implemented. Of course this is only true in this example where the
noise has χ21 distribution which is skewed to the right, for other scenarios we may observe
different dependence on τ .
64
Chapter 5
Data Analysis
5.1 Red Wine Data
We consider the data of [18] consisting of various physicochemical tests of 1599 red variants
of the Portuguese ”Vinho Verde” wine. Acidity and density are properties closely associated
with the quality of wine and grape, respectively. Of interest here is to study the dependence
pattern between ‘fixed acidity’ (Yfa) and ‘density’ (Yde) and how it changes with values of
other variables: ‘volatile acidity’, ‘citric acid’, ‘residual sugar’, ‘chlorides’, ‘free sulfur diox-
ide’, ‘total sulfur dioxide’, ‘pH’, ‘sulphates’ and ‘alcohol’, denoted
Xva, Xca, Xrs, Xch, Xfs, Xts, Xph, Xsu, Xal, respectively. Figure 5.1 shows pairwise scatter-
plots of all original variables (responses and covariates). Response variables are linearly
transformed to have mean 0 and standard deviation of 1, similarly covariates where trans-
formed to be between 0 and 1.
5.2 Analysis and results
To select the appropriate copula family, we fit GP-SIM with ‘Clayton’, ‘Frank’, ‘Gaussian’,
‘Gumbel’ and ‘T-3’ (student T with 3 degrees of freedom) dependencies. For each model the
MCMC was run for 20,000 iterations with 10,000 burn-in period. We used 30 inducing inputs
for the marginals and calibration function estimation (m1 = m2 = m = 30). The resulting
CVML, CCVML and WAIC values are shown in Table 5.1. All model selection measures
indicate that among candidate copula families the most suitable one is the Gaussian one.
The GP-SIM coefficients (β) fitted under the Gaussian copula family are shown in Table 5.2.
65
Figure 5.1: Wine Data: Pairwise scatterplots of all the variables in the analyzed data.
Table 5.1: Red Wine data: CVML, CCVML and WAIC criteria values different models.
Clayton Frank Gaussian Gumbel T-3CVML -1858 -1816 -1788 -1829 -1810
CCVML -582 -547 -522 -558 -534WAIC 3713 3634 3572 3656 3621
The credible intervals suggest that not all covariates may be needed to model dependence
between responses. For example, ‘residual sugars’ and ‘chlorides’ seem to not affect the
calibration function so we consider a model in which they are omitted from the conditional
copula model. In all models, we include all the covariates in the marginal distributions.
For comparison, we have also fitted all Gaussian GP-SIM models with only one covariate,
and with no covariates at all (constant). The computational algorithm to fit GP-SIM when
the conditional copula depends on only one variable is very similar to the one described
above. The main difference is that there is no β variable and the inducing inputs (for
calibration function) are evenly spread on [0, 1]. The testing results are shown in Table 5.3.
Based on the selection criteria results we conclude that all nine covariates are required to
explain the dependence structure of two responses. Figure 5.2 shows 1-dimensional plots
of Kendall’s τ calibration curve with 95% credible as a function of covariates. The plots
66
Table 5.2: Wine data: Posterior means and quantiles of β.
Variable Posterior Mean 95% Credible IntervalXva 0.274 [0.154, 0.389]Xca −0.336 [−0.413,−0.254]Xrs −0.076 [−0.278, 0.271]Xch 0.060 [−0.246, 0.259]Xfs 0.276 [0.106, 0.410]Xts 0.402 [0.248, 0.608]Xph 0.155 [0.054, 0.286]Xsu 0.501 [0.342, 0.601]Xal 0.463 [0.382, 0.517]
Table 5.3: Wine data: CVML, CCVML and WAIC criteria values for variable selection inconditional copula.
Variables CVML CCVML WAICALL -1788 -522 3572
Xva, Xca, Xfs, Xts, Xph, Xsu, Xal -1805 -532 3608Xva -1823 -552 3646Xca -1815 -541 3629Xrs -1849 -582 3698Xch -1842 -578 3688Xfs -1852 -584 3705Xts -1851 -583 3700Xph -1816 -557 3633Xsu -1841 -571 3682Xal -1847 -577 3697
Constant -1849 -584 3700
are constructed by varying one predictor while fixing all others at their mid-range values.
The plots clearly demonstrate that when covariates are fixed at their mid-range values, the
conditional correlation between ‘fixed acidity’ and ‘density’ increases with ‘volatile acidity’,
‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘pH’, ‘sulphates’ and ‘alcohol’, and decreases with
levels of ‘citric acid’. These relationships can influence the preparation method of the wine.
In order to demonstrate the difficulty one would have in gauging the complex evolution
of dependence between two responses as a function of covariates we plot in Figure 5.3 the
response variables together as they vary with each covariate. It is clear that the model
manages to identify a pattern that would be very difficult to distinguish without the help of
67
Figure 5.2: Wine Data: Slices of predicted Kendall’s τ as function of covariates. Red curvesrepresent 95% credible intervals.
0.5 1.0 1.5
0.0
0.4
0.8
volatile.acidity
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
citric.acid
Ken
dalls
Tau
5 10 15
0.0
0.4
0.8
residual.sugar
Ken
dalls
Tau
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.4
0.8
chlorides
Ken
dalls
Tau
0 10 20 30 40 50 60 700.
00.
40.
8
free.sulfur.dioxide
Ken
dalls
Tau
0 50 100 150 200 250 300
0.0
0.4
0.8
total.sulfur.dioxide
Ken
dalls
Tau
2.8 3.0 3.2 3.4 3.6 3.8 4.0
0.0
0.4
0.8
pH
Ken
dalls
Tau
0.5 1.0 1.5 2.0
0.0
0.4
0.8
sulphates
Ken
dalls
Tau
9 10 11 12 13 14 15
0.0
0.4
0.8
alcohol
Ken
dalls
Tau
Figure 5.3: Wine Data: Plots of ‘fixed acidity’(blue) and ‘density’(red) (linearly transformedto fit on one plot) against covariates.
0.5 1.0 1.5
volatile.acidity
0.0 0.2 0.4 0.6 0.8 1.0
citric.acid
5 10 15
residual.sugar
0.0 0.1 0.2 0.3 0.4 0.5 0.6
chlorides
0 10 20 30 40 50 60 70
free.sulfur.dioxide
0 50 100 150 200 250 300
total.sulfur.dioxide
2.8 3.0 3.2 3.4 3.6 3.8 4.0
pH
0.5 1.0 1.5 2.0
sulphates
9 10 11 12 13 14 15
alcohol
a flexible mathematical model.
68
Part II
Approximated Bayesian Methods
69
Chapter 6
Introduction
6.1 The Need of Simulation Based Methods
When data y0 ∈ X n is observed and the sampling distribution has density function f(y0|θ)indexed by parameter θ ∈ Rq, Bayesian inference for functions of θ rely on the characteristics
of the posterior distribution:
π(θ|y0) =p(θ)f(y0|θ)∫
Rqp(θ)f(y0|θ)dθ
∝ p(θ)f(y0|θ), (6.1)
where p(θ) denotes the prior distribution.
Since the early 1990s Bayesian statisticians have been able to operate largely free of
computation-induced constraints due to the rapid development of Markov chain Monte Carlo
(MCMC) sampling methods [see, for example 19, for a recent review]. This class of methods
allows one to produce samples from π in (6.1) despite its often intractable denominator.
While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC [see
14, and references therein] can draw from distributions with unknown normalizing con-
stants, they rely on a closed form for the functional form of the unnormalized posterior, i.e.
p(θ)f(y0|θ) (as was discussed in chapter 1).
Larger data should yield answers to more complex problems. The latter can be tackled
statistically using increasingly complex models, in as much as the sampling distribution is
no longer available in closed form. In this complex settings, a much weaker assumption often
holds, namely that, for any θ ∈ Rq, draws y ∼ f(y|θ) can be sampled. To get a motivation
70
for the simulation based methods, consider for example Hidden Markov Model:
X0 ∼P (x0),
Xi|xi−1 ∼P (Xi|xi−1, θ), i = 1, . . . , n,
Yi|xi ∼P (Yi|xi, θ), i = 1, . . . , n.
(6.2)
Note that except for examples with Gaussian transition and emission distributions, marginal
distribution P (y1, · · · , yn|θ) cannot be calculated in closed form. It is possible to treat hidden
random variables Xi as auxiliary and sample them with parameters using Particle MCMC
(PMCMC) [8] or ensemble MCMC [82] but it becomes increasing difficult as n increases.
Moreover for some financial time series models [69] (Stochastic Volatility for log return for
example) α-Stable distribution may be useful to model transition and/or emission probabil-
ities. The challenge is that stable distribution do not have closed form density but can be
sampled from and hence particle and ensemble MCMC are not feasible. Other widely used
examples where the likelihood functions cannot be expressed analytically include various
networks models [51] and Markov random fields [79].
In the absence of a tractable likelihood function, statisticians have turned to approximate
methods to perform Bayesian inference. Here we consider two alternative approaches that
have been proposed and gained momentum recently: the Approximate Bayesian Computa-
tion (ABC) [60, 10, 83, 25] and the Bayesian Synthetic Likelihood (BSL)[94, 26, 73]. These
algorithms are only based on pseudo-data simulations from f(y|θ) and do not require a
tractable form of the likelihood. Both algorithms are effective when they are combined with
Markov chain Monte Carlo sampling schemes to produce samples from an approximation of
π and both share the potential need for intense computational effort at each update. In the
next sections we describe in detail existing methods for Approximate Bayesian Computation
and Bayesian Synthetic Likelihood.
6.2 Approximate Bayesian Computation (ABC)
For models with intractable or computationally expensive likelihood evaluations, simulation
based algorithms such as ABC are frequently the only choice for the inference. In its sim-
plest form, the ABC is a reject/accept sampler. Suppose we observe the data y0, given
a user-defined summary statistic S(y) ∈ Rp, the Accept/Reject ABC repeatedly samples
parameters ζ∗ from the prior, each time simulates a pseudo-data y∗ ∼ f(y|ζ∗), and then
71
compares S(y∗) with S(y0). If both of them are the same, the generated ζ∗ is accepted,
otherwise rejected, see Algorithm 2. We emphasize that a closed form equation for the likeli-
Algorithm 2 Accept/Reject ABC
1: Given observed y0 and required number of samples M .2: for t = 1, · · · ,M do3: Match = FALSE4: while Not Match do5: ζ∗ ∼ p(θ)6: y ∼ f(y|ζ∗)7: if S(y) = S(y0) then8: θ(t) = ζ∗.9: Match = TRUE
10: end if11: end while12: end for
hood is not needed, only the ability to generate from f(y|θ) for any θ. If S(y) is a sufficient
statistics and Pr(S(y) = S(y0)) > 0 then the algorithm yields posterior samples from the
true posterior π(θ|y0). Alas, the level of complexity for models where ABC is needed, makes
it unlikely for these two conditions to hold. In order to implement ABC under more realistic
assumptions, a (small) constant ε is chosen and ζ∗ is accepted whenever d(S(y), S(y0)) < ε,
where d(S(y), S(y0)) is a user-defined distance function. If we denote the distribution of
accepted draws by πε(θ|S(y0)) then we obtain
limε↓0
πε(θ|S(y0)) = π(θ|S(y0)). (6.3)
In light of (6.3) one would like to have S(y) = y, but if the sample size of y0 is large, then the
curse of dimensionality leads to Pr(d(y,y0) < ε) ≈ 0. Thus, getting even a moderate number
of samples using ABC can be unattainable in this case. Unless S is sufficient, some infor-
mation about θ is lost so much attention is placed on finding appropriate low-dimensional
summary statistics [see, for example 29, 72]. We assume that the summary statistic function
S(·) is given. While Accept/Reject ABC can be used to sample from the posterior distri-
bution of θ, the computation cost is prohibitively large when we require closeness between
pseudo and observed data, i.e. d(S(y), S(y0)) < ε. This imposes constraints on the size of
the threshold ε which ends up being selected based on the available computational power
and time rather than on other factors such as precision.
72
Under weak or no information about the parameters in the model, the prior and a poste-
rior may be misaligned, i.e. regions of mass concentration do not overlap. Hence, parameters
values that are drawn from the prior will be rarely retained making the algorithm very inef-
ficient. Algorithm 3 presents the ABC-MCMC algorithm of [61] which avoids sampling from
Algorithm 3 ABC MCMC
1: Given y0, ε > 0 and required number of samples M .2: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.3: for t = 1, · · · ,M do4: Generate ζ∗ ∼ q(·|θ(t−1)).5: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).
6: Calculate α = min(
1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗)
p(θ(t−1))q(ζ∗|θ(t−1))
)7: Generate independent U ∼ U(0, 1).8: if U ≤ α then9: θ(t) = ζ∗.
10: else11: θ(t) = θ(t−1)
12: end if13: end for
the prior and instead relies on chain with a Metropolis-Hastings (MH) transition kernel, with
state space {(θ,y) ∈ Rq×X n}, proposal distribution q(ζ|θ)×f(y|ζ) and target distribution
πε(θ,y|y0) ∝ p(θ)f(y|θ)1{δ(y0,y)<ε}, (6.4)
where δ(y0,y) = d(S(y), S(y0)). Note that the goal is the marginal distribution for θ which
is:
πε(θ|y0) =
∫πε(θ,y|y0)dy ∝
∫p(θ)f(y|θ)1{δ(y0,y)<ε}dy = p(θ)P (δ(y0,y) < ε|θ). (6.5)
Therefore if we knew the conditional probability P (δ(y0,y) < ε|θ) for every θ, then we could
run a MH algorithm to sample from the approximate target given in (6.5). Other MCMC de-
signs suitable for ABC can be found in [55] and [13]. Sequential Monte Carlo SMC have also
been successfully used for ABC (we denote it by ABC-SMC) [85, 31]. ABC-SMC requires a
specified decreasing sequence ε0 > · · · > εJ . This method uses the Particle MCMC design
[8] in which samples are updated as the target distribution evolves with ε. More specifically,
it starts by sampling θ(1)0 , . . . , θ
(M)0 from πε0(θ|y0) using Accept-Reject ABC. Subsequently,
at time t + 1 samples available at time t are sequentially updated so their distribution is
73
πεt+1(θ|y0) [see 55, for a complete description of the SMC-MCMC]. The advantage of this
method is not only that it starts from large ε but also that it generates independent draws. A
comprehensive coverage of computational techniques for ABC can be found in [84] and refer-
ences therein. The ABC-MCMC algorithm proposed by [54] approximates P (δ(y0,y) < ε|θ)by J−1
∑Jj=1 1{δ(y0,yj)<ε} where J ≥ 1 and each yj is simulated from f(y|θ). Note that this
estimator is unbiased and hence the stationary distribution of θ is πε(θ|y0) as a consequence
of the pseudo-marginal MCMC [9]. Clearly, when the probability P (δ < ε|θ) is very small,
this method would require simulating large number of δs (or equivalently ys) in order to
move to a new state. Note that even when the proposed parameter θ is near the ”true”
unknown parameter θ0, the simulated δ(y,y0) given θ can be greater than ε due to random-
ness of conditional distribution δ|θ, in this case the chain can become sluggish to the point
of being impractical. We also note a general lack of guidelines concerning the selection of ε,
which is unfortunate as the performance of ABC sampling depends heavily on its value.
Notice that the choice of proposal distribution q(·|θ) can dramatically influence the per-
formance of ABC-MCMC. To make a fair comparison between different methods we revise
ABC-MCMC algorithm by introducing a decreasing sequence ε0 > · · · > εJ (J is number of
”steps”) similar to ABC-SMC and ”learning” transition kernel during burn-in as in Algo-
rithm 4. The main difference is that during burn-in period of length B the ε-sequence starts
with a higher value (which makes finding initial θ(0) values much more feasible) and gradu-
ally decreases while the proposal distribution is adapted in the same period. The adaptation
of the proposal takes place only during the burn-in period. For independent MH sampling
the generic proposal is Gaussian N (·|µ, Σ) with constant c set to 2 or 3, for the random
walk sampler the standard transitional probability is N (·|θ(t−1), Σ) with c = 2.382/q which
is proven to be optimal for Gaussian posterior [75, 76].
All the algorithms discussed so far rely on numerous generations of pseudo-data. These can
be computationally costly, some attempts to reduce computational cost in ABC are made
in [93] and [45]. The approaches are based on learning the dependence between δ and θ.
Therefore instead of simulating many statistics for each proposed θ the accelerated algorithm
captures information from all simulated pairs through the functional form. Flexible regres-
sion models are used to model these unknown functional relationships, and the performance
depends on the signal to noise ratio and on the ability to capture patterns that can be highly
complex.
74
Algorithm 4 ABC MCMC modified (ABC-MCMC-M)
1: Given y0, sequence ε0 > · · · > εJ , constant c, burn-in period B and required number ofsamples M .
2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb).6: for t = 1, · · · ,M do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of {θ(t)} t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance of{θ(t)} t = 1, · · · , (aj − 1).
10: end if11: Generate ζ∗ ∼ q(·|θ(t−1), µ, Σ).12: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).
13: Calculate α = min(
1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗,µ,Σ)
p(θ(t−1))q(ζ∗|θ(t−1),µ,Σ)
)14: Generate independent U ∼ U(0, 1).15: if U ≤ α then16: θ(t) = ζ∗.17: else18: θ(t) = θ(t−1)
19: end if20: end for
75
6.3 Bayesian Synthetic Likelihood
As an alternative to ABC which requires tuning of ε and selection of a distance function d(·, ·),[94] tackled the intractability of the sampling distribution by assuming that the conditional
distribution for the statistic S(y) given θ is Gaussian with mean µθ and covariance matrix
Σθ. The Synthetic Likelihood (SL) procedure assigns to each θ the likelihood SL(θ) =
N (s0;µθ,Σθ), where N (x;µ,Σ) denotes the density of a normal with mean µ and covariance
Σ, and s0 = S(y0). SL can be used for maximum likelihood estimation as in [94] or within the
Bayesian paradigm as proposed by [26] and [73]. In Bayesian Synthetic Likelihood (BSL) [73]
propose to implement a Metropolis-Hastings sampler that has π(θ|s0) ∝ p(θ)N (s0;µθ,Σθ)
as stationary distribution. It is clear that direct implementation is not possible as the
conditional mean and covariance matrix are unknown. However, both can be estimated
based on m statistics (s1, · · · , sm) sampled from their conditional distribution given θ. More
precisely, after simulating yi ∼ f(y|θ) and setting si = S(yi), i = 1, · · · ,m, estimate
µθ =
∑mi=1 sim
,
Σθ =
∑mi=1(si − µθ)(si − µθ)T
m− 1,
(6.6)
so that the synthetic likelihood is
SL(θ|y0) = N (S(y0)|µθ, Σθ). (6.7)
The pseudo-code in Algorithm 5 shows the steps involved in the BSL-MCMC sampler.
Since each Metropolis-Hastings step requires calculating the likelihood ratios between two
SLs calculated at different θs one can anticipate the heavy computational load involved in
running the chain for thousands of iterations, especially if sampling each y is expensive. Note
that even though these estimates for the conditional mean and covariance are unbiased, the
estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC
theory is not applicable here. [73] presents an unbiased Gaussian likelihood estimator and
empirically show that using biased and unbiased estimates generally perform similarly. They
also remark that this procedure is very robust to the number of simulations m, and demon-
strate empirically that using m = 50 to 200 produce similar results.
The normality assumption for summary statistics is certainly a strong assumption which
may not hold in practice. Following up on this, [6] relaxed the jointly Gaussian assumption
76
Algorithm 5 Bayesian Synthetic Likelihood (BSL-MCMC)
1: Given s0 = S(y0), number of simulations m and required number of samples M .2: Get initial θ(0), estimate µθ(0) , Σθ(0) by simulating m statistics given θ(0).3: Define h(θ(0)) = N (s0; µθ(0) , Σθ(0)).4: for t = 1, · · · ,M do5: Generate ζ∗ ∼ q(·|θ(t−1)).6: Estimate µζ∗ , Σζ∗ by simulating m statistics given ζ∗.
7: Calculate h(ζ∗) = N (s0; µζ∗ , Σζ∗).
8: Calculate α = min(
1, p(ζ∗)h(ζ∗)q(θ(t−1)|ζ∗)p(θ(t−1))h(θ(t−1))q(ζ∗|θ(t−1))
)9: Generate independent U ∼ U(0, 1).
10: if U ≤ α then11: Set θ(t) = ζ∗ .12: else13: Set θ(t) = θ(t−1).14: end if15: end for
to Gaussian copula with non-parametric marginal distribution estimates (NON-PAR BSL),
which includes joint Gaussian as a special case but is much more flexible. The estimation is
based, as in the BSL framework, on m pseudo-data samples simulated for each θ.
6.4 Plan
It is evident that both ABC and BSL are computationally costly and require enormous num-
ber of pseudo-data simulations to run even a moderate size MCMC chain. Accelerating these
algorithms is especially important for very large data sets, time consuming pseudo-data sim-
ulations or summary statistic calculations. We propose to speed up these methods by storing
past simulated draws and using those to approximate unknown likelihood. While this re-
duces drastically the computation time, it raises the need to control the approximating error
introduced when modifying the original transition kernel. The objective is to approximate
P (δ < ε|θ) and N (s0;µθ,Σθ) for any θ at every MCMC iteration using past simulated (θ, δ)
and (θ, s) for ABC and BSL respectively. K-Nearest-Neighbor (kNN) method is used as a
non-parametric estimation tool for quantities described above. The main advantage of kNN
is that it is uniformly strongly consistent which guarantees that for a large enough chain
history, we can control the error between the intended stationary distribution and that of
the proposed accelerated MCMC.
77
The structure of this part is the following: in Chapter 7 we describe the accelerated
MCMC algorithms for ABC. In Chapter 8 we extend the proposed method to BSL. The
practical impact of these algorithms is evaluated via simulations in Chapter 9 and the data
analysis involving the Stochastic Volatility model (with α-stable errors) applied to a time
series of daily log returns of Dow Jones index between Jan 1, 2010 and Dec 31, 2017 is
presented in Chapter 10. The theoretical analysis showing control of perturbation errors in
total variation norm is presented in Chapter 11.
78
Chapter 7
Approximated ABC (AABC)
7.1 Computational Inefficiency of ABC
The problem that we tackle in this thesis, is the computational burden of standard ABC
procedures, like ABC-MCMC. As was pointed out in the previous chapter ABC algorithms
require large number of pseudo-data simulations when the threshold ε is small enough which
results in low computational efficiency. Letting θ0 be parameter that generated the observed
data, if P (δ < ε|θ0) is small then even when the proposed ζ∗ state is close to θ0 there
is high probability that generated δ∗ will be greater than ε and therefore rejected. Thus
many samples are rejected not because they are in the tail of the (approximate) posterior
but simply due to variability of δ conditional on θ. Note that for all ABC methods past
simulation results are completely ignored, we think however that they could provide essential
information that can significantly accelerate the algorithm. This observation is the basis of
the proposed method.
7.2 Approximated ABC-MCMC (AABC-MCMC)
In this section we described a novel algorithm for ABC-MCMC sampler that utilizes past
pseudo data simulations and significant improve performance of a chain. We mentioned in
the last chapter that the objective of ABC-MCMC (given threshold ε) is to sample from this
distribution with support Θ:
πε(θ|y0) ∝ p(θ)P (δ(y0,y) < ε|θ), (7.1)
79
where δ(y0,y) = d(S(y), S(y0)) with y ∼ f(y|θ). If P (δ(y0,y) < ε|θ) = h(θ) was known for
every θ then we could run an exact MH-MCMC chain with invariant distribution proportional
to p(θ)h(θ). Since this function of θ is generally unknown it is estimated by an indicator
1{δ(y0,y)<ε} which is an unbiased estimator. Suppose now that at iteration t + 1, we stored
N − 1 past simulations ZN−1 = {ζn, δn}N−1n=1 where ζ denotes θ proposal which is generated
independently of the MCMC (otherwise the Markovian property of the chain is violated).
Given two new independent proposals ζ∗, ζ∗ ∼ q(|θ(t)) the first is the proposal used for
the chain update, the second is used to enrich the ”history”. We then generate one δ∗ by
first simulating y∗ (given ζ∗) then calculating statistics s∗ = S(y∗) and finally computing
discrepancy between s∗ and s0 = S(y0), δ∗ = d(s∗, s0). We then combine past samples ZN−1
with a new pair (ζ∗, δ∗), ZN = ZN−1 ∪ (ζ∗, δ∗), and estimate h(ζ∗) as follows
h(ζ∗) =
∑Nn=1WNn(ζ∗)1{δn<ε}∑N
n=1WNn(ζ∗). (7.2)
Here weight function WNn(ζ∗) = WN(ζn, ζ∗) = W (‖ζn− ζ∗‖) depends on Euclidean distance
between ζn and ζ∗ assigning more weights to pairs that are closest. We will discuss several
choices for W (·) function below.
In other words a non-parametric estimate of h(ζ) is produced for each ζ based on previous
simulations ZN . Notice that if there is a close neighbor of ζ∗ for which discrepancy is less
than ε then the estimated h(ζ∗) will not be zero and there is a chance of moving to a different
state. Intuitively, this is expected to stabilize the acceptance probability of the chain and
preform better than standard ABC-MCMC. Since the proposed weighted estimate is no
longer an unbiased estimator of h(θ), a new theoretical evaluation is needed to study the
effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm’s
utility is the ability to control the total variation distance between the desired distribution
of interest given in (7.1) and the modified chain’s target. As will be shown in chapter 11, we
rely on three assumptions to ensure that the chain would approximately sample from (7.1):
1) compactness of Θ; 2) uniform ergodicity of the chain using the true h and 3) uniform
convergence in probability of h(θ) to h(θ) as N →∞.
K-Nearest-Neighbor (kNN) regression approach [32, 11] has a property of uniform con-
sistency [16] therefore for h(θ) we employ this technique. Here we define K = g(N) (for
simulations we use g(·) =√
(·)) and let λ : {1, · · · , N} → {1, · · · , N} be a permutation of
indices that sorts {ζn} ∈ ZN = {ζn, δn}Nn=1 from closest to ζ∗ to furthest. Suppose now that
80
Algorithm 6 Approximated ABC MCMC (AABC-MCMC)
1: Given y0 with summary statistics s0, sequence ε0 > · · · > εJ , constant c, burn-in periodB, required number of samples M , initial simulations ZN = {ζn, δn}Nn=1 with ζn ∼ p(ζ),yn ∼ f(·|ζn) and δn = d(S(yn), s0).
2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), s0) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)6: for t = 1, · · · , J do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).
10: end if11: Generate ζ∗, ζ∗
iid∼ N (·; µ, Σ).12: Simulate y∗ ∼ f(·|ζ∗) and let δ∗ = d(S(y∗), s0).13: Add simulated pair of parameter and discrepancy to the past set: ZN = ZN−1 ∪{ζ∗, δ∗} and set N = N + 1.
14: h(ζ∗) =∑Nn=1WNn(ζ∗)1{δn<ε}∑N
n=1WNn(ζ∗).
15: h(θ(t)) =∑Nn=1WNn(θ(t))1{δn<ε}∑N
n=1WNn(θ(t)).
16: Calculate α = min(
1, p(ζ∗)h(ζ∗)N (θ(t);µ,Σ)
p(θ(t))h(θ(t))N (ζ∗;µ,Σ)
)17: Generate independent U ∼ U(0, 1).18: if U ≤ α then19: θ(t+1) = ζ∗.20: else21: θ(t+1) = θ(t)
22: end if23: end for
81
after the permutation, the past set ZN is rearranged by distance to ζ∗ so that (ζ1, δ1) has
smallest ‖ζ1− ζ∗‖ while (ζN , δN) has largest distance ‖ζN − ζ∗‖. Then kNN sets WNn(ζ∗) to
zero all n > K and there are several weight choices for n ≤ K, we focus on two:
(U) The uniform kNN with WNn(ζ∗) = 1 for all n ≤ K;
(L) The linear kNN with WNn(ζ∗) = W (‖ζn − ζ∗‖) = 1− ‖ζn − ζ∗‖/‖ζK − ζ∗‖ for n ≤ K
so that the weight decreases from 1 to 0 as n increases from 1 to K.
Moreover kNN theoretical arguments generally require independent pairs of {ζn, δn}Nn=1,
therefore for proposal distribution we apply independent sampler so that q(·|θ(t)) = q(·).As in Algorithm 4 we allow ε-sequence decrease gradually during the burn-in period.
In all our simulation we assume that proposal is Gaussian which of course can be changed
to any other appropriate distribution (with positive support on Θ) related to a particular
problem. The entire procedure is outlined in Algorithm 6.
To conclude, at the end of a simulation of size M the MCMC samples are {θ(1), . . . , θ(M)}and the history used for updating the chain is {(ζ1, δ1), . . . , (ζM , δM)}. The two sequences
are independent of one another, i.e. for any N > 0, the elements in ZN are independent of
the chain’s history up to time N .
Note also that h(θ) is estimated in numerator and denominator of probability of acceptance
in every iteration, so even for a current state this value is recalculated and not borrowed
from the previous iteration. This procedure generally improves mixing of a chain and it is
theoretically justified as will be shown in chapter 11. Constant c here controls the variability
of the proposal, if it is too small then MCMC will not explore the posterior effectively, if too
large then there would be many rejections as frequently proposed values will be in tails of
the posterior. For all our simulations and real data example we use c = 1.5 which was found
empirically quite satisfactory.
82
Chapter 8
Approximated BSL (ABSL)
8.1 Computational Inefficiency of BSL
Similar to ABC, BSL is computationally costly and requires many pseudo-data simulations
to run even moderate chains, since for every iteration it generates m pseudo-data sets. This
m cannot be small since then estimations of conditional mean µθ and covariance Σθ will not
be accurate especially for moderate or large dimension of summary statistic p. To accelerate
BSL-MCMC we propose to store and utilize past simulations of (ζ , s) to approximate the
conditional mean and covariance for every proposed ζ∗ (proposed parameter), making the
whole procedure computationally faster. Instead of simulating m pseudo-data sets, only one
is simulated and used in combination with the past simulations. The approach can trivially be
extended to NONPAR-BSL algorithm but we do not pursue it further. K-Nearest-Neighbor
(kNN) method is used as a non-parametric estimation tool for quantities described above.
8.2 Approximated Bayesian Synthetic Likelihood (ABSL)
Setting s0 = S(y0) and assuming conditional normally for this statistic the objective is to
sample from
π(θ|s0) ∝ p(θ)N (s0;µθ,Σθ). (8.1)
During the MCMC run, the proposal ζ∗ generated from q(·) and the history ZN is enriched
using ζ∗ ∼ q(·), y∗ ∼ f(·|ζ∗) and s∗ = S(y∗) (can trivially be extended to more than
one pseudo-data set generation). Thus all proposals with associated statistics are stored in
ZN = {ζn, sn}Nn=1, note that similar to AABC-MCMC, this ”history” set is independent of
83
the chain states. Then for any ζ, conditional mean and covariance of statistics vector is
estimated using past samples as weighted averages:
µζ =
∑Nn=1 WNn(ζ)sn∑Nn=1WNn(ζ)
,
Σζ =
∑Ni=1 WNn(ζ)(sn − µζ)(sn − µζ)T∑N
i=1 WNn(ζ).
(8.2)
Again the weights are functions of distance between proposed value and parameters’ values
Algorithm 7 Approximated Bayesian Synthetic Likelihood (ABSL)
1: Given s0 = S(y0), constant c, burn-in period B, J number of adaption points dur-ing burn-in , required number of samples M , initial pseudo data simulations ZN ={ζn, sn}Nn=1 with ζn ∼ p(ζ), yn ∼ f(·|ζn) and sn = S(yn).
2: Get initial θ(0).3: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
4: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)5: for i = 1, · · · ,M do6: if i = aj for some j = 1, · · · , J then7: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).
8: end if9: Generate ζ∗, ζ∗
iid∼ N (·; µ, Σ).10: Simulate y∗ ∼ f(·|ζ∗) and let s∗ = S(y∗).11: Add simulated pair of parameter and statistic to the past set: ZN = ZN−1 ∪ {ζ∗, s∗}
and set N = N + 1.
12: Calculate: µζ∗ =∑Nn=1WNn(ζ∗)sn∑Nn=1WNn(ζ∗)
, Σζ∗ =∑Ni=1WNn(ζ∗)(sn−µζ∗ )(sn−µζ∗ )T∑N
i=1WNn(ζ∗).
13: Calculate: µθ(t) =∑Nn=1WNn(θ(t))sn∑Nn=1WNn(θ(t))
, Σθ(t) =∑Ni=1WNn(θ(t))(sn−µθ(t) )(sn−µθ(t) )T∑N
i=1WNn(θ(t)).
14: h(ζ∗) = N (s0; µζ∗ , Σζ∗).
15: h(θ(t)) = N (s0; µθ(t) , Σθ(t)).
16: Calculate α = min(
1, p(ζ∗)h(ζ∗)N (θ(t);µ,Σ)
p(θ(t))h(θ(t))N (ζ∗;µ,Σ)
)17: Generate independent U ∼ U(0, 1).18: if U ≤ α then19: θ(t+1) = ζ∗.20: else21: θ(t+1) = θ(t)
22: end if23: end for
84
from the past WNn(ζ) = W (‖ζ− ζn‖), we use simple Euclidean distance. To get appropriate
convergence properties we use kNN approach to calculate weights WNn, where only K =√N
closest values to ζ is used in calculation of conditional means and covariance. As was
described in the previous chapter uniform and linear weights are available, the former assumes
equal weights for all K values the latter has linear decreasing weights. The advantage of this
method is that the use of the past to estimate conditional mean and covariance matrix can
significantly speed up the whole procedure since there is no longer need to generate many
data sets y at every step as was done in original BSL, the hope is that these estimates are still
close enough to the true values. Applying these estimated in MCMC we assume that proposal
distribution is independent of the previous state so that q(·|θ(t)) = N (·;µ,Σ) for some fixed
µ and Σ (of course any other distribution with non-zero density on the support of the exact
posterior can be used). As will be shown in Chapter 11 we need the assumption that pairs
in past set ZN must be independent which is satisfied when the proposal is independent and
when accepted and rejected simulations are saved. We will also show that if Θ is compact
and under minor assumptions of expectation and covariance of s|θ the proposed algorithm
exhibits good convergence properties and we can control the error between the intended
stationary distribution and that of the proposed accelerated MCMC.
To get rough idea of the transition kernel we propose to learn it during the burn-in period
with J adaptation points, Algorithm 7 outlines the entire ABSL method. At the end of a
simulation of size M the sequence of MCMC samples is θ(0), . . . , θ(M) while the sequence of
proposals is ζ1, . . . , ζM . For simulations below we set c = 1.5 and J = 15 to be consistent
with AABC-MCMC, ABC-MCMC-M and ABC-SMC procedures.
85
Chapter 9
Simulations
9.1 Details
In this section we carry through a series of simulations to compare performances of various
algorithms. We analyze simple Moving Average model of lag 2, Ricker’s model, Stochastic
volatility with Gaussian emission noise and finally Stochastic volatility with α-Stable errors.
For all these models simulation of pseudo data for any parameter is simple and computation-
ally fast however the use of standard estimation methods can be quite challenging (especially
for the last 3 models) therefore it is tempting to implement ABC and BSL type algorithms
for inferential purposes for these examples.
For ABC samplers before running a MCMC chain we estimate initial and final thresholds ε0
and ε15 (15 equal steps in log scale were used for all models) and matrix A which is used in
discrepancy calculation δ = d(S(y), S(y0)) = (S(y)−S(y0))TA(S(y)−S(y0)). To estimate
A we first set it to the identity matrix then generate 500 pairs {ζi,yi}500i=1 from p(ζ)f(y|ζ).
Next calculate discrepancies {ζi, δi}500i=1 with δi = d(S(yi), S(y0)), and pick ζ∗ with smallest
discrepancy. Finally generate 100 pseudo-data (y1, . . . ,y100) from f(y|ζ∗), compute corre-
sponding summary statistics (s1, . . . , s100) and set A to be the inverse of covariance matrix
of (s1, . . . , s100). This procedure (with an updated A) is repeated several times, at the final
stage we calculate (δ1, . . . , δ100) and set ε0 to be 5% quantile of these discrepancies. The
number of simulations was set to 500 and 100 just for computational convenience and is not
driven by any theoretical arguments. To estimate the final threshold ε15 we use the Random
Walk version of Algorithm 4 with M = B = 5000 and initial threshold ε0. We add one
modification by setting εj, j = 1, . . . , 15 equal to 1% quantile of generated discrepancies δ
between adaptation points aj−1 and aj, the final threshold is set to ε15. Intuitively ε se-
86
quence should decrease as the chain states move closer to the ”true” parameter. Note that
this chain is only used to approximate the final ε and cannot be used to study the properties
of approximate posterior. Intermediate values ε1, . . . , ε14 are then computed as equidistant
points on the natural log scale between ε0 and ε15.
We compare the following algorithms:
(SMC) Standard Sequential Monte Carlo for ABC;
(ABC-RW) The modified ABC-MCMC algorithm which updates ε and the random walk Metropolis
transition kernel during burn-in;
(ABC-IS) The modified ABC-MCMC algorithm which updates ε and the Independent Metropolis
transition kernel during burn-in;
(BSL-RW) Modified BSL where it adapts the random walk Metropolis transition kernel during
burn-in;
(BSL-IS) Modified BSL where it adapts the independent Metropolis transition kernel during
burn-in;
(AABC-U) Approximated ABC-MCMC with independent proposals and uniform (U) weights;
(AABC-L) Approximated ABC-MCMC with independent proposals and linear (L) weights;
(ABSL-U) Approximated BSL-MCMC with independent proposals and uniform (U) weights;
(AABC-L) Approximated BSL-MCMC with independent proposals and linear (L) weights.
(Exact) When likelihood is computable, posterior samples were generated using MCMC.
For SMC 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-
U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and
BSL-IS are much more computationally expensive, total number of iterations were fixed at
10000 with 2000 burn-in and 50 simulations of y for every proposed ζ∗ (i.e. m = 50). Exact
chain was run for 5000 iterations and 2000 for burn-in. It must be pointed out that all
approximate samplers are based on the same summary statistics, same discrepancy function
and the same ε sequence, so that they all start with the same initial conditions.
87
9.2 Measures for Comparisons
For more reliable results we compare these sampling algorithms under data set replications,
in this study we set number of replicates R to be 100, so that for each model 100 data
sets were generated and each one was analyzed with the described above sampling methods.
Assorted statistics and measures were calculated for every model and data set, letting θ(t)rs
represent posterior samples from replicate r = 1, · · · , R, iteration t = 1, · · · ,M and param-
eter component s = 1, · · · , q and similarly θ(t)rs posterior from an exact chain (all draws are
after burn-in period). We also define θtrues denote the true parameter that generated the
data. Moreover let Drs(x), Drs(x) be estimated density function at replicate r = 1, · · · , Rand components s = 1, · · · , q for approximate and exact chains respectively. Then the
following quantities are defined:
Diff in mean (DIM) = Meanr,s(|Meant(θ(t)rs )−Meant(θ
(t)rs )|),
Diff in covariance (DIC) = Meanr,s(|Covt(θ(t)rs )− Covt(θ(t)
rs )|),
Total Variation (TV) = Meanr,s
(0.5
∫|Drs(x)− Drs(x)|dx
),
Bias2 = Means
((Meantr(θ
(t)rs )− θtrues
)2),
Var = Means(V arr(Meant(θ(t)rs ))),
MSE = Bias2 + Var,
where Meant(ast) is defined as average of {ast} over index t and in similar manner V art(ast)
and Covt(ast) representing variance and covariance respectively. The first three measures
are useful in determining how close posterior draws from different samplers are to the draws
generated by the exact chain (when it is available). On the other hand the last three are
standard quantities that measure how close in mean square posterior means are to the true
parameters that generated the data. To study efficiency of proposed algorithms we need to
take into account CPU time that it takes to run a chain as well as auto-correlation properties.
Define auto-correlation time (ACT) for every parameter’s component and replicate of samples
θ(t)rs as:
ACTrs = 1 + 2∞∑a=1
ρa(θ(t)rs ),
where ρa is auto-correlation coefficient at lag a. In practice we sum all the lags up to the first
negative correlation. Letting M − B to be number of chain iterations (after burn-in) and
88
CPUr correspond to total CPU time to run the whole chain during replicate r, we introduce
Effective Sample Size (ESS) and Effective Sample Size per CPU (ESS/CPU) as:
ESS = Meanrs((M −B)/ACTrs),
ESS/CPU = Meanrs((M −B)/ACTrs/CPUr).(9.1)
Note that these indicators are averaged over parameter components and replicates. ESS
intuitively can be thought as approximate number of ”independent” samples out of M −B, the higher is ESS the more efficient is the sampling algorithm, when ESS is combined
with CPU (ESS/CPU) it provides a powerful indicator for MCMC’s efficiency. Generally a
sampler with highest ESS/CPU is preferred as it produces larger number of ”independent”
draws per unit time.
9.3 Moving Average Model
A popular toy example to check performances of ABC and BSL techniques is MA2 model:
ziiid∼ N (0, 1); i = {−1, 0, 1, · · · , n},
yi = zi + θ1zi−1 + θ2zi−2; i = {1, · · · , n}.(9.2)
The data are represented by the sequence y = {y1, · · · , yn}. It is well known that Yi follow
a stationary distribution for any θ1, θ2, but there are conditions required for identifiability.
Hence, we impose uniform prior on the following set:
θ1 + θ2 > −1,
θ1 − θ2 < 1,
−2 < θ1 < 2,
−1 < θ2 < 2.
(9.3)
It is very easy to see that the joint distribution of y is multivariate Gaussian with mean
0, diagonal variances 1 + θ21 + θ2
2, covariance at lags 1 and 2, θ1 + θ1θ2 and θ2 respectively
and zero at other lags. In this case, (Exact) sampling is feasible. For simulations we set
{θ1 = 0.6, θ2 = 0.6}, n = 200 and define summary statistics S(y) = (γ0(y), γ1(y), γ2(y))
as sample variance and covariances at lags 1 and 2. First we show results based on one
replicate. Figure 9.1 shows the trace plots, histograms and auto-correlation functions esti-
89
mated from posterior draws for parameters θ1 and θ2 for the AABC-U sampler. Note that
only post burn-in samples are shown. Similarly Figure 9.2 displays behavior of ABSL-U
Figure 9.1: MA2 model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6
010
0020
0030
0040
0050
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6 0.7 0.8
050
010
0020
0030
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
Figure 9.2: MA2 model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
−0.
50.
00.
51.
0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
−0.5 0.0 0.5 1.0
020
0040
0060
0080
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.5
1.0
1.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.0 0.5 1.0 1.5
010
0030
0050
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
90
Figure 9.3: MA2 model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.30
0.40
0.50
0.60
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.3 0.4 0.5 0.6
050
015
0025
0035
00
0 500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6 0.7 0.8
010
0030
0050
00
0 500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
sampler. These algorithms can be compared to standard ABC-RW method in Figure 9.3.
In the interest of keeping paper length within reasonable limits, we do not report the per-
formance of the remaining algorithms, but report that AABC-L is similar to AABC-U and
ABSL-L to ABSL-U. ABC-IS is generally less efficient than ABC-RW. From these plots it
is evident that proposed AABC-U and ABSL-U have much better mixing than ABC-RW.
Auto-correlation function for these two methods has quite small values because independent
proposal is implemented there compared to random walk where proposal depends on the
current state.
To see how close the draws from approximated samplers are to the draws from the exact
chain, we plot estimated densities in Figure 9.4. Left and right side plots refer to θ1 and
θ2, respectively. The two upper plots compare estimated density of exact MCMC sampler
with ABC-based ones (SMC, ABC-RW and AABC-U), while the two lower plots compare
the exact sampler with Synthetic Likelihood based methods (BSL-IS and ABSL-U). All ap-
proximate samplers’ draws deviate from the exact samples however posterior distribution of
AABC-U is very similar to SMC and ABC-RW, similarly distribution produced by ABSL-U
is very close to BSL-IS. This observation is true for both components, θ1 and θ2. The dif-
ference between approximate posterior distributions produced by simulation-based methods
and the exact posterior is probably due to the choice of summary statistic which does not
91
Figure 9.4: MA model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1 and θ2.
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
Exact
BSL−IS
ABSL−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ2D
enis
ty
Exact
BSL−IS
ABSL−U
capture the information about the parameters in the most effective way.
To study accuracy, precision and efficiency of proposed samplers we perform a simulation
study where 100 data sets are generated and all samplers are run for every data set. The
results are summarized in Table 9.1. Examining this table we immediately note that ES-
Table 9.1: Simulation Results (MA model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, Variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
Sampler DIM DIC TV√
Bias2√
Var√
MSE ESS ESS/CPUSMC 0.082 0.0045 0.418 0.014 0.115 0.116 471 0.505ABC-RW 0.088 0.0063 0.466 0.016 0.123 0.124 23 0.231ABC-IS 0.084 0.0067 0.455 0.016 0.115 0.116 44 0.389AABC-U 0.083 0.0071 0.444 0.018 0.116 0.117 3446 6.215AABC-L 0.080 0.0067 0.438 0.017 0.112 0.113 2820 5.107BSL-RW 0.082 0.0070 0.438 0.015 0.114 0.115 252 0.282BSL-IS 0.081 0.0070 0.436 0.015 0.114 0.115 841 0.923ABSL-U 0.081 0.0095 0.443 0.017 0.114 0.115 3950 5.584ABSL-L 0.082 0.0078 0.441 0.015 0.114 0.115 4165 6.030
S/CPU measure is much larger for proposed algorithms than for standard methods. The
92
improvement is very substantial, for example ESS/CPU for AABC-U is 12 times larger than
for the best standard ABC procedures like SMC. Similar results are shown for Bayesian Syn-
thetic Likelihood. The main reason for such efficiency is using past draws to make decision
about accepting or rejecting a proposal. The improvement in efficiency is of no use if pos-
terior distributions are very different from the exact one. Therefore examining DIM, DIC,
TV and MSE quantities that calculate how close posterior draws are to samples generated
by the exact MCMC is essential. For all these quantities the smaller the value the better is
the sampler. We see that all these measures for AABC-U and AABC-L are very similar to
SMC, ABC-RW and ABC-IS and frequently outperforms them. Similarly for BSL approach.
Another observation is that the approximated algorithm with uniform and linear weights
generally perform very similarly.
9.4 Ricker’s Model
Ricker’s model is analyzed very frequently to test Synthetic Likelihood procedures [94, 73].
It is a particular instance of hidden Markov model:
x−49 = 1; ziiid∼ N (0, exp(θ2)2); i = {−48, · · · , n},
xi = exp(exp(θ1))xi−1 exp(−xi−1 + zi); i = {−48, · · · , n},
yi = Pois(exp(θ3)xi); i = {−48, · · · , n},
(9.4)
where Pois(λ) is Poisson distribution with mean parameter λ and n = 100. Only y =
(y1, · · · , yn) sequence is observed, first 50 values are ignored. Note that all parameters
θ = (θ1, θ2, θ3) are unrestricted, the prior is given as (each prior parameter is independent):
θ1 ∼ N (0, 1),
θ2 ∼ Unif(−2.3, 0),
θ3 ∼ N (0, 4).
(9.5)
We restrict the range of θ2 as all algorithms become unstable for θ2 outside this interval. Note
that the marginal distribution of y is not available in closed form, but transition distribution
of hidden variables Xi|xi−1 and emission probabilities Yi|xi are known and hence we can
run Particle MCMC (PMCMC) [8] or Ensemble MCMC [82] to sample from the posterior
distribution π(θ|y0). Here we are utilizing the Particle MCMC with 100 particles. As
93
suggested in [94] we set θ0 = (log(3.8), 0.9, 2.3) and define summary statistics S(y) as the
14-dimensional vector whose components are:
(C1) #{i : yi = 0},
(C2) Average of y, y,
(C3:C7) Sample auto-correlations at lags 1 through 5,
(C8:C11) Coefficients β0, β1, β2, β3 of cubic regression
(yi − yi−1) = β0 + β1yi + β2y2i + β3y
3i + εi, i = 2, . . . , n,
(C12-C14) Coefficients β0, β1, β2 of quadratic regression
y0.3i = β0 + β1y
0.3i−1 + β2y
0.6i−1 + εi, i = 2, . . . , n.
Figures 9.5, 9.6 and 9.7 show trace-plots, histograms and ACF function for AABC-
U, ABSL-U and ABC-RW samplers for each component (red lines correspond to the true
parameter). We show here ABC-RW instead of ABC-IS since the latter has much worse
Figure 9.5: Ricker’s model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.8
1.2
1.6
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.8 1.0 1.2 1.4 1.6
020
0050
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
0−
1.0
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.0 −1.5 −1.0 −0.5 0.0
020
0040
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
1.8
2.2
2.6
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
1.8 2.0 2.2 2.4 2.6 2.8
020
0050
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
performance for this model. The main observation is that mixing of AABC-U is much
better than in ABC-RW with smaller auto-correlation values. ABSL-U has higher auto-
correlations than AABC-U but still performs quite well. To see how close the draws from
94
Figure 9.6: Ricker’s model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.8
1.0
1.2
1.4
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.8 1.0 1.2 1.4
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−1.
5−
0.5
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−1.5 −1.0 −0.5 0.0
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
2.1
2.3
2.5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
2.1 2.2 2.3 2.4 2.5 2.6 2.7
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.7: Ricker’s model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
1.0
1.2
1.4
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
1.0 1.1 1.2 1.3 1.4 1.5
020
0040
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−1.
5−
0.5
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
010
0025
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
2.0
2.2
2.4
2.6
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
2.0 2.2 2.4 2.6
010
0030
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
simulation-based algorithms to the draws from the exact chain, we plot estimated densities
in Figure 9.8. Two upper plots (left and right are associated to parameter’s component)
compares estimated density of exact PMCMC sampler (with 100 particles) with ABC-based
95
ones (SMC, ABC-RW and AABC-U), two lower plots compare exact sampler with Synthetic
Likelihood based methods (BSL-RW and ABSL-U), here we have chosen BSL-RW over BSL-
IS since it has better general performance in this model. Observe that ABC-based samplers
Figure 9.8: Ricker’s model: Estimated densities for each component. First row comparesExact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-RW andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.
0.5 1.0 1.5 2.0
01
23
4
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
−2.0 −1.5 −1.0 −0.5 0.0
0.0
0.5
1.0
1.5
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
θ3
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.5 1.0 1.5 2.0
01
23
45
6
θ1
Den
isty
Exact
BSL−RW
ABSL−U
−2.0 −1.5 −1.0 −0.5 0.0
0.0
0.5
1.0
1.5
θ2
Den
isty
Exact
BSL−RW
ABSL−U
1.5 2.0 2.5 3.0
01
23
45
θ3
Den
isty
Exact
BSL−RW
ABSL−U
(SMC, ABC-RW and AABC-U) have very similar estimated densities, densities of Synthetic
Likelihood methods are also similar. For the second component there is quite large difference
between exact and approximate posteriors which may be due to non-informative summary
statistics.
A more general study, where results are averaged over 100 independent replicates, is
shown in Table 9.2. Again, the proposed strategies clearly outperforms in overall efficiency
(ESS/CPU). For instance, AABC-U is about 10 times more efficient than standard SMC and
ABSL-U is 6 times more efficient than BSL-RW. At the same time DIM, DIC, TV and MSE
are generally smaller for approximate methods than for standard ones. Therefore it is evident
that for this model the improvement of sampler’s efficiency (or number of independent draws
per CPU time) does not decrease accuracy and precision of posterior’s moments.
96
Table 9.2: Simulation Results (Ricker’s model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
Sampler DIM DIC TV√
Bias2√
Var√
MSE ESS ESS/CPUSMC 0.152 0.0177 0.378 0.086 0.201 0.219 472 0.521ABC-RW 0.135 0.0201 0.389 0.059 0.180 0.189 87 0.199ABC-IS 0.139 0.0215 0.485 0.063 0.195 0.205 47 0.099AABC-U 0.147 0.0279 0.402 0.076 0.190 0.204 3563 4.390AABC-L 0.141 0.0258 0.392 0.070 0.189 0.201 4206 5.193BSL-RW 0.129 0.0080 0.382 0.038 0.206 0.209 131 0.030BSL-IS 0.122 0.0082 0.455 0.022 0.197 0.198 33 0.007ABSL-U 0.103 0.0054 0.377 0.023 0.170 0.171 284 0.180ABSL-L 0.106 0.0051 0.382 0.012 0.173 0.173 207 0.135
9.5 Stochastic Volatility with Gaussian emissions
When analyzing stationary time series it is frequently observed that there are periods of high
and periods of low volatility. Such phenomenon is called volatility clustering, see for example
[59]. One way to model such a behaviour is through a Stochastic Volatility (SV) model,
where variances of the observed time series depend on hidden states that themselves form
a stationary time series. Consider the following model which depends on three parameters
(θ1, θ2, θ3):
x1 ∼ N (0, 1/(1− θ21)); vi
iid∼ N (0, 1); wiiid∼ N (0, 1); i = {1, · · · , n},
xi = θ1xi−1 + vi; i = {2, · · · , n},
yi =√
exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.
(9.6)
Only y = (y1, · · · , yn) is observed while (x1, · · · , xn) are hidden states. First parameter
θ1 must be between -1 and 1 and controls auto-correlation of hidden states, θ2 and θ3 are
unrestricted and relate to the way hidden states influence variability of the observed series.
Note that for fixed hidden states the distribution of the observed variable is normal which
might not be appropriate in some examples. We introduce the following priors, independently
97
for each parameter:
θ1 ∼ Unif(0, 1),
θ2 ∼ N (0, 1),
θ3 ∼ N (0, 1).
(9.7)
We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1) and length of the time series
n = 500. Since the marginal distribution of y is not known in closed-form, standard MCMC
strategy cannot be implemented. We use Particle MCMC (PMCMC) as the Exact sampling
scheme. Since pseudo-data sets can be easily generated for every parameter value, the SV
is a good example to demonstrate the performances of the algorithms considered here. For
summary statistics we use a 7-dimensional vector whose components are:
(C1) #{i : y2i > quantile(y2
0, 0.99)},
(C2) Average of y2,
(C3) Standard deviation of y2,
(C4) Sum of the first 5 auto-correlations of y2,
(C5) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.1)}}ni=1,
(C6) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.5)}}ni=1,
(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.
Here quantile(y, τ) is defined as τ -quantile of the sequence y. As was shown in [81] and [24]
the auto-correlation of indicators (under different quantiles) can be very useful in charac-
terizing a time series and that is why we have added (C5),(C6) and (C7) to the summary
statistic. We focus here on y2 and its auto-correlations since model parameters only affect
variability of y (auto-correlation of y is zero for any lag). Figures 9.9, 9.10 and 9.11 show
trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers
respectively for each component (red lines correspond to the true parameter). The major
observation is that mixing of AABC-U is much better than in ABC-RW with smaller auto-
correlation values. ABSL-U has higher auto-correlations than AABC-U but still performs
well. In Figure 9.12 we compare the sample-based kernel smoothing density estimates ob-
tained from BSL-IS and BSL-RW. We note that all samples obtained from the approximate
98
Figure 9.9: SV model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.6
0.8
1.0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−3.
0−
2.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−3.0 −2.5 −2.0 −1.5
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
5−
0.5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.10: SV model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.6
0.8
1.0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
040
0080
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−4.
0−
3.0
−2.
0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5
020
0050
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−2.
0−
1.0
0.0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0050
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
algorithms are exact posterior (produced using PMCMC with 100 particles). Generally all
ABC-based samplers perform similarly, on the other hand ABSL-U performs worse than
generic BSL-IS in this run as it is shifted away from the exact posterior for θ1 and θ3.
99
Figure 9.11: SV model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.80
0.90
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.80 0.85 0.90 0.95 1.00
010
0020
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
8−
2.2
−1.
6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6
020
0040
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−2.
0−
1.0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.0 −1.5 −1.0 −0.5
020
0040
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.12: SV model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.
0.5 0.6 0.7 0.8 0.9 1.0
02
46
810
12
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
−4 −3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
−3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.5 0.6 0.7 0.8 0.9 1.0
02
46
810
12
θ1
Den
isty
Exact
BSL−IS
ABSL−U
−4 −3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ2
Den
isty
Exact
BSL−IS
ABSL−U
−3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
Exact
BSL−IS
ABSL−U
To get more general conclusions we show average results in Table 9.3 over 100 data repli-
cates. Again we note that proposed algorithms outperform the benchmark samplers by 8
times in ESS/CPU. Moreover AABC-U and AABC-L have very similar or smaller values
100
Table 9.3: Simulation Results (SV model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
Sampler DIM DIC TV√
Bias2√
Var√
MSE ESS ESS/CPUSMC 0.232 0.0428 0.417 0.187 0.255 0.316 471 0.336ABC-RW 0.210 0.0396 0.459 0.228 0.255 0.342 31 0.097ABC-IS 0.179 0.0439 0.460 0.196 0.219 0.294 30 0.090AABC-U 0.194 0.0447 0.424 0.212 0.217 0.304 1793 2.445AABC-L 0.189 0.0441 0.420 0.211 0.235 0.316 1659 2.253BSL-RW 0.200 0.0360 0.411 0.175 0.227 0.287 131 0.043BSL-IS 0.195 0.0362 0.404 0.175 0.225 0.285 346 0.113ABSL-U 0.229 0.0422 0.551 0.184 0.241 0.303 871 0.822ABSL-L 0.231 0.0410 0.548 0.197 0.240 0.311 843 0.817
for DIM, TV and MSE, which demonstrates that these samplers are much more efficient
than standard methods and at the same produce as accurate (or more accurate) parameter
estimates as generic algorithms.
ABSL-U and ABSL-L on the other hand did not perform well for this model, TV and MSE
for these samplers are larger by 10% than generic ones.
9.6 Stochastic Volatility with α-Stable errors
As was pointed out in the previous sub-section, standard SV model assumes that conditional
on hidden states observed variables have a normal distribution, which is a strong assumption.
Frequently, in financial time series, a large sudden drop occurs that is very unlikely under
Gaussianity. Therefore, it is suggested to use heavy tailed distributions (instead of Gaussian)
to model financial data. We consider a family of distributions named α-Stable (Stab(α, β))
with two parameters α ∈ (0, 2] (stability parameter) and β ∈ [−1, 1] (skew parameter). Two
special cases are α = 1 and α = 2 which correspond to Cauchy and Gaussian distribution
respectively, note that for α < 2 the distribution has undefined variance. We define the
101
following SV model with α-Stable errors with four parameter (θ1, θ2, θ3, θ4):
x1 ∼ N (0, 1/(1− θ21)); vi
iid∼ N (0, 1); wiiid∼ Stab(θ4,−1); i = {1, · · · , n},
xi = θ1xi−1 + vi; i = {2, · · · , n},
yi =√
exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.
(9.8)
This model is very similar to the simple SV with only difference that emission errors follow
α-Stable distribution with unknown stable parameter and fixed skew of −1. We generally
prefer negative skew emission probability to model large negative financial returns. As in
the previous simulation example θ2 and θ3 are unrestricted. The prior distribution for this
model is (independently for each parameter):
θ1 ∼ Unif(0, 1),
θ2 ∼ N (0, 1),
θ3 ∼ N (0, 1),
θ4 ∼ Unif(1.5, 2).
(9.9)
We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1, θ4 = 1.8) and length of the
time series n = 500. The major challenge with this model is that there are no closed-form
densities for α-Stable distributions. Hence, most MCMC samplers, including PMCMC and
ensemble MCMC, cannot be used to sample from the posterior. However sampling from this
family of distributions is feasible which makes it particularly amenable for simulation based
methods like ABC and BSL. For summary statistics we use a 7-dimensional vector whose
components are:
(C1) #{i : y2i > quantile(y2
0, 0.99)},
(C2) Average of y2,
(C3) Standard deviation of y2,
(C4) Sum of the first 5 auto-correlations of y2,
(C5) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.1)}}ni=1,
(C6) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.5)}}ni=1,
(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.
102
Figures 9.13,9.14 and 9.15 show trace-plots, histograms and ACF function for AABC-U,
ABSL-U and ABC-RW samplers respectively for each component (red lines correspond to
the true parameters). As in previous examples the mixing of AABC-U and ABSL-U is
Figure 9.13: SV α-Stable model: AABC-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
0 10000 20000 30000 40000
0.4
0.8
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
6−
1.6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6 −1.4
030
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
00.
5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0 0.5
015
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ3
0 10000 20000 30000 40000
1.5
1.8
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
030
00
0 20 40 60 80 1000.
00.
6
Lag
AC
F
ACF for θ4
much better than of ABC-RW. Since exact sampling is not feasible in this example we
compare samplers to SMC (instead of exact samples), the plotted estimated densities are
in Figure 9.16, here we have chosen BSL-IS over BSL-RW because it has better general
performance in this model. Generally all simulation-based samplers have similar densities in
this example.
For more general conclusions we show average results in Table 9.4 over 100 data replicates.
Here to calculate DIM, DIC and TV, samplers are compared to SMC since exact draws
cannot be obtained. As in previous examples ESS/CPUs for AABC-U, AABC-L, ABSL-
U and ABSL-L are roughly 8 times larger than benchmark algorithms. For this example
looking at DIM, DIC and TV maybe misleading since approximated samplers are compared
to another approximate sampler. Much more informative is MSE measure, it is very similar
across ABC-based and BSL-based algorithms. Therefore we can conclude that proposed
samplers perform very well in this example.
103
Figure 9.14: SV α-Stable model: ABSL-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
0 10000 20000 30000 40000
0.3
0.7
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
030
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−3.
0−
1.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−3.5 −3.0 −2.5 −2.0 −1.5 −1.0
050
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−2.
00.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.0 −1.5 −1.0 −0.5 0.0 0.5
040
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ3
0 10000 20000 30000 40000
1.5
1.8
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
020
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ4
Figure 9.15: SV α-Stable model: ABC-RW Sampler.Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
0 10000 20000 30000 40000
0.6
0.9
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
015
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
4−
1.6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.4 −2.2 −2.0 −1.8 −1.6
030
00
0 500 1000 1500 2000 2500 3000
0.0
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
020
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ3
0 10000 20000 30000 40000
1.7
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.7 1.8 1.9 2.0
020
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ4
104
Figure 9.16: SV α-Stable model: Estimated densities for each component. First row com-pares SMC, ABC-RW and AABC-U samplers. Second row compares SMC, BSL-IS andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2, θ3 andθ4.
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
SMC
ABC−RW
AABC−U
−3.5 −3.0 −2.5 −2.0 −1.5
0.0
0.5
1.0
1.5
2.0
θ2
Den
isty
SMC
ABC−RW
AABC−U
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
θ3
Den
isty
SMC
ABC−RW
AABC−U
1.5 1.6 1.7 1.8 1.9 2.0
01
23
4
θ4
Den
isty
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
SMC
BSL−IS
ABSL−U
−3.5 −3.0 −2.5 −2.0 −1.5
0.0
0.5
1.0
1.5
θ2
Den
isty
SMC
BSL−IS
ABSL−U
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
SMC
BSL−IS
ABSL−U
1.5 1.6 1.7 1.8 1.9 2.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
θ4
Den
isty
SMC
BSL−IS
ABSL−U
Table 9.4: Simulation Results (SV α-Stable model): Average Difference in mean, Differencein covariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm. In DIM, DIC andTV, samplers are compared to SMC.
Diff with SMC Diff with true parmater Efficiency
Sampler DIM DIC TV√
Bias2√
Var√
MSE ESS ESS/CPUSMC 0.000 0.0000 0.000 0.221 0.201 0.299 468 0.267ABC-RW 0.078 0.0126 0.205 0.248 0.198 0.317 24 0.069ABC-IS 0.082 0.0151 0.306 0.232 0.221 0.320 26 0.071AABC-U 0.069 0.0124 0.170 0.250 0.183 0.310 1303 1.617AABC-L 0.069 0.0132 0.161 0.246 0.181 0.305 1256 1.546BSL-RW 0.044 0.0116 0.122 0.225 0.181 0.289 123 0.037BSL-IS 0.045 0.0103 0.125 0.226 0.177 0.287 285 0.084ABSL-U 0.063 0.0133 0.228 0.225 0.181 0.289 832 0.735ABSL-L 0.061 0.0140 0.230 0.236 0.183 0.299 757 0.671
105
Chapter 10
Data Analysis
10.1 Dow-Jones log-returns
For real world example we consider Dow-Jones index daily log returns from January 1, 2010
until December 31, 2018. The data was downloaded from Yahoo Finance1 website. Given a
time series of prices Pi, i = 1, · · · , n, log returns are calculated in the following way:
ri = log(Pi)− log(Pi−1), i = 2, · · · , n.
The resulting time series is of length 2262. To make log returns more suitable for analysis,
we standardize rt by subtracting its mean and then multiply each return by 200, so that
absolute values were not too small, Figure 10.1 shows transformed returns. This time series
(y0) has mean zero by construction, its auto-correlations and partial auto-correlations are
insignificant for any lag however it is obvious that variances are correlated. There are periods
of low and high variability, therefore to analyze its properties we apply Stochastic Volatility
model with α-Stable errors as described in previous chapter. Since likelihood does not exist
for this class of models, simulation-based methods are probably the only available tools for
the inference.
10.2 Analysis
The evolution of time series described by equation (9.8) (note that skewed parameter of
Stable distribution is fixed at value of −1) and parameters’ prior as in equation (9.9). To
1https://ca.finance.yahoo.com/
106
Figure 10.1: Dow Jones daily transformed log return for a period of Jan 2010 - Dec 2018.−
10−
50
510
Tran
sfor
med
log
retu
rn
Jan 10 May 10 Sep 10 Jan 11 May 11 Sep 11 Jan 12 Apr 12 Aug 12 Dec 12 Apr 13 Aug 13 Dec 13 Apr 14 Aug 14 Dec 14 Apr 15 Aug 15 Dec 15 Apr 16 Aug 16 Dec 16 Apr 17 Jul 17 Nov 17 Mar 18 Jul 18 Oct 18
estimate posterior distribution we run AABC-U and ABLS-U samplers. Summary statistic
for both methods is 7-dimensional vector as in section 9.6. Each chain was run for 100
thousand iterations with last 80 thousands used for inference. Figures 10.2 and 10.3 show
trace-plots and histograms for AABC-U and ABSL-U samplers respectively for each param-
eter. We observe that similar to simulation results, the mixing of AABC-U is generally
better than of ABSL-U. However posterior draws of ABSL-U for the first 3 components are
uni-modal, symmetric and bell-shaped, very similar to Gaussian distributions, which is not
surprising since when Gaussian priors are used the posterior of BSL algorithms must also
be Gaussian by conjugacy. Table 10.1 reports posterior mean and 95% credible intervals
for every parameter and for both samplers. AABC-U and ABSL-U produce similar results.
Table 10.1: Dow Jones log return stochastic volatility: 95% credible intervals and posterioraverages for 4 parameters for two proposed samplers (AABC-U and ABSL-U).
AABC-U ABSL-UParameter 2.5% Quantile Average 97.5% Quantile 2.5% Quantile Average 97.5% Quantileθ1 0.787 0.899 0.990 0.775 0.856 0.959θ2 -0.411 -0.147 0.112 -0.369 -0.092 0.222θ3 -1.405 -0.790 -0.304 -1.858 -0.841 -0.206θ4 1.758 1.916 1.997 1.721 1.909 1.996
We see that estimated correlation between adjacent variables in hidden layer of Stochastic
107
Figure 10.2: Dow Jones log returns: AABC-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.
0 20000 40000 60000 80000
0.65
0.75
0.85
0.95
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
010
0030
0050
00
0 20000 40000 60000 80000
−1.
0−
0.5
0.0
0.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.0 −0.5 0.0 0.5
010
0020
0030
0040
0050
0060
00
0 20000 40000 60000 80000
−2.
5−
2.0
−1.
5−
1.0
−0.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0040
0060
00
0 20000 40000 60000 80000
1.5
1.6
1.7
1.8
1.9
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
010
0030
0050
00
Figure 10.3: Dow Jones log returns: ABSL-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.
0 20000 40000 60000 80000
0.65
0.75
0.85
0.95
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
020
0040
0060
0080
0012
000
0 20000 40000 60000 80000
−1.
0−
0.5
0.0
0.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.0 −0.5 0.0 0.5
020
0060
0010
000
0 20000 40000 60000 80000
−2.
5−
2.0
−1.
5−
1.0
−0.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0040
0060
0080
00
0 20000 40000 60000 80000
1.5
1.6
1.7
1.8
1.9
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
020
0060
0010
000
volatility is about 0.9 and parameter of α-Stable emission noise is 1.91 which can produce
more extreme values than predicted by standard Gaussian noise. Moreover since 0 is inside
the credible interval for θ2, we can disregard this parameter. Overall this example shows
that the proposed samplers AABC-U and ABSL-U can be implemented successfully for real
world data problems.
108
Chapter 11
Theoretical Justifications
In this chapter we show that our novel approximated ABC MCMC and BSL samplers with
independent proposals exhibit ergodic properties in a long run. In other words we want to
show that as number of MCMC iterations increases marginal distribution of {θ(t)} converges
to appropriate posterior distribution in total variation and sample averages converge to the
true expectations.
11.1 Preliminary Theorems
We start by reviewing our notation. Let p(θ), q(θ) represent the prior and proposal distri-
butions for θ ∈ Θ respectively. For AABC we define a function h(θ) as P (δ < ε|θ) where
δ = δ(y,y0) and y ∼ f(y|θ). Then given a proposed ζ∗ the acceptance probability is:
a(θ, ζ∗) = min(1, α(θ, ζ∗)),
α(θ, ζ∗) =p(ζ∗)q(θ)h(ζ∗)
p(θ)q(ζ∗)h(θ).
(11.1)
This MH procedure defines an exact transition kernel which we call P (·, ·). Since h(θ) is not
available in closed form we will estimate it using k-Nearest-Neighbor approach.
Let ZN = {ζn,1{δn<ε}}Nn=1 represent N independent samples from q(ζ)P (1{δ<ε}|ζ) for AABC.
Actually ZN contains past generated samples that were saved before Nth iteration. Given θ
and ζ∗ we apply kNN to approximate h(θ) and h(ζ∗) by calculating local weighted averages
of 1{δn<ε} for ζn that are close to θ or ζ∗. We denote such estimate h(θ;ZN), and the
probability of proposal acceptance for this perturbed algorithm (more on perturbed MCMC
109
can be found in [77, 71, 46]) is:
a(θ, ζ∗;ZN) = min(1, α(θ, ζ∗;ZN)),
α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)
p(θ)q(ζ∗)h(θ;ZN).
(11.2)
The approximate kernel transition is PN(·, ·) = EZN
[PN(·, ·;ZN)
], the goal is to show that
as N →∞ the distance between this transition and the exact one converges to zero, where
distance is defined as:
‖PN(·, ·)− P (·, ·)‖ = supθ‖PN(θ, ·)− P (θ, ·)‖TV , (11.3)
where the last distance is ”total variation” distance between two measures. First we show
that under strong consistency assumption of h(θ;ZN), perturbed kernel converges to the
exact one.
Theorem 11.1.1. Suppose Θ is compact, supθ ‖h(θ;ZN)− h(θ)‖ → 0 with probability 1 and
h(θ) > 0 for all θ ∈ Θ. Then for any ε > 0 there exists C such that for all N > C,
‖PN − P‖ < ε.
Next let Pε = {PN : ‖PN − P‖ < ε} be a collection of perturbed kernels each ε distance
from the exact kernel. For illustration consider an example when auxiliary set ZN grows
with number of iterations, in this case at each iteration a new kernel PN ∈ Pε is used in the
chain. We want to show that this procedure will results in ergodic chain with appropriate
convergence results. For most of the presented results below we refer to the work of [47] on
convergence properties of perturbed kernels.
To obtain useful convergence results we need to make additional Doeblin Condition assump-
tion about the exact kernel P :
Definition 11.1.1 (Doeblin Condition). Given a kernel P , there exists 0 < α < 1 such that
sup(θ,ζ∗)∈Θ×Θ
‖P (θ, ·)− P (ζ∗, ·)‖TV < 1− α.
We also choose ε so that α∗ = α + 2ε < 1 and ε < α/2 which by Remark 2.1 in [47]
guarantees that every member of Pε satisfies Doeblin Condition with α = α∗ and has a
unique invariant measure. Thus we define the following 3 assumptions:
(A1) Exact transition kernel P satisfies Doeblin Condition,
110
(A2) For any P ∈ Pε, ‖P − P‖ < ε,
(A3) ε < min(α/2, (1− α)/2).
Now, let µ be invariant measure of the exact kernel P , and the perturbed chain θ(0), θ(1), · · · , θ(t)
is a Markov chain with θ(0) ∼ ν = µ0. Also define marginal distribution of θ(t) denoted by
µt, t = 1, 2, , and equal to µt = νP0P1 · · · Pt with each Pt ∈ P , t = 1, 2, · · · and P0 being
identity transition (for convenience). First we need to examine the total variation distance
between µ and average measure∑M−1
t=0 µt/M , in other words:∥∥∥∥∥µ−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
, where P0 = I. (11.4)
Then we have the following important convergence result:
Theorem 11.1.2. Suppose that exact kernel P satisfies (A1), every member of Pε (A2) and
ε is chosen to satisfy (A3). Let ν be any probability measure on (Θ,F0), then∥∥∥∥∥µ−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
≤ (1− (1− α)M)‖µ− ν‖TVMα
− ε(1− (1− α)M)
Mα2+ε
α, (11.5)
which implies that this difference can be arbitrary small for sufficiently large M and small
enough ε.
Next we focus on the following mean squared error (MSE):
E
(µf − ∑M−1t=0 f(θ(t))
M
)2 ,
where f is bounded function and µf = Eµ[f(θ)]. The main objective here is to find the upper
bound for this MSE when perturbed MCMC is used and how it depends on the sample size
M . To obtain the main result we introduce the following lemma:
Lemma 11.1.3. Suppose θ(0) ∼ ν with ν being any distribution, µt = νP1 · · · Pt be marginal
distribution of θ(t), t = 1, 2, · · · with Pt ∈ Pε and ε satisfying (A2) and (A3) respectively.
Moreover let f(θ) and g(θ) be bounded functions with |f | = supθ f(θ) and |g| = supθ g(θ),
then
cov(f(θ(k)), f(θ(j))) ≤ 8|f ||g|(1− α∗)|k−j|.
111
For the proofs we will also utilize the following two theorems, one is about the strong
uniform consistency of kNN estimators the later one is about uniform ergodicity of Hastings
algorithm with independent proposal.
Theorem 11.1.4 (Uniform Consistency of kNN - [16]). Given independent {ζn, δn}Nn=1, let Θ
be support of distribution of ζ, h(ζ) = E(δ|ζ) and hN(ζ) =∑N
j=1WNj δj (kNN estimator)
(here j are permuted indices that order distances between ζn and ζ from smallest to largest).
Suppose weights WNj satisfy
(i)∑N
j=1 WNj = 1,
(ii) WNj = 0 for j > K, and K = K(N) with K →∞ and K/N → 0,
(iii) supN K maxjWNj <∞.
If
(i) Θ is compact,
(ii) h(ζ) is continuous function,
(iii) V ar(δ|ζ) is bounded random variable,
(iv) K(N) satisfies K/√N log(N)→∞,
then supζ∈Θ |hN(ζ)− h(ζ)| → 0 with probability 1.
Note that the uniform and linear weights satisfy WNj assumptions above.
Theorem 11.1.5 (Independent Metropolis sampler - [62]). Suppose θ(t) is a MH Markov Chain
with invariant distribution π(θ), independent proposal q(θ) and acceptance probabilities
a(θ, ζ∗) = min(
1, π(ζ∗)q(θ)π(θ)q(ζ∗)
).
If there exists β > 0 such that q(θ)/π(θ) > β for all θ ∈ Θ, then the algorithm is uniformly
ergodic so that ‖P n(θ, ·)− π‖TV < (1− β)n (here P n(θ, ·) is conditional distribution of θ(n)
given θ(0) = θ).
11.2 Main Results
The next important convergence results follows (similar to Theorem 2.5 of [47]):
112
Theorem 11.2.1 (Approximation of MSE). Suppose P , Pε and ε satisfy (A1), (A2) and
(A3) respectively. Letting µ represent invariant measure of P , f(θ) be a bounded function
and θ(0) ∼ ν. Then
E
(µf − 1
M
M−1∑t=0
f(θ(t))
)2
≤ 4|f |2(
(1− (1− α)M)
Mα− ε(1− (1− α)M)
Mα2+ε
α
)2
+ 8|f |2(
1
M+
2
(α∗)2
((1− α∗)M+1 − (1− α∗)
M2+
(1− α∗)− (1− α∗)2
M
)).
(11.6)
In other words this expectation can be made arbitrary small for sufficiently large M and
small enough ε.
Based on these theorems we now can obtain convergence results for AABC and ABSL
algorithms.
Theorem 11.2.2 (Ergodicity of AABC). Consider the proposed AABC sampler (with ε thresh-
old), let p(θ) represent prior measure on Θ, ZN simulated pairs {ζn,1{δn<ε}}Nn=1 (ζn ∼ q(ζ))
with the following assumptions:
(B1) Θ being compact set.
(B2) q(θ) > 0 continuous density of independent proposal distribution.
(B3) p(θ) > 0 continuous density of prior distribution.
(B4) h(θ) = P (δ < ε|θ) > 0 and continuous function of θ.
(B5) In kNN estimation assume that K(N) =√N with uniform or linear weights.
Then for sufficiently large N (number of past simulations) and M (number of chain itera-
tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.
Corollary 11.2.2.1 (Ergodicity of ABSL). Consider the proposed ABSL algorithm, let p(θ)
represent prior measure on Θ, h(θ) = N (s0;µθ,Σθ), ZN simulated pairs {ζn, sn}Nn=1 (ζn ∼q(ζ) and sn is summary statistics) with the following assumptions:
(B1) Θ being compact set.
(B2) q(θ) > 0 continuous density of independent proposal distribution.
113
(B3) p(θ) > 0 continuous density of prior distribution.
(B4) h(θ) continuous function of θ.
(B5) |Σθ| > a0 where Σθ = V ar(s|θ) for every θ ∈ Θ.
(B6) E[sj|θ] and E[sjsk|θ] are continuous functions of θ for every 1 ≤ j, k ≤ p with sj
representing jth component of summary statistic s.
(B7) V ar[sj|θ] and V ar[sjsk|θ] are bounded functions.
(B8) In kNN estimation assume that K(N) =√N with uniform or linear weights.
Then for sufficiently large N (number of past simulations) and M (number of chain itera-
tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.
11.3 Proofs of theorems
Proof. [Proof of Theorem 11.1.1] Note that supθ ‖h(θ;ZN)− h(θ)‖ → 0 w.p.1 implies that
for all θ and ζ∗ in Θ:
h(θ;ZN)p→ h(θ),
h(ζ∗;ZN)p→ h(ζ∗),
therefore by Slutsky’s theorem we obtain
h(ζ∗;ZN)
h(θ;ZN)
p→ h(ζ∗)
h(θ),
for all (θ, ζ∗) in Θ×Θ. therefore
α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)
p(θ)q(ζ∗)h(θ;ZN)
p→ p(ζ∗)q(θ)h(ζ∗)
p(θ)q(ζ∗)h(θ)= α(θ, ζ∗).
Since min(1, x) is a continuous function, Continuous Mapping Theorem implies that
a(θ, ζ∗;ZN) = min(1, α(θ, ζ∗;ZN))p→ min(1, α(θ, ζ∗)) = a(θ, ζ∗).
Note that this not just a point-wise convergence, but uniform convergence in probability so
that one C will work for all (θ, ζ∗). That is, for any (θ, ζ∗), δ > 0 and ε > 0 there exists C
114
such that for all N > C, P (|a(θ, ζ∗;ZN)− a(θ, ζ∗)| > δ) < ε.
Another important observation is that (fixing θ, ζ∗ and letting a(θ, ζ∗) = a and a(θ, ζ∗;ZN) =
a for convenience)
EZN (|a− a|) =
∫|a− a|dF (ZN) =
∫|a−a|<δ
|a− a|dF (ZN) +
∫|a−a|≥δ
|a− a|dF (ZN) ≤
≤ δ +
∫|a−a|≥δ
dF (ZN) ≤ δ + ε.
(11.7)
Because |a − a| ≤ 1 and applying definition of convergence in probability. The above in-
equality shows that we can make this expected value arbitrary small by taking large enough
N , moreover this result is uniform so one N will work for all θ and ζ∗.
Next we focus on the distance between two transition kernels, this discussion is similar to
the proof of Corollary 2.3 in [5]. Observe that (using independent proposals):
P (θ, dζ∗) = q(ζ∗)a(θ, ζ∗)dζ∗ + δθ(ζ∗)r(θ),
PN(θ, dζ∗) =
∫q(ζ∗)a(θ, ζ∗;ZN)dζ∗dF (ZN) + δθ(ζ
∗)rN(θ),
where r(θ) = 1−∫q(ζ∗)a(θ, ζ∗)dζ∗ and rN(θ) = 1−
∫ ∫q(ζ∗)a(θ, ζ∗)dζ∗dF (ZN). Fix θ ∈ Θ,
and noting that total variation between two probability distributions that have densities is
also equal to:
‖π − π‖TV = 0.5
∫|π(θ)− π(θ)|dθ.
Therefore
P (θ, dζ∗)− PN(θ, dζ∗) =
∫q(ζ∗)(a(θ, ζ∗)− a(θ, ζ∗;ZN))dF (ZN)
+ δθ(dζ∗)
∫ ∫q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt,
(11.8)
115
and it follows that
‖P (θ, dζ∗)− PN(θ, dζ∗)‖TV ≤0.5
{∣∣∣∣∫ ∫ q(ζ∗)(a(θ, ζ∗)− a(θ, t;ZN))dF (ZN)dζ∗∣∣∣∣
+
∣∣∣∣∫ ∫ q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt
∣∣∣∣}=
∣∣∣∣∫ ∫ q(t)(a(θ, t)− a(θ, t;ZN))dF (ZN)dt
∣∣∣∣≤∫ ∫
q(t) |a(θ, t)− a(θ, t;ZN)| dF (ZN)dt ≤ δ + ε
(11.9)
for any ε > 0 and δ > 0 and large enough N by (11.7). Since this result is true for any θ ∈ Θ
we finally get the main result:
supθ‖PN(θ, dζ∗)− P (θ, dζ∗)‖TV ≤ δ + ε (11.10)
Proof. [Proof of Theorem 11.1.2] We generally follow the proof of Theorem 2.4 in [47]. First
observe that:
νP0 · · · PM − µPM = (ν − µ)PM +M−1∑t=0
νP0 · · · Pt(Pt+1 − P )PM−t−1.
By Assumptions 2 and 3, we get:
‖νP0 · · · PtPt+1 − νP0 · · · PtP‖TV ≤ ε,
and
‖νP0 · · · PtPt+1Pn−t−1 − νP0 · · · PtPPM−t−1‖TV ≤ ε(1− α)M−t−1.
Using these results, the triangular inequality and formula for sum of finite geometric series
116
we establish that:
‖νP0 · · · PM − µPM‖TV ≤‖µPM − νPM‖TV +M−1∑t=0
‖νP0 · · · PtPt+1PM−t−1 − νP0 · · · PtPPM−t−1‖TV
≤(1− α)M‖µ− ν‖TV + ε
M−1∑t=0
(1− α)M−t−1
=(1− α)M‖µ− ν‖TV + ε1− (1− α)M
α.
(11.11)
Finally we get the main result using that fact that µ is invariant for P (again using sum of
finite geometric series)∥∥∥∥∥µ−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
=
∥∥∥∥∥∑M−1
t=0 µP t
M−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
≤ 1
M
M−1∑t=0
‖µP t − νP0 · · · Pt‖TV
≤ 1
M
M−1∑t=0
((1− α)t‖µ− ν‖TV + ε
1− (1− α)t
α
)=
(1− (1− α)M)‖µ− ν‖TVMα
− ε(1− (1− α)M)
Mα2+ε
α.
(11.12)
Proof. [Proof of Lemma 11.1.3] Without loss of generality we assume that k > j, next
define:
f(θ(j)) = f(θ(j))− µjf,
g(θ(k)) = g(θ(k))− µkg,
so that E[f(θ(j))] = E[g(θ(k))] = 0. Then we get the following
cov(f(θ(j)), g(θ(k))) =E[f(θ(j))g(θ(k))] = E[E[f(θ(j))g(θ(k))|θ(j)]]
=E[f(θ(j))E[g(θ(k))|θ(j)]] = Eθ(j) [f(θ(j))δθ(j)Pj+1 · · · Pkg],(11.13)
where δθ is point mass at θ and using our notation δθ(j)Pj+1 · · · Pk corresponds to conditional
distribution of θ(k) given fixed value of θ(j).
Using the general observation that for any two measures ν1 and ν2 and any bounded function
117
f the following inequality holds
|ν1f − ν2f | ≤ 2|f |‖ν1 − ν2‖TV , (11.14)
we find that:
|δθ(j)Pj+1 · · · Pkg| =|δθ(j)Pj+1 · · · Pkg − 0| = |δθ(j)Pj+1 · · · Pkg − µkg|
=|δθ(j)Pj+1 · · · Pkg − µjPj+1 · · · Pkg| ≤ 2|g|‖δθ(j)Pj+1 · · · Pk − µjPj+1 · · · Pk‖TV≤2|g|(1− α∗)|k−j|
(11.15)
note that this result is for any θ(j) ∈ Θ. Returning to (11.13) we get that:
cov(f(θ(j)), g(θ(k))) ≤ 2|f ||g|(1− α∗)|k−j|. (11.16)
Finally by triangular inequality |f | ≤ 2|f | for any j = 1, 2, · · · and similarly for |g|. The
desired result follows immediately.
Proof. [Proof of Theorem 11.2.1] Using our standard notation νP0 · · · Ptf = E[f(θ(t))],
Theorem 11.1.2, Lemma 11.1.3 and simple results for double sum of geometric series we get
E
(µf − 1
M
M−1∑t=0
f(θ(t))
)2 = E
(µf − 1
M
M−1∑t=0
νP0 · · · Ptf +1
M
M−1∑t=0
νP0 · · · Ptf −1
M
M−1∑t=0
f(θ(t))
)2
=
(µf − 1
M
M−1∑t=0
νP0 · · · Ptf
)2
+ E
( 1
M
M−1∑t=0
νP0 · · · Ptf −1
M
M−1∑t=0
f(θ(t))
)2
≤(
2|f |(
(1− (1− α)M)‖µ− ν‖TVMα
− ε(1− (1− α)M)
Mα2+ε
α
))2
+1
M2
M−1∑j=0
M−1∑t=0
cov(f(θ(j)), f(θ(t)))
≤ 4|f |2(
(1− (1− α)M)
Mα− ε(1− (1− α)M)
Mα2+ε
α
)2
+8|f |2
M2
M−1∑j=0
M−1∑t=0
(1− α∗)|t−j|
= 4|f |2(
(1− (1− α)M)
Mα− ε(1− (1− α)M)
Mα2+ε
α
)2
+ 8|f |2(
1
M+
2
(α∗)2
((1− α∗)M+1 − (1− α∗)
M2+
(1− α∗)− (1− α∗)2
M
)).
(11.17)
118
Obtaining the desired result.
Proof. [Proof of Theorem 11.2.2] First by (B1), (B2) and (B3), Theorem 11.1.5 guarantees
uniform ergodicity of the exact chain P with β = minθ∈Θq(θ)
p(θ)h(θ)/cwhere c is the normalizing
constant of the posterior. Note that β > 0 since Θ is compact, ratio is continuous and
never zero. Therefore P also satisfies Doeblin Condition. Next from (B1), (B4) and (B5),
Theorem 11.1.4 implies that supθ∈Θ ‖h(θ;ZN) − h(θ)‖ → 0 with probability 1. Hence by
Theorem 11.1.1 perturbed kernel P can be made arbitrary close to the exact kernel P for
sufficiently large N . Note that total variation distance between PN and P decreases to zero
as N increases. Finally assumptions of Theorems 11.1.2 and 11.2.1 follow trivially.
Proof. [Proof of Corollary 11.2.2.1] First by (B1)-(B5), Theorem 11.1.5 guarantees uniform
ergodicity of the exact chain P with β = minθ∈Θq(θ)
p(θ)h(θ)/cwhere c is the normalizing constant
of the posterior. Note that β > 0 since Θ is compact, ratio is continuous and never zero.
Therefore P satisfies Doeblin Condition. Next from (B1), (B6)-(B8), Theorem 11.1.4
implies that supθ∈Θ ‖h(θ;ZN) − h(θ)‖ → 0 with probability 1. Hence by Theorem 11.1.1
perturbed kernel P can be made arbitrary close to the exact kernel P for sufficiently large
N . Note that total variation distance between PN and P decreases to zero as N increases.
Finally assumptions of Theorems 11.1.2 and 11.2.1 follow trivially.
119
Part III
Final Remarks
120
Chapter 12
Conclusions and Future Work
We started this thesis by looking at bivariate conditional copulas. The inclusion of a dynamic
copula in the model comes with a significant computational price. It can be justified by the
need for an exploration of dependence, or because it can improve the predictive accuracy of
the model. We have proposed a Bayesian procedure to estimate the calibration function of
a conditional copula model jointly with the marginal distributions. In our attempt to move
away from an additive model hypothesis we consider a sparse Gaussian process priors used in
conjunction with a single index model. The resulting procedure reduces the dimensionality
of the parameter space and can be used for moderate number of covariates. The simplifying
assumption is often adopted as a way to bypass the need for estimating a conditional copula
model. However, even if the SA is true when conditioning on the true set of covariates, we
showed that if one or more covariates are not included in the fitted model, then the SA is
violated. We have introduced a couple of selection criteria to help select the copula family
from a set of candidates and to gauge data support in favour of the simplifying assump-
tion. While the former task seems to be achieved by all criteria considered, the latter is
a particularly difficult problem and we are excited about the good performance exhibited
by our permutation-based version of the cross-validated marginal likelihood criterion. Its
theoretical properties are the focus of our ongoing work and we plan to extend its use to
identifying the set of covariates that do not influence the calibration function.
As a natural continuation of the first project we then proposed two methods to check data
support for the simplifying assumption in conditional bivariate copula problems as well as
in various other models. Both are based on splitting the data into train and test sets, then
partitioning test set into bins using predicted calibration values and finally use randomiza-
tion or χ2 test to detect if the distribution in each bin is the same or not. It was presented
121
empirically that under SA probability of Type I error is controlled while generic Bayesian
model selection methods fail to provide reliable results. When generative process does not
satisfy SA, these two methods also perform quite well in showing large power. Similar results
are obtained for mean, logistic and quantile regressions. There are still some uncertainty
about what proportion of the data should be assigned to train and which to a test set. It
was also assumed that sample sizes in each ”bin” is the same however in some problems
power can be increased by changing sample sizes in each bin. These problems should be
investigated further.
Lastly we focused on simulation-based algorithms and proposed to speed up generic ABC-
MCMC and BSL algorithms by storing and utilizing past simulations. This approach sig-
nificantly speeds up the process and can be very useful for models where simulation of a
pseudo data set is computationally expensive or when large number of MCMC iterations is
required. We presented theoretical arguments and necessary assumptions for convergence
properties of the perturbed Markov chain. The performance of these strategies were ex-
amined via a series of simulations under different models. All simulation summaries show
that the proposed methods significantly improve mixing and efficiency of the chain and at
the same time produce as accurate and precise parameter estimates as generic samplers.
One obvious drawback is that due to the curse of dimensionality, k-NN estimator may not
produce good results when parameter dimension q is moderate or large. Thus it is of great
interest to modify the proposed algorithms and try to extend them to larger dimensions.
122
Bibliography
[1] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Pair-copula con-
structions of multiple dependence. Insurance Mathematics & Economics, 44(2):182–198,
April 2009.
[2] E.F. Acar, C Genest, and Johanna Neslehova. Beyond simplified pair-copula construc-
tions. Journal of Multivariate Analysis, 110:74–90, 2012.
[3] Elif F Acar, Radu V Craiu, and Fang Yao. Dependence calibration in conditional
copulas: A nonparametric approach. Biometrics, 67(2):445–453, 2011.
[4] Elif F Acar, Radu V Craiu, Fang Yao, et al. Statistical testing of covariate effects in
conditional copula models. Electronic Journal of Statistics, 7:2822–2850, 2013.
[5] Pierre Alquier, Nial Friel, Richard Everitt, and Aidan Boland. Noisy monte carlo: Con-
vergence of markov chains with approximate transition kernels. Statistics and Comput-
ing, 26(1-2):29–47, 2016.
[6] Ziwen An, David J Nott, and Christopher Drovandi. Robust bayesian synthetic likeli-
hood via a semi-parametric approach. arXiv preprint arXiv:1809.05800, 2018.
[7] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An
introduction to MCMC for machine learning. Machine learning, 50(1-2):5–43, 2003.
[8] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain
monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 72(3):269–342, 2010.
[9] Christophe Andrieu, Gareth O Roberts, et al. The pseudo-marginal approach for effi-
cient monte carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
123
[10] Meıli Baragatti and Pierre Pudlo. An overview on approximate bayesian computation.
In ESAIM: Proceedings, volume 44, pages 291–299. EDP Sciences, 2014.
[11] Gerard Biau and Luc Devroye. Lectures on the nearest neighbor method. Springer, 2015.
[12] Christopher M Bishop. Pattern recognition and machine learning. Springer-Verlag New
York Inc., 2006.
[13] LUKE Bornn, NATESH Pillai, AARON Smith, and DAWN Woodard. One pseudo-
sample is enough in approximate bayesian computation mcmc. Biometrika, 99(1):1–10,
2014.
[14] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov
chain monte carlo. CRC press, 2011.
[15] V. Chavez-Demoulin and T. Vatter. Generalized additive models for conditional copulas.
J. Multivariate Anal., 141:147–167, 2015.
[16] Philip E Cheng. Strong consistency of nearest neighbor regression function estimators.
Journal of Multivariate Analysis, 15(1):63–72, 1984.
[17] Taeryon Choi, Jian Q Shi, and Bo Wang. A Gaussian process regression approach to a
single-index model. Journal of Nonparametric Statistics, 23(1):21–36, 2011.
[18] Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, and Jose Reis. Mod-
eling wine preferences by data mining from physicochemical properties. Decision Support
Systems, 47(4):547–553, 2009.
[19] Radu V Craiu and Jeffrey S Rosenthal. Bayesian computation via markov chain monte
carlo. Annual Review of Statistics and Its Application, 1:179–201, 2014.
[20] Radu V. Craiu and Avideh Sabeti. In mixed company: Bayesian inference for bivariate
conditional copula models with discrete and continuous outcomes. Journal of Multi-
variate Analysis, 110:106–120, 2012.
[21] Claudia Czado. Pair-copula constructions of multivariate copulas. Copula theory and
its applications, pages 93–109, 2010.
124
[22] Luciana Dalla Valle, Fabrizio Leisen, and Luca Rossini. Bayesian non-parametric con-
ditional copula estimation of twin data. Journal of the Royal Statistical Society: Series
C (Applied Statistics), 2017.
[23] Alexis Derumigny and Jean-David Fermanian. About tests of the “simplifying” assump-
tion for conditional copulas. Dependence Modeling, 5(1):154–197, 2017.
[24] Holger Dette, Marc Hallin, Tobias Kley, Stanislav Volgushev, et al. Of copulas, quan-
tiles, ranks and spectra: An l {1}-approach to spectral analysis. Bernoulli, 21(2):781–
831, 2015.
[25] Christopher C Drovandi. Abc and indirect inference. In Handbook of Approximate
Bayesian Computation, pages 179–209. Chapman and Hall/CRC, 2018.
[26] Christopher C Drovandi, Clara Grazian, Kerrie Mengersen, and Christian Robert. Ap-
proximating the likelihood in abc. In Handbook of Approximate Bayesian Computation,
pages 321–368. Chapman and Hall/CRC, 2018.
[27] Julian J Faraway. Extending the linear model with R: generalized linear, mixed effects
and nonparametric regression models. Chapman and Hall/CRC, 2016.
[28] Julian J Faraway. Linear models with R. Chapman and Hall/CRC, 2016.
[29] Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate
bayesian computation: semi-automatic approximate bayesian computation. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 74(3):419–474, 2012.
[30] J.-D. Fermanian and O. Lopez. Single-index copulae. ArXiv preprint: 1512.07621, 2015.
[31] Sarah Filippi, Chris P Barnes, Julien Cornebise, and Michael PH Stumpf. On optimality
of kernels for approximate bayesian computation using sequential monte carlo. Statistical
applications in genetics and molecular biology, 12(1):87–107, 2013.
[32] Evelyn Fix and JL Hodges. Discriminatory analysis, nonparametric discrimination:
Consistency properties usaf school of aviation medicine, randolph field. Technical report,
Texas, Tech. Report 4, 1951.
[33] Seymour Geisser and William F Eddy. A predictive approach to model selection. Journal
of the American Statistical Association, 74(365):153–160, 1979.
125
[34] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple se-
quences. Statistical Science, Vol. 7(4):457–472, 1992.
[35] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information
criteria for bayesian models. Statistics and Computing, 24(6):997–1016, 2014.
[36] Christian Genest and Anne-Catherine Favre. Everything you always wanted to know
about copula modeling but were afraid to ask. Journal of Hydrologic Engineering,
12(4):347–368, Jul-Aug 2007.
[37] Christian Genest, Kilani Ghoudi, and L-P Rivest. A semiparametric estimation pro-
cedure of dependence parameters in multivariate families of distributions. Biometrika,
82(3):543–552, 1995.
[38] Irene Gijbels, Marek Omelka, and Noel Veraverbeke. Estimation of a copula when
a covariate affects only marginal distributions. Scandinavian Journal of Statistics,
42(4):1109–1126, 2015.
[39] Robert B Gramacy and Heng Lian. Gaussian process single-index models as emulators
for computer experiments. Technometrics, 54(1):30–41, 2012.
[40] T. Hanson, A. Branscum, and W. Johnson. Predictive comparison of joint longitudinal-
survival modelling: a case study illustrating competing approaches. Lifetime Data
Analysis, 17:2–28, 2011.
[41] W Keith Hastings. Monte carlo sampling methods using markov chains and their ap-
plications. Biometrika, 57:97–109, 1970.
[42] Jose Miguel Hernandez-Lobato, James R Lloyd, and Daniel Hernandez-Lobato. Gaus-
sian process conditional copulas with applications to financial time series. In Advances
in Neural Information Processing Systems, pages 1736–1744, 2013.
[43] Philip Hougaard. Analysis of Multivariate Survival Data. Statistics for Biology and
Health. Springer-Verlag, New York, 2000.
[44] Yuao Hu, Robert B Gramacy, and Heng Lian. Bayesian quantile regression for single-
index models. Statistics and Computing, 23(4):437–454, 2013.
126
[45] Marko Jarvenpaa, Michael U Gutmann, Arijus Pleska, Aki Vehtari, Pekka Marttinen,
et al. Efficient acquisition rules for model-based approximate bayesian computation.
Bayesian Analysis, 2018.
[46] James E Johndrow and Jonathan C Mattingly. Error bounds for approximations of
markov chains used in bayesian sampling. arXiv preprint arXiv:1711.05382, 2017.
[47] James E Johndrow, Jonathan C Mattingly, Sayan Mukherjee, and David Dun-
son. Optimal approximating markov chains for bayesian inference. arXiv preprint
arXiv:1508.03387, 2015.
[48] Matthias Killiches, Daniel Kraus, and Claudia Czado. Examination and visualisation
of the simplifying assumption for vine copulas in three dimensions. Australian & New
Zealand Journal of Statistics, 59(1):95–117, 2017.
[49] N. Klein and T. Kneiß. Simultaneous inference in structured additive conditional copula
regression models: a unifying Bayesian approach. Stat. Comput., pages 1–20, 2015.
[50] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspec-
tives, 15(4):143–156, 2001.
[51] Eric D Kolaczyk and Gabor Csardi. Statistical analysis of network data with R, vol-
ume 65. Springer, 2014.
[52] Lajmi Lakhal, Louis-Paul Rivest, and Belkacem Abdous. Estimating survival and as-
sociation in a semicompeting risks model. Biometrics, 64(1):180–188, March 2008.
[53] Philippe Lambert and F Vandenhende. A copula-based model for multivariate non-
normal longitudinal data: analysis of a dose titration safety study on a new antidepres-
sant. Statist. Medicine, 21:3197–3217, 2002.
[54] A Lee, C Andrieu, and A Doucet. Discussion of constructing summary statistics for ap-
proximate bayesian computation: semi-automatic approximate bayesian computation.
JR Stat. Soc. Ser. B Stat. Methodol, 74(3):449–450, 2012.
[55] Anthony Lee. On the choice of mcmc kernels for approximate bayesian computation
with smc samplers. In Proceedings of the 2012 Winter Simulation Conference (WSC),
pages 1–12. IEEE, 2012.
127
[56] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science
& Business Media, 2006.
[57] Evgeny Levi and Radu V Craiu. Bayesian inference for conditional copulas using gaus-
sian process single index models. Computational Statistics & Data Analysis, 122:115–
134, 2018.
[58] David Lopez-Paz, Jose Miguel Hernandez-Lobato, and Zoubin Ghahramani. Gaussian
process vine copulas for multivariate dependence. In Proceedings of the 30th Inter-
national Conference on Machine Learning, volume 28, pages 10–18, Atlanta, Georgia,
USA, 2013. JMLR: W&CP.
[59] Thomas Lux and Michele Marchesi. Volatility clustering in financial markets: a mi-
crosimulation of interacting agents. International journal of theoretical and applied
finance, 3(04):675–702, 2000.
[60] Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ryder. Approximate
bayesian computational methods. Statistics and Computing, 22(6):1167–1180, 2012.
[61] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavare. Markov chain
monte carlo without likelihoods. Proceedings of the National Academy of Sciences,
100(26):15324–15328, 2003.
[62] Kerrie L Mengersen, Richard L Tweedie, et al. Rates of convergence of the hastings
and metropolis algorithms. The annals of Statistics, 24(1):101–121, 1996.
[63] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller,
and Edward Teller. Equation of state calculations by fast computing machines. The
journal of chemical physics, 21(6):1087–1092, 1953.
[64] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer
Science & Business Media, 2012.
[65] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling.
In International Conference on Artificial Intelligence and Statistics, 2010.
[66] Andrew Naish-Guzman and Sean Holden. The generalized FITC approximation. In
Advances in Neural Information Processing Systems, pages 1057–1064, 2007.
128
[67] Shaoyang Ning and Neil Shephard. A nonparametric bayesian approach to copula esti-
mation. arXiv preprint arXiv:1702.07089, 2017.
[68] Robert Nishihara, Iain Murray, and Ryan P Adams. Parallel mcmc with generalized
elliptical slice sampling. Journal of Machine Learning Research, 15(1):2087–2112, 2014.
[69] John P Nolan. Modeling financial data with stable distributions. In Handbook of heavy
tailed distributions in finance, pages 105–130. Elsevier, 2003.
[70] Andrew J Patton. Modelling asymmetric exchange rate dependence*. International
economic review, 47(2):527–556, 2006.
[71] Natesh S Pillai and Aaron Smith. Ergodicity of approximate mcmc chains with appli-
cations to large data sets. arXiv preprint arXiv:1405.0182, 2014.
[72] Dennis Prangle. Summary statistics in approximate bayesian computation. arXiv
preprint arXiv:1512.05633, 2015.
[73] Leah F Price, Christopher C Drovandi, Anthony Lee, and David J Nott. Bayesian
synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1–11,
2018.
[74] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse
approximate gaussian process regression. The Journal of Machine Learning Research,
6:1939–1959, 2005.
[75] Gareth O Roberts, Andrew Gelman, Walter R Gilks, et al. Weak convergence and
optimal scaling of random walk metropolis algorithms. The annals of applied probability,
7(1):110–120, 1997.
[76] Gareth O Roberts, Jeffrey S Rosenthal, et al. Optimal scaling for various metropolis-
hastings algorithms. Statistical science, 16(4):351–367, 2001.
[77] Gareth O Roberts, Jeffrey S Rosenthal, and Peter O Schwartz. Convergence properties
of perturbed markov chains. Journal of applied probability, 35(1):1–11, 1998.
[78] Jeffrey S Rosenthal. Markov chain monte carlo algorithms: Theory and practice. In
Monte Carlo and Quasi-Monte Carlo Methods 2008, pages 157–169. Springer, 2009.
129
[79] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applica-
tions. Chapman and Hall/CRC, 2005.
[80] Avideh Sabeti, Mian Wei, and Radu V Craiu. Additive models for conditional copulas.
Stat, 3(1):300–312, 2014.
[81] Thilo A Schmitt, Rudi Schafer, Holger Dette, and Thomas Guhr. Quantile correlations:
Uncovering temporal dependencies in financial time series. International Journal of
Theoretical and Applied Finance, 18(07):1550044, 2015.
[82] Alexander Y Shestopaloff and Radford M Neal. Mcmc for non-linear state space models
using ensembles of latent sequences. arXiv preprint arXiv:1305.0320, 2013.
[83] SA Sisson, Y Fan, and MA Beaumont. Overview of approximate bayesian computation.
arXiv preprint arXiv:1802.09720, 2018.
[84] Scott A Sisson, Yanan Fan, and Mark Beaumont. Handbook of Approximate Bayesian
Computation. Chapman and Hall/CRC, 2018.
[85] Scott A Sisson, Yanan Fan, and Mark M Tanaka. Sequential monte carlo without
likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007.
[86] A. Sklar. Fonctions de repartition a n dimensions et leurs marges. Publications de
l’Institut de Statistique de l’Universite de Paris, 8:229–231, 1959.
[87] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-
inputs. In Advances in neural information processing systems, pages 1257–1264, 2005.
[88] David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde.
Bayesian measures of model complexity and fit (with discussion). Journal of the Royal
Statistical Society, Series B, 64:583–639(57), 2002.
[89] Aki Vehtari, Andrew Gelman, and Jonah Gabry. Practical bayesian model evaluation
using leave-one-out cross-validation and waic. Statistics and Computing, 27(5):1413–
1432, 2017.
[90] N Veraverbeke, M. Omelka, and Irene Gijbels. Estimation of a conditional copula and
association measures. Scand. J. Statist., 38:766–780, 2011.
130
[91] Sumio Watanabe. Asymptotic equivalence of Bayes cross validation and widely applica-
ble information criterion in singular learning theory. The Journal of Machine Learning
Research, 11:3571–3594, 2010.
[92] Sumio Watanabe. A widely applicable bayesian information criterion. Journal of Ma-
chine Learning Research, 14(Mar):867–897, 2013.
[93] Richard D Wilkinson. Accelerating abc methods using gaussian processes. arXiv
preprint arXiv:1401.1436, 2014.
[94] Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems.
Nature, 466(7310):1102, 2010.
[95] Juan Wu, Xue Wang, and Stephen G Walker. Bayesian nonparametric inference for
a multivariate copula function. Methodology and Computing in Applied Probability,
16(3):747–763, 2014.
[96] Juan Wu, Xue Wang, and Stephen G Walker. Bayesian nonparametric estimation of a
copula. Journal of Statistical Computation and Simulation, 85(1):103–116, 2015.
131