Multivariate Skewed Responses: New Semiparametric Regression
Model and a Bayesian Recourse
Abstract
For many real-life studies with skewed multivariate responses, the level of skewness
and association structure assumptions are essential for evaluating the covariate effects
on the response, and its predictive distribution. We present a novel semiparametric
model leading to a theoretically justifiable semiparametric Bayesian analysis of multi-
variate skewed responses. Similar to multivariate Gaussian densities, this multivariate
model is closed under marginalization, and allows a wide class of multivariate associa-
tions with meaningful physical interpretations of skewness levels and covariate effects
on the marginal density. Compared to existing models, our model enjoys several de-
sirable properties, including scalable Bayesian computing via available software, and
assurance of consistent Bayesian estimates of parameters and the nonparametric error
density under a set of plausible prior assumptions. We illustrate the practical advan-
tages of our methods over existing parametric alternatives via application to a clinical
dataset assessing periodontal disease, and a simulation study.
Keywords: Dirichlet process; Kernel density; Markov Chain Monte Carlo; Periodontal dis-
ease; Skewed error
1
1 Introduction
Particularly in biomedical and econometric studies, we often come across highly skewed
multivariate responses. For example, the preliminary analysis of a periodontal disease (PD)
study of Type-2 diabetic Gullah-speaking African-Americans (Fernandes et al., 2009) clearly
reveals that the two major clinical responses of subject-level PD status (Darby and Walsh,
2014), the mean (average) periodontal pocket depth (PPD) and the mean clinical attachment
level (CAL) within each patient, are highly skewed, with possibly different levels of skewness.
A correct modeling of these two PD bio-markers and their associations with other covariables
are important, because PD has strong links to many types of diseases, including diabetes
(Casanova et al., 2014), heart disease (Mattila et al., 1989) and cancer (Michaud et al., 2008).
A typical multivariate regression analysis that ignores skewness often fails to effectively
describe the effects of patient-level covariates on skewed responses, and may even lead to
wrong models and predictive densities. As an alternative, quantile regression (QR) methods
(Koenker and Bassett Jr, 1978; Koenker, 2005) may be effective, however, their estimation
mostly focus on a single pre-specified quantile, and neither evaluate the levels of skewness
nor estimate the covariate effects on the whole distribution. Furthermore, the methods based
on transforming the multivariate skewed responses to symmetry have difficulties in choosing
a suitable transformation and interpreting the effects of the covariates on the original non-
transformed response.
This paper addresses the major challenges and limitations of existing models and
model-based regression analysis of multivariate skewed response data. First, existing long
list of approaches to handle multivariate skewed responses are mostly restricted to para-
metric models (Azzalini and Dalla Valle, 1996; Azzalini and Capitanio, 2003; Ferreira and
Steel, 2007; Fernandez and Steel, 1996; Genton, 2004), and in particular, center around
2
the parametric family of skewed distributions pioneered by Azzalini (1985). Other variants
of these parametric models include hidden truncation models (Arnold and Beaver, 2000),
multivariate skew-elliptical distributions (Branco and Dey, 2001; Sahu et al., 2003), and
skew-normal/independent distributions (Lachos et al., 2009; Bandyopadhyay et al., 2010,
2012). Unlike these models, our new semiparametric multivariate model in Section 2 has
a skewed nonparametric error distribution. Second, an elegant modeling assures that the
model class is closed under marginalization, that is, any subset of the response vector has the
distribution from the same multivariate class of the original vector, however, with different
parameters (like the property of multivariate normal density). For the majority of existing
parametric multivariate skewed proposals, the marginal densities either do not come from
the same class or the parameters (say, variance and skewness) of the marginal density of each
component depend on the parameters of the rest of the components. This is a major obstacle
for component-wise interpretation of the covariate effects, and particularly for specification
of priors which are typically based on available prior opinion about each component of the
response vector. We advocate a latent variable based model that is closed under marginal-
ization with marginal skewness and variance that do not depend on the parameters of other
components. This allows a meaningful interpretation of component-wise skewness and co-
variate effects. Finally, accommodating a flexible class of multivariate associations (such
as antedependence, toeplitz, or autoregressive structures) via a covariance matrix within a
multivariate skewed framework is a major challenge. A recent variation of the multivariate
skew-normal model (Chang and Zimmerman, 2016) accommodates the antependence asso-
ciation, however, at the cost of a strict condition on the vector of skewness parameters.
Our semiparametric model allows a broader class of multivariate associations (including the
aforementioned structures) within the response vector, without imposing any restriction on
skewness parameters.
3
In recent times, available Bayesian and likelihood based methods to model multivariate
skewed responses include an exclusive use of parametric models, lack of theoretically justi-
fied semiparametric methods, and lack of a computational tool using standard software. To
address these practical concerns, we present a carefully executed Bayesian semiparametric
regression methodology that leads to easy implementation via Markov Chain Monte Carlo
(MCMC) techniques using easily available softwares. Our model eases the prior specifications
by allowing marginal interpretations of the parameters for each component. Also, to the best
of our knowledge, we are the first to present a semiparametric Bayesian method for multi-
variate skewed responses with desirable asymptotic properties under practically reasonable
prior assumptions.
The rest of the paper proceeds as follows. After introducing our semiparametric pro-
posal in Section 2, we present the relevant components of Bayesian inference, such as prior
and likelihood specifications, and MCMC implementation routines in Section 3. In Sections
4 and 5, we assess the finite sample properties (via simulation studies), and the posterior
consistency of our Bayesian estimates, respectively. In Section 6, we apply our model to the
motivating dataset on PD. Finally, we present some concluding remarks in Section 7.
2 Semiparametric Model for Multivariate Skewed Re-
sponse
Let Yi = (Yi1, . . . , Yim) ∈ Rm be a row vector of skewed responses from subject i = 1, . . . , n.
For brevity, we use fixed number of components m per subject, however, our methods and
computational tools presented here can accommodate unequal number of components mi.
4
For the PD study, m = 2 for every subject. Our regression model is
Yij = µij + εij, for j = 1, . . . ,m, (1)
where µij = xijβj and xij = (1, xij1, . . . , xij,p−1) is the row vector of (p − 1) covariates for
Yij (with first term for the intercept), βj is the corresponding (p × 1) vector of regression
parameters, and εi = (εi1, . . . , εim) is the row vector of errors that follows an m-variate
multivariate density fm(ε), with possibly skewed marginal densities.
We specify a semiparametric class of skewed multivariate densities for fm(ε) using
two independent (m × 1) vectors of latent variables, Z1i = (Z1i1, . . . , Z1im)T and Z2i =
(Z2i1, . . . , Z2im)T, where Z1ij and Z2ij have the same unknown nonparametric symmetric
marginal density hj(·) for j = 1, . . . ,m. However, (Z1i1, · · · , Z1im) are jointly independent
with joint density∏m
j=1 hj(z1ij), and Z2i = (z2i1, · · · , z2im) has the m-variate Gaussian Cop-
ula density (Nelsen, 2007)
Cm(Z2i | h,Σρ) = φm[Φ−1
1 {H1(z2i1)}, . . . ,Φ−11 {Hm(z2im)} | Σρ
]×
m∏j=1
hj(z2ij)
φ1
[Φ−1
1 {Hj(z2ij)}](2)
where φm(. | Σρ) is the m-variate Gaussian density with mean 0 and correlation matrix
Σρ with unknown parameter ρ, Φ−11 (.) is the inverse cdf of N(0, 1), Hj(·) is the cdf of the
unknown symmetric density hj(·), and h = (h1, . . . , hm). The copula model of (2) ensures
that Z2ij has the marginal density hj(·) (same as Z1ij). Our multivariate skew semiparametric
(MSS) model for the m-dimensional εi, denoted by εi ∼ MSS(0, h,Σρ, α), is given as εij =
αj(1 + α2j )−1/2|Z1ij| + (1 + α2
j )−1/2Z2ij for j = 1, . . . ,m. This is represented via matrix
notation as
εTi = Aα|Z1i|+ A∗αZ2i , (3)
5
where Aα and A∗α are diagonal matrices with jth diagonal elements αj(1 + α2j )−1/2 and
(1 + α2j )−1/2 respectively for α ∈ Rm. The multivariate semiparametric model of (3) has
the joint distribution of (2) with symmetric marginal density hj(·) only when αj = 0 for
j = 1, · · · ,m. We model the unknown nonparametric symmetric density hj(·) as a scale
mixture of symmetric (around 0) kernel densities K(· | σ), with scale σ as
hj(u) =
∫ ∞0
K(u | σ)dGj(σ), (4)
where Gj(·) is the unknown nonparametric mixing distribution with support in R+ = (0,∞).
A popular choice for K(· | σ) in (4) is N(0, σ2) with symmetric density hj(u) expressed as,
hj(u) =
∫ ∞0
φ1(u | σ)dGj(σ). (5)
Using the result of Feller (1971) (p. 158), when the kernel K(· | σ) is uniform density
with support (−σ, σ), hj(u) in (4) represents the class of all unimodal symmetric around
0 densities. The semiparametric model of (3) is similar to the “skewed shocks” model of
Sahu et al. (2003), except it has a new parametrization of weights (Aα, A∗α), nonparametric
(h1, . . . , hm) and multivariate copula modeling of Z1i (instead of parametric multivariate
elliptical density of (Z1i, Z2i)). Semiparametric multivariate model resulting from (3) is
valid even when no actual latent skewed shock vector |Z1i| is applicable in practice. Later,
we demonstrate that this latent vector representation leads to nice model properties, and
also facilitates implementation.
Replacing the nonparametric (multivariate) specification of latent variables (Z1i, Z2i)
in (3)-(4) with parametric multivariate mean zero normal densities Z1i ∼ Nm(0, D2σ) and
Z2i ∼ Nm(0, DσΣρDσ) with Dσ = diag(σ1, . . . , σm) and positive definite correlation matrix
6
Σρ, we obtain a parametric multivariate skewed response model. We call this a parametric
Gaussian mixture model (PGM), and denote it by εi ∼ PGM(0, α,Σ) to differentiate it from
the multivariate skew normal model of Azzalini and Dalla Valle (1996), and later, Chang
and Zimmerman (2016), among others. For this parametric multivariate model, the skewed
marginal distribution of εij is the univariate skew-normal of Azzalini (1985), denoted by
SN(0, αj, σj).
It is obvious that our latent variable based model in (3) is closed under marginalization
because the marginal density of ε(1)i is from the same semiparametric MSS density indexed
by (α(1),Σ11, h(1)), where the (m×1) dimensional response εi is partitioned as εi = (ε
(1)i , ε
(2)i ),
with corresponding partitions of α = (α(1), α(2)), (h1, · · · , hm) = (h(1), h(2)) and matrix Σ =Σ11 Σ12
Σ12 Σ22
. Unlike our MSS model as well as the parametric model of Sahu et al. (2003),
most of the existing multivariate parametric models are not closed under marginalization.
When εi follows a location 0 multivariate skew-normal (MSN) distribution of Azzalini and
Dalla Valle (1996) denoted by εi ∼ MSN(0, α,Σ), the corresponding fm(εi) is given by
fm(εi1, . . . , εim) = 2φm(εi|Σ)Φ1(εiα), (6)
with the skewness parameter vector α and association matrix Σ. The multivariate model
of (6) and other related models are essentially based on common univariate skewed shock
|Z∗i |, instead of the multivariate Z1i in (3). As a consequence, these models are closed un-
der marginalization only in a restricted sense. Unlike MSS and the multivariate normal
model, parameters (α∗(1),Σ∗(1)) of the marginal skew-normal density of ε
(1)i are different from
(α(1),Σ11), and in fact, these (α∗(1),Σ∗(1)) are even functions of (α(2),Σ22,Σ12). In particu-
lar, the scalar component εij has the univariate skew-normal distribution (Azzalini, 1985)
7
denoted by SN(0; αj, σj) with density
fj(εij) = 2φ1(εij|σj)Φ1(αjεij), (7)
where φ1(.|σ) is the N(0, σ2) density, (Σ)jj = σ2j , and the marginal skewness parameter αj is
αj =αj + (1/σ2
j )ΣT.j(−j)α(−j)
1 + αT(−j)Σ(−j,−j)α(−j) − (1/σ2
j )(Σ.j(−j)α(−j))2, (8)
where Σ.j is the j-th column of the covariance matrix Σ, R(−j) denotes the reduced vector
after deleting j-th element of the vector R, and Q(−j,−k) denotes the reduced matrix after
deleting j-th row and k-th column of matrix Q. It is obvious from (7)-(8) that the marginal
distribution of εij, including its mean, variance and skewness, clearly depends on all the
components of vector α and matrix Σ via αj. The magnitude and even the sign of αj can
be different from those of αj. For example, in the bivariate case of our PD study with εi =
(εi1, εi2) errors for PPD and CAL responses with α = (α1, α2) and Σ =
σ21 ρσ1σ2
ρσ1σ2 σ22
,
the marginal density of εi1 (PPD) is SN(0; α1, σ1) with α1 = {α1 +ρα2σ2/σ1}/{1 +α22σ
22(1−
ρ2)}, where ρ = Corr(εi1, εi2) = σ12/{σ1σ2}. Even the signs of α1 and α1 are different
whenever α21σ1 + ρα1α2σ2 < 0. For Bayesian analysis, in particular, this complicates the
prior specification for, say, (α1, σ1) based on available prior opinion about the marginal
skewness and variability of the first response PPD without using its complex form. This
model also imposes a restriction that εi1 is skewed if α2 6= 0, ρ 6= 0 even when α1 = 0. Using
the Theorem 1 of Chang and Zimmerman (2016), we can show that in general this model
imposes several restrictions on the association structure.
8
In particular, the marginal mean and the covariance of Yi for the MSS model are
E (Yij) = xijβj + {αj/(1 + α2j )
1/2}λj and var (Yi) = Vi, (9)
with diagonal and off-diagonal elements given by (Vi)jj = σ2hj−{α2
j/(1+α2j )}λ2
j and (Vi)jk =
ρjkσhjσhk/{(1 + α2j )(1 + α2
k)}1/2 for all j 6= k ∈ {1, ...,m}, where common expectations
E(|Z1ij|) = E(|Z2ij|) = λj and variances V ar(Z1ij) = V ar(Z2ij) = σ2hj in (9) are taken with
respect to the common marginal density hj(·) of Z1ij and Z2ij given in (4). The correlation
Corr(εij, εik) between a pair (εij, εik) within cluster i is ρjkσhjσhk/[{(1+α2j )σ
2hj−α2
jλ2j}{(1+
α2k)σ
2hk − α2
kλ2k}]1/2.
Figure 1 (right panel) presents the contour plots of several bivariate (m = 2) versions
of the parametric Gaussian mixture densities PGM(0, α,Σ) with (3) for σ1 = σ2 = 1,
ρ = σ12/(σ1σ2) = 0.7 and common α1 = α2 = α0. Note that the common marginal density
of ε1 and ε2 is univariate SN(0;α0, 1). It is obvious from these contour plots that for a fixed
ρ, the shape of the joint density changes noticeably when α0 = 0 (the elliptical contour plot
of the bivariate normal density) to α0 = 2, 5, 10, with the marginal densities becoming more
positively skewed.
To compare the shape of parametric Gaussian mixture model density with the bivariate
skew-normal density of Chang and Zimmerman (2016), Figure 1 (left panel) also presents
the contour plots of the density of (6) for m = 2, ρ = 0.7, σ1 = σ2 = 1 and for the same
set of four values of α0. Like the contour plots of the corresponding parametric bivariate
Gaussian mixture model densities, the shape of the contours for the bivariate skew-normal
densities becomes increasingly non-elliptical as α0 increases, however, even for a moderately
large α0, the shapes of the contours of the bivariate skew-normal are noticeably different
from the contours of the bivariate parametric Gaussian mixture densities. Also, increasing
9
value of α0 does not affect the skewness of the marginal density of the bivariate skew-normal
density as profoundly as it affects the skewness of the marginal density of the parametric
Gaussian mixture density. For the multivariate skew-normal model of (6), there is even a
moderately negative association in the left tail even when the association parameter ρ is 0.7.
This points to the flexibility of our multivariate model in (3), even in the parametric case.
For α1 = α2 = 10, ρ = −0.7 and σ1 = σ2 = 1, Figure 2 compares the marginal density
of the bivariate (that is, m = 2) parametric model of (6) (in left panel) with corresponding
marginal density of the bivariate parametric Gaussian mixture model (right panel) via the
contour plots and density histograms of their corresponding marginal densities. While the
positive skewness of both components of bivariate Gaussian mixture density is evident from
the contour plot in Figure 2(b), the contour plot of bivariate skew-normal density in Figure
2(a) is nearly elliptical. Also, the marginal density histogram and boxplot of bivariate skew-
normal model in Figure 2(c) and Figure 2(e) appear to be approximately symmetric while
those of bivariate Gaussian mixture density in Figure 2(d) and Figure 2(f) demonstrate
substantial marginal skewness as expected for the chosen α as high as 10. This illustrates
the fact that, depending on the other parameters of the model, the marginal skewness of the
multivariate skew-normal model of (6) may be restricted compared to that of our model.
3 Likelihood, Prior Specifications and Implementation
Consider the linear regression model of (1) for n independent multivariate responses Yi
and matrix of covariates Xi =(xTi1, . . . , x
Tim
)from i = 1, . . . , n subjects. We now develop
a semiparametric Bayesian inference procedure for the model with the multivariate error
defined in (3) with kernel mixture of symmetric densities hj(·) in (4). Given observed data
10
(a) α0 = 0 (b) α0 = 0
(c) α0 = 2 (d) α0 = 2
(e) α0 = 5 (f) α0 = 5
(g) α0 = 10 (h) α0 = 10
Figure 1: Contour plots of bivariate version of the parametric Gaussian mixture (PGM) modeldensity (right panel) and bivariate skew-normal densities (left panel) for association parameterρ = 0.7, and common scale parameter σ1 = σ2 = 1 and different values of common skewness α0.
11
(a) (b)
(c) (d)
(e) (f)
Figure 2: Contour plot of bivariate skew-normal density (panel a) with histogram (panel c) andboxplot (panel e) of the corresponding common marginal density; Contour plot of bivariate versionof the parametric Gaussian mixture (PGM) model density (panel b) with histogram (panel d)and boxplot (panel f) of corresponding marginal density. Both bivariate densities use associationparameter ρ = −0.7, common scale parameter σ1 = σ2 = 1 and common α1 = α2 = 10.
12
D = {yi, Xi; i = 1, . . . , n}, the likelihood function of Θ = (β, α, ρ, σ,G) is,
L(Θ | D) =n∏i=1
fm(yi −Xiβ), (10)
where the joint density fm of the error-vector εi = yi − Xiβ has a complicated analytical
expression, because εi is expressed as a function of latent variables in (3). We will later
describe how to handle this likelihood for the purpose of data analysis. After specifying the
likelihood in (10), the priors are specified as follows.
(i) We suggest independent multivariate normal priors π1(β) and π2(α) for parameter
vectors β and α.
(ii) The joint prior distribution for the unknown parameter vector ρ of the paramteric
association (correlation) matrix in the copula density in (2) is π3(ρ). Depending on
the dimension of ρ, this π3(ρ) can be either an univariate or multivariate density.
(iii) The joint prior process π4(G) of (G1, · · · , Gm) is a product of m independent Dirichlet
processes (Ferguson, 1974), DP(G0j, C) for j = 1, · · · ,m, with pre-specified prior mean
G0j(·) of unknown Gj(.) and pre-specified concentration C around G0j. So, specifying
a flexible prior on Gj(·) induces a flexible prior on the nonparametric hj(·) in (4).
Hence, the joint posterior density is
p(Θ | D) ∝ L(Θ | D)π1(β)π2(α)π3(ρ)π4(G) . (11)
For the DP(G0j, C) the prior “guess” (prior mean) G0j of Gj is specified based on the
prior guess h0j(u) =∫∞
0K(u|σ)dG0j(σ) of the marginal density hj(u) of Z1ij in (4). The
concentration (or precision) parameter C is the measure of uncertainty of Gj around G0j;
13
the larger the value of C, the closer the sample-paths of hj(u) are to the prior “guess” h0j(u)
and smaller values of C allow sample-path of hj(u) to be very different from h0j(u).
Even though, the posterior in (11) and even the likelihood in (10) are difficult to
compute analytically, we implement a parametric as well as a semiparametric likelihood
based Bayesian analysis using Markov Chain Monte Carlo (MCMC) via standard statistical
software such as publicly available WinBUGS. For implementation within, say, WinBUGS, we
first need to specify the likelihood of (10) as a standard distribution available in WinBUGS.
For this purpose, the likelihood of (10) is the product of∫Cm(A∗−1
α {yi − Xiβ − Aα|Z1i|} |
h,Σρ)[∏m
j=1 hj(Z1ij)] dZ1i using (3), where the integral is taken over the support Rm of latent
vector Z1i, and hj(·) is the marginal density of independent Z1ij. Also, Cm(· | h,Σρ) is the
multivariate density in (2). For example, when hj(.) is parametric, say, N(0, σ2j ) density, we
implement the likelihood via two following steps: (1) (Wi1, . . . ,Wim) for i = 1, · · · , n are
mean-zero multivariate normal Nm(0,Σρ), where Wij = Φ−11 [Hj{(1 + α2
j )1/2(yij − Xijβ) −
αj|Z1ij|}], Φ−1 is the inverse-cdf of standard normal, and Hj(·) = Hj(·|σj) is the cdf
corresponding to the parametric density hj(·|σj); (2) Z1ij are independent hj(·|σj). The
parametric analysis further needs independent parametric prior distributions σj ∼ Gj(·|γj)
for j = 1, · · · ,m given (possibly unknown) hyperparameter γj with known hyper-prior
πγ(γ1, · · · , γm). For the special case of the parametric Bayesian analysis of model (3) with
a parametric N(0, σ2j ) density φ(·|σj) for Z1ij and Z2ij, the A∗−1
α {yi − Xiβ − Aα|Z1i|} in
(10) is essentially mean zero m-variate normal Nm(0,Ψ) density with covariance matrix
Ψ = DσΣρDσ, and Z1i ∼ Nm(0, D2σ), where Dσ = diag(σ1, ..., σm).
For a semiparametric Bayesian analysis with a particular symmetric density kernel
K(z|σ) for the nonparametric kernel-mixture densities h = (h1, . . . , hm) in (4), we implement
the likelihood contribution via following three steps: (1) (Wi1, . . . ,Wim) for i = 1, · · · , n are
mean-zero multivariate Nm(0,Σρ), where Wij = Φ−11 [K−1
ij {(1+α2j )
1/2(yij−Xijβ)−αj|Z1ij|}],
14
where K−1ij is the inverse-cdf of K(·|σij); (2) Z1ij for j = 1, . . . ,m are independent with
density K(·|σij); (3) σij for j = 1, . . . ,m are independent with nonparametric cdf Gj with
Dirichlet prior DP (G0j, C). The first two above steps of the likelihood for MCMC can
be easily modified to use, say, uniform kernel for K(· | σ) for nonparametric density hj(.)
in (4) by letting A∗−1α {yi −Xiβ − Aα|Z1i|} = ({2Φ1(Wi1)− 1}σi1, ..., {2Φ1(Wim)− 1}σim)T,
where (Wi1, ...,Wim) ∼ Nm(0,Σρ) and Z1ij follows independent uniform density with support
(−σij, σij).
For convenient MCMC implementation, we use the constructive definition of the Dirich-
let Process (Sethuraman, 1994) truncated at a user specified finite number, say, K com-
ponents for prior of Gj. Suppose (δ1, . . . , δK) are generated independently from G0j and
(B1, . . . , BK−1) are generated independently from Beta(1, C), and ω1 = B1, ωh = Bk
∏l<k(1−
Bl) for k = 2, . . . , K − 1 and ωK = 1 −∑K−1
l=1 ωl then, the sample path of Gj is approxi-
mated by Gj =∑K
l=1 ωlIδl when K is large. These likelihood steps and priors facilitates easy
Bayesian semiparametric implementation for either the Gaussian or uniform kernels using
publicly available freewares.
4 Simulation Study
To assess the finite sample properties of our Bayesian estimates, we conduct a small scale
simulation study. We consider N = 500 replications of bivariate response data with sample
size n = 50. The bivariate skewed errors (εi1, εi2) are generated from the parametric Gaussian
mixture version of the model in (3) using α = (α1, α2) = (2, 2), Z1i ∼ N2(0, I) and Z2i ∼
N2(0,Σρ), where Σρ is a (2× 2) correlation matrix with ρ12 = 0.7. The regression structure
for the simulation model is Yij = β0+xij1β1+β2xij2+εij, where (β0, β1, β2) = (2, 0.5, 0.5) and
xij1 and xij2 are sampled independently such that xij2 ∼ N(0, 1) and xij1 = 1 with probability
15
0.5 and xij1 = −1 otherwise, for j = 1, 2. The simulation model has the same regression
parameters (common β) for both components, however, it may not be a realistic assumption
for our motivating study. We fit the semiparametric Gaussian mixture model (SPGM),
where symmetric densities h1(·) and h2(·) of (4) are expressed as unknown scale mixtures
of Gaussian kernel as in (5) with corresponding mixture distribution G1 and G2. We also
fit the parametric Gaussian mixture model described in Section 2, assuming Z1ij ∼ N(0, σ2j )
for j = 1, 2 and Z2i ∼ Nm(0,Σ), with (Σ)11 = σ21, (Σ)22 = σ2
2 and (Σ)12 = ρ12σ1σ2. We use
independent N(0, 1) priors for β0, β1 and β2, and also for the skewness parameters α1 and
α2. For the DP(G01, C) and DP(G02, C) priors corresponding to G1 and G2 respectively,
we assume the prior “guess” G01 and G02 to be a Gamma(2, 1) density and C = 1. For
the parametric Gaussian mixture model, we specify Gamma(2, 1) as the independent prior
distribution for σ21 = σ2
2. To test the performance of our semiparametric methods, we
use the mean squared error MSE(θ) =∑N
k=1(θk − θ)2/N , where θk is the estimate of the
parameter θ via analyzing the simulation replicate k = 1, 2, . . . , N of the dataset. Similarly,
we also use the average bias∑N
k=1(θk−θ)/N and the coverage probability of the 95% interval
estimated for the competing models. For the purpose of comparison, we obtain estimates
for the parameters using the Generalised Estimating Equations (GEE) approach (Liang and
Zeger, 1986) in conjunction with the sandwich variance estimate approach; for linear models,
GEE is equivalent to a mixed model with a sandwich variance estimate. Please note that,
using (9), the resulting marginal intercept for component j = 1, 2 of the simulation model
is β∗0j = β0 − {αj/(1 + α2j )}1/2λj, and thus the GEE estimate of the intercept is biased for
estimating β0. However, GEE will be unbiased for the regression coefficients β1 and β2. We
summarize the results of the simulation study in Table 1.
The simulation study reveals that our semiparametric Gaussian mixture model and
the parametric Gaussian mixture model yield smaller mean squared errors (at least 20% re-
16
Table 1: Simulation Results based on 500 replicates of the data comparing the Mean SquaredError (MSE), Standard Error (SE), Bias, and Coverage Probability (Coverage) for associa-tion parameter ρ12 = 0.7 sample size is n = 50. True parameter values are (β0, β1, β2, α1,α2) = (2, 0.5, 0.5, 2, 2).
SPGM PGM GEEβ0 β1 β2 α1 α2 β0 β1 β2 α1 α2 β0 β1 β2
100×MSE 1.702 0.376 0.432 16.22 29.995 1.461 0.359 0.397 18.07 14.88 56.897 0.479 0.54510×SE 0.948 0.616 0.636 2.361 3.436 0.766 0.600 0.627 2.966 2.776 0.687 0.632 0.64010×Bias 0.903 0.019 -0.123 -3.271 4.273 0.937 0.026 -0.079 -3.054 -2.688 7.511 0.093 -0.045Coverage 0.945 0.95 0.975 0.975 1 0.95 0.96 0.96 1 0.98 0.2 0.91 0.9
duction) and overall better coverage probability for 95% interval estimates when compared
to GEE based regression estimates. Unlike the popular repeated measures analysis method
using GEE, our methods provide estimates of skewness and better (more precise) interval
estimates of the regression parameters. We also see that in spite of the parametric Gaussian
mixture model being the the true simulation model, Bayesian estimates based on our semi-
parametric model are comparable with those from parametric model as far as MSE and bias
are concerned.
5 Posterior Consistency
For any theoretically defensible inference based on a complex semiparametric Bayesian
model, it is important to be assured that the posterior distribution of the parameters become
increasingly concentrated around the true parameter values as the sample size increases. This
necessitates the investigation of the asymptotic properties along with the finite sample prop-
erties. In this section, we provide sufficient conditions for posterior consistency, which is the
utmost important asymptotic property. For the sake of brevity, we present our theory for
the case where the dimension of the regression parameter, β is p = 1. However, our theory
can be extended to the general case where p > 1. We first investigate if the support of the
17
prior is large enough to cover all possible relevant densities. Suppose F denotes a class of all
univariate asymmetric unimodal residual densities and C{(F)m× (0, 1)} denotes a collection
of all m-dimensional residual densities f0 with unknown correlation ρjk ∈ (0, 1) having the
form,
f0(e) =
∫Rm+
2m|J |m∏j=1
h0j(zj)h0j(e∗j)
φ1
[Φ−1{H0j(e∗j)}
]φm [Φ−1{H01(e∗1)...Φ−1{H0m(e∗m)}|Σρ
]dz,
with the j-th marginal density given by,
f0j(ej) =
∫R+
2√
1 + α2jh0j(zj)h0j(e
∗j)dzj j = 1, ...,m
where h0j(·) be the true (unknown) symmetric (around 0) density, H0j(·) is the corresponding
cumulative distribution function and e∗j =√
1 + α2jej − αzj for j = 1, ...,m and
|J | =
∣∣∣∣∣∣∣I 0
A∗−1α −A∗−1
α Aα
∣∣∣∣∣∣∣Let Π denote a prior on C{(F)m×(0, 1)} such that Π = (πF)m⊗π3(ρ), where πF is a prior on
F defined by Gj ∼ DP(G0j, C) for j = 1, ...,m and α ∼ π2(α) and π3(ρ) is a prior on (0, 1).
For ease of exposition, we will assume that α is known. To show that posterior consistency
holds at true density f0, we first define the Kullback-Liebler divergence as KL(f0, f) =∫Rm f0 log(f0/f) and Kullback-Leibler neighborhood of size ε as κε(f0) = {f : KL(f0, f) < ε}.
For any prior Π∗ on the density space F∗, f0 is said to be in the Kullback-Leibler support
of Π∗ denoted by KL(Π∗) if f0 ∈ S, where S = {f0 : Π∗(f : KL(f0, f) < ε)∀ε > 0}. Define
FKL ={f0 ∈ C{(F)m × (0, 1)},
∫f0| log f0| <∞
}. We characterize the Kullback-Leibler Π
in the following lemma with its proof in Appendix A.
18
Lemma 1 1. FKL ⊂ KL(Π) if G0j is defined on support R+ for all j = 1, . . . ,m and G0j
is absolutely continuous with respect to Lebesgue measure and supp(π3(ρ)) = (0, 1).
Suppose, Dn is the observed data with sample size n, and β0 and f0 are the true val-
ues of the regression parameter and the multivariate error density respectively. We now
put forth the sufficient conditions to ensure that as the sample size n increases, the pos-
terior distributions of the parameter β and the error density f are concentrated around a
small neighborhood around their true values. We are essentially interested in making in-
ference on the regression parameter β so we define a strong L1 neighborhood around the
true value, β0 as the set S(β0) = {β : |β − β0| < δ} and weak neighborhood of around f0 as
Wε(f0) ={f ∈ F : |
∫ϕf −
∫ϕf0| < ε
}for a bounded continuous function ϕ. Assume that
U = Wε(f0) × S(β0). The following theorem gives the result on posterior consistency and
the details of the proof are given in the Appendix B.
Theorem 1 Suppose (f0, β0) ∈ FKL × R. Consider a prior Π = (ΠC ⊗ Πβ) on FKL × R.
Then Π{(fm, β) ∈ U c|Dn} → 0 a.s. under the true data generating distribution, Pf0,β0 that
generates data Dn.
Technical details of the proof are given in the Appendix B.
6 Application: GAAD Study
The motivating data example for this paper comes from a clinical study (Fernandes et al.,
2009) initially conducted to explore the association between PD and diabetes level (de-
termined by Hba1c, or ‘glycosylated hemoglobin’) in Type-2 diabetic Gullah-speaking (or
simply Gullah) African-Americans (GAAD). It is well documented that about 5% to 15%
of world population and about 50% U.S. adults (over the age of 35) are susceptible to peri-
19
odontitis (Pourabbas et al., 2005; Oliver et al., 1998), a major cause of progressive bone loss
around the tooth..
The GAAD study aims to evaluate the distribution of PD status of this population,
as well as quantify the effects of subject-level covariates such as age (in years), gender
(1=Female, 0=Male), Body Mass Index or BMI (obese=1 if BMI >= 30, obese=0 if BMI <
30), glycemic level or HbA1c (1= High, 0 = controlled) and smoking status(1 = smoker, 0
= never smoker) on the PD status (Reich and Bandyopadhyay, 2010), as measured by the
PPD and CAL in rounded whole millimeters (Darby and Walsh, 2014). For our analysis,
we select n = 288 patients (subjects) with complete covariate information. About 31% of
the subjects are smokers. The mean age of the subjects is about 55 years with a range from
26-87 years. Female subjects seem to be predominant (about 76%) in our data, which is not
uncommon in Gullah population (Johnson-Spruill et al., 2009). About 68% of subjects are
obese (BMI >= 30) and 59% are with HbA1c = 1 (with blood sugar level higher than 7
percent), an indicator of high glycemic level.
Here, our two highly skewed responses from each subject are mean PPD and mean
CAL, which are averages of corresponding measurements from all sites (6 sites per tooth)
of all teeth of the subject. The underlying statistical goals are to evaluate the regression
functions explaining the covariate-response relationships of this bivariate skewed responses
and understand the relationship between two response. However, the validity of the esti-
mation of the regression function under a standard linear mixed model framework depends
on the usual Gaussian assumption about the responses. After fitting a linear mixed model
to the PD data using the nlme package in R, we present, in Figure 3 [panels (a)-(f)], the
histograms of the mean PPD and mean CAL, the histogram and Q-Q plot of the empirical
Bayes estimates of the random effects, and the histograms of the residuals for two responses.
These plots clearly reveal evidence of high and varying levels of skewness (i.e., departures
20
(a) (b)
(c) (d)
(e) (f)
Figure 3: Analysis of GAAD data using linear mixed model: Histograms of the PeriodontalPocket Depth (PPD) responses (panel a) and Clinical Attachment Level (CAL) responses (panelb); Histogram (panel c) and Q-Q plot (panel d) of the empirical Bayesian estimates of the randomeffects to assess the validity of the assumption of normal random effects density; Histograms ofresiduals for PPD responses (panel e) and CAL responses (panel f) obtained after fitting a linearmixed model.
21
from the Gaussian errors assumption) for the two responses, and also for the random subject
effects. In this section, we illustrate the application of the models and Bayesian method de-
veloped in Sections 2 and 3 of this article. We compare the results from fitting the following
4 competing models for the random errors, which are as follows:
1. Semiparametric Gaussian Mixture model (SPGM), which assumes symmetric densities
h1(·) and h2(·) of (4) are expressed as unknown scale mixtures of Gaussian kernel as
in (5).
2. Parametric Gaussian Mixture model (PGM), which assumes that the latent vectors
Z1i ∼ N2(0, D2σ) and Z2i ∼ N2(0,Σ), where D2
σ = diag(σ21, σ
22) and (Σ)11 = σ2
1, (Σ)22 =
σ22 (Σ)12 = ρσ1σ2.
3. Skew-t model (ST) assumes a Skew-t (Sahu et al., 2003) distribution with ν degrees of
freedom for the errors (εi1, εi2).
4. Bivariate Normal (BVN) model assumes a symmetric N2(0,Σ) distribution for (εi1, εi2),
ignoring the skewness.
We use practically non-informative independent N(0, 100) priors for all the regression
parameters (the components of β1 and β2), independent N(0, 100) priors for the skewness
parameters α1 and α2. We choose independent DP(G0j, C) for j = 1, 2 priors for the unknown
mixing distributions G1 and G2, which correspond to∫∞
0K(u|σ)dG0j(σ) for j = 1, 2. We set
the concentration parameter to C = 1, and assume same Gamma(8, 4) density for the prior
“guess” G01 and G02 of unknown G1 and G2 respectively. The prior guess Gamma(8, 4)
corresponds to the prior opinion that the ranges of Z1i and Z2i can be as narrow as ±2
to as wide as ±6 with high prior probability (with prior expectation of range ±4). This
prior guess for G1 and G2 matches with the prior distribution used for σ1 and σ2 of the
22
parametric Gaussian mixture model, skew-t and bivariate normal models. For the degrees of
freedom ν of the skew-t distribution, we choose Exp(0.1)I[2,∞), which is exponential density
truncated at 2 as the prior. For all four models, we use a uniform (-1,+1) as the prior for
ρ. It is important here to note that our choice of priors, although fairly non-informative,
is for illustrating the implementation of our statistical methods and may not represent the
prior opinion of the clinical investigators. We generated two parallel MCMC chains of size
350,000 and computed the posterior estimates after discarding the first 325,000 iterations.
To guard against the potential problems arising due to autocorrelation among the successive
iterations, we use thinning rate of 25. To assess the model convergence, we use the trace
plots, autocorrelation plots and the Gelman-Rubin R. The MCMC chain for parametric
Gaussian mixture model requires a larger number of iterations to converge as compared to
that of the semiparametric Gaussian mixture model. The GAAD data and the relevant R
and WinBUGS codes for the parametric and semiparametric Bayesian analysis are available
from the authors on request.
We use the Conditional Predictive Ordinate (cpo) statistic, (cpo)i =∫fm(yi −Xiβ |
Θ)p(Θ | D(−i))dΘ (Gelfand et al., 1992), to compare the model performances, where D(−i)
is the reduced data after deleting the ith observation from the full data D, and p(Θ | D(−i))
is the posterior density of Θ given D(−i). These cpoi for i = 1, . . . , n are obtained via the
Monte Carlo method using the MCMC samples from the full posterior distribution p(Θ | D).
The summary statistic using the cpoi is the log of the pseudo-marginal likelihood (lpml),
given by lpml =∑n
i=1 log (cpoi). We also use the Watanabe-Akaike information criterion
(waic) (Watanabe, 2010; Vehtari et al., 2016), considered a state-of-the-art Bayesian model
selection tool. The computations of the waic and lpml are convenient, based on MCMC
samples from the full posterior distribution p(Θ | D).
Models with larger lpml and smaller waic indicate more data support. Table 2
23
presents the computed lpml and waic values for the 3 competing models. These values sug-
gest the semiparametric Gaussian mixture model of (3)-(4) as the most appropriate model
followed by the parametric Gaussian mixture model. Recall that the parametric Gaussian
mixture model is the parametric case of the semiparametric Gaussian mixture model. The
lpml values for the skew-t and bivariate normal (existing parametric) models are -242.112
and -278.57 respectively, both substantially smaller than our new semiparametric and para-
metric models. These numbers suggest that our new modeling strategy is substantially more
appropriate for the GAAD data, compared to the competing parametric models.
Table 2: Comparison of the model selection measures for the semiparametric Gaussian mixture(SPGM), parametric Gaussian mixture (PGM), skew-t (ST), and bivariate normal (BVN) models.Two model selection measures used are the Watanabe-Akaike Information Criterion (waic), andthe log pseudo-marginal likelihood (lpml)
Model WAIC LPMLSPGM 424.7795 -74.644PGM 441.8912 -127.882ST 618.9393 -244.546BVN 654.5964 -300.315
The posterior estimates, standard deviations, and 95% credible intervals (CIs) of model
parameters obtained from fitting the parametric Gaussian and the semiparametric Gaussian
mixture models to the bivariate GAAD data responses (PPD and CAL) are given in Tables 3
and 4, respectively. The corresponding estimates (without the 95% CIs) from the skew-t and
the BVN models are also presented. The CIs for the skewness parameters α1 (corresponding
to PPD) and α2 (corresponding to CAL) do not contain 0 under any of the two selected
best models, and thus present substantial evidence of skewness for both responses. However,
under the skew-t model, the CIs for α1 overlaps with 0 and fails to detect skewness. Positive
values for the posterior mean of the skewness parameters indicate right skewness, as expected
from the histograms of marginal responses shown in Figure 3. The posterior means of the
24
effects of glycemic status (HbA1c) on both PPD and CAL (with the 95% CIs not covering
zero) are positive, implying that higher HbA1c may lead to substantially higher levels of
PD. Among smokers, there is a positive trend (not significant) to have higher PPD, and a
significantly higher CAL, compared to non-smokers. Also, significantly higher PPD and CAL
are observed for males, compared to females from both the semiparametric and parametric
Gaussian mixture models. However, under the bivariate normal and skew-t models, gender
effects do not have enough posterior evidence. On the whole, for most of the parameters, the
posterior standard deviation and associated 95% CIs suggest that our new Bayesian methods
provide more precise estimates for the model parameters. In terms of sensitivity analysis, we
observe that moderate changes in the choice of priors do not affect the parameter estimates
(both magnitude and sign) and the model comparison measures greatly.
Table 3: Posterior estimates of the regression parameters and the skewness parameter α1
corresponding to the response ‘periodontal pocket depth’ (PPD), obtained after fitting thesemiparametric Gaussian mixture (SPGM), parametric Gaussian mixture (PGM), paramet-ric skew-t (ST), and bivariate normal (BVN) models. SD and CI denote the posteriorstandard deviation and 95% credible intervals, respectively.
SPGM PGM ST BVN
Mean SD CI Mean SD CI Mean SD Mean SD
Intercept 2.1169 0.3143 (1.4708,2.6673) 2.1887 0.3881 (1.4807,2.9230) 2.0865 0.6557 2.1169 0.6298Age -0.0038 0.0039 (-0.0107,0.0041) -0.0058 0.0043 (-0.0137,0.0032) -0.0069 0.0076 -0.0098 0.0073Gender -0.2534 0.0891 (-0.4332,-0.0993) -0.2096 0.1219 (-0.4301,-0.0448) -0.1458 0.1861 -0.0811 0.1753BMI -0.1825 0.0843 (-0.3195,0.0018) -0.1378 0.1021 (-0.3769,0.0505) -0.2259 0.1402 -0.1056 0.1645Smoker 0.0599 0.0826 (-0.0993, 0.2219) 0.0746 0.0987 (-0.1270,0.2534) 0.1663 0.1767 0.1485 0.1696HbA1c 0.1789 0.0807 (0.0437,0.3565) 0.1331 0.0634 (0.0156,0.2618) 0.2191 0.1546 0.1701 0.1501α1 2.4224 0.2845 (1.8813,2.9644) 0.6968 0.1176 (0.4709,0.9378) 0.0687 0.0843 – –
7 Conclusion
In this paper, we presented a novel class of semiparametric multivariate stochastic models for
highly skewed error distributions. Our full model based approaches are substantially different
25
Table 4: Posterior estimates of the regression parameters and the skewness parameter α2
corresponding to the response ‘clinical attachment level’ (CAL) obtained after fitting thesemiparametric Gaussian mixture (SPGM), parametric Gaussian mixture (PGM), paramet-ric skew-t (ST), and bivariate normal (BVN) models. SD and CI denote the posteriorstandard deviation and 95% credible intervals, respectively.
SPGM PGM ST BVN
Mean SD CI Mean SD CI Mean SD Mean SD
Intercept 1.1876 0.4131 (0.3360,1.9991) 0.9896 0.4851 (-0.0245,1.8539) 1.4133 0.6813 1.1401 0.6485Age 0.0096 0.0048 (0.0005,0.0193) 0.0081 0.0057 (-0.0025,0.0198) 0.0 050 0.0078 0.0073 0.0075Gender -0.3913 0.0976 (-0.5903,-0.2228) -0.3556 0.1592 (-0.6681,-0.0743) -0.2505 0.1968 -0.3094 0.1872BMI -0.1498 0.1045 (-0.3283,0.0885) -0.1022 0.1229 (-0.3778,0.1188) -0.1936 0.1619 -0.1218 0.1725Smoker 0.2060 0.0925 (0.0166,0.3913) 0.2546 0.1145 (0.0259,0.4605) 0.2835 0.1865 0.3581 0.1813HbA1c 0.2593 0.1055 (0.0782,0.4865) 0.2392 0.0864 (0.0708,0.3981) 0.3005 0.1623 0.3143 0.1573α2 1.2973 0.2231 (0.9077,1.7416) 1.1646 0.1333 (0.9216,1.4275) 0.5294 0.0919 – –
from the estimating equations approaches of Ma et al. (2005), because our method uses full
likelihood and allows estimation and prediction of skewness and association. Also, the meth-
ods of Ma et al. (2005) and others use estimating equations for models with single skewing
function (essentially single skewed “shock” model). The parametric Gaussian mixture model
with skew-normal density for each marginal is a special case of our semiparametric model.
Our class of models can incorporate a large and flexible within-subject association, and the
association structure does not affect the marginal densities. Recently there has been some
confusion about how to model multivariate skewed densities via “single shock model” versus
“multiple shock model”. We show the superiority of the later approach initiated by Sahu
et al. (2003), and extended it to a semiparametric models. Our new semiparametric models
are readily amenable to Bayesian analysis through prior specifications for each component
and they do not involve any complex a priori relationship among parameters of different com-
ponents. Also, our latent variable representation simplifies the MCMC-based computation
and we present a novel trick to implement the likelihood within WinBUGS and other standard
software. The WinBUGS codes for the parametric and semiparametric Bayesian analysis are
26
available from the authors on request.
The Bayesian method described in this paper yields precise estimate of covariate effects
with meaningful physical interpretation. This is the first semiparametric full likelihood
based method for analysis of multivariate skewed response with full theoretical justification
provided via Bayesian asymptotic results. These methods for multivariate skewed responses
are more appropriate compared to quantile regression, when the goal is to understand the
effect of the covariates on the whole distribution (instead of certain quantiles and moments).
However, our methods can also estimate any desired quantile function (details omitted for
brevity). It also allows for seamless prediction, which is often important in biomedical
studies. The analysis of the GAAD data considered in this paper is primarily focused
on modeling the skewed bivariate response, and investigating its cross-sectional association
with a fixed number of subject-level covariates. Certainly, our proposed method can be
easily extended to scenarios with covariates missing-at-random. Furthermore, the methods
and tools discussed in this article can be extended to studies where the dimensions of the
response vector vary over subjects – leading to the popular ‘informative cluster size’ scenario
(Seaman et al., 2014). These will be pursued elsewhere.
References
Arnold, B. C. and Beaver, R. J. (2000), ‘Hidden truncation models’, Sankhya: The Indian
Journal of Statistics, Series A 62, 23–35.
Azzalini, A. (1985), ‘A class of distributions which includes the normal ones’, Scandinavian
journal of statistics 12(2), 171–178.
Azzalini, A. and Capitanio, A. (2003), ‘Distributions generated by perturbation of symme-
27
try with emphasis on a multivariate skew-t distribution’, Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 65(2), 367–389.
Azzalini, A. and Dalla Valle, A. (1996), ‘The multivariate skew-normal distribution’,
Biometrika 83(4), 715–726.
Bandyopadhyay, D., Lachos, V. H., Abanto-Valle, C. A. and Ghosh, P. (2010), ‘Linear
mixed models for skew-normal/independent bivariate responses with an application to
periodontal disease’, Statistics in Medicine 29(25), 2643–2655.
Bandyopadhyay, D., Lachos, V. H., Castro, L. M. and Dey, D. K. (2012), ‘Skew-
normal/independent linear mixed models for censored responses with applications to Hiv
viral loads’, Biometrical Journal 54(3), 405–425.
Branco, M. D. and Dey, D. K. (2001), ‘A general class of multivariate skew-elliptical distri-
butions’, Journal of Multivariate Analysis 79(1), 99–113.
Casanova, L., Hughes, F. and Preshaw, P. (2014), ‘Diabetes and Periodontal Disease: a
two-way relationship’, British Dental Journal 217(8), 433–437.
Chang, S.-C. and Zimmerman, D. L. (2016), ‘Skew-normal antedependence models for
skewed longitudinal data’, Biometrika 103(2), 363–376.
Darby, M. L. and Walsh, M. (2014), Dental Hygiene: Theory and Practice, 4 edn, Elsevier
Health Sciences.
Feller, W. (1971), An Introduction to Probability Theory and Its Applications, Vol. 2, second
edn, Wiley, New York, NY.
Ferguson, T. (1974), ‘Prior distribution on spaces of probability measures’, The Annals of
Statistics 2, 615–629.
28
Fernandes, J., Wiegand, R., Salinas, C., Grossi, S., Sanders, J., Lopes-Virella, M. and Slate,
E. (2009), ‘Periodontal disease status in Gullah African Americans with type 2 diabetes
living in South Carolina’, Journal of Periodontology 80(7), 1062–1068.
Fernandez, C. and Steel, M. F. (1996), ‘On bayesian modelling of fat tails and skewness’,
Available at SSRN 821 .
Ferreira, J. T. and Steel, M. F. (2007), ‘A new class of skewed multivariate distributions
with applications to regression analysis’, Statistica Sinica 17(2), 505–529.
Gelfand, A. E., Dey, D. K. and Chang, H. (1992), Model determination using predictive
distributions with implementation via sampling-based methods, Technical report, DTIC
Document.
Genton, M. G. (2004), Skew-elliptical distributions and their applications: a journey beyond
normality, CRC Press.
Johnson-Spruill, I., Hammond, P., Davis, B., McGee, Z. and Louden, D. (2009), ‘Health
of Gullah families in South Carolina with Type-2 diabetes: Diabetes self-management
analysis from Project SuGar’, The Diabetes Educator 35(1), 117–123.
Koenker, R. (2005), Quantile Regression, number 38 in ‘Econometric Society Monographs’,
Cambridge University Press.
Koenker, R. and Bassett Jr, G. (1978), ‘Regression quantiles’, Econometrica 46(1), 33–50.
Lachos, V. H., Dey, D. K. and Cancho, V. G. (2009), ‘Robust linear mixed models with
skew-normal independent distributions from a bayesian perspective’, Journal of Statistical
Planning and Inference 139(12), 4098–4110.
29
Liang, K.-Y. and Zeger, S. L. (1986), ‘Longitudinal data analysis using generalized linear
models’, Biometrika 73(1), 13–22.
Ma, Y., Genton, M. G. and Tsiatis, A. A. (2005), ‘Locally efficient semiparametric esti-
mators for generalized skew-elliptical distributions’, Journal of the American Statistical
Association 100(471), 980–989.
Mattila, K. J., Nieminen, M. S., Valtonen, V. V., Rasi, V. P., Kesaniemi, Y. A., Syrjala,
S. L., Jungell, P. S., Isoluoma, M., Hietaniemi, K. and Jokinen, M. J. (1989), ‘Association
between dental health and acute myocardial infarction.’, Bmj 298(6676), 779–781.
Michaud, D. S., Liu, Y., Meyer, M., Giovannucci, E. and Joshipura, K. (2008), ‘Periodontal
disease, tooth loss, and cancer risk in male health professionals: a prospective cohort
study’, The lancet oncology 9(6), 550–558.
Nelsen, R. B. (2007), An introduction to copulas, Springer Science & Business Media.
Oliver, R. C., Brown, L. J. and Loe, H. (1998), ‘Periodontal diseases in the United States
population’, Journal of Periodontology 69(2), 269–278.
Pati, D. and Dunson, D. B. (2014), ‘Bayesian nonparametric regression with varying residual
density’, Annals of the Institute of Statistical Mathematics 66(1), 1–31.
Pourabbas, R., Kashefimehr, A., Rahmanpour, N., Babaloo, Z., Kishen, A., Tenenbaum,
H. C., Azarpazhooh, A., Mdala, I., Olsen, I., Haffajee, A. D. et al. (2005), ‘Position paper:
Epidemiology of Periodontal Diseases’, Journal of Periodontology 76(8), 1406–1419.
Reich, B. and Bandyopadhyay, D. (2010), ‘A latent factor model for spatial data with infor-
mative missingness’, Annals of Applied Statistics 4(1), 439–459.
30
Sahu, S. K., Dey, D. K. and Branco, M. D. (2003), ‘A new class of multivariate skew distri-
butions with applications to bayesian regression models’, Canadian Journal of Statistics
31(2), 129–150.
Seaman, S. R., Pavlou, M. and Copas, A. J. (2014), ‘Methods for observed-cluster inference
when cluster size is informative: A review and clarifications’, Biometrics 70(2), 449–456.
Sethuraman, J. (1994), ‘A constructive definition of dirichlet priors’, Statistica Sinica 4, 639–
650.
Tang, Y., Sinha, D., Pati, D., Lipsitz, S. and Lipshultz, S. (2015), ‘Bayesian partial linear
model for skewed longitudinal data.’, Biostatistics (Oxford, England) 16(3), 441–453.
Vehtari, A., Gelman, A. and Gabry, J. (2016), ‘Practical Bayesian model evaluation using
leave-one-out cross-validation and WAIC’, Statistics and Computing .
Watanabe, S. (2010), ‘Asymptotic equivalence of bayes cross validation and widely applicable
information criterion in singular learning theory’, Journal of Machine Learning Research
11, 3571–3594.
Appendix A
Proof of Lemma 1
Given a density f0 ∈ F the idea is to construct a sequence of functions f (M) ∈ C{(F)m ×
(0, 1)} M ≥ 1 such that KL(f0, f(M)) → 0 as M → ∞. Let e = (e1, ..., em), Aα =
diag(α1(1 + α2
1)−1/2, ..., αm(1 + α2m)−1/2
)and A∗α = diag
((1 + α2
1)−1/2, ..., (1 + α2m)−1/2
). We as-
31
sume α1 = α2 = ... = αm = α0 and α0 is known. Define,
f (M)(e) =
∫Rm+
2m|J |m∏j=1
hjM (zj)hjM (e∗j )
φ1
[Φ−1{HjM (e∗j )}
]φm [Φ−1{H1M (e∗1), ...,Φ−1{HmM (e∗m)}| ΣρM
]dz,
where e∗j = (1 + α20)1/2ej − α0zj for j = 1, ...,m, HjM is the cdf of hjM and ρM → ρ as M → ∞,
|J | =
∣∣∣∣∣∣∣∣I 0
A∗−1α −A∗−1
α Aα
∣∣∣∣∣∣∣∣ with the marginal density given by,
fjM (ej) =
∫ ∞0
2(1 + α20)1/2hjM (zj)hjM (e∗j )dzj , j = 1, ...,m.
We now construct hjM . Suppose FjM (·) denotes the cdf of fjM (·). h0j is continuous and symmetric
(around 0) density. Clearly, h0j is increasing R− and decreasing on R+. We define weights as in
Wu and Ghoshal (2008). Suppose t1 > 0 and t2 > 0 such that h0j(t1) = a1 and h0j(t2) = b1, where
0 < b1 < 1 and b1 < a1 < h0j(0). For given M , let M1 and M2 be such that M1M ≤ t1 ≤ M1+1
M and
M2M ≤ t2 ≤
M2+1M . Set
w∗ji =
iM {h0j
(iM
)− h0j
(i+1M
)}, 1 ≤ i < M1,
M1M {h0j
(M1M
)− a1}, i = M1,
M+1M {a1 − h0j
(M+1M
)}, i = M1 + 1,
iM {h0j
(i−1M
)− h0j
(iM
)}, M1 + 1 < i ≤M2,
iM {h0j
(i−1M
)− h0j
(iM
)}, i ≥M1 + 1
We define h∗jM (e) =∑∞
1 w∗jiK(e; iM ), where K(e; θ) = 1
2θ1(−θ≤e≤θ). By the continuity of
h0j , h∗jM (e) converges to h0j(e) pointwise. However,
∑∞1 w∗ji 6= 1 and h∗jM (e) is not a pdf. We
define wji = w∗ji1−
∑M1−11 w∗ji−
∑∞M2+1 w
∗ji∑M2
M1w∗ji
for M1 ≤ i ≤ M2 then∑∞
1 wji = 1. Let hjM (e) =
32
∑∞1 wjiK(e; i
M ), where K(e; θ) = 12θ1(−θ≤e≤θ). Observe that
h∗jM (e)− hjM (e) =
( ∞∑1
w∗ji −∞∑1
wji
)
=
(1−
∑M1−11 wji −
∑∞M2+1wji∑M2
M1w∗ji
− 1
)M2∑M1
w∗jiM
2i
≤
(1−
∑M1−11 wji −
∑∞M2+1wji∑M2
M1w∗ji
− 1
)M2∑M1
w∗ji
M
2M1
=
1−M1−1∑
1
wji −∞∑
M2+1
wji −M2∑M1
w∗ji
M
2M1
=
(1− 1
M
∞∑1
h0j(i/M)− a1
M
)M
2M1→ 0
as M →∞, by definition of Riemann integral. Thus hjM (e) converges to h0j . Let M be large such
that the RHS of the equation is less than a2 Define fjM (e) = 2
b
∫∞0 hjM (zj)hjM (
ej−azjb )dzj , where
a = α√1+α2
and b = 1√1+α2
. Hence,
| log fjM (ej)| = | log
(2
b
)+ log
∫ ∞0
hjM (zj)hjM (e∗j )|
≤ | log
(2
b
)|+ | log
∫ ∞0
hjM (zj)hjM (e∗j )|
Since hjM → h0j pointwise, by construction of hjM we have, c1h0j(zj) ≤ hjM (zj) ≤ c2h0j(zj),
and c1h0j(e∗j ) ≤ hjM (e∗j ) ≤ c2h0j(e
∗j ). Thus we have
c1
∫ ∞0
h0j(zj)h0j(e∗j ) ≤
∫ ∞0
hjM (zj)hjM (e∗j ) ≤ c2
∫ ∞0
h0j(zj)h0j(e∗j ).
Hence we also have | log fjM (ej)| < ∞. If f0 ∈ FKL, we have∫f0j | log f0j | < ∞ for all j. Since
| log h0j(ej)| is h0-integrable, using dominated convergence theorem, we have∫h0j log
h0jhjM→ 0 as
M → ∞. | log fjM (ej)| is bounded above by an integrable function and hence∫f0j log
f0jfjM→ 0
as M → ∞. Below we show that f (M) → f0 pointwise and construct an f0 integrable upper
bound of gM = log f0f (M) . Since, fjM → f0j pointwise for j = 1, ...,m, by Scheffe’s theorem
33
supe∈R|FjM (e) − F0j(e)| → 0 as M → ∞. φ and Φ being continuous in their arguments, we
conclude that fM → f0 pointwise. Observe that
|gM | ≤ | log f0|+ | log f (M)|
and
| log f (M)| ≤ | log(2m|J |)|+ (12)
| log
∫Rm+
m∏j=1
hjM (zj)hjM (e∗j )1
|ΣM |1/2exp
[−1
2H∗′M (e∗)(Σ−1
M − I)H∗M (e∗)
]dz|.
Let
IM = 1|ΣM |1/2
exp[−1
2H∗′M (e∗)(Σ−1
M − I)H∗M (e∗)].
Then
log(IM ) = −1
2log |ΣM | −
1
2H∗′M (e∗)(Σ−1
M − I)H∗M (e∗)
= C1 −1
2(H∗M (e∗)−H∗0 (e∗))
′(Σ−1
M − I)(H∗M (e∗)−H∗0 (e∗))
− 1
2H∗′
0 (e∗)(Σ−1M − Σ−1
0 )H∗0 (e∗)− 1
2H∗′
0 (e∗)(Σ−10 − I)H∗0 (e∗) (13)
We have H∗M = (H∗1M , ...,H∗mM ) and H∗0 = (H∗01, ...,H
∗0m), where H∗jM (e∗j ) = Φ−1{HjM (e∗j )} and
H∗0j(e∗j ) = Φ−1{H0j(e
∗j )} for j = 1, ...,m. Using the Taylor series expansion of H∗jM around H∗0j we
have for ζ ∈ [0, 1]
H∗jM = H∗0j +HjM (e∗j )−H0j(e
∗j )
φ1(H0j(e∗j )) +φ′(ζ){HjM (e∗j )−H0j(e
∗j )}
φ2(ζ)
Hence, |H∗jM − H∗0j | → 0 uniformly in e∗j and the first term in (13) is 0 as M → ∞. As ρM →
ρ0 as M → ∞, the m eigenvalues of Σ−1M converge to the m eigenvalues of Σ−1
0 . Therefore,
12H∗′0 (e∗)(Σ−1
M − Σ−10 )H∗0 (e∗) ≤ maxj |λ0j − λMj |H∗
′0 H
∗0 . Thus, there exists constants C1, C2 > 0
such that
exp{−(C1 + C2| log(I0)|)} ≤ IM ≤ exp{C1 + C2| log(I0)|}.
34
We know that hjM (·)→ h0j(·) for all j as M →∞. Hence
| log f (M)(e)| ≤ | log(2m|J |)|+ | log Ψ|, with
∫f0| log f (M)(e)| <∞.
By dominated convergence theorem, we conclude that∫f0 log f0
f (M) → 0 as M → ∞ and FKL ⊂
KL(Π).
Appendix B
Proof of Theorem 1
Suppose for any two densities g1 and g2, K(g1, g2) =∫R g1(w) log{g1(w)/g2(w)}dw and V (g1, g2) =∫
R g1(w)[log+{g1(w)/g2(w)}]2dw, where log+(u) = max(log(u), 0). Set Ki(f, β) = K(f0i, fβi) and
Vi(f, β) = V (f0i, fβi). The proof of Theorem 1 follows from (Pati and Dunson, 2014) and (Tang
et al., 2015) with minor changes under the condition that there exist test functions {Φn}∞n=1, sets
Θn =Wε(f0)×Θβn, n ≥ 1, and constants C1, C2, c1, c2 > 0 such that
1.∑∞
n=1E∏ni=1 f0i
Φn <∞
2. sup(f,β)∈Ucn∩Θn E∏ni=1 fβi
(1− Φn) ≤ C1e−c1n
3. Πβ(Θcβn) ≤ C2e
−c2n
4. For all δ > 0 and for almost every data sequence {yi, xi}∞i=1,
Π{(f, β) : Ki(f, β) < δ∀i,∑∞
i=1Vi(f,β)i2
<∞} > 0
To verify the above conditions, we construct sieves Θn =Wε(f0)×Θβn, where Θβn = {β : |β| < Tn}
with Tn = O(√n).
35