Download pdf - Multivariate Skewed Responses: New Semiparametric Regression …debdeep/papers/MSMB.pdf · 2017-07-25 · In recent times, available Bayesian and likelihood based methods to model

Multivariate Skewed Responses: New Semiparametric Regression

Model and a Bayesian Recourse

Abstract

For many real-life studies with skewed multivariate responses, the level of skewness

and association structure assumptions are essential for evaluating the covariate effects

on the response, and its predictive distribution. We present a novel semiparametric

model leading to a theoretically justifiable semiparametric Bayesian analysis of multi-

variate skewed responses. Similar to multivariate Gaussian densities, this multivariate

model is closed under marginalization, and allows a wide class of multivariate associa-

tions with meaningful physical interpretations of skewness levels and covariate effects

on the marginal density. Compared to existing models, our model enjoys several de-

sirable properties, including scalable Bayesian computing via available software, and

assurance of consistent Bayesian estimates of parameters and the nonparametric error

density under a set of plausible prior assumptions. We illustrate the practical advan-

tages of our methods over existing parametric alternatives via application to a clinical

dataset assessing periodontal disease, and a simulation study.

Keywords: Dirichlet process; Kernel density; Markov Chain Monte Carlo; Periodontal dis-

ease; Skewed error

1

1 Introduction

Particularly in biomedical and econometric studies, we often come across highly skewed

multivariate responses. For example, the preliminary analysis of a periodontal disease (PD)

study of Type-2 diabetic Gullah-speaking African-Americans (Fernandes et al., 2009) clearly

reveals that the two major clinical responses of subject-level PD status (Darby and Walsh,

2014), the mean (average) periodontal pocket depth (PPD) and the mean clinical attachment

level (CAL) within each patient, are highly skewed, with possibly different levels of skewness.

A correct modeling of these two PD bio-markers and their associations with other covariables

are important, because PD has strong links to many types of diseases, including diabetes

(Casanova et al., 2014), heart disease (Mattila et al., 1989) and cancer (Michaud et al., 2008).

A typical multivariate regression analysis that ignores skewness often fails to effectively

describe the effects of patient-level covariates on skewed responses, and may even lead to

wrong models and predictive densities. As an alternative, quantile regression (QR) methods

(Koenker and Bassett Jr, 1978; Koenker, 2005) may be effective, however, their estimation

mostly focus on a single pre-specified quantile, and neither evaluate the levels of skewness

nor estimate the covariate effects on the whole distribution. Furthermore, the methods based

on transforming the multivariate skewed responses to symmetry have difficulties in choosing

a suitable transformation and interpreting the effects of the covariates on the original non-

transformed response.

This paper addresses the major challenges and limitations of existing models and

model-based regression analysis of multivariate skewed response data. First, existing long

list of approaches to handle multivariate skewed responses are mostly restricted to para-

metric models (Azzalini and Dalla Valle, 1996; Azzalini and Capitanio, 2003; Ferreira and

Steel, 2007; Fernandez and Steel, 1996; Genton, 2004), and in particular, center around

2

the parametric family of skewed distributions pioneered by Azzalini (1985). Other variants

of these parametric models include hidden truncation models (Arnold and Beaver, 2000),

multivariate skew-elliptical distributions (Branco and Dey, 2001; Sahu et al., 2003), and

skew-normal/independent distributions (Lachos et al., 2009; Bandyopadhyay et al., 2010,

2012). Unlike these models, our new semiparametric multivariate model in Section 2 has

a skewed nonparametric error distribution. Second, an elegant modeling assures that the

model class is closed under marginalization, that is, any subset of the response vector has the

distribution from the same multivariate class of the original vector, however, with different

parameters (like the property of multivariate normal density). For the majority of existing

parametric multivariate skewed proposals, the marginal densities either do not come from

the same class or the parameters (say, variance and skewness) of the marginal density of each

component depend on the parameters of the rest of the components. This is a major obstacle

for component-wise interpretation of the covariate effects, and particularly for specification

of priors which are typically based on available prior opinion about each component of the

response vector. We advocate a latent variable based model that is closed under marginal-

ization with marginal skewness and variance that do not depend on the parameters of other

components. This allows a meaningful interpretation of component-wise skewness and co-

variate effects. Finally, accommodating a flexible class of multivariate associations (such

as antedependence, toeplitz, or autoregressive structures) via a covariance matrix within a

multivariate skewed framework is a major challenge. A recent variation of the multivariate

skew-normal model (Chang and Zimmerman, 2016) accommodates the antependence asso-

ciation, however, at the cost of a strict condition on the vector of skewness parameters.

Our semiparametric model allows a broader class of multivariate associations (including the

aforementioned structures) within the response vector, without imposing any restriction on

skewness parameters.

3

In recent times, available Bayesian and likelihood based methods to model multivariate

skewed responses include an exclusive use of parametric models, lack of theoretically justi-

fied semiparametric methods, and lack of a computational tool using standard software. To

address these practical concerns, we present a carefully executed Bayesian semiparametric

regression methodology that leads to easy implementation via Markov Chain Monte Carlo

(MCMC) techniques using easily available softwares. Our model eases the prior specifications

by allowing marginal interpretations of the parameters for each component. Also, to the best

of our knowledge, we are the first to present a semiparametric Bayesian method for multi-

variate skewed responses with desirable asymptotic properties under practically reasonable

prior assumptions.

The rest of the paper proceeds as follows. After introducing our semiparametric pro-

posal in Section 2, we present the relevant components of Bayesian inference, such as prior

and likelihood specifications, and MCMC implementation routines in Section 3. In Sections

4 and 5, we assess the finite sample properties (via simulation studies), and the posterior

consistency of our Bayesian estimates, respectively. In Section 6, we apply our model to the

motivating dataset on PD. Finally, we present some concluding remarks in Section 7.

2 Semiparametric Model for Multivariate Skewed Re-

sponse

Let Yi = (Yi1, . . . , Yim) ∈ Rm be a row vector of skewed responses from subject i = 1, . . . , n.

For brevity, we use fixed number of components m per subject, however, our methods and

computational tools presented here can accommodate unequal number of components mi.

4

For the PD study, m = 2 for every subject. Our regression model is

Yij = µij + εij, for j = 1, . . . ,m, (1)

where µij = xijβj and xij = (1, xij1, . . . , xij,p−1) is the row vector of (p − 1) covariates for

Yij (with first term for the intercept), βj is the corresponding (p × 1) vector of regression

parameters, and εi = (εi1, . . . , εim) is the row vector of errors that follows an m-variate

multivariate density fm(ε), with possibly skewed marginal densities.

We specify a semiparametric class of skewed multivariate densities for fm(ε) using

two independent (m × 1) vectors of latent variables, Z1i = (Z1i1, . . . , Z1im)T and Z2i =

(Z2i1, . . . , Z2im)T, where Z1ij and Z2ij have the same unknown nonparametric symmetric

marginal density hj(·) for j = 1, . . . ,m. However, (Z1i1, · · · , Z1im) are jointly independent

with joint density∏m

j=1 hj(z1ij), and Z2i = (z2i1, · · · , z2im) has the m-variate Gaussian Cop-

ula density (Nelsen, 2007)

Cm(Z2i | h,Σρ) = φm[Φ−1

1 {H1(z2i1)}, . . . ,Φ−11 {Hm(z2im)} | Σρ

]×

m∏j=1

hj(z2ij)

φ1

[Φ−1

1 {Hj(z2ij)}](2)

where φm(. | Σρ) is the m-variate Gaussian density with mean 0 and correlation matrix

Σρ with unknown parameter ρ, Φ−11 (.) is the inverse cdf of N(0, 1), Hj(·) is the cdf of the

unknown symmetric density hj(·), and h = (h1, . . . , hm). The copula model of (2) ensures

that Z2ij has the marginal density hj(·) (same as Z1ij). Our multivariate skew semiparametric

(MSS) model for the m-dimensional εi, denoted by εi ∼ MSS(0, h,Σρ, α), is given as εij =

αj(1 + α2j )−1/2|Z1ij| + (1 + α2

j )−1/2Z2ij for j = 1, . . . ,m. This is represented via matrix

notation as

εTi = Aα|Z1i|+ A∗αZ2i , (3)

5

where Aα and A∗α are diagonal matrices with jth diagonal elements αj(1 + α2j )−1/2 and

(1 + α2j )−1/2 respectively for α ∈ Rm. The multivariate semiparametric model of (3) has

the joint distribution of (2) with symmetric marginal density hj(·) only when αj = 0 for

j = 1, · · · ,m. We model the unknown nonparametric symmetric density hj(·) as a scale

mixture of symmetric (around 0) kernel densities K(· | σ), with scale σ as

hj(u) =

∫ ∞0

K(u | σ)dGj(σ), (4)

where Gj(·) is the unknown nonparametric mixing distribution with support in R+ = (0,∞).

A popular choice for K(· | σ) in (4) is N(0, σ2) with symmetric density hj(u) expressed as,

hj(u) =

∫ ∞0

φ1(u | σ)dGj(σ). (5)

Using the result of Feller (1971) (p. 158), when the kernel K(· | σ) is uniform density

with support (−σ, σ), hj(u) in (4) represents the class of all unimodal symmetric around

0 densities. The semiparametric model of (3) is similar to the “skewed shocks” model of

Sahu et al. (2003), except it has a new parametrization of weights (Aα, A∗α), nonparametric

(h1, . . . , hm) and multivariate copula modeling of Z1i (instead of parametric multivariate

elliptical density of (Z1i, Z2i)). Semiparametric multivariate model resulting from (3) is

valid even when no actual latent skewed shock vector |Z1i| is applicable in practice. Later,

we demonstrate that this latent vector representation leads to nice model properties, and

also facilitates implementation.

Replacing the nonparametric (multivariate) specification of latent variables (Z1i, Z2i)

in (3)-(4) with parametric multivariate mean zero normal densities Z1i ∼ Nm(0, D2σ) and

Z2i ∼ Nm(0, DσΣρDσ) with Dσ = diag(σ1, . . . , σm) and positive definite correlation matrix

6

Σρ, we obtain a parametric multivariate skewed response model. We call this a parametric

Gaussian mixture model (PGM), and denote it by εi ∼ PGM(0, α,Σ) to differentiate it from

the multivariate skew normal model of Azzalini and Dalla Valle (1996), and later, Chang

and Zimmerman (2016), among others. For this parametric multivariate model, the skewed

marginal distribution of εij is the univariate skew-normal of Azzalini (1985), denoted by

SN(0, αj, σj).

It is obvious that our latent variable based model in (3) is closed under marginalization

because the marginal density of ε(1)i is from the same semiparametric MSS density indexed

by (α(1),Σ11, h(1)), where the (m×1) dimensional response εi is partitioned as εi = (ε

(1)i , ε

(2)i ),

with corresponding partitions of α = (α(1), α(2)), (h1, · · · , hm) = (h(1), h(2)) and matrix Σ =Σ11 Σ12

Σ12 Σ22

. Unlike our MSS model as well as the parametric model of Sahu et al. (2003),

most of the existing multivariate parametric models are not closed under marginalization.

When εi follows a location 0 multivariate skew-normal (MSN) distribution of Azzalini and

Dalla Valle (1996) denoted by εi ∼ MSN(0, α,Σ), the corresponding fm(εi) is given by

fm(εi1, . . . , εim) = 2φm(εi|Σ)Φ1(εiα), (6)

with the skewness parameter vector α and association matrix Σ. The multivariate model

of (6) and other related models are essentially based on common univariate skewed shock

|Z∗i |, instead of the multivariate Z1i in (3). As a consequence, these models are closed un-

der marginalization only in a restricted sense. Unlike MSS and the multivariate normal

model, parameters (α∗(1),Σ∗(1)) of the marginal skew-normal density of ε

(1)i are different from

(α(1),Σ11), and in fact, these (α∗(1),Σ∗(1)) are even functions of (α(2),Σ22,Σ12). In particu-

lar, the scalar component εij has the univariate skew-normal distribution (Azzalini, 1985)

7

denoted by SN(0; αj, σj) with density

fj(εij) = 2φ1(εij|σj)Φ1(αjεij), (7)

where φ1(.|σ) is the N(0, σ2) density, (Σ)jj = σ2j , and the marginal skewness parameter αj is

αj =αj + (1/σ2

j )ΣT.j(−j)α(−j)

1 + αT(−j)Σ(−j,−j)α(−j) − (1/σ2

j )(Σ.j(−j)α(−j))2, (8)

where Σ.j is the j-th column of the covariance matrix Σ, R(−j) denotes the reduced vector

after deleting j-th element of the vector R, and Q(−j,−k) denotes the reduced matrix after

deleting j-th row and k-th column of matrix Q. It is obvious from (7)-(8) that the marginal

distribution of εij, including its mean, variance and skewness, clearly depends on all the

components of vector α and matrix Σ via αj. The magnitude and even the sign of αj can

be different from those of αj. For example, in the bivariate case of our PD study with εi =

(εi1, εi2) errors for PPD and CAL responses with α = (α1, α2) and Σ =

σ21 ρσ1σ2

ρσ1σ2 σ22

,

the marginal density of εi1 (PPD) is SN(0; α1, σ1) with α1 = {α1 +ρα2σ2/σ1}/{1 +α22σ

22(1−

ρ2)}, where ρ = Corr(εi1, εi2) = σ12/{σ1σ2}. Even the signs of α1 and α1 are different

whenever α21σ1 + ρα1α2σ2 < 0. For Bayesian analysis, in particular, this complicates the

prior specification for, say, (α1, σ1) based on available prior opinion about the marginal

skewness and variability of the first response PPD without using its complex form. This

model also imposes a restriction that εi1 is skewed if α2 6= 0, ρ 6= 0 even when α1 = 0. Using

the Theorem 1 of Chang and Zimmerman (2016), we can show that in general this model

imposes several restrictions on the association structure.

8

In particular, the marginal mean and the covariance of Yi for the MSS model are

E (Yij) = xijβj + {αj/(1 + α2j )

1/2}λj and var (Yi) = Vi, (9)

with diagonal and off-diagonal elements given by (Vi)jj = σ2hj−{α2

j/(1+α2j )}λ2

j and (Vi)jk =

ρjkσhjσhk/{(1 + α2j )(1 + α2

k)}1/2 for all j 6= k ∈ {1, ...,m}, where common expectations

E(|Z1ij|) = E(|Z2ij|) = λj and variances V ar(Z1ij) = V ar(Z2ij) = σ2hj in (9) are taken with

respect to the common marginal density hj(·) of Z1ij and Z2ij given in (4). The correlation

Corr(εij, εik) between a pair (εij, εik) within cluster i is ρjkσhjσhk/[{(1+α2j )σ

2hj−α2

jλ2j}{(1+

α2k)σ

2hk − α2

kλ2k}]1/2.

Figure 1 (right panel) presents the contour plots of several bivariate (m = 2) versions

of the parametric Gaussian mixture densities PGM(0, α,Σ) with (3) for σ1 = σ2 = 1,

ρ = σ12/(σ1σ2) = 0.7 and common α1 = α2 = α0. Note that the common marginal density

of ε1 and ε2 is univariate SN(0;α0, 1). It is obvious from these contour plots that for a fixed

ρ, the shape of the joint density changes noticeably when α0 = 0 (the elliptical contour plot

of the bivariate normal density) to α0 = 2, 5, 10, with the marginal densities becoming more

positively skewed.

To compare the shape of parametric Gaussian mixture model density with the bivariate

skew-normal density of Chang and Zimmerman (2016), Figure 1 (left panel) also presents

the contour plots of the density of (6) for m = 2, ρ = 0.7, σ1 = σ2 = 1 and for the same

set of four values of α0. Like the contour plots of the corresponding parametric bivariate

Gaussian mixture model densities, the shape of the contours for the bivariate skew-normal

densities becomes increasingly non-elliptical as α0 increases, however, even for a moderately

large α0, the shapes of the contours of the bivariate skew-normal are noticeably different

from the contours of the bivariate parametric Gaussian mixture densities. Also, increasing

9

value of α0 does not affect the skewness of the marginal density of the bivariate skew-normal

density as profoundly as it affects the skewness of the marginal density of the parametric

Gaussian mixture density. For the multivariate skew-normal model of (6), there is even a

moderately negative association in the left tail even when the association parameter ρ is 0.7.

This points to the flexibility of our multivariate model in (3), even in the parametric case.

For α1 = α2 = 10, ρ = −0.7 and σ1 = σ2 = 1, Figure 2 compares the marginal density

of the bivariate (that is, m = 2) parametric model of (6) (in left panel) with corresponding

marginal density of the bivariate parametric Gaussian mixture model (right panel) via the

contour plots and density histograms of their corresponding marginal densities. While the

positive skewness of both components of bivariate Gaussian mixture density is evident from

the contour plot in Figure 2(b), the contour plot of bivariate skew-normal density in Figure

2(a) is nearly elliptical. Also, the marginal density histogram and boxplot of bivariate skew-

normal model in Figure 2(c) and Figure 2(e) appear to be approximately symmetric while

those of bivariate Gaussian mixture density in Figure 2(d) and Figure 2(f) demonstrate

substantial marginal skewness as expected for the chosen α as high as 10. This illustrates

the fact that, depending on the other parameters of the model, the marginal skewness of the

multivariate skew-normal model of (6) may be restricted compared to that of our model.

3 Likelihood, Prior Specifications and Implementation

Consider the linear regression model of (1) for n independent multivariate responses Yi

and matrix of covariates Xi =(xTi1, . . . , x

Tim

)from i = 1, . . . , n subjects. We now develop

a semiparametric Bayesian inference procedure for the model with the multivariate error

defined in (3) with kernel mixture of symmetric densities hj(·) in (4). Given observed data

10

(a) α0 = 0 (b) α0 = 0

(c) α0 = 2 (d) α0 = 2

(e) α0 = 5 (f) α0 = 5

(g) α0 = 10 (h) α0 = 10

Figure 1: Contour plots of bivariate version of the parametric Gaussian mixture (PGM) modeldensity (right panel) and bivariate skew-normal densities (left panel) for association parameterρ = 0.7, and common scale parameter σ1 = σ2 = 1 and different values of common skewness α0.

11

(a) (b)

(c) (d)

(e) (f)

Figure 2: Contour plot of bivariate skew-normal density (panel a) with histogram (panel c) andboxplot (panel e) of the corresponding common marginal density; Contour plot of bivariate versionof the parametric Gaussian mixture (PGM) model density (panel b) with histogram (panel d)and boxplot (panel f) of corresponding marginal density. Both bivariate densities use associationparameter ρ = −0.7, common scale parameter σ1 = σ2 = 1 and common α1 = α2 = 10.

12

D = {yi, Xi; i = 1, . . . , n}, the likelihood function of Θ = (β, α, ρ, σ,G) is,

L(Θ | D) =n∏i=1

fm(yi −Xiβ), (10)

where the joint density fm of the error-vector εi = yi − Xiβ has a complicated analytical

expression, because εi is expressed as a function of latent variables in (3). We will later

describe how to handle this likelihood for the purpose of data analysis. After specifying the

likelihood in (10), the priors are specified as follows.

(i) We suggest independent multivariate normal priors π1(β) and π2(α) for parameter

vectors β and α.

(ii) The joint prior distribution for the unknown parameter vector ρ of the paramteric

association (correlation) matrix in the copula density in (2) is π3(ρ). Depending on

the dimension of ρ, this π3(ρ) can be either an univariate or multivariate density.

(iii) The joint prior process π4(G) of (G1, · · · , Gm) is a product of m independent Dirichlet

processes (Ferguson, 1974), DP(G0j, C) for j = 1, · · · ,m, with pre-specified prior mean

G0j(·) of unknown Gj(.) and pre-specified concentration C around G0j. So, specifying

a flexible prior on Gj(·) induces a flexible prior on the nonparametric hj(·) in (4).

Hence, the joint posterior density is

p(Θ | D) ∝ L(Θ | D)π1(β)π2(α)π3(ρ)π4(G) . (11)

For the DP(G0j, C) the prior “guess” (prior mean) G0j of Gj is specified based on the

prior guess h0j(u) =∫∞

0K(u|σ)dG0j(σ) of the marginal density hj(u) of Z1ij in (4). The

concentration (or precision) parameter C is the measure of uncertainty of Gj around G0j;

13

the larger the value of C, the closer the sample-paths of hj(u) are to the prior “guess” h0j(u)

and smaller values of C allow sample-path of hj(u) to be very different from h0j(u).

Even though, the posterior in (11) and even the likelihood in (10) are difficult to

compute analytically, we implement a parametric as well as a semiparametric likelihood

based Bayesian analysis using Markov Chain Monte Carlo (MCMC) via standard statistical

software such as publicly available WinBUGS. For implementation within, say, WinBUGS, we

first need to specify the likelihood of (10) as a standard distribution available in WinBUGS.

For this purpose, the likelihood of (10) is the product of∫Cm(A∗−1

α {yi − Xiβ − Aα|Z1i|} |

h,Σρ)[∏m

j=1 hj(Z1ij)] dZ1i using (3), where the integral is taken over the support Rm of latent

vector Z1i, and hj(·) is the marginal density of independent Z1ij. Also, Cm(· | h,Σρ) is the

multivariate density in (2). For example, when hj(.) is parametric, say, N(0, σ2j ) density, we

implement the likelihood via two following steps: (1) (Wi1, . . . ,Wim) for i = 1, · · · , n are

mean-zero multivariate normal Nm(0,Σρ), where Wij = Φ−11 [Hj{(1 + α2

j )1/2(yij − Xijβ) −

αj|Z1ij|}], Φ−1 is the inverse-cdf of standard normal, and Hj(·) = Hj(·|σj) is the cdf

corresponding to the parametric density hj(·|σj); (2) Z1ij are independent hj(·|σj). The

parametric analysis further needs independent parametric prior distributions σj ∼ Gj(·|γj)

for j = 1, · · · ,m given (possibly unknown) hyperparameter γj with known hyper-prior

πγ(γ1, · · · , γm). For the special case of the parametric Bayesian analysis of model (3) with

a parametric N(0, σ2j ) density φ(·|σj) for Z1ij and Z2ij, the A∗−1

α {yi − Xiβ − Aα|Z1i|} in

(10) is essentially mean zero m-variate normal Nm(0,Ψ) density with covariance matrix

Ψ = DσΣρDσ, and Z1i ∼ Nm(0, D2σ), where Dσ = diag(σ1, ..., σm).

For a semiparametric Bayesian analysis with a particular symmetric density kernel

K(z|σ) for the nonparametric kernel-mixture densities h = (h1, . . . , hm) in (4), we implement

the likelihood contribution via following three steps: (1) (Wi1, . . . ,Wim) for i = 1, · · · , n are

mean-zero multivariate Nm(0,Σρ), where Wij = Φ−11 [K−1

ij {(1+α2j )

1/2(yij−Xijβ)−αj|Z1ij|}],

14

where K−1ij is the inverse-cdf of K(·|σij); (2) Z1ij for j = 1, . . . ,m are independent with

density K(·|σij); (3) σij for j = 1, . . . ,m are independent with nonparametric cdf Gj with

Dirichlet prior DP (G0j, C). The first two above steps of the likelihood for MCMC can

be easily modified to use, say, uniform kernel for K(· | σ) for nonparametric density hj(.)

in (4) by letting A∗−1α {yi −Xiβ − Aα|Z1i|} = ({2Φ1(Wi1)− 1}σi1, ..., {2Φ1(Wim)− 1}σim)T,

where (Wi1, ...,Wim) ∼ Nm(0,Σρ) and Z1ij follows independent uniform density with support

(−σij, σij).

For convenient MCMC implementation, we use the constructive definition of the Dirich-

let Process (Sethuraman, 1994) truncated at a user specified finite number, say, K com-

ponents for prior of Gj. Suppose (δ1, . . . , δK) are generated independently from G0j and

(B1, . . . , BK−1) are generated independently from Beta(1, C), and ω1 = B1, ωh = Bk

∏l<k(1−

Bl) for k = 2, . . . , K − 1 and ωK = 1 −∑K−1

l=1 ωl then, the sample path of Gj is approxi-

mated by Gj =∑K

l=1 ωlIδl when K is large. These likelihood steps and priors facilitates easy

Bayesian semiparametric implementation for either the Gaussian or uniform kernels using

publicly available freewares.

4 Simulation Study

To assess the finite sample properties of our Bayesian estimates, we conduct a small scale

simulation study. We consider N = 500 replications of bivariate response data with sample

size n = 50. The bivariate skewed errors (εi1, εi2) are generated from the parametric Gaussian

mixture version of the model in (3) using α = (α1, α2) = (2, 2), Z1i ∼ N2(0, I) and Z2i ∼

N2(0,Σρ), where Σρ is a (2× 2) correlation matrix with ρ12 = 0.7. The regression structure

for the simulation model is Yij = β0+xij1β1+β2xij2+εij, where (β0, β1, β2) = (2, 0.5, 0.5) and

xij1 and xij2 are sampled independently such that xij2 ∼ N(0, 1) and xij1 = 1 with probability

15

0.5 and xij1 = −1 otherwise, for j = 1, 2. The simulation model has the same regression

parameters (common β) for both components, however, it may not be a realistic assumption

for our motivating study. We fit the semiparametric Gaussian mixture model (SPGM),

where symmetric densities h1(·) and h2(·) of (4) are expressed as unknown scale mixtures

of Gaussian kernel as in (5) with corresponding mixture distribution G1 and G2. We also

fit the parametric Gaussian mixture model described in Section 2, assuming Z1ij ∼ N(0, σ2j )

for j = 1, 2 and Z2i ∼ Nm(0,Σ), with (Σ)11 = σ21, (Σ)22 = σ2

2 and (Σ)12 = ρ12σ1σ2. We use

independent N(0, 1) priors for β0, β1 and β2, and also for the skewness parameters α1 and

α2. For the DP(G01, C) and DP(G02, C) priors corresponding to G1 and G2 respectively,

we assume the prior “guess” G01 and G02 to be a Gamma(2, 1) density and C = 1. For

the parametric Gaussian mixture model, we specify Gamma(2, 1) as the independent prior

distribution for σ21 = σ2

2. To test the performance of our semiparametric methods, we

use the mean squared error MSE(θ) =∑N

k=1(θk − θ)2/N , where θk is the estimate of the

parameter θ via analyzing the simulation replicate k = 1, 2, . . . , N of the dataset. Similarly,

we also use the average bias∑N

k=1(θk−θ)/N and the coverage probability of the 95% interval

estimated for the competing models. For the purpose of comparison, we obtain estimates

for the parameters using the Generalised Estimating Equations (GEE) approach (Liang and

Zeger, 1986) in conjunction with the sandwich variance estimate approach; for linear models,

GEE is equivalent to a mixed model with a sandwich variance estimate. Please note that,

using (9), the resulting marginal intercept for component j = 1, 2 of the simulation model

is β∗0j = β0 − {αj/(1 + α2j )}1/2λj, and thus the GEE estimate of the intercept is biased for

estimating β0. However, GEE will be unbiased for the regression coefficients β1 and β2. We

summarize the results of the simulation study in Table 1.

The simulation study reveals that our semiparametric Gaussian mixture model and

the parametric Gaussian mixture model yield smaller mean squared errors (at least 20% re-

16

Table 1: Simulation Results based on 500 replicates of the data comparing the Mean SquaredError (MSE), Standard Error (SE), Bias, and Coverage Probability (Coverage) for associa-tion parameter ρ12 = 0.7 sample size is n = 50. True parameter values are (β0, β1, β2, α1,α2) = (2, 0.5, 0.5, 2, 2).

SPGM PGM GEEβ0 β1 β2 α1 α2 β0 β1 β2 α1 α2 β0 β1 β2

100×MSE 1.702 0.376 0.432 16.22 29.995 1.461 0.359 0.397 18.07 14.88 56.897 0.479 0.54510×SE 0.948 0.616 0.636 2.361 3.436 0.766 0.600 0.627 2.966 2.776 0.687 0.632 0.64010×Bias 0.903 0.019 -0.123 -3.271 4.273 0.937 0.026 -0.079 -3.054 -2.688 7.511 0.093 -0.045Coverage 0.945 0.95 0.975 0.975 1 0.95 0.96 0.96 1 0.98 0.2 0.91 0.9

duction) and overall better coverage probability for 95% interval estimates when compared

to GEE based regression estimates. Unlike the popular repeated measures analysis method

using GEE, our methods provide estimates of skewness and better (more precise) interval

estimates of the regression parameters. We also see that in spite of the parametric Gaussian

mixture model being the the true simulation model, Bayesian estimates based on our semi-

parametric model are comparable with those from parametric model as far as MSE and bias

are concerned.

5 Posterior Consistency

For any theoretically defensible inference based on a complex semiparametric Bayesian

model, it is important to be assured that the posterior distribution of the parameters become

increasingly concentrated around the true parameter values as the sample size increases. This

necessitates the investigation of the asymptotic properties along with the finite sample prop-

erties. In this section, we provide sufficient conditions for posterior consistency, which is the

utmost important asymptotic property. For the sake of brevity, we present our theory for

the case where the dimension of the regression parameter, β is p = 1. However, our theory

can be extended to the general case where p > 1. We first investigate if the support of the

17

prior is large enough to cover all possible relevant densities. Suppose F denotes a class of all

univariate asymmetric unimodal residual densities and C{(F)m× (0, 1)} denotes a collection

of all m-dimensional residual densities f0 with unknown correlation ρjk ∈ (0, 1) having the

form,

f0(e) =

∫Rm+

2m|J |m∏j=1

h0j(zj)h0j(e∗j)

φ1

[Φ−1{H0j(e∗j)}

]φm [Φ−1{H01(e∗1)...Φ−1{H0m(e∗m)}|Σρ

]dz,

with the j-th marginal density given by,

f0j(ej) =

∫R+

2√

1 + α2jh0j(zj)h0j(e

∗j)dzj j = 1, ...,m

where h0j(·) be the true (unknown) symmetric (around 0) density, H0j(·) is the corresponding

cumulative distribution function and e∗j =√

1 + α2jej − αzj for j = 1, ...,m and

|J | =

∣∣∣∣∣∣∣I 0

A∗−1α −A∗−1

α Aα

∣∣∣∣∣∣∣Let Π denote a prior on C{(F)m×(0, 1)} such that Π = (πF)m⊗π3(ρ), where πF is a prior on

F defined by Gj ∼ DP(G0j, C) for j = 1, ...,m and α ∼ π2(α) and π3(ρ) is a prior on (0, 1).

For ease of exposition, we will assume that α is known. To show that posterior consistency

holds at true density f0, we first define the Kullback-Liebler divergence as KL(f0, f) =∫Rm f0 log(f0/f) and Kullback-Leibler neighborhood of size ε as κε(f0) = {f : KL(f0, f) < ε}.

For any prior Π∗ on the density space F∗, f0 is said to be in the Kullback-Leibler support

of Π∗ denoted by KL(Π∗) if f0 ∈ S, where S = {f0 : Π∗(f : KL(f0, f) < ε)∀ε > 0}. Define

FKL ={f0 ∈ C{(F)m × (0, 1)},

∫f0| log f0| <∞

}. We characterize the Kullback-Leibler Π

in the following lemma with its proof in Appendix A.

18

Lemma 1 1. FKL ⊂ KL(Π) if G0j is defined on support R+ for all j = 1, . . . ,m and G0j

is absolutely continuous with respect to Lebesgue measure and supp(π3(ρ)) = (0, 1).

Suppose, Dn is the observed data with sample size n, and β0 and f0 are the true val-

ues of the regression parameter and the multivariate error density respectively. We now

put forth the sufficient conditions to ensure that as the sample size n increases, the pos-

terior distributions of the parameter β and the error density f are concentrated around a

small neighborhood around their true values. We are essentially interested in making in-

ference on the regression parameter β so we define a strong L1 neighborhood around the

true value, β0 as the set S(β0) = {β : |β − β0| < δ} and weak neighborhood of around f0 as

Wε(f0) ={f ∈ F : |

∫ϕf −

∫ϕf0| < ε

}for a bounded continuous function ϕ. Assume that

U = Wε(f0) × S(β0). The following theorem gives the result on posterior consistency and

the details of the proof are given in the Appendix B.

Theorem 1 Suppose (f0, β0) ∈ FKL × R. Consider a prior Π = (ΠC ⊗ Πβ) on FKL × R.

Then Π{(fm, β) ∈ U c|Dn} → 0 a.s. under the true data generating distribution, Pf0,β0 that

generates data Dn.

Technical details of the proof are given in the Appendix B.

6 Application: GAAD Study

The motivating data example for this paper comes from a clinical study (Fernandes et al.,

2009) initially conducted to explore the association between PD and diabetes level (de-

termined by Hba1c, or ‘glycosylated hemoglobin’) in Type-2 diabetic Gullah-speaking (or

simply Gullah) African-Americans (GAAD). It is well documented that about 5% to 15%

of world population and about 50% U.S. adults (over the age of 35) are susceptible to peri-

19

odontitis (Pourabbas et al., 2005; Oliver et al., 1998), a major cause of progressive bone loss

around the tooth..

The GAAD study aims to evaluate the distribution of PD status of this population,

as well as quantify the effects of subject-level covariates such as age (in years), gender

(1=Female, 0=Male), Body Mass Index or BMI (obese=1 if BMI >= 30, obese=0 if BMI <

30), glycemic level or HbA1c (1= High, 0 = controlled) and smoking status(1 = smoker, 0

= never smoker) on the PD status (Reich and Bandyopadhyay, 2010), as measured by the

PPD and CAL in rounded whole millimeters (Darby and Walsh, 2014). For our analysis,

we select n = 288 patients (subjects) with complete covariate information. About 31% of

the subjects are smokers. The mean age of the subjects is about 55 years with a range from

26-87 years. Female subjects seem to be predominant (about 76%) in our data, which is not

uncommon in Gullah population (Johnson-Spruill et al., 2009). About 68% of subjects are

obese (BMI >= 30) and 59% are with HbA1c = 1 (with blood sugar level higher than 7

percent), an indicator of high glycemic level.

Here, our two highly skewed responses from each subject are mean PPD and mean

CAL, which are averages of corresponding measurements from all sites (6 sites per tooth)

of all teeth of the subject. The underlying statistical goals are to evaluate the regression

functions explaining the covariate-response relationships of this bivariate skewed responses

and understand the relationship between two response. However, the validity of the esti-

mation of the regression function under a standard linear mixed model framework depends

on the usual Gaussian assumption about the responses. After fitting a linear mixed model

to the PD data using the nlme package in R, we present, in Figure 3 [panels (a)-(f)], the

histograms of the mean PPD and mean CAL, the histogram and Q-Q plot of the empirical

Bayes estimates of the random effects, and the histograms of the residuals for two responses.

These plots clearly reveal evidence of high and varying levels of skewness (i.e., departures

20

(a) (b)

(c) (d)

(e) (f)

Figure 3: Analysis of GAAD data using linear mixed model: Histograms of the PeriodontalPocket Depth (PPD) responses (panel a) and Clinical Attachment Level (CAL) responses (panelb); Histogram (panel c) and Q-Q plot (panel d) of the empirical Bayesian estimates of the randomeffects to assess the validity of the assumption of normal random effects density; Histograms ofresiduals for PPD responses (panel e) and CAL responses (panel f) obtained after fitting a linearmixed model.

21

from the Gaussian errors assumption) for the two responses, and also for the random subject

effects. In this section, we illustrate the application of the models and Bayesian method de-

veloped in Sections 2 and 3 of this article. We compare the results from fitting the following

4 competing models for the random errors, which are as follows:

1. Semiparametric Gaussian Mixture model (SPGM), which assumes symmetric densities

h1(·) and h2(·) of (4) are expressed as unknown scale mixtures of Gaussian kernel as

in (5).

2. Parametric Gaussian Mixture model (PGM), which assumes that the latent vectors

Z1i ∼ N2(0, D2σ) and Z2i ∼ N2(0,Σ), where D2

σ = diag(σ21, σ

22) and (Σ)11 = σ2

1, (Σ)22 =

σ22 (Σ)12 = ρσ1σ2.

3. Skew-t model (ST) assumes a Skew-t (Sahu et al., 2003) distribution with ν degrees of

freedom for the errors (εi1, εi2).

4. Bivariate Normal (BVN) model assumes a symmetric N2(0,Σ) distribution for (εi1, εi2),

ignoring the skewness.

We use practically non-informative independent N(0, 100) priors for all the regression

parameters (the components of β1 and β2), independent N(0, 100) priors for the skewness

parameters α1 and α2. We choose independent DP(G0j, C) for j = 1, 2 priors for the unknown

mixing distributions G1 and G2, which correspond to∫∞

0K(u|σ)dG0j(σ) for j = 1, 2. We set

the concentration parameter to C = 1, and assume same Gamma(8, 4) density for the prior

“guess” G01 and G02 of unknown G1 and G2 respectively. The prior guess Gamma(8, 4)

corresponds to the prior opinion that the ranges of Z1i and Z2i can be as narrow as ±2

to as wide as ±6 with high prior probability (with prior expectation of range ±4). This

prior guess for G1 and G2 matches with the prior distribution used for σ1 and σ2 of the

22

parametric Gaussian mixture model, skew-t and bivariate normal models. For the degrees of

freedom ν of the skew-t distribution, we choose Exp(0.1)I[2,∞), which is exponential density

truncated at 2 as the prior. For all four models, we use a uniform (-1,+1) as the prior for

ρ. It is important here to note that our choice of priors, although fairly non-informative,

is for illustrating the implementation of our statistical methods and may not represent the

prior opinion of the clinical investigators. We generated two parallel MCMC chains of size

350,000 and computed the posterior estimates after discarding the first 325,000 iterations.

To guard against the potential problems arising due to autocorrelation among the successive

iterations, we use thinning rate of 25. To assess the model convergence, we use the trace

plots, autocorrelation plots and the Gelman-Rubin R. The MCMC chain for parametric

Gaussian mixture model requires a larger number of iterations to converge as compared to

that of the semiparametric Gaussian mixture model. The GAAD data and the relevant R

and WinBUGS codes for the parametric and semiparametric Bayesian analysis are available

from the authors on request.

We use the Conditional Predictive Ordinate (cpo) statistic, (cpo)i =∫fm(yi −Xiβ |

Θ)p(Θ | D(−i))dΘ (Gelfand et al., 1992), to compare the model performances, where D(−i)

is the reduced data after deleting the ith observation from the full data D, and p(Θ | D(−i))

is the posterior density of Θ given D(−i). These cpoi for i = 1, . . . , n are obtained via the

Monte Carlo method using the MCMC samples from the full posterior distribution p(Θ | D).

The summary statistic using the cpoi is the log of the pseudo-marginal likelihood (lpml),

given by lpml =∑n

i=1 log (cpoi). We also use the Watanabe-Akaike information criterion

(waic) (Watanabe, 2010; Vehtari et al., 2016), considered a state-of-the-art Bayesian model

selection tool. The computations of the waic and lpml are convenient, based on MCMC

samples from the full posterior distribution p(Θ | D).

Models with larger lpml and smaller waic indicate more data support. Table 2

23

presents the computed lpml and waic values for the 3 competing models. These values sug-

gest the semiparametric Gaussian mixture model of (3)-(4) as the most appropriate model

followed by the parametric Gaussian mixture model. Recall that the parametric Gaussian

mixture model is the parametric case of the semiparametric Gaussian mixture model. The

lpml values for the skew-t and bivariate normal (existing parametric) models are -242.112

and -278.57 respectively, both substantially smaller than our new semiparametric and para-

metric models. These numbers suggest that our new modeling strategy is substantially more

appropriate for the GAAD data, compared to the competing parametric models.

Table 2: Comparison of the model selection measures for the semiparametric Gaussian mixture(SPGM), parametric Gaussian mixture (PGM), skew-t (ST), and bivariate normal (BVN) models.Two model selection measures used are the Watanabe-Akaike Information Criterion (waic), andthe log pseudo-marginal likelihood (lpml)

Model WAIC LPMLSPGM 424.7795 -74.644PGM 441.8912 -127.882ST 618.9393 -244.546BVN 654.5964 -300.315

The posterior estimates, standard deviations, and 95% credible intervals (CIs) of model

parameters obtained from fitting the parametric Gaussian and the semiparametric Gaussian

mixture models to the bivariate GAAD data responses (PPD and CAL) are given in Tables 3

and 4, respectively. The corresponding estimates (without the 95% CIs) from the skew-t and

the BVN models are also presented. The CIs for the skewness parameters α1 (corresponding

to PPD) and α2 (corresponding to CAL) do not contain 0 under any of the two selected

best models, and thus present substantial evidence of skewness for both responses. However,

under the skew-t model, the CIs for α1 overlaps with 0 and fails to detect skewness. Positive

values for the posterior mean of the skewness parameters indicate right skewness, as expected

from the histograms of marginal responses shown in Figure 3. The posterior means of the

24

effects of glycemic status (HbA1c) on both PPD and CAL (with the 95% CIs not covering

zero) are positive, implying that higher HbA1c may lead to substantially higher levels of

PD. Among smokers, there is a positive trend (not significant) to have higher PPD, and a

significantly higher CAL, compared to non-smokers. Also, significantly higher PPD and CAL

are observed for males, compared to females from both the semiparametric and parametric

Gaussian mixture models. However, under the bivariate normal and skew-t models, gender

effects do not have enough posterior evidence. On the whole, for most of the parameters, the

posterior standard deviation and associated 95% CIs suggest that our new Bayesian methods

provide more precise estimates for the model parameters. In terms of sensitivity analysis, we

observe that moderate changes in the choice of priors do not affect the parameter estimates

(both magnitude and sign) and the model comparison measures greatly.

Table 3: Posterior estimates of the regression parameters and the skewness parameter α1

corresponding to the response ‘periodontal pocket depth’ (PPD), obtained after fitting thesemiparametric Gaussian mixture (SPGM), parametric Gaussian mixture (PGM), paramet-ric skew-t (ST), and bivariate normal (BVN) models. SD and CI denote the posteriorstandard deviation and 95% credible intervals, respectively.

SPGM PGM ST BVN

Mean SD CI Mean SD CI Mean SD Mean SD

Intercept 2.1169 0.3143 (1.4708,2.6673) 2.1887 0.3881 (1.4807,2.9230) 2.0865 0.6557 2.1169 0.6298Age -0.0038 0.0039 (-0.0107,0.0041) -0.0058 0.0043 (-0.0137,0.0032) -0.0069 0.0076 -0.0098 0.0073Gender -0.2534 0.0891 (-0.4332,-0.0993) -0.2096 0.1219 (-0.4301,-0.0448) -0.1458 0.1861 -0.0811 0.1753BMI -0.1825 0.0843 (-0.3195,0.0018) -0.1378 0.1021 (-0.3769,0.0505) -0.2259 0.1402 -0.1056 0.1645Smoker 0.0599 0.0826 (-0.0993, 0.2219) 0.0746 0.0987 (-0.1270,0.2534) 0.1663 0.1767 0.1485 0.1696HbA1c 0.1789 0.0807 (0.0437,0.3565) 0.1331 0.0634 (0.0156,0.2618) 0.2191 0.1546 0.1701 0.1501α1 2.4224 0.2845 (1.8813,2.9644) 0.6968 0.1176 (0.4709,0.9378) 0.0687 0.0843 – –

7 Conclusion

In this paper, we presented a novel class of semiparametric multivariate stochastic models for

highly skewed error distributions. Our full model based approaches are substantially different

25

Table 4: Posterior estimates of the regression parameters and the skewness parameter α2

corresponding to the response ‘clinical attachment level’ (CAL) obtained after fitting thesemiparametric Gaussian mixture (SPGM), parametric Gaussian mixture (PGM), paramet-ric skew-t (ST), and bivariate normal (BVN) models. SD and CI denote the posteriorstandard deviation and 95% credible intervals, respectively.

SPGM PGM ST BVN

Mean SD CI Mean SD CI Mean SD Mean SD

Intercept 1.1876 0.4131 (0.3360,1.9991) 0.9896 0.4851 (-0.0245,1.8539) 1.4133 0.6813 1.1401 0.6485Age 0.0096 0.0048 (0.0005,0.0193) 0.0081 0.0057 (-0.0025,0.0198) 0.0 050 0.0078 0.0073 0.0075Gender -0.3913 0.0976 (-0.5903,-0.2228) -0.3556 0.1592 (-0.6681,-0.0743) -0.2505 0.1968 -0.3094 0.1872BMI -0.1498 0.1045 (-0.3283,0.0885) -0.1022 0.1229 (-0.3778,0.1188) -0.1936 0.1619 -0.1218 0.1725Smoker 0.2060 0.0925 (0.0166,0.3913) 0.2546 0.1145 (0.0259,0.4605) 0.2835 0.1865 0.3581 0.1813HbA1c 0.2593 0.1055 (0.0782,0.4865) 0.2392 0.0864 (0.0708,0.3981) 0.3005 0.1623 0.3143 0.1573α2 1.2973 0.2231 (0.9077,1.7416) 1.1646 0.1333 (0.9216,1.4275) 0.5294 0.0919 – –

from the estimating equations approaches of Ma et al. (2005), because our method uses full

likelihood and allows estimation and prediction of skewness and association. Also, the meth-

ods of Ma et al. (2005) and others use estimating equations for models with single skewing

function (essentially single skewed “shock” model). The parametric Gaussian mixture model

with skew-normal density for each marginal is a special case of our semiparametric model.

Our class of models can incorporate a large and flexible within-subject association, and the

association structure does not affect the marginal densities. Recently there has been some

confusion about how to model multivariate skewed densities via “single shock model” versus

“multiple shock model”. We show the superiority of the later approach initiated by Sahu

et al. (2003), and extended it to a semiparametric models. Our new semiparametric models

are readily amenable to Bayesian analysis through prior specifications for each component

and they do not involve any complex a priori relationship among parameters of different com-

ponents. Also, our latent variable representation simplifies the MCMC-based computation

and we present a novel trick to implement the likelihood within WinBUGS and other standard

software. The WinBUGS codes for the parametric and semiparametric Bayesian analysis are

26

available from the authors on request.

The Bayesian method described in this paper yields precise estimate of covariate effects

with meaningful physical interpretation. This is the first semiparametric full likelihood

based method for analysis of multivariate skewed response with full theoretical justification

provided via Bayesian asymptotic results. These methods for multivariate skewed responses

are more appropriate compared to quantile regression, when the goal is to understand the

effect of the covariates on the whole distribution (instead of certain quantiles and moments).

However, our methods can also estimate any desired quantile function (details omitted for

brevity). It also allows for seamless prediction, which is often important in biomedical

studies. The analysis of the GAAD data considered in this paper is primarily focused

on modeling the skewed bivariate response, and investigating its cross-sectional association

with a fixed number of subject-level covariates. Certainly, our proposed method can be

easily extended to scenarios with covariates missing-at-random. Furthermore, the methods

and tools discussed in this article can be extended to studies where the dimensions of the

response vector vary over subjects – leading to the popular ‘informative cluster size’ scenario

(Seaman et al., 2014). These will be pursued elsewhere.

References

Arnold, B. C. and Beaver, R. J. (2000), ‘Hidden truncation models’, Sankhya: The Indian

Journal of Statistics, Series A 62, 23–35.

Azzalini, A. (1985), ‘A class of distributions which includes the normal ones’, Scandinavian

journal of statistics 12(2), 171–178.

Azzalini, A. and Capitanio, A. (2003), ‘Distributions generated by perturbation of symme-

27

try with emphasis on a multivariate skew-t distribution’, Journal of the Royal Statistical

Society: Series B (Statistical Methodology) 65(2), 367–389.

Azzalini, A. and Dalla Valle, A. (1996), ‘The multivariate skew-normal distribution’,

Biometrika 83(4), 715–726.

Bandyopadhyay, D., Lachos, V. H., Abanto-Valle, C. A. and Ghosh, P. (2010), ‘Linear

mixed models for skew-normal/independent bivariate responses with an application to

periodontal disease’, Statistics in Medicine 29(25), 2643–2655.

Bandyopadhyay, D., Lachos, V. H., Castro, L. M. and Dey, D. K. (2012), ‘Skew-

normal/independent linear mixed models for censored responses with applications to Hiv

viral loads’, Biometrical Journal 54(3), 405–425.

Branco, M. D. and Dey, D. K. (2001), ‘A general class of multivariate skew-elliptical distri-

butions’, Journal of Multivariate Analysis 79(1), 99–113.

Casanova, L., Hughes, F. and Preshaw, P. (2014), ‘Diabetes and Periodontal Disease: a

two-way relationship’, British Dental Journal 217(8), 433–437.

Chang, S.-C. and Zimmerman, D. L. (2016), ‘Skew-normal antedependence models for

skewed longitudinal data’, Biometrika 103(2), 363–376.

Darby, M. L. and Walsh, M. (2014), Dental Hygiene: Theory and Practice, 4 edn, Elsevier

Health Sciences.

Feller, W. (1971), An Introduction to Probability Theory and Its Applications, Vol. 2, second

edn, Wiley, New York, NY.

Ferguson, T. (1974), ‘Prior distribution on spaces of probability measures’, The Annals of

Statistics 2, 615–629.

28

Fernandes, J., Wiegand, R., Salinas, C., Grossi, S., Sanders, J., Lopes-Virella, M. and Slate,

E. (2009), ‘Periodontal disease status in Gullah African Americans with type 2 diabetes

living in South Carolina’, Journal of Periodontology 80(7), 1062–1068.

Fernandez, C. and Steel, M. F. (1996), ‘On bayesian modelling of fat tails and skewness’,

Available at SSRN 821 .

Ferreira, J. T. and Steel, M. F. (2007), ‘A new class of skewed multivariate distributions

with applications to regression analysis’, Statistica Sinica 17(2), 505–529.

Gelfand, A. E., Dey, D. K. and Chang, H. (1992), Model determination using predictive

distributions with implementation via sampling-based methods, Technical report, DTIC

Document.

Genton, M. G. (2004), Skew-elliptical distributions and their applications: a journey beyond

normality, CRC Press.

Johnson-Spruill, I., Hammond, P., Davis, B., McGee, Z. and Louden, D. (2009), ‘Health

of Gullah families in South Carolina with Type-2 diabetes: Diabetes self-management

analysis from Project SuGar’, The Diabetes Educator 35(1), 117–123.

Koenker, R. (2005), Quantile Regression, number 38 in ‘Econometric Society Monographs’,

Cambridge University Press.

Koenker, R. and Bassett Jr, G. (1978), ‘Regression quantiles’, Econometrica 46(1), 33–50.

Lachos, V. H., Dey, D. K. and Cancho, V. G. (2009), ‘Robust linear mixed models with

skew-normal independent distributions from a bayesian perspective’, Journal of Statistical

Planning and Inference 139(12), 4098–4110.

29

Liang, K.-Y. and Zeger, S. L. (1986), ‘Longitudinal data analysis using generalized linear

models’, Biometrika 73(1), 13–22.

Ma, Y., Genton, M. G. and Tsiatis, A. A. (2005), ‘Locally efficient semiparametric esti-

mators for generalized skew-elliptical distributions’, Journal of the American Statistical

Association 100(471), 980–989.

Mattila, K. J., Nieminen, M. S., Valtonen, V. V., Rasi, V. P., Kesaniemi, Y. A., Syrjala,

S. L., Jungell, P. S., Isoluoma, M., Hietaniemi, K. and Jokinen, M. J. (1989), ‘Association

between dental health and acute myocardial infarction.’, Bmj 298(6676), 779–781.

Michaud, D. S., Liu, Y., Meyer, M., Giovannucci, E. and Joshipura, K. (2008), ‘Periodontal

disease, tooth loss, and cancer risk in male health professionals: a prospective cohort

study’, The lancet oncology 9(6), 550–558.

Nelsen, R. B. (2007), An introduction to copulas, Springer Science & Business Media.

Oliver, R. C., Brown, L. J. and Loe, H. (1998), ‘Periodontal diseases in the United States

population’, Journal of Periodontology 69(2), 269–278.

Pati, D. and Dunson, D. B. (2014), ‘Bayesian nonparametric regression with varying residual

density’, Annals of the Institute of Statistical Mathematics 66(1), 1–31.

Pourabbas, R., Kashefimehr, A., Rahmanpour, N., Babaloo, Z., Kishen, A., Tenenbaum,

H. C., Azarpazhooh, A., Mdala, I., Olsen, I., Haffajee, A. D. et al. (2005), ‘Position paper:

Epidemiology of Periodontal Diseases’, Journal of Periodontology 76(8), 1406–1419.

Reich, B. and Bandyopadhyay, D. (2010), ‘A latent factor model for spatial data with infor-

mative missingness’, Annals of Applied Statistics 4(1), 439–459.

30

Sahu, S. K., Dey, D. K. and Branco, M. D. (2003), ‘A new class of multivariate skew distri-

butions with applications to bayesian regression models’, Canadian Journal of Statistics

31(2), 129–150.

Seaman, S. R., Pavlou, M. and Copas, A. J. (2014), ‘Methods for observed-cluster inference

when cluster size is informative: A review and clarifications’, Biometrics 70(2), 449–456.

Sethuraman, J. (1994), ‘A constructive definition of dirichlet priors’, Statistica Sinica 4, 639–

650.

Tang, Y., Sinha, D., Pati, D., Lipsitz, S. and Lipshultz, S. (2015), ‘Bayesian partial linear

model for skewed longitudinal data.’, Biostatistics (Oxford, England) 16(3), 441–453.

Vehtari, A., Gelman, A. and Gabry, J. (2016), ‘Practical Bayesian model evaluation using

leave-one-out cross-validation and WAIC’, Statistics and Computing .

Watanabe, S. (2010), ‘Asymptotic equivalence of bayes cross validation and widely applicable

information criterion in singular learning theory’, Journal of Machine Learning Research

11, 3571–3594.

Appendix A

Proof of Lemma 1

Given a density f0 ∈ F the idea is to construct a sequence of functions f (M) ∈ C{(F)m ×

(0, 1)} M ≥ 1 such that KL(f0, f(M)) → 0 as M → ∞. Let e = (e1, ..., em), Aα =

diag(α1(1 + α2

1)−1/2, ..., αm(1 + α2m)−1/2

)and A∗α = diag

((1 + α2

1)−1/2, ..., (1 + α2m)−1/2

). We as-

31

sume α1 = α2 = ... = αm = α0 and α0 is known. Define,

f (M)(e) =

∫Rm+

2m|J |m∏j=1

hjM (zj)hjM (e∗j )

φ1

[Φ−1{HjM (e∗j )}

]φm [Φ−1{H1M (e∗1), ...,Φ−1{HmM (e∗m)}| ΣρM

]dz,

where e∗j = (1 + α20)1/2ej − α0zj for j = 1, ...,m, HjM is the cdf of hjM and ρM → ρ as M → ∞,

|J | =

∣∣∣∣∣∣∣∣I 0

A∗−1α −A∗−1

α Aα

∣∣∣∣∣∣∣∣ with the marginal density given by,

fjM (ej) =

∫ ∞0

2(1 + α20)1/2hjM (zj)hjM (e∗j )dzj , j = 1, ...,m.

We now construct hjM . Suppose FjM (·) denotes the cdf of fjM (·). h0j is continuous and symmetric

(around 0) density. Clearly, h0j is increasing R− and decreasing on R+. We define weights as in

Wu and Ghoshal (2008). Suppose t1 > 0 and t2 > 0 such that h0j(t1) = a1 and h0j(t2) = b1, where

0 < b1 < 1 and b1 < a1 < h0j(0). For given M , let M1 and M2 be such that M1M ≤ t1 ≤ M1+1

M and

M2M ≤ t2 ≤

M2+1M . Set

w∗ji =

iM {h0j

(iM

)− h0j

(i+1M

)}, 1 ≤ i < M1,

M1M {h0j

(M1M

)− a1}, i = M1,

M+1M {a1 − h0j

(M+1M

)}, i = M1 + 1,

iM {h0j

(i−1M

)− h0j

(iM

)}, M1 + 1 < i ≤M2,

iM {h0j

(i−1M

)− h0j

(iM

)}, i ≥M1 + 1

We define h∗jM (e) =∑∞

1 w∗jiK(e; iM ), where K(e; θ) = 1

2θ1(−θ≤e≤θ). By the continuity of

h0j , h∗jM (e) converges to h0j(e) pointwise. However,

∑∞1 w∗ji 6= 1 and h∗jM (e) is not a pdf. We

define wji = w∗ji1−

∑M1−11 w∗ji−

∑∞M2+1 w

∗ji∑M2

M1w∗ji

for M1 ≤ i ≤ M2 then∑∞

1 wji = 1. Let hjM (e) =

32

∑∞1 wjiK(e; i

M ), where K(e; θ) = 12θ1(−θ≤e≤θ). Observe that

h∗jM (e)− hjM (e) =

( ∞∑1

w∗ji −∞∑1

wji

)

=

(1−

∑M1−11 wji −

∑∞M2+1wji∑M2

M1w∗ji

− 1

)M2∑M1

w∗jiM

2i

≤

(1−

∑M1−11 wji −

∑∞M2+1wji∑M2

M1w∗ji

− 1

)M2∑M1

w∗ji

M

2M1

=

1−M1−1∑

1

wji −∞∑

M2+1

wji −M2∑M1

w∗ji

M

2M1

=

(1− 1

M

∞∑1

h0j(i/M)− a1

M

)M

2M1→ 0

as M →∞, by definition of Riemann integral. Thus hjM (e) converges to h0j . Let M be large such

that the RHS of the equation is less than a2 Define fjM (e) = 2

b

∫∞0 hjM (zj)hjM (

ej−azjb )dzj , where

a = α√1+α2

and b = 1√1+α2

. Hence,

| log fjM (ej)| = | log

(2

b

)+ log

∫ ∞0

hjM (zj)hjM (e∗j )|

≤ | log

(2

b

)|+ | log

∫ ∞0

hjM (zj)hjM (e∗j )|

Since hjM → h0j pointwise, by construction of hjM we have, c1h0j(zj) ≤ hjM (zj) ≤ c2h0j(zj),

and c1h0j(e∗j ) ≤ hjM (e∗j ) ≤ c2h0j(e

∗j ). Thus we have

c1

∫ ∞0

h0j(zj)h0j(e∗j ) ≤

∫ ∞0

hjM (zj)hjM (e∗j ) ≤ c2

∫ ∞0

h0j(zj)h0j(e∗j ).

Hence we also have | log fjM (ej)| < ∞. If f0 ∈ FKL, we have∫f0j | log f0j | < ∞ for all j. Since

| log h0j(ej)| is h0-integrable, using dominated convergence theorem, we have∫h0j log

h0jhjM→ 0 as

M → ∞. | log fjM (ej)| is bounded above by an integrable function and hence∫f0j log

f0jfjM→ 0

as M → ∞. Below we show that f (M) → f0 pointwise and construct an f0 integrable upper

bound of gM = log f0f (M) . Since, fjM → f0j pointwise for j = 1, ...,m, by Scheffe’s theorem

33

supe∈R|FjM (e) − F0j(e)| → 0 as M → ∞. φ and Φ being continuous in their arguments, we

conclude that fM → f0 pointwise. Observe that

|gM | ≤ | log f0|+ | log f (M)|

and

| log f (M)| ≤ | log(2m|J |)|+ (12)

| log

∫Rm+

m∏j=1

hjM (zj)hjM (e∗j )1

|ΣM |1/2exp

[−1

2H∗′M (e∗)(Σ−1

M − I)H∗M (e∗)

]dz|.

Let

IM = 1|ΣM |1/2

exp[−1

2H∗′M (e∗)(Σ−1

M − I)H∗M (e∗)].

Then

log(IM ) = −1

2log |ΣM | −

1

2H∗′M (e∗)(Σ−1

M − I)H∗M (e∗)

= C1 −1

2(H∗M (e∗)−H∗0 (e∗))

′(Σ−1

M − I)(H∗M (e∗)−H∗0 (e∗))

− 1

2H∗′

0 (e∗)(Σ−1M − Σ−1

0 )H∗0 (e∗)− 1

2H∗′

0 (e∗)(Σ−10 − I)H∗0 (e∗) (13)

We have H∗M = (H∗1M , ...,H∗mM ) and H∗0 = (H∗01, ...,H

∗0m), where H∗jM (e∗j ) = Φ−1{HjM (e∗j )} and

H∗0j(e∗j ) = Φ−1{H0j(e

∗j )} for j = 1, ...,m. Using the Taylor series expansion of H∗jM around H∗0j we

have for ζ ∈ [0, 1]

H∗jM = H∗0j +HjM (e∗j )−H0j(e

∗j )

φ1(H0j(e∗j )) +φ′(ζ){HjM (e∗j )−H0j(e

∗j )}

φ2(ζ)

Hence, |H∗jM − H∗0j | → 0 uniformly in e∗j and the first term in (13) is 0 as M → ∞. As ρM →

ρ0 as M → ∞, the m eigenvalues of Σ−1M converge to the m eigenvalues of Σ−1

0 . Therefore,

12H∗′0 (e∗)(Σ−1

M − Σ−10 )H∗0 (e∗) ≤ maxj |λ0j − λMj |H∗

′0 H

∗0 . Thus, there exists constants C1, C2 > 0

such that

exp{−(C1 + C2| log(I0)|)} ≤ IM ≤ exp{C1 + C2| log(I0)|}.

34

We know that hjM (·)→ h0j(·) for all j as M →∞. Hence

| log f (M)(e)| ≤ | log(2m|J |)|+ | log Ψ|, with

∫f0| log f (M)(e)| <∞.

By dominated convergence theorem, we conclude that∫f0 log f0

f (M) → 0 as M → ∞ and FKL ⊂

KL(Π).

Appendix B

Proof of Theorem 1

Suppose for any two densities g1 and g2, K(g1, g2) =∫R g1(w) log{g1(w)/g2(w)}dw and V (g1, g2) =∫

R g1(w)[log+{g1(w)/g2(w)}]2dw, where log+(u) = max(log(u), 0). Set Ki(f, β) = K(f0i, fβi) and

Vi(f, β) = V (f0i, fβi). The proof of Theorem 1 follows from (Pati and Dunson, 2014) and (Tang

et al., 2015) with minor changes under the condition that there exist test functions {Φn}∞n=1, sets

Θn =Wε(f0)×Θβn, n ≥ 1, and constants C1, C2, c1, c2 > 0 such that

1.∑∞

n=1E∏ni=1 f0i

Φn <∞

2. sup(f,β)∈Ucn∩Θn E∏ni=1 fβi

(1− Φn) ≤ C1e−c1n

3. Πβ(Θcβn) ≤ C2e

−c2n

4. For all δ > 0 and for almost every data sequence {yi, xi}∞i=1,

Π{(f, β) : Ki(f, β) < δ∀i,∑∞

i=1Vi(f,β)i2

<∞} > 0

To verify the above conditions, we construct sieves Θn =Wε(f0)×Θβn, where Θβn = {β : |β| < Tn}

with Tn = O(√n).

35