32
A Weighted Approach to Zero-Inflated Poisson Regression Models With Missing Data in Covariates. Martin Lukusa. T. 1 ? , Shen-Ming Lee 1 and Chin-Shang Li 2 1. Department of Statistics, Feng Chia University, Taiwan. 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. May 2014 Martin Lukusa. T. 1 ? , Shen-Ming Lee 1 and Chin-Shang Li (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Div Modeling ZIP with incomplete Covariates May 2014 1 / 29

Publications 2004-2013

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

A Weighted Approach to Zero-Inflated PoissonRegression Models With Missing Data in Covariates.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2

1. Department of Statistics, Feng Chia University, Taiwan.

2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA.

May 2014

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 1 / 29

Table of Contents

1 Introduction

2 Estimating ZIP under no missings

3 Estimating ZIP under Missings CovariatesEstimating ZIP under Missings Covariates

4 Large Sample Properties

5 simulation Study

6 Empirical Study

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 2 / 29

Introduction

Background Study

1 Count data cover considerable portion of data used in many fieldsincluding social sciences, medical, industrial, ecology and etc.

2 poisson regression model is the basic and natural model for countdata. See (Cameron and Trivedi, 1998)[1]).

3 Overdispersion problem: In practice, many count data have manyzeroes resulting to an overdispersed distribution or a skeweddistribution See (McCullagh and Nelder(1989)[10])

As Consequences

Assumption of mean=variance does not hold.

Poisson fit becomes questionable.

Need for advanced models able to account for overdispersion.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 3 / 29

Introduction

(Be Cont.)

Potential candidates for handling overdispersion are:

Negative binomial (NB)

Zero-Inflation Poisson (ZIP)

Negative Zero Inflated Binomial (NZIB)

Generalized Poisson Model (GP)

Hurdle models and other Truncated models

Interest: In the presence of overdispersion, we propose to fit countdata using Zero-Inflated Poisson (ZIP) Model.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 4 / 29

Introduction

Literature Review ZIP Model

Under no missings values, ZIP has been intensively visited. Many workshave been done. For instance; Lambert (1992) studied ZIP model withapplication to decfects in manufacturing with use ofExpectation-Maximization (EM) algorithm of Dempster(1977). Jansakuland Hinde (2002)[11] proposed a score test for a ZIP model against aPoisson model. Hall and shen (2010) [6], has proposed a robust ZIPestimator, Li (2010) [9], has studied the lack of fit for parametric test, andLi (2011) [8], proposed a semiparametric score test for ZIP.Under missing values in covariates modelling ZIP, very few authors havefocused on that way. For instance, Chen and Fu(2011)[13] where theyproposed a model selection and Bhavna et al. (2011)[2] conducted amultiple imputation for missing caries data in ZIP models.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 5 / 29

Introduction

General Concepts of Zero-Inflation Poisson (ZIP) Model.

Let Yi = 0, 1, 2, · · · , n be count response variable, suct that Yi is amixture of two distributions. Yi degenerates at mass 0 with probability pior Yi is Poisson(µi ) random with probability 1− pi .Yi density probability function is given by:

P(Yi = y ; p, µ) = p + (1− p)e−µ, if y = 0; (1− p)e−µµy

y !, if y > 0.

Define p = H(u) as the mixing probability where H(u) =exp(u)

1 + exp(u)and

u = γTX . Similarly, µ = exp(u) is the poisson mean where u = βTX , and

where X =(1,XT ,ZT

)T, and here X and Z are univariate covariates.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 6 / 29

Introduction

ZIP can model simulteneously p, µ respectively by logistic and poisson

forms. We seek to estimate θ =(γT ,βT

)T, where γ =

(γ0,γ

T1 ,γ

T2

)Tand β =

(β0,β

T1 ,β

T2

)T. Let now define

P(Yi = 0|Xi ,Zi

)= H(γTXi ) + (1− H(γTXi )) exp(− exp(βTXi ))

= H(γTXi

)H−1

(γTXi + exp(βTXi )

), (1)

P(Yi > 0|Xi ,Zi

)= 1− P

(Yi = 0|Xi ,Zi

). (2)

Note that Zero-Inflated Poisson mean, denoted by E(Yi ) and its variancedenoted by Var(Yi ) of Yi are respectively expressed as follows

E(Yi ) =[1− H

(γTXi

)]exp

(βTXi

), and

Var(Yi ) =(1− H

(γTXi

))exp

(βTXi

){1 + H

(γTXi

)exp

(βTXi

)}.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 7 / 29

Estimating ZIP under no missings

Likelihood function (ZIP)

The likelihood function L(θ) is expressed by

L(θ) =n∏

i=1

[H(γTXi

)H−1

(γTXi + exp(βTXi )

)]I (yi=0)

n∏i=1

[(1− H(γTXi )

)exp(− exp(βTXi )

)[exp(βTXi )]yi

yi !

]I (yi>0)

. (3)

Applying the natural log to L(θ), we obtain the log-likelihood function `(θ) given by

`(θ) =n∑

i=1

[I (Yi = 0)

(log[H(γTXi

)]− log

[H(γTXi + exp(βTXi )

)])]

+n∑

i=1

[I (Yi > 0)

(log(1− H(γTXi )

)+(Yiβ

TXi − exp(βTXi )− logYi !))]

.

(4)

which can be optimized by maximum likelihood estimation method (MLE/EM al.) to find θas in Lambert (1992).

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 8 / 29

Estimating ZIP under no missings

Model Estimation Complete Data(Ideal)

Under no missings, the estimating equation (E.E) based on score functions is given by

Un

(θ)

=1√n

∂`(θ)

∂θT=

1√n

∂`(θ)

∂γ∂`(θ)

∂β

=1√n

n∑i=1

Si(θ), (5)

with Si(θ)

=(STi1

(θ),ST

i2

(θ))T

, where Si1(θ)

and Si2(θ)

are respectively given by

Si1(θ)

= XiH(γTXi + exp(βTXi )

){I (Yi = 0)− HγTXi

H(γTXi + exp(βTXi )

)}, (6)

Si2(θ)

= Xi

{[Yi −

(1− H(γTXi )

)exp(βTXi )

]}+ exp(βTXi )Si1

(θ). (7)

Since, E[Un

(θ)]

= 0, this implies that Un

(θ)

is unbiased (E.E). Solving for Un

(θ)

= 0, we

obtain the maximum likelihood estimator (MLE) θF of θ.Some Optimization Techniques; for instance, NR (Jansakul and Hinde (2002)[11]), EMalgorithm (Lambert(1992)[5]),ITRLS (Li (2010)[9]), MCMC under Bayesian set up.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 9 / 29

Estimating ZIP under Missings Covariates

Missing Data Concepts

Consider the data {Yi , Xi , Zi , Wi , δi }, i = 1, · · · , n.Denote

V =(Z ,W

)T. The missing mechanisms play important role in dealing

with missing problems. Rubin(1976) distinguishes three processes ofmissing as follows:

1 Missing Completely at Random(MCAR) with selection probabilitygiven by

P(δi = 1|Yi ,Xi ,Vi ) = πi ,2 Missing at Random(MAR) where the selection probability is expressed

by

P(δi = 1|Yi ,Xi ,Vi ) = π(Yi ,Vi ),3 Missing at Not Random (MNAR) where the selection probability is

given by

P(δi = 1|Yi ,Xi ,Vi ) = π(Yi ,Xi ,Vi ),

MCAR and MAR refer to ignorable missings while MNAR refers tononignorable missings.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 10 / 29

Estimating ZIP under Missings Covariates Estimating ZIP under Missings Covariates

Complete Cases Analysis (CC Method)

Define δi = 1, when Xi is observed and δi = 0, otherwise.Under MAR,define the selection probability P(δi = 1|Yi ,Xi ,Vi ) = π(Yi ,Vi ).ZIP CC estimating function denoted by UCC ,n

(θ)

is expressed as follows

UCC ,n

(θ)

=1√n

n∑i=1

δiSi(θ)

=1√n

n∑i=1

δi

(Si1(θ)

Si2(θ)) , (8)

where Si1(θ)

and Si2(θ)

are respectively given in expressions (6) and (7).

The solution θCC is obtained when solving UCC ,n

(θ)

= 0.

1 It can be established that under MAR, E[UCC ,n

(θ)]6= 0.

2 Therefore, UCC ,n

(θ)

is biased.

3 CC inference mostly inefficiency due to cases deletion.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 11 / 29

Estimating ZIP under Missings Covariates Estimating ZIP under Missings Covariates

Inverse Probability Weighting (IPW) Method

BackgroundInspired from Horvitz-Thomsom (H-T) estimator (1952)[7], separately,Flanders and Greenland (1991)[12] and Zhao and Lipsitz (1992)[14]extended the H-T estimator and proposed the IPW estimator. Recently,Creemers et al.(2012)[3] proposed general concepts for nonparametricweighting estimating function under MAR. The IPW inversely weights theobserved data with some weights π

(Yi ,Vi

)in order to reduce the bias due

to incomplete cases deletion. This article implements the IPW toZero-Inflated Poisson when π

(Yi ,Vi

)is nonparametric.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 12 / 29

Estimating ZIP under Missings Covariates Estimating ZIP under Missings Covariates

Let δi = 1 if Xi is observed, and δi = 0 elsewhere. Under MAR, the selection probability ofobserving Xi is

P(δi = 1|Yi ,Xi ,Vi ) = π(Yi ,Vi ),

where Vi =(Zi ,Wi

)T. In this article, Zi is always observed and Wi is a surrogate of Xi . Wi is

available and independent of Yi given(Xi ,Zi

). The inverse probability weighting (E.E) ZIP

respectively with parametric and nonparametric π(.) are given by:

UWp,n

(θ,φ

)=

1√n

n∑i=1

δi

π(Yi ,Vi ;φ

)Si(θ) =1√n

n∑i=1

δi

π(Yi ,Vi ;φ

) (Si1(θ)Si2(θ)

)(9)

Uws,n

(θ, π

)=

1√n

n∑i=1

δi

π(Yi ,Vi

)Si(θ) =1√n

n∑i=1

δi

π(Yi ,Vi

) (Si1(θ)Si2(θ)

). (10)

where Si1(θ) and Si2(θ) are respectively given in expressions (6) and (7). We obtainconsistent estimator of θ denoted by θWs using UWs,n

(θ, π

)= 0; similarly, θWp using

UWp,n

(θ,φ

)= 0.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 13 / 29

Estimating ZIP under Missings Covariates Estimating ZIP under Missings Covariates

Weighting procedure

In practice, selection probability π(Yi ,Vi

)is unknown, and needs to be

estimated either parametrically (Rosenbaum and Rubin, 1983) wherelogistic regression is the most used, or nonparametrically (Wang andWang, 1997)[4]. First, parametrically we have:

U(φ) =1√nυi

n∑i=1

{δi − H(φTυi )

}, (11)

where υ = (1,Y ,V ), with (V = Z ,W ), φ = (φ0, φ1, φ2), and where

H(φTυi ) = π(Y ,V ) =exp(φ0 + φ1Yi + φ2Vi )

1 + expφ0 + φ1Yi + φ2Vi. (12)

Once estimated, π(Yi ,Vi

)is replaced in expression (10)

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 14 / 29

Estimating ZIP under Missings Covariates Estimating ZIP under Missings Covariates

Weighting procedure (Be cont.)

Second, the selection probability is nonparametrically estimated. We alsoassume that covariates are categorical. Let v1, v2, · · · , vm denote thedistinct values of Vi , for v ∈

[v1, v2, · · · , vm

]; the nonparametric estimator

of π(Yi ,Vi

)is given by the following expression:

π(y , v) =

∑ni=1 δi I

(Yi = y ,Vi = v

)∑ni=1 I

(Yi = y ,Vi = v

) , (13)

where V =(Z ,W

)T, and I

(.)

is indicator function. Once estimated,π(Yi ,Vi

)is replaced in expression (10). Next on , we focused only case

where selection probability is nonparametric.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 15 / 29

Large Sample Properties

Theorem (1)

Under some regularity conditions, it follows that as n→∞, θF is a consistent estimatorof θ and

√n(θF − θ) is asymptotically normally distributed with mean 0 and covariance

matrix ∆F , where

∆F = H−1F (θ)QF (θ)[H−1

F (θ)]T = QF (θ)

since QF (θ) = E[Si(θ)[Si(θ)]T]and HF (θ) = E

(−∂Si(θ)

∂θT

)are equal.

Theorem (2)

Under some regularity conditions, it follows that as n→∞, θWs is a consistentestimator of θ and

√n(θWs − θ) is asymptotically normally distributed with mean 0 and

variance matrix ∆Ws where,

∆Ws = [HF

(θ)]−1

[J(θ, π

)− [J∗

(θ, π

)− C ∗(θ, π)]][HT

F

(θ)]−1

where respectively we have

J(θ, π) = E

[Si (θ)ST

i (θ)

π(Yi ,Vi )

], J∗(θ, π) = E

[S∗i (θ)[Si

∗(θ)]T

π(Yi ,Vi )

],

C ∗(θ, π) = E

[S∗i (θ)[Si

∗(θ)]T], and S∗

i (θ) = E

[Si (θ)|Yi ,Vi

].

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 16 / 29

simulation Study

Settings A simulation study is conducted to investigate the finite sampleproperties performance under different estimator. We consider thefollowing

θF which is the complete data estimator of θ.

θCC which is the complete cases estimator of θ.

θWt which is the true weight IPW estimator of θ.

θWs which is the semiparametric IPW estimator of θ.

We chose the sample size n to be equal to 500 and 1000 observations, andtotal number of simulations was set to 1000 replications. We consideredcovariates to be univariate and categorical. For each senario, we computethe bias, the asymptotic standard error (ASE), the standarddeviation(SD), and the 95% confidence interval coverage probability (CP).We considered covariates X and Z such that X were generated fromuniform distribution U(−1, 2) and Z is binomial B(n, p) where p is theprobability of success equals to 0.5. We generated a binary surrogatevariable W such that W = 1 if X ≥ 0 and W = 0 otherwise.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 17 / 29

simulation Study

cont. The count response (ZIP) was generated as followsY = p ∗ 0 + (1− p) ∗ pois(λ) here p is the mixing proportion and λ thepoisson mean. They were respectively expressed asp = H(γ1 + γ2 ∗ X + γ3 ∗ Z ) and λ = exp(−β1 − β2 ∗ X − β3 ∗ Z ), with

θ =(γT ,βT

)T, where γ =

(θ1,θ2,θ3

)T,β =

(θ1,θ2,θ3

)Tsuch that

θ = (−1,−1, 0.5, 1, 0.7, 1). To account for the missing covariates in X ,where δi = 1 if X is observed and δi = 0 otherwise, we used the selectionprobability defined by logistic regression such asH(−η1 − η2Y ∗ I (Y < k∗)− η3(1− I (Y < k∗) + η4Z + η5W ), where thecut points was k∗ = 6. We considered two situations as follows:

1 When δ was generated using the selection probability withη = c(−1, 0.3, 10,−0.5, 1), the missng rate (mrate) was 33%.

2 When δ was generated using the selection probability withη = c(−1, .5, 1, .7, 1), the missng rate (mrate) was 68%.

Results of simulations are found into table1, table2 and table3.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 18 / 29

simulation Study

Table1: Simulation

Table: Result1 :: 500 obs , 1000 draws and mrate 0.33 .

par estimate θF θCC θWt θWs

γ2 Bias -0.0258 -0.5708 -0.0627 -0.0558SD 0.1575 0.2895 0.2927 0.2338

ASE 0.1527 0.2768 0.2770 0.2088CP 0.9470 0.4630 0.9430 0.9040

β2 Bias -0.0008 -0.0393 -0.0012 -0.0075SD 0.0260 0.0279 0.0317 0.0284

ASE 0.0261 0.0280 0.0299 0.0262CP 0.9570 0.7020 0.9310 0.9260

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 19 / 29

simulation Study

Table2: Simulation

Table: Result2 :: 500 obs , 1000 draws and mrate 0.67 .

par estimate θF θCC θWt θWs

γ2 Bias -0.0258 -0.5606 -0.0626 -0.0558SD 0.1575 0.2897 0.2927 0.2339

ASE 0.1527 0.2769 0.2772 0.2088CP 0.9470 0.4780 0.9440 0.9060

β2 Bias -0.0008 -0.0388 -0.0012 -0.0075SD 0.0260 0.0281 0.0317 0.0285

ASE 0.0261 0.0282 0.0300 0.0263CP 0.9570 0.7180 0.9340 0.9280

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 20 / 29

simulation Study

Table3: Simulation

Table: Result3: 1000 obs , 1000 draws and mrate 0.33 .

par estimate θF θCC θWt θWs

γ2 Bias -0.0065 -0.5368 -0.0222 -0.0177SD 0.1086 0.1964 0.1976 0.1628

ASE 0.1068 0.1905 0.1921 0.1500CP 0.9440 0.1620 0.9470 0.9330

β2 Bias 0.0000 -0.0389 0.0003 -0.0035SD 0.0187 0.0190 0.0214 0.0195

ASE 0.0184 0.0197 0.0213 0.0188CP 0.9450 0.5050 0.9550 0.9380

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 21 / 29

Empirical Study

Motorcycle Violations Speed Regulations (Taiwan Survey)

Period and Space

1 The survey was conducted during the year 2007 accross Taiwan.2 source: Ministry of transport (ROC).3 The total number of respondents (observations) was 7386.4 Most of respondents were from Taipei, Taoyun, Gaoshong, and Hualien. Only few

respondents from the other remaining cities and counties have been recorded.

Variables

1 Y.Violation , count variable :: gives the number of time the respondent has been foundguilty of violating speed regulations.

2 X2.Age Age, categorical variable where:: (1) < 18 years old, (2) 18− 19 years old, (3)20− 29 years old, (4) 30− 39 years old, (5) 40− 49 years old, and (6) >= 50 years old.

3 X3 Education Levels, categorical variable where (1) under high school, (2) high school,(3) colleage and above.

4 X7 Kilometer (Km), continuous variable where (1) < 1000 Km, (2) 1,000-2,999 Km, (3)3000-9999 Km, and over (4) 10,000 Km in a year.

5 X8 Engine power, categorical variable where (1) under 50cc. (2) 50 - 249cc. (3)250-549cc (4) over 550cc.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 22 / 29

Empirical Study

Table: Variable Descriptions .

Variable Definition Possible Val N OBS Missing I % II %

N.Violations(Y) number of violations in a year 1-7 7386 7386 0 91% are 0 9% > 0Distance(X) Km covered per year 1-4 7386 6262 1124 15% missing 85% obsX.Dummy 2 1, 000 ≤ X < 2, 999 Km 0-1 7386 6262 1124 25% of X=2X.Dummy3 3, 000 ≤ X < 9, 999 Km 0-1 7386 6262 1124 12.8% of X=3X.Dummy3 ≥ 10000 Km 0-1 7386 6262 1124 12.8% of X=4Engine (Z) ≤ 250cc 0-1 7386 7386 0 88% has ≤250cc 12%Age (W) ≤ 40years old 0-1 7386 7386 0 72.5% has ≤40 17.5%

1 X.Dummy1 X < 1,000 Km is considered as baseline.2 W :: Age (X2) is chosen as the most potential the surrogate.3 The correlation (X, W) is 0.10, not so enough.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 23 / 29

Empirical Study

Output 1. Histogram for Count response variable

Figure: Awesome figure

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 24 / 29

Empirical Study

Diagnostic MAR

δ = 1 if X 6= NA and 0 otherwise (Probability of missingness). TestingMAR, we fit logistic regression π(δ = 1|observed data).

π(Y ,Z ,W ) = Pr(δ = 1|Y ,Z ,W ) = H(αTU),

where U = (1,Y T ,ZT ,W T ). The selection probability is given by

π(Y ,Z ,W ) = H(0.854631 + 0.051814Y + 0.047927Z − 0.071278W )

with coefficients from the below table.

Coef. Estimate Std.Err Pr(> |z |)α0 0.854631 0.005223 < 2e-16α1 0.051814 0.008466 9.81e-10α2 0.047927 0.013212 0.000288α3 -0.071278 0.009311 2.18e-14

From the above results, the probability of missing is strongly depending onobserved data. That support MAR assumption. Note here that α isnuissance parameter.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 25 / 29

Empirical Study

Output 2. Regression ZIP

Table: Outputs Regression ZIP model

Par. Complete Cases Semiparametric IPW

Coef. θCC Std.Err Pr(> |z |) θWs Std.Err Pr(> |z |)

γ0 2.6854 0.1723 0.0000 2.8181 0.1790 0.0000γ1 -0.5747 0.2014 0.0043 -0.5766 0.2059 0.0051γ2 -0.8672 0.2001 0.0000 -0.8564 0.2077 0.0000γ3 -1.0602 0.2165 0.0000 -1.0359 0.2277 0.0000γ4 -2.1884 0.1694 0.0000 -2.1705 0.1647 0.0000

β0 -0.2976 0.1466 0.0423 -0.3190 0.1659 0.0546

β1 0.0960 0.1608 0.5505 0.0988 0.1790 0.5812

β2 0.1530 0.1592 0.3365 0.1576 0.1811 0.3840

β3 0.3802 0.1705 0.0258 0.4100 0.1961 0.0366

β4 0.2033 0.1093 0.0627 0.2235 0.1118 0.0456

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 26 / 29

Empirical Study

Where Θ = (γ, β), and where γ = (γ0, γ1, γ2), and β = (β0, β1, β2) arerespectively the logistic and poisson components. The mixing proportionand the poisson mean for complete cases and inverse probability weightingare respectively expressed byPCC = H(2.6854− 0.5747D1− 0.8672D2− 1.0602D3− 2.1884Z ) andµCC = exp(−0.2976 + 0.0960D1 + 0.1530D2 + 0.3802D3− 2.1884Z ),PWs = H(2.8181− 0.5766D1 + 0.0988D2 + 1.0359D3− 2.1705Z ) andµWs = exp(−0.3190 + 0.0988D1 + 0.1576D2 + 0.4100D3 + 0.2235Z ).The predicted function for random Y can be put in the following way.

Y ∼

{deg(0), with probability P,

Poisson(µ), with probability 1− P.

where P = PCC or PWs and µ = µCC or µWs .

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 27 / 29

Empirical Study

Discussion

1 The histogram has a skew on the left especially at point 0, showing aZero Inflated poisson distribution with( 91%) of zeroes.

2 The empirical study shows that our proposed method thesemiparametric inverse probability weighting for zero inflated poissonprovides acceptable results.

3 In addition, IPW performance is challenged by CC method. That isdue to the fact that the surrogate variable (W) has little contributionto (X).The correlation (X, W) is 0.10, and very big proportion ofzeros tends to support more for logit component than poisson.

4 The CC method performance is due to the percent of missing (15 %only) and big sample size.

As future work, other methods recommended in literature such as multipleimputations parametric/nonparametric and empirical multiple imputationcan be used to see how ZIP-IPW performs comparely to those others.Find other surrogate variables having strong relation with variable havingmissing.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 28 / 29

Empirical Study

Thank you for your Attention

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 29 / 29

Empirical Study

Cameron, A.C and Trividi, P.K.Regression Analysis of Count Data”.Cambridge University Press, thirteenth edition, 1998.

Bhavna,T.P, Preisser,J.S, Stearn,S.C, and Rosier,R.G.Multiple imputation of dental caries data using a zero inflated poissonregression models.J public Health Dent., 1:71–78, 2011.

Creemers,A, Aerts, M, Hens, N, and Molenberghs, G.A nonparametric approach to weighted estimating equations forregression analysis with missing covariates.Computational Statistics and Data Analysis, 56:100–113, 2012.

Wang, C.Y, Wang,S, Zhao,L.P, and Ou, S.T.Weighted semiparametric estimation in regression with missingcovariates data.Journal of American Statistical Association, 92:512–525, 1997.

Lambert, D.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 29 / 29

Empirical Study

Zero-inflated poisson regression, with an application to defects inmanufacturing.Technometrics, 34:1–14, 1992.

Hall, D.B and Shen,J.Robust estimation for zero-inflated poisson regression.Scandinavian Journal of Statistics, 37:237–252, 2010.

D.G Horvitz and D.J Thomson.A generalization of sampling without replacement.Journal of The American Statistics Association, 47:663–685, 1952.

Chin-Shang , Li.Score test for semiparametric zero-inflated poisson models.International Journal of Statistics and Probability, 1:1–7, 2012.

Chin-Shang ,Li.A lack-of-fit test for parametric zero-inflated poisson zero-inflatedpoisson models.Computational and Simulation, 81:1081–1098, 2010.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 29 / 29

Empirical Study

McCullagh,P and Nelder,J.Generalized linear models.1989.

Jansakul, N and Hinde,J.P.Score test for zero-inflated poisson models.Computational Statistics and Data Analysis, 40:75–96, 2002.

Flanders, W.D. and Greenland, S.Analytic methods for two-stage casecontrol studies and other stratifieddesigns.Statistics in Medecine, 10:739–747, 1991.

Chen, X.D and Fu,Y.Z.Model selection for zero-inflatated regression with missing covariates.Computational Statistics and Data Analysis, 55:765–773, 2011.

Zhao,L.P and Lipsitz,S.Design and analysis of two-stage studies.Journal of stat. in Medecine, 11:769–782, 1992.

Martin Lukusa. T.1?

, Shen-Ming Lee1 and Chin-Shang Li2 (1. Department of Statistics, Feng Chia University, Taiwan. [ 2pt] 2.Division of Biostatistics, Department of Public Health Sciences , UC, Davis, USA. [-2pt] )Modeling ZIP with incomplete Covariates May 2014 29 / 29