RACE 616 Advance Statistical Analysis in Medical Research ... · 1. Agresti A. Categorical data analysis. 2nd edition. New York: John Wiley & Sons INC 2002. 2. Klienbaum GD, Kupper

RACE 616 Advance Statistical Analysis

in Medical Research

Poisson regression Assoc.Prof.Dr.Ammarin Thakkinstian

M a s t e r o f S c i e n c e

P r o g r a m i n M e d i c a l

E p i d e m i o l o g y a n d

D o c t o r o f P h i l o s o p h y

P r o g r a m i n C l i n i c a l

E p i d e m i o l o g y

S e c t i o n f o r C l i n i c a l

E p i d e m i o l o g y &

B i o s t a t i s t i c s

F a c u l t y o f M e d i c i n e

R a m a t h i b o d i H o s p i t a l

M a h i d o l U n i v e r s i t y

h t t p : / / w w w . c e b - r a m a . o r g /

A c a d e m i c Y e a r 2 0 1 6

S e m e s t e r 2

1 CONTENTS

1. LOG-LINEAR MODEL ............................................................................................ 4

2. POISSON REGRESSION ......................................................................................... 9

3. ESTIMATION OF RISK RATIO FOR FOLLOW UP STUDY ........................ 17

3.1 Using Poisson regression .................................................................................... 18

3.2 Using binary regression ...................................................................................... 19

3.3 Using logit regression ......................................................................................... 19

4. CAPTURE-RECAPTURE ...................................................................................... 21

Assignment II (20%) ................................................................................................ 24

2

OBJECTIVES Students should be able to

• Analyze categorical data using log-linear and Poison models

• Construct a Poisson regression model

• Estimate the rate ratio, testing coefficients, and interpret results

• Apply log-linear or Poisson function in capture-recapture analysis

REFERENCES 1. Agresti A. Categorical data analysis. 2nd edition. New York: John Wiley & Sons INC 2002.

2. Klienbaum GD, Kupper LL, Muller EK, and Nizam A. Applied regression analysis and

other multivariable methods. 3rdedition. Washington: Duxbury Press 1998; 687 - 709.

3. Zelterman D. Model for discrete data. Revised edition. Oxford: Oxford University Press 2006

4.Cumming P. Methods for estimating adjusted risk ratios. The STATA Journal 2009; 9: 175-196.

5. Cumming P. Estimating adjusted risk ratios for matched and unmatched data: An update.The

STATA Journal 2011; 11: 290-298.

6. Chao A, Tsay PK, Lin SH, Shau WY, Chao DY. The applications of capture-recapture models

to epidemiological data. Stat Med 2001 Oct 30;20(20): 3123-57.

7.Hook EB, Regal RR. Internal validity analysis: a method for adjusting capture-recapture

estimates of prevalence. Am J Epidemiol 1995 Nov 1;142(9 Suppl): S48-52.

READING SECTION Appendix I: Methods for estimating adjusted risk ratio

ASSIGNMENT II (20%) p.24, due: Feb 2, 2017

3

4

1. LOG-LINEAR MODEL

Analyzing categorical data sometimes involves factors that have not been prior factors assigned

with respect to response or outcome variables. The aim is to assess associations or interactions

between variables simultaneously, where cause and effect relationships are still unclear. To model

this, a log-linear model is usually applied, which can estimate a mean of cell counts of the multi-

dimensional contingency tables (e.g. I x J, I x J x K, I x J x K x L) of interested variables. The log-

linear model is the basic of modeling other categorical models (e.g., logistic regression, multi-logit

model, binary regression, and Poisson regression) but those models have specified the outcome

variables in advance.

The simplest form of log-linear models is the 2x2 contingency table, as display in table 1

(From Agresti A 2000 & Zelterman D 2006).

Table1. Notation for 2x2 table

X

Y

total 1 2

1 n11 n12 n1+

2 n21 n22 n2+

Total

n+1

n+2

n++

Note that a subscript + refers to summation over that index.

Let ijπ be the probability of having a count event for i row and j column which can be estimated

below.

5

++

++

++

++

++

=

=

=

=

=

nnnnnnnn

nnij

ij

2222

2121

1212

1111

π

π

π

π

π

The marginal probabilities are estimated as

++

++

++

++

++

++

++

++

=

=

=

=

nnπ

nnπ

nnπ

nnπ

22

11

22

11

Row I and column J are said to be independent if the joint probability of row I and column J (IxJ)

is equal to a product of their marginal probabilities

J1,...,-j 1,...I;ifor == ++ jiij πππ

The expected count event (mij) is equal to

jcolumn ofy probabilit marginali row ofy probabilit marginal

if

OR

==

=

= ++++

j

i

ji

jiij

βα

nβµα

ππµ

This equation can be re-written as

6

jYj

iXi

Yj

Xiij

βλ

αλ

µλ

λλλµ

log

loglog

where

log

=

=

=

++=

The log linear model of ijμ is said to be an additive effect of variable X and Y or the model is

said to be independent between X (i row ) and Y (j column) variables.

If X and Y are dependent (or both factors are associated), the log-linear model is

XYij

Yj

Xiij λλλλμ +++=log

This model is analogous to an ANOVA model of the main effect of factor A and B with an

interaction AB. This model is also called a saturated model, i.e., all possible effects are

considered. The interaction term XYijλ is equal to the general log (odds ratio) of the 2x2 table. For

instance,

( ) ( ) ( ) ( )( )XY

21XY

12XY22

XY11

XY21

Y1

X2

XY12

Y2

X1

XY22

Y2

X2

XY11

Y1

X1

211222112112

2211 logloglogloglog)log(

λλλλλλλλλλλλλλλλλλλλ

μμμμμμμμOR

−−+=

+++−+++−+++++++=

−−+==

Example1. Investigators wanted to assess an association between wearing helmet and gender (in

other words, numbers of persons wearing helmets are different among males and females). Data is

displayed in table 2.

7

Table 2. Number of wearing helmets between male and female

The independent (additive) log-linear model is:

Hj

Siij λλλμ ++=log

The parameter H

iSj λ&λ can be calculated by:

0.903 25,458

22,994

0.097 25,458

2,464

0.240 25,458

6,123

0.760 25,458

19,335

222

111

222

111

====

====

====

====

++

++

++

++

++

++

++

++

nnnnnnnn

H

H

S

S

πλ

πλ

πλ

πλ

The expected ijμ̂ is estimated as:

3x25,4580.240x0.90ˆˆˆ7x25,4580.240x0.09ˆˆˆ3x25,4580.760x0.90ˆˆˆ7x25,4580.760x0.09ˆˆˆ

2222

2221

2112

1111

========

++++

++++

++++

++++

nnnn

ππµππµππµππµ

If we assume that the response variable of this example is wearing a helmet, which is a binary

response variable of yes and no, to estimate the probability of wearing helmet versus not wearing

helmet can be performed as:

Sex

Helmet Fitted value Log-fitted value

Yes No Total Yes No Yes No

Male 1,862 17,473 19,335 1,871.4 1,7463.6 7.5 9.8

Female 602 5,521 6,123 592.6 5,530.4 6.4 8.6

Total 2,464 22,994 25,458 2,464 22,994

8

Y2

Y1

Y2

Xi

Y1

Xi

λλ)λλ(λ)λλ(λ

ππit

ππitμμ

μμ

−=

++−++=

−==−=

+

+

+

+++

+

+

1

1

2

121

2

1

ˆ1ˆ

logˆˆ

logˆlogˆlogˆˆ

log

That is the logit of response variable Y does not depend on the level of X. The log-linear model is

thus the basic model of the logit model with the logit link where one variable is pre-specified to be

the binary response variable.

The log-linear model can be fitted using STATA. The concept of fitting model is similar to

ANOVA, in which one category is needed to be a reference group (the last group of X & Y) as

follows:

. tabi 1862 17473 \ 602 5521 | col row | 1 2 | Total -----------+----------------------+---------- 1 | 1,862 17,473 | 19,335 2 | 602 5,521 | 6,123 -----------+----------------------+---------- Total | 2,464 22,994 | 25,458 Fisher's exact = 0.638 1-sided Fisher's exact = 0.329 . ren row sex . ren col hel Independent model loglin pop sex hel, fit(sex, hel) anova Variable sex = A Variable hel = B Margins fit: sex, hel Note: Anova-like constraints are assumed. The last level of each variable (and all interactions with it) will be dropped from estimation. The variable codings are constrained to sum to zero, so the last level will equal -1 times the sum of the other levels. Poisson regression Number of obs = 4 LR chi2(2) = 26306.15 Prob > chi2 = 0.0000 Log likelihood = -19.940884 Pseudo R2 = 0.9985 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- A1 | .5749324 .0073321 78.41 0.000 .5605617 .589303 B1 | -1.116724 .0105987 -105.36 0.000 -1.137497 -1.09595 _cons | 8.076219 .0112611 717.18 0.000 8.054148 8.098291 ------------------------------------------------------------------------------

9

Hj

Si 117.1575.008.8log −+=ijμ

*Estimate expected (or fitted) numbers of m11 (male, wearing helmet) disp _b[_cons]+_b[A1]+_b[B1] disp exp(_b[_cons]+_b[A1]+_b[B1] ) /*exp(m11)*/ 1871.374 *Estimate expected numbers of m12 (male, not wearing helmet) disp _b[_cons]+_b[A1]-_b[B1] disp exp(_b[_cons]+_b[A1]-_b[B1] ) /*exp(m12)*/ 17463.626 *Estimate expected numbers of m21 (female, wearing helmet) disp _b[_cons]-_b[A1]+_b[B1] disp exp(_b[_cons]-_b[A1]+_b[B1] ) /*exp(m21)*/ 592.625 *Estimate expected numbers of m22 (female, not wearing helmet) disp _b[_cons]-_b[A1]-_b[B1] disp exp(_b[_cons]-_b[A1]-_b[B1] ) /*exp(m22)*/ 5530.374

2. POISSON REGRESSION

A Poisson regression model is the log-linear model in which the response variable is not

specified or the model with specifying the outcome variable, in which it is count or discrete

event with having distribution as Poisson. If the outcome is not specified, the equation is similar

to the log-linear model as described above. The output of helmet example approaching via

Poisson is as below: poisson pop ib(2).sex ib(2).hel, nolog

Poisson regression Number of obs = 4

LR chi2(2) = 26306.15

Prob > chi2 = 0.0000

Log likelihood = -19.940884 Pseudo R2 = 0.9985

------------------------------------------------------------------------------

pop | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.sex | 1.149865 .0146642 78.41 0.000 1.121123 1.178606

1.hel | -2.233447 .0211975 -105.36 0.000 -2.274994 -2.191901

_cons | 8.618011 .0129433 665.83 0.000 8.592642 8.643379

------------------------------------------------------------------------------ We can use the post-estimation command ‘predict’ or ‘lincom’ to estimate expected numbers of

combinations of sex-helmet as follows:

predict exp_pop

(option n assumed; predicted number of events)

10

. list

+------------------------------+

| sex hel pop exp_pop |

|------------------------------|

1. | 1 1 1862 1871.374 |

2. | 1 2 17473 17463.63 |

3. | 2 1 602 592.626 |

4. | 2 2 5521 5530.374 |

+------------------------------+

Estimation using lincom

notes: exp(male & wearing helmet )

. lincom _cons+ 1.sex+1.hel, eform

( 1) [pop]1.sex + [pop]1.hel + [pop]_cons = 0

------------------------------------------------------------------------------

pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 1871.374 38.2733 368.40 0.000 1797.843 1947.912

------------------------------------------------------------------------------

. notes: male & no-hel

. lincom _cons+ 1.sex , eform

( 1) [pop]1.sex + [pop]_cons = 0

------------------------------------------------------------------------------

pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

(1) | 17463.63 130.6028 1306.12 0.000 17209.52 17721.49

. notes: female & hel

. lincom _cons +1.hel, eform ( 1) [pop]1.hel + [pop]_cons = 0 ------------------------------------------------------------------------------ pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 592.626 13.64176 277.36 0.000 566.4828 619.9757 notes: female & non-hel lincom _cons , eform ( 1) [pop]_cons = 0 ------------------------------------------------------------------------------ pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5530.374 71.58104 665.83 0.000 5391.842 5672.465 ------------------------------------------------------------------------------

To assess association between sex and helmet is required to fit an interaction of the two variables

in the Poisson model as follows:

11

**estimate LL for the independent model . qui poisson pop ib(2).sex ib(2).hel . estimates store A . poisson pop ib(2).sex##ib(2).hel, nolog Poisson regression Number of obs = 4 LR chi2(3) = 26306.37 Prob > chi2 = 0.0000 Log likelihood = -19.833152 Pseudo R2 = 0.9985 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.sex | 1.152098 .0154388 74.62 0.000 1.121838 1.182357 1.hel | -2.216057 .0429215 -51.63 0.000 -2.300181 -2.131932 | sex#hel | 1 1 | -.0229488 .0493614 -0.46 0.642 -.1196953 .0737977 | _cons | 8.616314 .0134583 640.22 0.000 8.589936 8.642692 ------------------------------------------------------------------------------ . estimates store B . lrtest B A Likelihood-ratio test LR chi2(1) = 0.22 (Assumption: A nested in B) Prob > chi2 = 0.6425

The coefficient of sex#hel = -0.023, p =-.642; this suggests no association between the

two variables. These results are similar to the logit model as showed below.

char hel[omit] 2

. xi: logit i.hel i.sex [freq=pop] i.hel _Ihel_1-2 (naturally coded; _Ihel_2 omitted) i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted) Logistic regression Number of obs = 25458 LR chi2(1) = 0.22 Prob > chi2 = 0.6425 Log likelihood = -8094.6473 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ _Ihel_1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Isex_2 | .0229488 .0493614 0.46 0.642 -.0737977 .1196953 _cons | -2.239006 .024378 -91.85 0.000 -2.286786 -2.191226 ------------------------------------------------------------------------------

As for the log-linear equation

βμαλ jiij = .

If one variable from the right side of the log-linear equation is assigned as the outcome (e.g.,

iα ))) and thus moved to the left side of equation becomes to the Poisson equation as follows:

12

ijpp22110ij

iji

ii0ij

ij

exβ...xβxββyln

eXββyln where

0,1j ,1,2,3,...ki ,yln)Y(ELet

+++++=

++=

===

∑

In which log (yij) is log (number of outcome), xi is independent variable. The outcome event is

quite rare, or/and the population size is very large compared with the numbers of the interested

event. Examples of this are, numbers of new cases of lung cancer in the general population of

the country, number of rabies infections per year, number of deaths from head injuries per year,

number of myocardial infarctions within 5 years in diabetes, etcetera. The link function of the

Poisson regression is a log-link. If we think that the response variable is treated as

quantitatively discrete, then modeling this using the general least squares method (e.g. linear

regression) should be applicable. However, the general least square equation does not work

well with this type of data because:

- Data is skewed; the least square requires a normal distribution

- Variance of Poisson usually increases if variable X increases, whereas variance of

the least square must be constant across the value of X variable

- The nature of the data is a non-negative integer (Yij ≥ 0), fitting linear regression

using the least square method will sometimes give a negative prediction!

Poisson regression does not have any problem with the conditions described above that happen in

linear regression. Required data for analysis can be summary or raw data. Layout for summary data

is total numbers and number of events group variable as displayed below:

Table 3. Number of deaths from trauma by age groups Age groups Population (N) No. of Deaths

≤ 24 15,860 1,046 25 – 34 8,079 838 35 – 44 5,778 541 45 – 54 3,147 357 55 – 64 1,825 221 65 – 74 907 147 75 – 84 230 44

≥ 85 32 6

13

For this data, the outcome variable is death and the independent variable is age group. The mortality

rate for each age group is estimated as:

r/1,000/yea66 0.06615,8601,046 24

==

=

=

≤I

Nn

Ii

iagei

The incident rate can be a cumulative incidence if the count number of the population is observed,

or it can be an incident density if a person-time is observed. Its distribution is assumed as a

Poisson distribution if death independently-randomly occurs at each point of time, and the number

of deaths is very small compared with the whole population. The Poisson distribution is as

follows:

λVar(Y)λE(Y)

,Y!eλλ):P(Y

λY

==

∞==−

.,0,1,2,3,..Y

Theoretically, a Poisson variable can take any value of nonnegative integers and the probability

of Y is changed if the mean λ changes. The special character of Poisson is that its mean equates

to its variance (i.e., λ). As for the incidence rate above, we can estimate its variance as:

(62,70)00/year1.96x2/1,066

2/1,00031470.066

=±=

±=

=

=

=

=

≤

≤

≤≤

SEZλ95%CI

Thus

/year

NI

SE

I)Var(I

α/2

24age

24age

24age24age

14

If death is assigned as the outcome of interest, which is a countable and rare event, it can be

modeled with other independent variables which may have an affect or influence on the

numbers of interested events, similar to other regression models as follows:

iji

ii0ij eagegrbblny ++= ∑

The way to interpret ib is described below if age ≤ 24 is assigned to be the reference group.

1b

age24

34age25

1age24

34age25

1

01024age34age25

1034age25

024age

RRey

y

by

yln

b

bbblnylny

bblny

blny

1 ==

=

=

−+=−

+=

=

−

−

≤−

−

≤

That is, using Poisson regression model we can estimate the relative rate (RR) for any age

group (or other exposure groups) compared to the reference group using the exponential of its

corresponding coefficient. Fitting the Poisson equation in STATA can be accomplished quite

easily if data is prepared to follow the format that STATA requires, which is a summary data of

numbers of death (outcome) and total numbers of subjects (N) by group variables (i.e., age

group for this example). The data layout consists of at least 3 variables which are age group,

death, and N, in which the variable agegr refers to group, death and N are the numbers of deaths

and total size of population (or person-time if observed) in that corresponding age group. This

data format is known as summary data. In the case in which only raw or individual patient data

format is available, we can construct an equation straight away like we construct other

equations (e.g., Logistic and Cox equations). However, we can transform raw data to summary

data format using the “contract” command in STATA as below. The summary data requires

much less space and less time consumption when the model is constructed.

15

use “raw data for table 3.dta" gen agegr= recode(age, 24, 34, 44, 54, 64, 74, 84, 85)

lab define agegr 24"<=24" 34"25-34" 44"35-44" 54"45-54" 64"55-64" 74"65-74" 84"75-84" 85">=85" lab value agegr agegr contract agegr death

drop if death ==. egen N = sum(_freq), by( agegr) keep if death ==1 drop death ren _freq death list

xi: poisson death i.agegr , exposure(N) i.agegr _Iagegr_24-85 (naturally coded; _Iagegr_24 omitted) Poisson regression Number of obs = 8 LR chi2(7) = 229.09 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.8020 ------------------------------------------------------------------------------ death | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iagegr_34 | .4528216 .046361 9.77 0.000 .3619557 .5436874 _Iagegr_44 | .3504332 .0529571 6.62 0.000 .2466393 .4542272 _Iagegr_54 | .5423577 .0612955 8.85 0.000 .4222207 .6624948 _Iagegr_64 | .6076543 .0740332 8.21 0.000 .4625519 .7527566 _Iagegr_74 | .899117 .0880837 10.21 0.000 .726476 1.071758 _Iagegr_84 | 1.064937 .1538938 6.92 0.000 .7633109 1.366563 _Iagegr_85 | 1.04485 .4094175 2.55 0.011 .2424069 1.847294 _cons | -2.718827 .0309196 -87.93 0.000 -2.779428 -2.658226 N | (exposure) ------------------------------------------------------------------------------

The coefficients (βs) are estimated and these suggest that there is a trend of increasing risk

when age increases. The estimated β for age 25-34 = 0.45 & the RR = exp(0.45) = 1.57, i.e.,

subjects aged 25-34 years are about 1.6 times higher risk of death compared with subjects aged

≤ 24 years. To estimate RR by STATA can be done by adding option “irr” after the poisson

command.

poisson, irr Poisson regression Number of obs = 8 LR chi2(7) = 229.09 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.8020 ------------------------------------------------------------------------------ death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iagegr_34 | 1.572744 .0729139 9.77 0.000 1.436135 1.722346 _Iagegr_44 | 1.419682 .0751822 6.62 0.000 1.279717 1.574956 _Iagegr_54 | 1.720058 .1054319 8.85 0.000 1.525345 1.939625 _Iagegr_64 | 1.836119 .1359337 8.21 0.000 1.588122 2.122844 _Iagegr_74 | 2.457432 .2164598 10.21 0.000 2.067781 2.920509 _Iagegr_84 | 2.900657 .446393 6.92 0.000 2.145368 3.92185 _Iagegr_85 | 2.842973 1.163963 2.55 0.011 1.274313 6.342633 N | (exposure) ------------------------------------------------------------------------------

16

In the case where data is already summarized as displayed in table 3, data can be input into

STATA directly as seen below:

tabi 1046 15860 \ 838 8079\ 541 5778 \ 357 3147 \ 221 1825 \ 147 907 \ 44 230 \ 6 32 contract row col [freq=pop] list egen N=sum(_freq), by(row) list drop if col ==2 drop col list ren row agegr ren _freq death ren N pop list lab define agegr 1"<=24" 2"25-34" 3"35-44" 4"45-54" 5"55-64" 6"65-74" 7"75-84" 8">=85", modify lab value agegr agegr list agegr death pop gen ir = death/pop list poisson death i.agegr, exp(pop) nolog Poisson regression Number of obs = 8 LR chi2(7) = 191.43 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.7719 ------------------------------------------------------------------------------ death | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- agegr | 34 | .4179985 .046361 9.02 0.000 .3271326 .5088643 44 | .3247983 .0529571 6.13 0.000 .2210043 .4285922 54 | .4987706 .0612955 8.14 0.000 .3786335 .6189076 64 | .557216 .0740332 7.53 0.000 .4121136 .7023183 74 | .8127801 .0880837 9.23 0.000 .6401391 .985421 84 | .9537568 .1538938 6.20 0.000 .6521305 1.255383 85 | .9368685 .4094175 2.29 0.022 .134425 1.739312 | _cons | -2.782695 .0309196 -90.00 0.000 -2.843297 -2.722094 ln(N) | 1 (exposure) ------------------------------------------------------------------------------ margins agegr, predict(ir) Adjusted predictions Number of obs = 8 Model VCE : OIM Expression : Predicted incidence rate, predict(ir) ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- agegr | 24 | .0618715 .001913 32.34 0.000 .058122 .065621 34 | .0939778 .0032464 28.95 0.000 .087615 .1003406 44 | .0856148 .0036809 23.26 0.000 .0784004 .0928292 54 | .1018836 .0053922 18.89 0.000 .0913149 .1124522 64 | .1080156 .0072659 14.87 0.000 .0937747 .1222566 74 | .1394687 .0115032 12.12 0.000 .1169229 .1620145 84 | .1605839 .0242089 6.63 0.000 .1131353 .2080326 85 | .1578947 .0644603 2.45 0.014 .031555 .2842345 ------------------------------------------------------------------------------ predict ir, ir table agegr, c(mean ir)

17

3. ESTIMATION OF RISK RATIO FOR FOLLOW UP STUDY

For a randomized controlled trial or a cohort study, estimation of treatment effect or exposure

effect can be done using either odds ratio (OR) or risk ratio (RR). As we know that the estimated

OR is close to the estimated RR if the interested event/disease is rare. However, the estimated

OR is higher than the estimated RR if the event/disease is common and thus treatment or

exposure effect is over estimated. Analysis is often needed to adjust for confounding factors in

the cohort study or the RCT in case the randomization does not work. Multiple logistic

regression is often used for this purpose. To be more accurate in estimate the treatment effect,

several methods have been proposed (see appendix I), but only three methods will be covered in

this module. The first method is the Poisson regression which is a generalized linear model with

log link function and distribution of Poisson. The second method which can be applied is a

generalized linear model with a log link but binomial (not Poisson) distribution, which is known

as a log-binomial or binomial log-linear regression. The last method is applying logit equation

but estimate a marginal probability of having event by groups of variable., a ratio of marginal

probability is then estimated. The data in table 4 is a cohort study of breast cancer data

(Appendix 1). The investigators would like to estimate risk of death by staging and estrogen

receptor level, but the effects of staging on death might be confounded by estrogen receptor level

or vice versa.

Table 4. Frequency of death by estrogen receptor and staging

Estrogen receptor level Stage Total Death

Low 1 12 2

High 1 55 5

Low 2 22 9

High 2 74 17

Low 3 14 12

High 3 15 9

18

3.1 Using Poisson regression

We use breast cancer data provided from STATA journal for illustration. Analysis using Poisson

regression can be done as follows:

read “Breast cancer IPD.dta” *estimate death rate and its CI by Poisson & binary distribution ci death , poisson -- Poisson Exact -- Variable | Exposure Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- death | 192 .28125 .0382733 .2112837 .3669702 . ci death /*binomial distribution, compute standard normal CI by default*/ Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- death | 192 .28125 .0325326 .2170807 .3454193 poisson death i.stage i.er2, irr nolog /*report irr rather log(rate)*/ Poisson regression Number of obs = 192 LR chi2(3) = 26.71 Prob > chi2 = 0.0000 Log likelihood = -109.14601 Pseudo R2 = 0.1090 ------------------------------------------------------------------------------ death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.520742 1.074375 2.17 0.030 1.093288 5.811955 3 | 5.913372 2.645148 3.97 0.000 2.460814 14.20992 | er2 | low | 1.630775 .4688634 1.70 0.089 .9282513 2.864987 _cons | .0938724 .0361805 -6.14 0.000 .0441028 .1998065 ------------------------------------------------------------------------------

However, standard errors of those coefficients might be bias estimation since the interested event

(death) is not rare (28%). To relax this assumption, more robust variance estimation should be

applied.

poisson death i.stage i.er2, irr vce(robust) nolog Poisson regression Number of obs = 192 Wald chi2(3) = 53.61 Prob > chi2 = 0.0000 Log pseudolikelihood = -109.14601 Pseudo R2 = 0.1090 ------------------------------------------------------------------------------ | Robust death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.520742 .9937819 2.35 0.019 1.16399 5.458932 3 | 5.913372 2.28568 4.60 0.000 2.772187 12.61386 | er2 | low | 1.630775 .3480542 2.29 0.022 1.073305 2.477792 _cons | .0938724 .0338979 -6.55 0.000 .0462555 .1905077 ------------------------------------------------------------------------------

19

3.2 Using binary regression

Another model that can be applied with the data and study design like this is a binary regression

analysis, in which the outcome variable has a binomial distribution and the log-link function is

used to link between outcome and independent variables. This can be performed in a few ways in

STATA as below (Cummings P, 2011):

binreg death i.stage i.er2, nolog rr Generalized linear models No. of obs = 192 Optimization : MQL Fisher scoring Residual df = 188 (IRLS EIM) Scale parameter = 1 Deviance = 185.8541388 (1/df) Deviance = .9885858 Pearson = 190.2164034 (1/df) Pearson = 1.011789 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u) [Log] BIC = -802.555 ------------------------------------------------------------------------------ | EIM death | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.538158 .9976133 2.37 0.018 1.174781 5.483782 3 | 5.868047 2.259727 4.60 0.000 2.758699 12.48196 | er2 | low | 1.558326 .3067419 2.25 0.024 1.059515 2.291972 _cons | .0951663 .0342763 -6.53 0.000 .046979 .1927803 ------------------------------------------------------------------------------

3.3 Using logit regression Another way to model when the outcome variable is not too rare is applying a logit equation.

However, this will result in estimation of odds ratio rather than estimation of risk ratio. If our

study design is a cohort or a randomized controlled trial, we prefer to report an effect of

prognostic factor or treatment intervention using a risk ratio rather than an odds ratio. In

addition, odds ratio sometimes over estimates the effect if the occurrence of the outcome is

common. Using the logit regression is therefore required to further estimate marginal effects

and then use these estimates to calculate the risk ratio as below:

logistic death ib(1).er2 i.stage , nolog Logistic regression Number of obs = 192 LR chi2(3) = 42.27 Prob > chi2 = 0.0000 Log likelihood = -92.939847 Pseudo R2 = 0.1853

20

------------------------------------------------------------------------------ death | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 0.er2 | 2.508065 .9916923 2.33 0.020 1.155507 5.443836 | stage | 2 | 3.109772 1.44851 2.44 0.015 1.248087 7.748406 3 | 18.8389 11.03231 5.01 0.000 5.978343 59.36498 | _cons | .0937695 .0393847 -5.64 0.000 .0411665 .2135893 ------------------------------------------------------------------------------

The odds ratios were 2.5 for estrogen receptor and as high as 18 for stage 3 versus stage 1. Let’s estimate marginal effects for each factor: margins er2, post coeflegend Expression : Pr(death), predict() ------------------------------------------------------------------------------ | Margin Legend -------------+---------------------------------------------------------------- er2 | 0 | .4008795 _b[0bn.er2] 1 | .2392455 _b[1.er2] ------------------------------------------------------------------------------ nlcom (lnrr: ln(_b[0bn.er2]/_b[1.er2])), post display "risk ratio = " exp(_b[lnrr]) _skip(3) /* */ "95% CI = "exp(_b[lnrr]-invnormal(1-.025)*_se[lnrr]) /* */ "," exp(_b[lnrr]+invnormal(1-.025)*_se[lnrr]) ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnrr | .5161706 .2177193 2.37 0.018 .0894487 .9428925 ------------------------------------------------------------------------------ . display "risk ratio = " exp(_b[lnrr]) _skip(3) /* > */ "95% CI = "exp(_b[lnrr]-invnormal(1-.025)*_se[lnrr]) /* > */ "," exp(_b[lnrr]+invnormal(1-.025)*_se[lnrr]) risk ratio = 1.6755988 95% CI = 1.0935712,2.567397

The death rates were 0.24 and 0.40 for receptor positive and negative, respectively. The ratio of

death rates could be estimated using ‘nlcom’ command after marginal estimations, or using

adjrr command as below:

adjrr er2

R1 = 0.4009 (0.0660) 95% CI (0.2716, 0.5302)

R0 = 0.2392 (0.0332) 95% CI (0.1742, 0.3043)

ARR = 1.6756 (0.3648) 95% CI (1.0936, 2.5674)

ARD = 0.1616 (0.0745) 95% CI (0.0155, 0.3077)

p-value (R0 = R1): 0.0301

p-value (ln(R1/R0) = 0): 0.0177

21

This yielded the RR of 1.68 (95% CI: 1.09, 2.57). As you can see that the estimated RR and OR

were very much different since death is not too rare for this data.

The estimated risk ratios and standard errors from binomial log linear (or logit with post-

marginal estimation) are not much different when compared with a Poisson regression. The

Poisson approach however tends to give higher estimations for both parameters than the binary

log linear approach, particularly the Poisson model without robust variance estimation yields

the highest standard errors of estimation. The most appropriate method for this data should be

binary log linear regression followed by Poisson regression with robust variance estimation.

4. CAPTURE-RECAPTURE

The capture-recapture method has been used to estimate the number of missing or undercounted

population by merging 2 or more databases. For instance, a researcher would like to estimate the

rates of adolescent pregnancy and abortion across Thailand using a cross-sectional hospital-

based data registry of the Ministry of Public Health. However, the estimated rates based on

hospital data registry are more likely to bias from underestimate for a number of reasons as

follows. Not all hospitals collaborate with this project, and there may be incorrect coding in the

databases of those collaborating hospitals.

Some adolescent pregnant women might give birth or abort at homes or clinics, thus their names

would not appear in the hospital databases. The hospital-based data of live births may be

overlapped with a few other sources of data, e.g., Civil Registration Database under The

Ministry of Interior, and Health National Survey. The overlapped data is called re-capture data,

which can be used to estimate or predict the number of missing data, see Chao 2001 and Hook

1995. Various approaches can be applied for estimation of missing data, but we will focus on

applying a log-linear model (or Poisson model). Details of analysis will be illustrated using the

example below.

22

Suppose that we would like to estimate the prevalence of Hepatitis C infection. There are 3

sources of data, i.e., A, B, and C; each variable is coded 0 and 1 for present and absent in that data

registry. Thus, all possible data layouts are 2x2x2, as described in Table 5. The missing number

‘h’ is the number that we would like to estimate.

Table 5. Data layout for capture-recapture

Source A Source B Source C

+ - + + a b - c d - + e f - g h*

We can apply a log-linear model as described above, or Poisson regression without the outcome

variable to estimate the missing number ‘h’. We need to input data for each cell in Table 5 into

STATA as listed below:

list freq A B C +------------------+ | freq A B C | |------------------| 1. | 5 1 1 1 | 2. | 7 1 1 0 | 3. | 15 1 0 1 | 4. | 37 1 0 0 | 5. | 3 0 1 1 | |------------------| 6. | 14 0 1 0 | 7. | 20 0 0 1 | 8. | . 0 0 0 | poisson freq i.A i.B i.C, nolog

Poisson regression Number of obs = 8

LR chi2(3) = 96.00

Prob > chi2 = 0.0000

Log likelihood = -18.717469 Pseudo R2 = 0.7194

------------------------------------------------------------------------------

freq | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.A | -.3184537 .1642822 -1.94 0.053 -.6404409 .0035334

1.B | -1.444889 .2064288 -7.00 0.000 -1.849481 -1.040296

1.C | -.9301478 .1800837 -5.17 0.000 -1.283105 -.5771903

_cons | 3.933108 .1245397 31.58 0.000 3.689015 4.177201

------------------------------------------------------------------------------

23

As for the this Poisson model, we can use the post-estimation command ‘predict’ to predict the

expected number of missing ‘h’, or we can use ‘lincom’ to estimate the missing numbers.

This suggested that the missing number was 51 which could range from 30 to 88 subjects.

The estimated total number of Hepatitis C positive is 152 (51+101) cases. Suppose that the number of

the population is at least 15,000, the estimated rate is 10 (95% CI: 9,12) per1000 subjects.

list A B C freq exp +-----------------------------+ | A B C freq exp_freq | |-----------------------------| 1. | 1 1 1 5 3.440009 | 2. | 1 1 0 7 8.745244 | 3. | 1 0 1 15 14.62778 | 4. | 1 0 0 37 37.18697 | 5. | 0 1 1 3 4.746958 | |-----------------------------| 6. | 0 1 0 14 12.06779 | 7. | 0 0 1 20 20.18525 | 8. | 0 0 0 . 51.31526 | +-----------------------------+ lincom _b[_cons] , eform ( 1) [freq]_cons = 0 ------------------------------------------------------------------------------ freq | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 51.31526 14.01225 14.42 0.000 30.04801 87.63496 ------------------------------------------------------------------------------ cii 15000 152, poisson -- Poisson Exact -- Variable | Exposure Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- | 15000 .0101333 .0008219 .0085864 .0118784 Estimate using post-estimation command predict exp_freq /*expected number of events, default; which is exp(xb)!*/ predict xb, xb /*ln(exp no.)*/ predict stdp, stdp /*standard error of xb */ gen ll_xb = xb-1.96*stdp gen ul_xb = xb+1.96*stdp gen ll_exp_freq = exp(ll_xb) gen ul_exp_freq = exp(ul_xb) list freq exp_freq ll_exp_freq ul_exp_freq xb ll_xb ul_xb , compress /* +------------------------------------------------------------------------+ | freq exp_freq ll_exp~q ul_exp~q xb ll_xb ul_xb | |------------------------------------------------------------------------| 1. | 5 3.440009 1.834266 6.451441 1.235474 .6066446 1.864304 | 2. | 7 8.745244 5.640689 13.5585 2.16851 1.730006 2.607014 | 3. | 15 14.62778 9.879021 21.65922 2.682922 2.290413 3.075431 | 4. | 37 37.18697 27.45409 50.3703 3.615958 3.312515 3.919402 | 5. | 3 4.746958 2.976143 7.571413 1.557504 1.090628 2.02438 | |------------------------------------------------------------------------| 6. | 14 12.06779 7.615479 19.12309 2.49054 2.030183 2.950897 | 7. | 20 20.18525 13.68008 29.78377 3.004952 2.615941 3.393964 | 8. | . 51.31526 30.04772 87.63582 3.937988 3.402787 4.47319 | +------------------------------------------------------------------------+

24

Assignment II (20%)

1. The Table below contains data for numbers of deaths and total population by age group and

sex. The data concerns itself with the number of deaths by accidents in 1999.

a) Estimate death rate and 95% confidence interval by sex. Is the death rate different

between males and females?

b) Estimate death rate and 95% confidence interval by age group. Are they different

according to age groups?

c) Fit the Poisson regression equation for age group and sex, how the equation

looks like?

d) Do overall age and sex affect to the number of deaths? Perform the test for age and sex.

e) Create a table for reporting this result and interpret it.

Age group

Gender

Male Female

Deaths N Deaths N

≤ 24 828 11,786 218 4,074

25 – 34 682 6,221 156 1,856

35 – 44 432 4,313 109 1,465

45 – 54 259 2,263 98 884

55 – 64 162 1,280 59 545

65 – 74 103 600 44 307

75 - 84 33 150 11 80

≥ 85 3 20 3 12

25

2. The data set is a registry of accidents during 2001-2002. Education programs were launched in

order to prevent car accidents, particularly during public or long holidays. We are wondering

whether the programs work well or not? (use “Assignment II death by age sex year edu.dta")

a) Plan for your analysis and create a summary table which includes: the number of deaths,

the total population by year, education program, age groups, and sex.

b) Estimate death rate by each education program. Are death rates different according to

different education programs?

c) Fit Poisson regression by including all variables in the model. What variables

affect death? Demonstrate the tests.

d) Fit binary log linear regression by including all variables in the model. Compare &

comment results between the two models and make a decision which model is more

appropriate?

e) Write a paragraph of results along with table/s for report of this analysis, then interpret

the results.

Documents

RACE 616 Advance Statistical Analysis in Medical Research ... · 1. Agresti A. Categorical data analysis. 2nd edition. New York: John Wiley & Sons INC 2002. 2. Klienbaum GD, Kupper