Upload
others
View
8
Download
1
Embed Size (px)
Citation preview
RACE 616 Advance Statistical Analysis
in Medical Research
Poisson regression Assoc.Prof.Dr.Ammarin Thakkinstian
M a s t e r o f S c i e n c e
P r o g r a m i n M e d i c a l
E p i d e m i o l o g y a n d
D o c t o r o f P h i l o s o p h y
P r o g r a m i n C l i n i c a l
E p i d e m i o l o g y
S e c t i o n f o r C l i n i c a l
E p i d e m i o l o g y &
B i o s t a t i s t i c s
F a c u l t y o f M e d i c i n e
R a m a t h i b o d i H o s p i t a l
M a h i d o l U n i v e r s i t y
h t t p : / / w w w . c e b - r a m a . o r g /
A c a d e m i c Y e a r 2 0 1 6
S e m e s t e r 2
1 CONTENTS
1. LOG-LINEAR MODEL ............................................................................................ 4
2. POISSON REGRESSION ......................................................................................... 9
3. ESTIMATION OF RISK RATIO FOR FOLLOW UP STUDY ........................ 17
3.1 Using Poisson regression .................................................................................... 18
3.2 Using binary regression ...................................................................................... 19
3.3 Using logit regression ......................................................................................... 19
4. CAPTURE-RECAPTURE ...................................................................................... 21
Assignment II (20%) ................................................................................................ 24
2
OBJECTIVES Students should be able to
• Analyze categorical data using log-linear and Poison models
• Construct a Poisson regression model
• Estimate the rate ratio, testing coefficients, and interpret results
• Apply log-linear or Poisson function in capture-recapture analysis
REFERENCES 1. Agresti A. Categorical data analysis. 2nd edition. New York: John Wiley & Sons INC 2002.
2. Klienbaum GD, Kupper LL, Muller EK, and Nizam A. Applied regression analysis and
other multivariable methods. 3rdedition. Washington: Duxbury Press 1998; 687 - 709.
3. Zelterman D. Model for discrete data. Revised edition. Oxford: Oxford University Press 2006
4.Cumming P. Methods for estimating adjusted risk ratios. The STATA Journal 2009; 9: 175-196.
5. Cumming P. Estimating adjusted risk ratios for matched and unmatched data: An update.The
STATA Journal 2011; 11: 290-298.
6. Chao A, Tsay PK, Lin SH, Shau WY, Chao DY. The applications of capture-recapture models
to epidemiological data. Stat Med 2001 Oct 30;20(20): 3123-57.
7.Hook EB, Regal RR. Internal validity analysis: a method for adjusting capture-recapture
estimates of prevalence. Am J Epidemiol 1995 Nov 1;142(9 Suppl): S48-52.
READING SECTION Appendix I: Methods for estimating adjusted risk ratio
ASSIGNMENT II (20%) p.24, due: Feb 2, 2017
3
4
1. LOG-LINEAR MODEL
Analyzing categorical data sometimes involves factors that have not been prior factors assigned
with respect to response or outcome variables. The aim is to assess associations or interactions
between variables simultaneously, where cause and effect relationships are still unclear. To model
this, a log-linear model is usually applied, which can estimate a mean of cell counts of the multi-
dimensional contingency tables (e.g. I x J, I x J x K, I x J x K x L) of interested variables. The log-
linear model is the basic of modeling other categorical models (e.g., logistic regression, multi-logit
model, binary regression, and Poisson regression) but those models have specified the outcome
variables in advance.
The simplest form of log-linear models is the 2x2 contingency table, as display in table 1
(From Agresti A 2000 & Zelterman D 2006).
Table1. Notation for 2x2 table
X
Y
total 1 2
1 n11 n12 n1+
2 n21 n22 n2+
Total
n+1
n+2
n++
Note that a subscript + refers to summation over that index.
Let ijπ be the probability of having a count event for i row and j column which can be estimated
below.
5
++
++
++
++
++
=
=
=
=
=
nnnnnnnn
nnij
ij
2222
2121
1212
1111
π
π
π
π
π
The marginal probabilities are estimated as
++
++
++
++
++
++
++
++
=
=
=
=
nnπ
nnπ
nnπ
nnπ
22
11
22
11
Row I and column J are said to be independent if the joint probability of row I and column J (IxJ)
is equal to a product of their marginal probabilities
J1,...,-j 1,...I;ifor == ++ jiij πππ
The expected count event (mij) is equal to
jcolumn ofy probabilit marginali row ofy probabilit marginal
if
OR
==
=
= ++++
j
i
ji
jiij
βα
nβµα
ππµ
This equation can be re-written as
6
jYj
iXi
Yj
Xiij
βλ
αλ
µλ
λλλµ
log
loglog
where
log
=
=
=
++=
The log linear model of ijμ is said to be an additive effect of variable X and Y or the model is
said to be independent between X (i row ) and Y (j column) variables.
If X and Y are dependent (or both factors are associated), the log-linear model is
XYij
Yj
Xiij λλλλμ +++=log
This model is analogous to an ANOVA model of the main effect of factor A and B with an
interaction AB. This model is also called a saturated model, i.e., all possible effects are
considered. The interaction term XYijλ is equal to the general log (odds ratio) of the 2x2 table. For
instance,
( ) ( ) ( ) ( )( )XY
21XY
12XY22
XY11
XY21
Y1
X2
XY12
Y2
X1
XY22
Y2
X2
XY11
Y1
X1
211222112112
2211 logloglogloglog)log(
λλλλλλλλλλλλλλλλλλλλ
μμμμμμμμOR
−−+=
+++−+++−+++++++=
−−+==
Example1. Investigators wanted to assess an association between wearing helmet and gender (in
other words, numbers of persons wearing helmets are different among males and females). Data is
displayed in table 2.
7
Table 2. Number of wearing helmets between male and female
The independent (additive) log-linear model is:
Hj
Siij λλλμ ++=log
The parameter H
iSj λ&λ can be calculated by:
0.903 25,458
22,994
0.097 25,458
2,464
0.240 25,458
6,123
0.760 25,458
19,335
222
111
222
111
====
====
====
====
++
++
++
++
++
++
++
++
nnnnnnnn
H
H
S
S
πλ
πλ
πλ
πλ
The expected ijμ̂ is estimated as:
3x25,4580.240x0.90ˆˆˆ7x25,4580.240x0.09ˆˆˆ3x25,4580.760x0.90ˆˆˆ7x25,4580.760x0.09ˆˆˆ
2222
2221
2112
1111
========
++++
++++
++++
++++
nnnn
ππµππµππµππµ
If we assume that the response variable of this example is wearing a helmet, which is a binary
response variable of yes and no, to estimate the probability of wearing helmet versus not wearing
helmet can be performed as:
Sex
Helmet Fitted value Log-fitted value
Yes No Total Yes No Yes No
Male 1,862 17,473 19,335 1,871.4 1,7463.6 7.5 9.8
Female 602 5,521 6,123 592.6 5,530.4 6.4 8.6
Total 2,464 22,994 25,458 2,464 22,994
8
Y2
Y1
Y2
Xi
Y1
Xi
λλ)λλ(λ)λλ(λ
ππit
ππitμμ
μμ
−=
++−++=
−==−=
+
+
+
+++
+
+
1
1
2
121
2
1
ˆ1ˆ
logˆˆ
logˆlogˆlogˆˆ
log
That is the logit of response variable Y does not depend on the level of X. The log-linear model is
thus the basic model of the logit model with the logit link where one variable is pre-specified to be
the binary response variable.
The log-linear model can be fitted using STATA. The concept of fitting model is similar to
ANOVA, in which one category is needed to be a reference group (the last group of X & Y) as
follows:
. tabi 1862 17473 \ 602 5521 | col row | 1 2 | Total -----------+----------------------+---------- 1 | 1,862 17,473 | 19,335 2 | 602 5,521 | 6,123 -----------+----------------------+---------- Total | 2,464 22,994 | 25,458 Fisher's exact = 0.638 1-sided Fisher's exact = 0.329 . ren row sex . ren col hel Independent model loglin pop sex hel, fit(sex, hel) anova Variable sex = A Variable hel = B Margins fit: sex, hel Note: Anova-like constraints are assumed. The last level of each variable (and all interactions with it) will be dropped from estimation. The variable codings are constrained to sum to zero, so the last level will equal -1 times the sum of the other levels. Poisson regression Number of obs = 4 LR chi2(2) = 26306.15 Prob > chi2 = 0.0000 Log likelihood = -19.940884 Pseudo R2 = 0.9985 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- A1 | .5749324 .0073321 78.41 0.000 .5605617 .589303 B1 | -1.116724 .0105987 -105.36 0.000 -1.137497 -1.09595 _cons | 8.076219 .0112611 717.18 0.000 8.054148 8.098291 ------------------------------------------------------------------------------
9
Hj
Si 117.1575.008.8log −+=ijμ
*Estimate expected (or fitted) numbers of m11 (male, wearing helmet) disp _b[_cons]+_b[A1]+_b[B1] disp exp(_b[_cons]+_b[A1]+_b[B1] ) /*exp(m11)*/ 1871.374 *Estimate expected numbers of m12 (male, not wearing helmet) disp _b[_cons]+_b[A1]-_b[B1] disp exp(_b[_cons]+_b[A1]-_b[B1] ) /*exp(m12)*/ 17463.626 *Estimate expected numbers of m21 (female, wearing helmet) disp _b[_cons]-_b[A1]+_b[B1] disp exp(_b[_cons]-_b[A1]+_b[B1] ) /*exp(m21)*/ 592.625 *Estimate expected numbers of m22 (female, not wearing helmet) disp _b[_cons]-_b[A1]-_b[B1] disp exp(_b[_cons]-_b[A1]-_b[B1] ) /*exp(m22)*/ 5530.374
2. POISSON REGRESSION
A Poisson regression model is the log-linear model in which the response variable is not
specified or the model with specifying the outcome variable, in which it is count or discrete
event with having distribution as Poisson. If the outcome is not specified, the equation is similar
to the log-linear model as described above. The output of helmet example approaching via
Poisson is as below: poisson pop ib(2).sex ib(2).hel, nolog
Poisson regression Number of obs = 4
LR chi2(2) = 26306.15
Prob > chi2 = 0.0000
Log likelihood = -19.940884 Pseudo R2 = 0.9985
------------------------------------------------------------------------------
pop | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.sex | 1.149865 .0146642 78.41 0.000 1.121123 1.178606
1.hel | -2.233447 .0211975 -105.36 0.000 -2.274994 -2.191901
_cons | 8.618011 .0129433 665.83 0.000 8.592642 8.643379
------------------------------------------------------------------------------ We can use the post-estimation command ‘predict’ or ‘lincom’ to estimate expected numbers of
combinations of sex-helmet as follows:
predict exp_pop
(option n assumed; predicted number of events)
10
. list
+------------------------------+
| sex hel pop exp_pop |
|------------------------------|
1. | 1 1 1862 1871.374 |
2. | 1 2 17473 17463.63 |
3. | 2 1 602 592.626 |
4. | 2 2 5521 5530.374 |
+------------------------------+
Estimation using lincom
notes: exp(male & wearing helmet )
. lincom _cons+ 1.sex+1.hel, eform
( 1) [pop]1.sex + [pop]1.hel + [pop]_cons = 0
------------------------------------------------------------------------------
pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1871.374 38.2733 368.40 0.000 1797.843 1947.912
------------------------------------------------------------------------------
. notes: male & no-hel
. lincom _cons+ 1.sex , eform
( 1) [pop]1.sex + [pop]_cons = 0
------------------------------------------------------------------------------
pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 17463.63 130.6028 1306.12 0.000 17209.52 17721.49
. notes: female & hel
. lincom _cons +1.hel, eform ( 1) [pop]1.hel + [pop]_cons = 0 ------------------------------------------------------------------------------ pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 592.626 13.64176 277.36 0.000 566.4828 619.9757 notes: female & non-hel lincom _cons , eform ( 1) [pop]_cons = 0 ------------------------------------------------------------------------------ pop | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5530.374 71.58104 665.83 0.000 5391.842 5672.465 ------------------------------------------------------------------------------
To assess association between sex and helmet is required to fit an interaction of the two variables
in the Poisson model as follows:
11
**estimate LL for the independent model . qui poisson pop ib(2).sex ib(2).hel . estimates store A . poisson pop ib(2).sex##ib(2).hel, nolog Poisson regression Number of obs = 4 LR chi2(3) = 26306.37 Prob > chi2 = 0.0000 Log likelihood = -19.833152 Pseudo R2 = 0.9985 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.sex | 1.152098 .0154388 74.62 0.000 1.121838 1.182357 1.hel | -2.216057 .0429215 -51.63 0.000 -2.300181 -2.131932 | sex#hel | 1 1 | -.0229488 .0493614 -0.46 0.642 -.1196953 .0737977 | _cons | 8.616314 .0134583 640.22 0.000 8.589936 8.642692 ------------------------------------------------------------------------------ . estimates store B . lrtest B A Likelihood-ratio test LR chi2(1) = 0.22 (Assumption: A nested in B) Prob > chi2 = 0.6425
The coefficient of sex#hel = -0.023, p =-.642; this suggests no association between the
two variables. These results are similar to the logit model as showed below.
char hel[omit] 2
. xi: logit i.hel i.sex [freq=pop] i.hel _Ihel_1-2 (naturally coded; _Ihel_2 omitted) i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted) Logistic regression Number of obs = 25458 LR chi2(1) = 0.22 Prob > chi2 = 0.6425 Log likelihood = -8094.6473 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ _Ihel_1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Isex_2 | .0229488 .0493614 0.46 0.642 -.0737977 .1196953 _cons | -2.239006 .024378 -91.85 0.000 -2.286786 -2.191226 ------------------------------------------------------------------------------
As for the log-linear equation
βμαλ jiij = .
If one variable from the right side of the log-linear equation is assigned as the outcome (e.g.,
iα ))) and thus moved to the left side of equation becomes to the Poisson equation as follows:
12
ijpp22110ij
iji
ii0ij
ij
exβ...xβxββyln
eXββyln where
0,1j ,1,2,3,...ki ,yln)Y(ELet
+++++=
++=
===
∑
In which log (yij) is log (number of outcome), xi is independent variable. The outcome event is
quite rare, or/and the population size is very large compared with the numbers of the interested
event. Examples of this are, numbers of new cases of lung cancer in the general population of
the country, number of rabies infections per year, number of deaths from head injuries per year,
number of myocardial infarctions within 5 years in diabetes, etcetera. The link function of the
Poisson regression is a log-link. If we think that the response variable is treated as
quantitatively discrete, then modeling this using the general least squares method (e.g. linear
regression) should be applicable. However, the general least square equation does not work
well with this type of data because:
- Data is skewed; the least square requires a normal distribution
- Variance of Poisson usually increases if variable X increases, whereas variance of
the least square must be constant across the value of X variable
- The nature of the data is a non-negative integer (Yij ≥ 0), fitting linear regression
using the least square method will sometimes give a negative prediction!
Poisson regression does not have any problem with the conditions described above that happen in
linear regression. Required data for analysis can be summary or raw data. Layout for summary data
is total numbers and number of events group variable as displayed below:
Table 3. Number of deaths from trauma by age groups Age groups Population (N) No. of Deaths
≤ 24 15,860 1,046 25 – 34 8,079 838 35 – 44 5,778 541 45 – 54 3,147 357 55 – 64 1,825 221 65 – 74 907 147 75 – 84 230 44
≥ 85 32 6
13
For this data, the outcome variable is death and the independent variable is age group. The mortality
rate for each age group is estimated as:
r/1,000/yea66 0.06615,8601,046 24
==
=
=
≤I
Nn
Ii
iagei
The incident rate can be a cumulative incidence if the count number of the population is observed,
or it can be an incident density if a person-time is observed. Its distribution is assumed as a
Poisson distribution if death independently-randomly occurs at each point of time, and the number
of deaths is very small compared with the whole population. The Poisson distribution is as
follows:
λVar(Y)λE(Y)
,Y!eλλ):P(Y
λY
==
∞==−
.,0,1,2,3,..Y
Theoretically, a Poisson variable can take any value of nonnegative integers and the probability
of Y is changed if the mean λ changes. The special character of Poisson is that its mean equates
to its variance (i.e., λ). As for the incidence rate above, we can estimate its variance as:
(62,70)00/year1.96x2/1,066
2/1,00031470.066
=±=
±=
=
=
=
=
≤
≤
≤≤
SEZλ95%CI
Thus
/year
NI
SE
I)Var(I
α/2
24age
24age
24age24age
14
If death is assigned as the outcome of interest, which is a countable and rare event, it can be
modeled with other independent variables which may have an affect or influence on the
numbers of interested events, similar to other regression models as follows:
iji
ii0ij eagegrbblny ++= ∑
The way to interpret ib is described below if age ≤ 24 is assigned to be the reference group.
1b
age24
34age25
1age24
34age25
1
01024age34age25
1034age25
024age
RRey
y
by
yln
b
bbblnylny
bblny
blny
1 ==
=
=
−+=−
+=
=
−
−
≤−
−
≤
That is, using Poisson regression model we can estimate the relative rate (RR) for any age
group (or other exposure groups) compared to the reference group using the exponential of its
corresponding coefficient. Fitting the Poisson equation in STATA can be accomplished quite
easily if data is prepared to follow the format that STATA requires, which is a summary data of
numbers of death (outcome) and total numbers of subjects (N) by group variables (i.e., age
group for this example). The data layout consists of at least 3 variables which are age group,
death, and N, in which the variable agegr refers to group, death and N are the numbers of deaths
and total size of population (or person-time if observed) in that corresponding age group. This
data format is known as summary data. In the case in which only raw or individual patient data
format is available, we can construct an equation straight away like we construct other
equations (e.g., Logistic and Cox equations). However, we can transform raw data to summary
data format using the “contract” command in STATA as below. The summary data requires
much less space and less time consumption when the model is constructed.
15
use “raw data for table 3.dta" gen agegr= recode(age, 24, 34, 44, 54, 64, 74, 84, 85)
lab define agegr 24"<=24" 34"25-34" 44"35-44" 54"45-54" 64"55-64" 74"65-74" 84"75-84" 85">=85" lab value agegr agegr contract agegr death
drop if death ==. egen N = sum(_freq), by( agegr) keep if death ==1 drop death ren _freq death list
xi: poisson death i.agegr , exposure(N) i.agegr _Iagegr_24-85 (naturally coded; _Iagegr_24 omitted) Poisson regression Number of obs = 8 LR chi2(7) = 229.09 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.8020 ------------------------------------------------------------------------------ death | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iagegr_34 | .4528216 .046361 9.77 0.000 .3619557 .5436874 _Iagegr_44 | .3504332 .0529571 6.62 0.000 .2466393 .4542272 _Iagegr_54 | .5423577 .0612955 8.85 0.000 .4222207 .6624948 _Iagegr_64 | .6076543 .0740332 8.21 0.000 .4625519 .7527566 _Iagegr_74 | .899117 .0880837 10.21 0.000 .726476 1.071758 _Iagegr_84 | 1.064937 .1538938 6.92 0.000 .7633109 1.366563 _Iagegr_85 | 1.04485 .4094175 2.55 0.011 .2424069 1.847294 _cons | -2.718827 .0309196 -87.93 0.000 -2.779428 -2.658226 N | (exposure) ------------------------------------------------------------------------------
The coefficients (βs) are estimated and these suggest that there is a trend of increasing risk
when age increases. The estimated β for age 25-34 = 0.45 & the RR = exp(0.45) = 1.57, i.e.,
subjects aged 25-34 years are about 1.6 times higher risk of death compared with subjects aged
≤ 24 years. To estimate RR by STATA can be done by adding option “irr” after the poisson
command.
poisson, irr Poisson regression Number of obs = 8 LR chi2(7) = 229.09 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.8020 ------------------------------------------------------------------------------ death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iagegr_34 | 1.572744 .0729139 9.77 0.000 1.436135 1.722346 _Iagegr_44 | 1.419682 .0751822 6.62 0.000 1.279717 1.574956 _Iagegr_54 | 1.720058 .1054319 8.85 0.000 1.525345 1.939625 _Iagegr_64 | 1.836119 .1359337 8.21 0.000 1.588122 2.122844 _Iagegr_74 | 2.457432 .2164598 10.21 0.000 2.067781 2.920509 _Iagegr_84 | 2.900657 .446393 6.92 0.000 2.145368 3.92185 _Iagegr_85 | 2.842973 1.163963 2.55 0.011 1.274313 6.342633 N | (exposure) ------------------------------------------------------------------------------
16
In the case where data is already summarized as displayed in table 3, data can be input into
STATA directly as seen below:
tabi 1046 15860 \ 838 8079\ 541 5778 \ 357 3147 \ 221 1825 \ 147 907 \ 44 230 \ 6 32 contract row col [freq=pop] list egen N=sum(_freq), by(row) list drop if col ==2 drop col list ren row agegr ren _freq death ren N pop list lab define agegr 1"<=24" 2"25-34" 3"35-44" 4"45-54" 5"55-64" 6"65-74" 7"75-84" 8">=85", modify lab value agegr agegr list agegr death pop gen ir = death/pop list poisson death i.agegr, exp(pop) nolog Poisson regression Number of obs = 8 LR chi2(7) = 191.43 Prob > chi2 = 0.0000 Log likelihood = -28.278512 Pseudo R2 = 0.7719 ------------------------------------------------------------------------------ death | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- agegr | 34 | .4179985 .046361 9.02 0.000 .3271326 .5088643 44 | .3247983 .0529571 6.13 0.000 .2210043 .4285922 54 | .4987706 .0612955 8.14 0.000 .3786335 .6189076 64 | .557216 .0740332 7.53 0.000 .4121136 .7023183 74 | .8127801 .0880837 9.23 0.000 .6401391 .985421 84 | .9537568 .1538938 6.20 0.000 .6521305 1.255383 85 | .9368685 .4094175 2.29 0.022 .134425 1.739312 | _cons | -2.782695 .0309196 -90.00 0.000 -2.843297 -2.722094 ln(N) | 1 (exposure) ------------------------------------------------------------------------------ margins agegr, predict(ir) Adjusted predictions Number of obs = 8 Model VCE : OIM Expression : Predicted incidence rate, predict(ir) ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- agegr | 24 | .0618715 .001913 32.34 0.000 .058122 .065621 34 | .0939778 .0032464 28.95 0.000 .087615 .1003406 44 | .0856148 .0036809 23.26 0.000 .0784004 .0928292 54 | .1018836 .0053922 18.89 0.000 .0913149 .1124522 64 | .1080156 .0072659 14.87 0.000 .0937747 .1222566 74 | .1394687 .0115032 12.12 0.000 .1169229 .1620145 84 | .1605839 .0242089 6.63 0.000 .1131353 .2080326 85 | .1578947 .0644603 2.45 0.014 .031555 .2842345 ------------------------------------------------------------------------------ predict ir, ir table agegr, c(mean ir)
17
3. ESTIMATION OF RISK RATIO FOR FOLLOW UP STUDY
For a randomized controlled trial or a cohort study, estimation of treatment effect or exposure
effect can be done using either odds ratio (OR) or risk ratio (RR). As we know that the estimated
OR is close to the estimated RR if the interested event/disease is rare. However, the estimated
OR is higher than the estimated RR if the event/disease is common and thus treatment or
exposure effect is over estimated. Analysis is often needed to adjust for confounding factors in
the cohort study or the RCT in case the randomization does not work. Multiple logistic
regression is often used for this purpose. To be more accurate in estimate the treatment effect,
several methods have been proposed (see appendix I), but only three methods will be covered in
this module. The first method is the Poisson regression which is a generalized linear model with
log link function and distribution of Poisson. The second method which can be applied is a
generalized linear model with a log link but binomial (not Poisson) distribution, which is known
as a log-binomial or binomial log-linear regression. The last method is applying logit equation
but estimate a marginal probability of having event by groups of variable., a ratio of marginal
probability is then estimated. The data in table 4 is a cohort study of breast cancer data
(Appendix 1). The investigators would like to estimate risk of death by staging and estrogen
receptor level, but the effects of staging on death might be confounded by estrogen receptor level
or vice versa.
Table 4. Frequency of death by estrogen receptor and staging
Estrogen receptor level Stage Total Death
Low 1 12 2
High 1 55 5
Low 2 22 9
High 2 74 17
Low 3 14 12
High 3 15 9
18
3.1 Using Poisson regression
We use breast cancer data provided from STATA journal for illustration. Analysis using Poisson
regression can be done as follows:
read “Breast cancer IPD.dta” *estimate death rate and its CI by Poisson & binary distribution ci death , poisson -- Poisson Exact -- Variable | Exposure Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- death | 192 .28125 .0382733 .2112837 .3669702 . ci death /*binomial distribution, compute standard normal CI by default*/ Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- death | 192 .28125 .0325326 .2170807 .3454193 poisson death i.stage i.er2, irr nolog /*report irr rather log(rate)*/ Poisson regression Number of obs = 192 LR chi2(3) = 26.71 Prob > chi2 = 0.0000 Log likelihood = -109.14601 Pseudo R2 = 0.1090 ------------------------------------------------------------------------------ death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.520742 1.074375 2.17 0.030 1.093288 5.811955 3 | 5.913372 2.645148 3.97 0.000 2.460814 14.20992 | er2 | low | 1.630775 .4688634 1.70 0.089 .9282513 2.864987 _cons | .0938724 .0361805 -6.14 0.000 .0441028 .1998065 ------------------------------------------------------------------------------
However, standard errors of those coefficients might be bias estimation since the interested event
(death) is not rare (28%). To relax this assumption, more robust variance estimation should be
applied.
poisson death i.stage i.er2, irr vce(robust) nolog Poisson regression Number of obs = 192 Wald chi2(3) = 53.61 Prob > chi2 = 0.0000 Log pseudolikelihood = -109.14601 Pseudo R2 = 0.1090 ------------------------------------------------------------------------------ | Robust death | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.520742 .9937819 2.35 0.019 1.16399 5.458932 3 | 5.913372 2.28568 4.60 0.000 2.772187 12.61386 | er2 | low | 1.630775 .3480542 2.29 0.022 1.073305 2.477792 _cons | .0938724 .0338979 -6.55 0.000 .0462555 .1905077 ------------------------------------------------------------------------------
19
3.2 Using binary regression
Another model that can be applied with the data and study design like this is a binary regression
analysis, in which the outcome variable has a binomial distribution and the log-link function is
used to link between outcome and independent variables. This can be performed in a few ways in
STATA as below (Cummings P, 2011):
binreg death i.stage i.er2, nolog rr Generalized linear models No. of obs = 192 Optimization : MQL Fisher scoring Residual df = 188 (IRLS EIM) Scale parameter = 1 Deviance = 185.8541388 (1/df) Deviance = .9885858 Pearson = 190.2164034 (1/df) Pearson = 1.011789 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u) [Log] BIC = -802.555 ------------------------------------------------------------------------------ | EIM death | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- stage | 2 | 2.538158 .9976133 2.37 0.018 1.174781 5.483782 3 | 5.868047 2.259727 4.60 0.000 2.758699 12.48196 | er2 | low | 1.558326 .3067419 2.25 0.024 1.059515 2.291972 _cons | .0951663 .0342763 -6.53 0.000 .046979 .1927803 ------------------------------------------------------------------------------
3.3 Using logit regression Another way to model when the outcome variable is not too rare is applying a logit equation.
However, this will result in estimation of odds ratio rather than estimation of risk ratio. If our
study design is a cohort or a randomized controlled trial, we prefer to report an effect of
prognostic factor or treatment intervention using a risk ratio rather than an odds ratio. In
addition, odds ratio sometimes over estimates the effect if the occurrence of the outcome is
common. Using the logit regression is therefore required to further estimate marginal effects
and then use these estimates to calculate the risk ratio as below:
logistic death ib(1).er2 i.stage , nolog Logistic regression Number of obs = 192 LR chi2(3) = 42.27 Prob > chi2 = 0.0000 Log likelihood = -92.939847 Pseudo R2 = 0.1853
20
------------------------------------------------------------------------------ death | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 0.er2 | 2.508065 .9916923 2.33 0.020 1.155507 5.443836 | stage | 2 | 3.109772 1.44851 2.44 0.015 1.248087 7.748406 3 | 18.8389 11.03231 5.01 0.000 5.978343 59.36498 | _cons | .0937695 .0393847 -5.64 0.000 .0411665 .2135893 ------------------------------------------------------------------------------
The odds ratios were 2.5 for estrogen receptor and as high as 18 for stage 3 versus stage 1. Let’s estimate marginal effects for each factor: margins er2, post coeflegend Expression : Pr(death), predict() ------------------------------------------------------------------------------ | Margin Legend -------------+---------------------------------------------------------------- er2 | 0 | .4008795 _b[0bn.er2] 1 | .2392455 _b[1.er2] ------------------------------------------------------------------------------ nlcom (lnrr: ln(_b[0bn.er2]/_b[1.er2])), post display "risk ratio = " exp(_b[lnrr]) _skip(3) /* */ "95% CI = "exp(_b[lnrr]-invnormal(1-.025)*_se[lnrr]) /* */ "," exp(_b[lnrr]+invnormal(1-.025)*_se[lnrr]) ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnrr | .5161706 .2177193 2.37 0.018 .0894487 .9428925 ------------------------------------------------------------------------------ . display "risk ratio = " exp(_b[lnrr]) _skip(3) /* > */ "95% CI = "exp(_b[lnrr]-invnormal(1-.025)*_se[lnrr]) /* > */ "," exp(_b[lnrr]+invnormal(1-.025)*_se[lnrr]) risk ratio = 1.6755988 95% CI = 1.0935712,2.567397
The death rates were 0.24 and 0.40 for receptor positive and negative, respectively. The ratio of
death rates could be estimated using ‘nlcom’ command after marginal estimations, or using
adjrr command as below:
adjrr er2
R1 = 0.4009 (0.0660) 95% CI (0.2716, 0.5302)
R0 = 0.2392 (0.0332) 95% CI (0.1742, 0.3043)
ARR = 1.6756 (0.3648) 95% CI (1.0936, 2.5674)
ARD = 0.1616 (0.0745) 95% CI (0.0155, 0.3077)
p-value (R0 = R1): 0.0301
p-value (ln(R1/R0) = 0): 0.0177
21
This yielded the RR of 1.68 (95% CI: 1.09, 2.57). As you can see that the estimated RR and OR
were very much different since death is not too rare for this data.
The estimated risk ratios and standard errors from binomial log linear (or logit with post-
marginal estimation) are not much different when compared with a Poisson regression. The
Poisson approach however tends to give higher estimations for both parameters than the binary
log linear approach, particularly the Poisson model without robust variance estimation yields
the highest standard errors of estimation. The most appropriate method for this data should be
binary log linear regression followed by Poisson regression with robust variance estimation.
4. CAPTURE-RECAPTURE
The capture-recapture method has been used to estimate the number of missing or undercounted
population by merging 2 or more databases. For instance, a researcher would like to estimate the
rates of adolescent pregnancy and abortion across Thailand using a cross-sectional hospital-
based data registry of the Ministry of Public Health. However, the estimated rates based on
hospital data registry are more likely to bias from underestimate for a number of reasons as
follows. Not all hospitals collaborate with this project, and there may be incorrect coding in the
databases of those collaborating hospitals.
Some adolescent pregnant women might give birth or abort at homes or clinics, thus their names
would not appear in the hospital databases. The hospital-based data of live births may be
overlapped with a few other sources of data, e.g., Civil Registration Database under The
Ministry of Interior, and Health National Survey. The overlapped data is called re-capture data,
which can be used to estimate or predict the number of missing data, see Chao 2001 and Hook
1995. Various approaches can be applied for estimation of missing data, but we will focus on
applying a log-linear model (or Poisson model). Details of analysis will be illustrated using the
example below.
22
Suppose that we would like to estimate the prevalence of Hepatitis C infection. There are 3
sources of data, i.e., A, B, and C; each variable is coded 0 and 1 for present and absent in that data
registry. Thus, all possible data layouts are 2x2x2, as described in Table 5. The missing number
‘h’ is the number that we would like to estimate.
Table 5. Data layout for capture-recapture
Source A Source B Source C
+ - + + a b - c d - + e f - g h*
We can apply a log-linear model as described above, or Poisson regression without the outcome
variable to estimate the missing number ‘h’. We need to input data for each cell in Table 5 into
STATA as listed below:
list freq A B C +------------------+ | freq A B C | |------------------| 1. | 5 1 1 1 | 2. | 7 1 1 0 | 3. | 15 1 0 1 | 4. | 37 1 0 0 | 5. | 3 0 1 1 | |------------------| 6. | 14 0 1 0 | 7. | 20 0 0 1 | 8. | . 0 0 0 | poisson freq i.A i.B i.C, nolog
Poisson regression Number of obs = 8
LR chi2(3) = 96.00
Prob > chi2 = 0.0000
Log likelihood = -18.717469 Pseudo R2 = 0.7194
------------------------------------------------------------------------------
freq | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.A | -.3184537 .1642822 -1.94 0.053 -.6404409 .0035334
1.B | -1.444889 .2064288 -7.00 0.000 -1.849481 -1.040296
1.C | -.9301478 .1800837 -5.17 0.000 -1.283105 -.5771903
_cons | 3.933108 .1245397 31.58 0.000 3.689015 4.177201
------------------------------------------------------------------------------
23
As for the this Poisson model, we can use the post-estimation command ‘predict’ to predict the
expected number of missing ‘h’, or we can use ‘lincom’ to estimate the missing numbers.
This suggested that the missing number was 51 which could range from 30 to 88 subjects.
The estimated total number of Hepatitis C positive is 152 (51+101) cases. Suppose that the number of
the population is at least 15,000, the estimated rate is 10 (95% CI: 9,12) per1000 subjects.
list A B C freq exp +-----------------------------+ | A B C freq exp_freq | |-----------------------------| 1. | 1 1 1 5 3.440009 | 2. | 1 1 0 7 8.745244 | 3. | 1 0 1 15 14.62778 | 4. | 1 0 0 37 37.18697 | 5. | 0 1 1 3 4.746958 | |-----------------------------| 6. | 0 1 0 14 12.06779 | 7. | 0 0 1 20 20.18525 | 8. | 0 0 0 . 51.31526 | +-----------------------------+ lincom _b[_cons] , eform ( 1) [freq]_cons = 0 ------------------------------------------------------------------------------ freq | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 51.31526 14.01225 14.42 0.000 30.04801 87.63496 ------------------------------------------------------------------------------ cii 15000 152, poisson -- Poisson Exact -- Variable | Exposure Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- | 15000 .0101333 .0008219 .0085864 .0118784 Estimate using post-estimation command predict exp_freq /*expected number of events, default; which is exp(xb)!*/ predict xb, xb /*ln(exp no.)*/ predict stdp, stdp /*standard error of xb */ gen ll_xb = xb-1.96*stdp gen ul_xb = xb+1.96*stdp gen ll_exp_freq = exp(ll_xb) gen ul_exp_freq = exp(ul_xb) list freq exp_freq ll_exp_freq ul_exp_freq xb ll_xb ul_xb , compress /* +------------------------------------------------------------------------+ | freq exp_freq ll_exp~q ul_exp~q xb ll_xb ul_xb | |------------------------------------------------------------------------| 1. | 5 3.440009 1.834266 6.451441 1.235474 .6066446 1.864304 | 2. | 7 8.745244 5.640689 13.5585 2.16851 1.730006 2.607014 | 3. | 15 14.62778 9.879021 21.65922 2.682922 2.290413 3.075431 | 4. | 37 37.18697 27.45409 50.3703 3.615958 3.312515 3.919402 | 5. | 3 4.746958 2.976143 7.571413 1.557504 1.090628 2.02438 | |------------------------------------------------------------------------| 6. | 14 12.06779 7.615479 19.12309 2.49054 2.030183 2.950897 | 7. | 20 20.18525 13.68008 29.78377 3.004952 2.615941 3.393964 | 8. | . 51.31526 30.04772 87.63582 3.937988 3.402787 4.47319 | +------------------------------------------------------------------------+
24
Assignment II (20%)
1. The Table below contains data for numbers of deaths and total population by age group and
sex. The data concerns itself with the number of deaths by accidents in 1999.
a) Estimate death rate and 95% confidence interval by sex. Is the death rate different
between males and females?
b) Estimate death rate and 95% confidence interval by age group. Are they different
according to age groups?
c) Fit the Poisson regression equation for age group and sex, how the equation
looks like?
d) Do overall age and sex affect to the number of deaths? Perform the test for age and sex.
e) Create a table for reporting this result and interpret it.
Age group
Gender
Male Female
Deaths N Deaths N
≤ 24 828 11,786 218 4,074
25 – 34 682 6,221 156 1,856
35 – 44 432 4,313 109 1,465
45 – 54 259 2,263 98 884
55 – 64 162 1,280 59 545
65 – 74 103 600 44 307
75 - 84 33 150 11 80
≥ 85 3 20 3 12
25
2. The data set is a registry of accidents during 2001-2002. Education programs were launched in
order to prevent car accidents, particularly during public or long holidays. We are wondering
whether the programs work well or not? (use “Assignment II death by age sex year edu.dta")
a) Plan for your analysis and create a summary table which includes: the number of deaths,
the total population by year, education program, age groups, and sex.
b) Estimate death rate by each education program. Are death rates different according to
different education programs?
c) Fit Poisson regression by including all variables in the model. What variables
affect death? Demonstrate the tests.
d) Fit binary log linear regression by including all variables in the model. Compare &
comment results between the two models and make a decision which model is more
appropriate?
e) Write a paragraph of results along with table/s for report of this analysis, then interpret
the results.