Confidence Intervals About Discrete-Time …web.pdx.edu/~newsomj/qig/dinno.pdfApplied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press, New

Confidence Intervals About Discrete-Time Survival/Cumulative Incidence Estimates

Using the Delta Method

Alexis Dinno with Jong-Sung Kim

In presentation to Quantitative Interest Group, Brown Bag Lunch Series

School of Community Health

March 1, 2010 [email protected] (503) 725-3076

Let me say right up front… This talk is based on ongoing work I am doing under the tutelage of Professor Jong-Sung Kim. This presentation marks the first time I have attempted to communicate these ideas in depth to an audience, so feedback is welcome.

Event history analysis Answering the questions whether and when and event will happen in a population at risk. Has different names such as ‘survival analysis’ or ‘failure time model.’ Common concepts study time beginning of time event occurrence right censoring baseline model

(Image courtesy of Wikimedia commons: http://commons.wikimedia.org/wiki/File:Km_plot.jpg)

A good reference for this material… Singer, J. D. and Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press, New York, NY. Also covers Cox continuous time proportional hazards event history models, and multilevel growth curve modeling of continuous change over time.

Discrete time event history analysis 1: the basic model

Time in discrete time event history analysis is, ahem, discrete. In the above there are T discrete time periods represented, and p predictor variables. Models conditional log-odds of event occurrence as a linear function of time t (specified by αt) and the predictors (Xs) multiplied by their slope parameters (βs).

DTEHA 2: the data i Discrete time event history models employ person-period data formats wherein observations in each discrete period are nested within individuals. Event occurrence is represented as a nominal variable coded as: no event = 1; and event = 1. Post-event observations are (right) censored. Right censoring is modeled with the absence of observations corresponding to the censored periods.

DTEHA 3: the data ii — the person period data set ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 female support Y 01 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1

02 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 02 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 02 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1

03 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 03 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 03 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0

04 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 04 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 (Adapted from Singer, J. and Willett, J. (1993). It’s about time: Using discrete-time survival analysis to study duration and the timing of events. Journal of Educational and Behavioral Statistics, 18(2):155–195.)

DTEHA 4: the conditional hazard function The discrete time hazard function ht describes the risk of event occurrence at time t conditional on the predictors for a randomly selected individual who has not experienced an event before time t, the estimated discrete time conditional hazard is thus:

Baseline discrete time hazard of ever smoking vs. year

Baseline discrete time hazard of ever smoking vs. year

DTEHA 5: the survival function The survival function St describes the proportion of the study population for whom an event has not occurred by time t conditional on the predictors since the study’s beginning of time, and is thus the product of the complements of the discrete time conditional hazards up to time t as:

The estimated survival function at time t is thus:

Baseline survival of ever smoking vs. year

Baseline survival of ever smoking vs. year

DTEHA 6: cumulative incidence* We can reframe St in terms of cumulative incidence, which is the proportion of the study population for whom an event has occurred by time t conditional on the predictors since the study’s beginning of time. Cumulative incidence is simple 1 – St, and we note that the variance of St and 1 – St are identical.

* Professor Kim points out that the term ‘cumulative incidence’ is sometimes used with a different meaning in competing risks event history models. The definition I use above is consistent with how epidemiologists conceive of risk: the proportion experiencing an event in a specific period of time.

Baseline cumulative incidence of ever smoking vs. year

Cumulative incidence of ever smoking by sex Sex: — males — females

Baseline cumulative incidence of ever smoking vs. age

Age cohort: • 12 in ‘97 • 13 in ‘97 • 14 in ‘97 • 15 in ‘97 • 16 in ‘97 • 17 in ‘97

DTEHA 7: time-varying predictors One advantage offered by discrete time event history models is the ability to permit the value of predictors to vary across time. For example, respondent educational attainment (grade in primary or secondary school, then degrees completed) might be expected to increase over time, particularly in a study population of adolescents followed for years. ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 female education Y 01 1 0 0 0 0 0 0 0 0 0 0 0 1 10 0 01 0 1 0 0 0 0 0 0 0 0 0 0 1 11 0 01 0 0 1 0 0 0 0 0 0 0 0 0 1 12 0 01 0 0 0 1 0 0 0 0 0 0 0 0 1 14 0 01 0 0 0 0 1 0 0 0 0 0 0 0 1 14 0 01 0 0 0 0 0 1 0 0 0 0 0 0 1 14 1

DTEHA 8: parsimonious representations of time Another kind of flexibility is often termed parsimonious representation of time. Time in the DTEHA framework must not necessarily be modeled as discrete periods, but could be modeled with a constant effect of time (1), as a linear effect of time (2), as some polynomial function of time (3), or even as a hybrid incorporating continuous and discrete functions of time (4):

The advantage of increased statistical power is especially realized when including interaction terms between predictors and time in order to explicitly allow the effects of variables to vary across time.

The problem: no CIs for conditional hazard or survival! To effectively interpret and communicate the results of DTEHA models we would like to be able to infer whether the differences between different groups are meaningful. It would be extremely useful to put confidence intervals about the survival/cumulative incidence function values, and to graph the corresponding boundaries of the survival or cumulative incidence curves. In particular, this would facilitate rapid ‘eyeball’ hypothesis testing over the duration of study time, between an arbitrary number of groups. Caveat: the usual concerns about multiple comparisons still apply.




A naïve approach to CIs Typically discrete time event history models are estimated using maximum likelihood techniques, such as those available in the logistic regression packages for major statistical software packages. These packages estimate the standard errors of the parameter estimates (i.e. the sβ terms), and confidence intervals (CIs) could be normally approximated:

One approach would perform the same transform on the sβ terms as is used with the β terms to estimate ĥt, Ŝt or 1– Ŝt in order to estimate CIs for

€

σ h t,

€

σ S tor (remember

€

σ S t =

€

σ1− S t

). However, this approach produces inaccurate results.

Accurate normal approximations for CIs In order to normally approximate confidence intervals for ĥt, Ŝt or 1– Ŝt with any accuracy as below, we would need

€

σ h t and

€

σ S t.

Unfortunately, there are no commonly accepted estimators of the standard error terms

€

σ h t2 and

€

σ S t2 .

The delta method: the univariate case The problem is that

€

E[g(X )] = g[E (X )] only when

€

g (X ) is a linear function. But variance, of, for example, X, involves second order terms, and is therefore not a linear function. Therefore

€

σ g ( X )2 ≠ g (σ X

2 ) Fortunately, we can approximate

€

σ g ( X )2 given

€

σ X2 using the (univariate)

delta method (here using a first-order Taylor expansion) as follows:

More accurate approximations could be produces with a 2nd-order TSE:

An estimator of

€

σ h t

2 for unconditional models

If

€

g (X ) =e X

1+ e X (the anti-logit or logistic function), then, per the

quotient rule, we obtain

€

σ h t2 as shown:

An estimator of

€

σ S t2 for unconditional models step 1

ln(Ŝt) is more tractable, and if

€

g (X ) = 1− X , we first estimate

€

σ ln( S t )

2 :

An estimator of

€

σ S t2 for unconditional models step 2

We then consider

€

g (X ) = e X to obtain

€

σ S t2 , substituting for ĥt and Ŝt to

obtain only using sα and α.

The delta method: the multivariate case The delta method in the univariate case:

Can be extended to a multivariate case (1st-order TSE):

, where: Z is a 1 by p+1 row vector of dt, X1t,…, Xpt, V is a is a 1 by p + 1 vector of partial derivatives of Z evaluated at the p + 1 by 1 mean vector µ, and Σ is the p + 1 by p + 1 variance-covariance matrix of Z.

Extending estimator of

€

σ h t

2 to conditional models

Extending estimator of

€

σ S t2 to conditional models

Again assuming independence for

€

σ ln( S t )

2 , the form of

€

σ S t2 follows:

, where

€

B is a row vector of estimated parameters

€

β 1, …,

€

β p and Xi is a

column vector of observed variables X1i, …, Xpi indexed by i = 1, …, t.

Application: initiation of smoking in adolescents The DTEHA models presented are employing National Longitudinal Surveys of Youth 1997 data, which: • include detailed annual self reports of 30-day smoking behavior; • include a wealth of socio-demographic information, including

characteristics of the parenting and household environment; • span the ages at which the vast majority of regular smokers initiate

the and progress to their established smoking careers; • include state and local geocodes permitting linkage to state and local

tobacco control policies.

Baseline cumulative incidence of ever smoking vs. year




Cumulative incidence of ever smoking by age & white

Group: – white ♂ – white ♀ – !white ♂ – !white ♀

Future development: multilevel approach As previously mentioned, interacting predictors with model specifications of time permit an estimation of time varying effects. However, within the DTEHA framework described so far this would likely lead to incorrect estimates of parameter variances since we would not be accounting for the clustering of observations within periods of observation and within individuals. DTEHA has already been extended to a multilevel modeling framework to accommodate precisely these issues, and it remains for Professor Kim and I to generalize our estimators of

€

σ h t2 and

€

σ S t2 to the multilevel case.

Future development: complimentary log-log link The logit link function in the DTEHA models discussed so far assume proportional odds across all discrete times. However, DTEHA models can also be estimated using a complimentary log-log link function to produce models under an assumption of proportional hazards across all discrete times. We will derive estimators corresponding to the ones we have presented for such models.

Software The dthaz package available for Stata (type ‘findit dthaz’) currently performs DTEHA and provides utilities for managing data sets. Within the next two months the package will be updated to include the standard error estimates of the conditional hazard and survival functions (for logit link models), with corresponding graphic capabilities. An R package is planned.

Appreciation I am indebted to Zhiguo Li at the University of Michigan for directing my attention to the delta method. Professor Jong-Sung Kim has guided my steps in this process: the credit is his, and the mistakes are mine.

Thank you Questions and discussion?

DTEHA: miscellany Left censoring Lack of event occurrence at time t

Documents

Confidence Intervals About Discrete-Time …web.pdx.edu/~newsomj/qig/dinno.pdfApplied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press, New