55
Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Embed Size (px)

Citation preview

Page 1: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Lecture 9:Marginal Logistic Regression

Model and GEE(Chapter 8)

Page 2: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Marginal Logistic Regression Model and GEE

Marginal models are suitable to estimate population average parameters

• For example, in the Indonesian study, a marginal model can be used to address questions such as:– What is the prevalence of respiratory infection in

children as a function of age?– Is the prevalence of respiratory infection greater

in the sub-population of children with vitamin A deficiency?

– How does the association of vitamin A deficiency and respiratory infection change with age?

• The scientific objective is to characterize and contrast populations of children.

Page 3: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Marginal Models for Binary Responses:Logistic Regression

Model for the Mean

Model for the Association

Marginal odds ratio,

Page 4: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Marginal Odds Ratio

a greater value indicates positive association

Two possible specifications:

Degree of association is the same for all pairs of observations from the same subject

Degree of association is inversely proportional to the time between observations from the same subject

Page 5: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Parameter Interpretation in Logistic Regression: ICHS Study

Marginal Logistic Regression

Page 6: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Parameter Interpretation in Logistic Regression: ICHS Study

Logistic Regression with Random Effects

Page 7: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Parameter Interpretation in Logistic Regression: ICHS Study

Transition Logistic Regression Model

Page 8: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Parameter Interpretation in Logistic Regression: ICHS Study

Transition Logistic Regression Model (cont’d)

Page 9: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation of β in GLM: Cross-sectional Data

• If Yi is binary or a count, we specify the likelihood function and estimate the parameters of interest using Maximum Likelihood Estimation

Page 10: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation of β in GLM: Cross-sectional Data

• For example, if Y is binary, i.e.:

…we estimate β0 and β1 by maximizing

Page 11: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation of β in GLM: Cross-sectional Data

• For example, if Y is a count, i.e.:

…we estimate β0 and β1 by maximizing

Page 12: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation of β in GLM: Cross-sectional Data

In general, we have:

is called the score equation. Solutions to the score equation are not available in closed form, and so require an iterative procedure called iterative weighted least squares (IWLS) algorithm…

Solving the score equation is equivalent to maximizing the likelihood function.

Page 13: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation of β in GLM: Cross-sectional Data

Main ideas of IWLS

1. i = EYi, vi = var(Yi) = v(i)

2. Choose to make close to yi on average

3. Weight yi by vi-1

Page 14: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE Estimation of β in GLM: Longitudinal Data

In the case of a linear regression model with the assumption of normality, the extension from ordinary linear regression to longitudinal problems was facilitated by thinking about a multivariate normal distribution.

By specifying a model for the mean E[Yi] and the model for the covariance matrix Vi, we can fully specify the multivariate normal distribution:

and use MLE.

Page 15: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE Estimation of β in GLM: Longitudinal Data

Unfortunately, if the elements of Yi are counts or binary response, we cannot naturally extend the Bernoulli or Poisson distributions to take into account of correlation. Multivariate extensions of these distributions are quite complex (except for biostat students!).

The main impediments with binary and count data are:

1. There are not multivariate generalizations of the necessary probability distributions

2. Population-average and subject-specific approaches do not lead to the same model for the mean response

Page 16: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE(applies only to marginal models)

• Under a GEE approach, we forget about trying to specify a model for the whole multivariate distribution of a data vector. Instead, the idea is to just model the mean response E[Yi] and the covariance matrix Vi of a data vector as in the normal case.

• GEE is based on the concept of “estimating equations” and provides a very general approach for analyzing correlated responses that can be discrete or continuous.

Page 17: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE

• The idea behind GEE is to generalize and extend the usual likelihood equations for a GLM with a univariate response by incorporating the covariance matrix of the vector of responses Y

• For the case of linear models, the Generalized Least Square (GLS) estimator for the vector of regression coefficients is a special case of the GEE approach

Page 18: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE

• In the absence of a convenient likelihood to work with, it is sensible to estimate β by solving the following multivariate equation:

where

Note: with continuous data, the estimate from this score equation reduces to the MLE

Page 19: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE (cont’d)• The method of generalized estimating equations

provides consistent estimates for the mean parameter when a model for the correlation may not be reliably specified.

• is a multivariate generalization of the score equation used to maximize the likelihood function under a GLM

Page 20: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GLM for Longitudinal Data (GEE)

In summary:

1. For GEE models, we specify a GLM for the mean response

2. independence, completely unstructured

3. The estimates of β and their standard errors will be consistent (i.e. unbiased for large sample size).

4. If the specification of Vi is correct, then the GEE solution is the maximum likelihood estimate.

Page 21: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE

One important property of the GLM family is that the score function depends only on the mean and variance of Yi. Therefore the estimating equation:

can be used to estimate the regression coefficients for any choices of link and variance functions, whether or not they correspond to a particular member of the exponential family.

is the generalized estimating equation.

Page 22: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE Properties

• is nearly efficient relative to the maximum likelihood estimate of , provided that var(Yi) has been reasonably approximated.

• GEE is the maximum likelihood score equation for multivariate Gaussian data, and for binary data, when var(Yi) is correctly specified.

• is consistent for , even when var(Yi) is incorrectly specified.

Page 23: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

What we need to specify for implementing GEE

E[Yij ] ijg(ij ) ij X ij

' Var (Yij ) v(ij )Corr(Yi)

Model for the mean

Known variance function

Working correlation matrix: model for the pairwise correlations among the responses

Page 24: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Working covariance matrix

Vi Ai1/ 2corr(Yi)Ai

1/ 2

Ai diag(v(ij ))V is called the working covariance matrix to distinguish it from the true underlying covariance of Y

Page 25: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

GEE

ˆ

/][

0)(

)()(

1

1

1

1

kijjki

ii

N

iii

ii

N

iiii

D

yVD

yVy

minimize

GEE equations

Solution of the GEE equation

Page 26: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Properties of GEE Estimates

•The GEE estimator is consistent whether or not the within-subject associations/correlations have been correctly modelled.

•That is, for the GEE estimator to provide a valid estimate of the true β, we only require that the model for the mean response has been correctly specified.

Page 27: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Asymptotic distribution of the GEE estimator

N

iiiiii

N

iiii

DVYVDM

DVDB

MBB

E

1

11

1

1

11

)cov(

)ˆcov(

]ˆ[

True covariance matrix

•In large samples, the GEE estimator is multivariate normal

Page 28: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Sandwich estimate of )ˆcov(

iiiiii

N

iii

i

N

iii

DVYYVDM

DVDB

BMB

ˆˆˆˆˆˆˆ

ˆˆˆˆ

ˆˆˆ)ˆ(v̂co

1

1

1

1

1

11

bread

meat

Consistent estimate of the true covariance matrix of Y

Page 29: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Link to stata command xtgee for continuous data

substitute into GEE equations, to get…

• xtgee, identity link, corr(exch)

• Use Weighted Least Square for

iiii XDXu ,)(

11

1

1

1

11

1

)()ˆcov(

)()(ˆ

ii

N

ii

ii

N

iiii

N

ii

XVX

yVXXVX

)ˆ(Cov

Page 30: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

• xtgee, identity link, corr(exch), robust

• Use Sandwich Estimator for

11

1

11

1

11

1

1

1

11

1

))()(()()ˆcov(

)()(ˆ

ii

N

iiiiii

N

iiii

N

ii

ii

N

iiii

N

ii

XVXXVyCovVXXVX

yVXXVX

)ˆ(Cov

Link to stata command xtgee for continuous data (cont’d)

Page 31: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Link to stata commands xtgee for binary data

• Substitute into GEE equation. But no closed-form solution, so need iterative procedure.

• Difference between using robust or not, is analogous to continuous data.

• xtgee,logit link, corr(exch)• xtgee, logit link, corr(exch), robust

2

2

)1(

][

1)(

1

11

K

kkijk

K

kkijk

K

kkijk

i

i

X

X

ijk

k

X

k

ijjki

X

X

i

e

eXeD

e

e

Page 32: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Using GEE

Page 33: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Bottom Line

If the scientific focus is on the regression coefficients β:

1. Focus on modeling the mean structure

2. Use a reasonable approximation of the covariance structure

3. Check the inferences for β by comparing β’s robust standard errors with respect to different covariance assumptions

4. If the β’s standard errors differ substantially, a more careful treatment of the covariance model might be necessary.

Page 34: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Maximum Likelihood Estimation for Binary Data

Page 35: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial

Data from the 2x2 crossover trial on cerebrovascular deficiency adapted from Jones and Kenward, where treatment A and B are active drug and placebo, respectively: the outcome indicates whether an electrocardiogram was judged abnormal (0) or normal (1).

Page 36: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial (cont’d)

Goal: To compare the effect of an active drug (A) and a placebo (B) on cerebrovascular deficiency

• 34 patients received A followed by B• 33 patients received B followed by A• Yij = 1 if normal electrocardiogram reading

At period 1:

Page 37: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial (cont’d)

Calculate MLE of odds ratios separately for period 1 and period 2.

Odds ratio of being normal for the active drug versus the placebo is:

This estimate is larger than 1, and therefore indicates that the active drug produces a higher proportion of normal readings.

However, the estimate is not statistically significant.

Should we compare (???) the data for Periods 1 and 2?

Page 38: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial (cont’d)

This approach has several limitations:

1. Ignore the carry-over effect, i.e. the effect of the treatment at period 1 might influence the response at period 2 (treatment x period interaction)

2. Two responses for the same subject are likely to be correlated

3. In fact, the odds ratio

is estimated to be

So, let’s use GEE to estimate a population average odds ratio, taking into account within-subject correlation

Page 39: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial (cont’d)

GEE Approach

• We combine data from both periods.

• We can analyze a 2x2 crossover trial as a longitudinal study with ni = n = 2 and m = 67.

Page 40: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Example: 2x2 crossover trial (cont’d)

GEE Approach (cont’d)

• Fit a logistic regression model:

Page 41: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

exp(0.57)-1 = 0.77

Population average odds of a normal reading are estimated to be 77% higher when using the drugs as compared to the placebo

exp(3.56) = 35

Subjects with normal responses at the first visit have odds of normal reading at the next visit that are almost 35 times higher than those whose first response was abnormal

Page 42: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

In summary

1. Model 1 includes the treatment x period interaction (little support from the data), and estimates marginal odds ratio by GEE

2. Model 2 drops the period x treatment interaction and estimates the marginal odds ratio by GEE

3. Model 3 assumes that the marginal odds ratio is 1, here = 0.56, with standard error 0.38 (much larger than under Models 1 and 2)

Note: If we fit Model 3, but using robust standard errors, then we obtain similar results to the GEE approach.

Page 43: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

•The prevalance of respiratory infections in six consecutive quarters reveals a positive seasonal trend with a summer maximum

•The prevalence of xerophthalmia also indicates some seasonality with a winter maximum

Page 44: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

•275 children in Indonesia were examined for up to six consecutive quarters for the presence of respiratory infections (i=1,…,m=275; j=1,…,6 visits).

•Goals of the analysis

•Determine whether prevalence of respiratory infection is higher among children who suffer from xerophthlamia (an ocular manifestation of chronic vitamin A deficiency)?

•Estimate the change of respiratory infection with age.

•Consider seasonality as a potential confounder.

Example: Respiratory Infections

Page 45: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Model 1: First visit only

•Look only at the data from the first visit

•Fit a logistic regression model of respiratory infection on xerophthalmia and age, adjusting for other covariates

•We find a strong non-linear cross-sectional age effect on the prevalence of respiratory infection

•Cross-sectional analysis suggests that the prevalence of respiratory infection increases from age 12 months and reaches its peak at age 20 months before starting to decline

Cross-Sectional Analysis

Page 46: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Model 2: All visits + controlling for seasonality

•Look at data from all visits

•Fit a logistic regression model of respiratory infection on xerophthalmia and age, adjust for other covariates

•We still find a strong non-linear cross-sectional age effect on prevalence of respiratory infection

•The age coefficient in Model 2 can be interpreted as weighted averages of the cross-sectional age coefficients for each visit.

Cross-Sectional Analysis

Page 47: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Longitudinal Analysis

Here we want to distinguish the contributions of cross-sectional and longitudinal information to the estimated relationship of respiratory infection and age.

Page 48: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Longitudinal Analysis

Model 3: Separate CS from LDA

•Separate differences among sub-populations of children at different ages and a fixed time (CS) from changes in children over time (LD)

Page 49: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Longitudinal Analysis

Model 3: Separate CS from LDA

Page 50: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Longitudinal Analysis

Model 4: Separates CS from LDA + controlling for seasonality

Page 51: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Summary of Results•Pattern of convex relationship between age and the risk of respiratory infection appears to coincide with the pattern of seasonality.

•If we include harmonic terms, then the longitudinal parameters (corresponding to the follow-up in the table) are not statistically significant.

•The longitudinal information (i.e. variation over time of respiratory infections versus variations over time of age) is highly confounded by seasonality.

•Therefore, in the presence of a strong seasonal signal, we can learn little about the effects of aging from data collected over an 18-month period if we restrict our attention to longitudinal information.

•However, much can be learned by comparing children at different ages so long as we can assume that there are not cohort effects confounding the inferences about age.

•BE CAREFUL – in longitudinal analysis, always look for time-varying confounders!

Page 52: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Table 8.7. Logistic regressions of the prevalence of respiratory function on age and xerophthalmia adjusting for gender, season, and height for age.

Model 1 and 2 estimate cross-sectional effects; Models 3 and 4 distinguish cross-sectional from longitudinal effects.

Models 2-4 are fitted using the alternating logistic regression implementation of GEE.

CS,

1st visit onlyCS, All visits LD LD

βc

βL

Page 53: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Model 2

The association between RI and Xero. is positive, although not statistically significant at the 5% level.

Page 54: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Prevalence of respiratory infection increases from age 12 months and reaches its peak at 20 months before starting to decline.

The age coefficients in Model 2 can be interpreted as a weighted average of the cross-sectional age coefficients from each visit.

Page 55: Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

The risk of RI declines in the first 7 to 8 months of follow-up, before rising later in life.

(ageij – ageik)