If we use a logistic model, we do not have the problem of suggesting risks greater than 1 or less than 0 for some values of X: E[1{outcome = 1} ] = exp(a+bX)

If we use a logistic model, we do not have the problem of suggestingrisks greater than 1 or less than 0 for some values of X:

E[1{outcome = 1} ] = exp(a+bX)/ [1 + exp(a+bX) ]

Logistic model is a linear model, on a different scale than the linear risk model:log(Pr(outcome=1)/[1 – Pr(outcome=1) ] = a + bX

Ordinal outcomes:

Ungrouped Poisson Regression

Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded

Collapsing covariates into group-level covariates can introduce bias and loose information

Ungrouped Poisson Regression methods have been developed use individual, time-varying covariate informationestimate effects of covariates on rates

Estimates similar to those from proportional hazards models


Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.


Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.

Survival Analyses

– Survival analysis is a set of statistical techniques whose goal is to predict the time of (or time until) an event– The dependent variable is the time to occurrence of a specific

event (te) from some time (t0)– An event is a qualitative change in some attribute

– People who do not have the event during the follow-up are said to be censored

– The independent variable may be a treatment, a clinical characteristic, a demographic characteristic, or some other predictor of survival

– Examples include different treatments, high/low blood pressure, or treating hospital

– Examples– Predicting the life expectancy of a group:

– Do smokers die younger than non-smokers?– The event is death

– Measure the efficacy of a treatment– Does AZT delay the onset of pneumoncystis carini pneumonia in

people who are HIV positive?– The event is pneumonia diagnosis

– Measure the role of multiple predictors on the time until an outcome:

– Can you predict time to conception, using lots of predictors, in couple undergoing fertility treatment?

– The event is conception

Survival Analyses

– Examples (continued)– Survival analysis deals with special problems in each of the

previous studies– The smoking study

– The subjects most likely began smoking at different ages and you need to account for the various years at risk

– Some subjects may disappear or die from trauma– AZT study

– The exact date of exposure to the virus is rarely known– Time periods may exist where participants left for a different clinical trial

and then returned– Pregnancy study

– Some couples may never get pregnant

Survival Analyses

Censoring

If you create a timeline, a number of events will occur– Uncensored observation: The event under observation occurs and

their time truly reflects the survival time– Censored observation: The events under observation does not occur

and their time reflects a minimum survivor time– Random censoring: Loss to follow-up for random reasons– Right censoring: People who never have the event– Interval censoring: People who have missing data across a chunk of time

– Subject dropped out and then rejoins– Left censoring: People who are lacking a good start time

– Informative censoring: Some people may leave the study for reasons that relate to failure

Survival functionS(t) = P(T > t): probability of surviving at least to time t

Hazard function

h(t)= limt0 P(tT<t+t | Tt)

t

– Interpretation– The hazard function h(t) gives the instantaneous potential per unit

time for the event to occur, given that the individual has survived up to time t.

– The survivor function focuses on the individual not failing, but the hazard focuses on failing.

– The “|Tt” is the given part of the formula– You need to have survived to that moment of time

Survival and Hazard Functions

Kaplan Meier Estimation of Survival Function

Cox Proportional Hazards model

Analogous to other regressions such as linear or logistic regression* Based upon the hazard function

h(t | X)= h0(t) exp(a + bX)

* h(t | X=1) / h(t | X=0) = exp(b)exp(b) : the hazard ratio for a

unit increase in X

* Assume the hazard ratio is unchanged with respect to time

Gordon et al (1984) Coronary risk factors and exercise test performance in asymptomatichypercholeserolemic men: Application of proportional hazards analysisAJ Epi 120(2):210-214

Model Specification, Fitting, and SelectionSpecification: What is the functional FORM of the relationship between Y and X

E[Y | X0, X1] = a + b0*X0 + b1*X1

Fitting:Using data to estimate the various constants in the generic functional form of a model.

Selection:You may be able to specify several reasonable models. Then the task is to select which of the models to emphasize or report. You may use the data to help select a model:

* specify several models* fit the models to your data* examine the quality of the fit

-- what this may be depends on how you plan to use the model and the method used to fit the model

* Select a “good” model

It may be difficult to interpret or trust p-values or effect estimates from models chosen by a selection procedure

Model Fitting

There are many methods to fit statistical models:consider a simple model:

Y = a + bX + eE[Y|X] = a + bXneed to estimate “a” and “b” from data

Least Squares:Find a’, b’ to minimize: sum( (yi – a’ –b’xi)^2 )relatively fastdoes not depend on distribution of errors (e)

Maximum Likelihood:assume a distribution for the errors (e)find a’, b’ to maximize: product( likelihood (yi – a’ – b’xi) )equivalently, find a’, b’ so that: sum(derivative of log of likelihood() ) ) = 0

Estimating Equations:define a function similar to sum(derivative of log of likelihood() ) )and find a’, b’ to set it equal to 0

Model SelectionModels that fit data well will have

lower values for sum( (yi – a’ –b’xi)^2 ): residual sums of squares (RSS)sum of squared errors (SSE)

orgreater likelihoods or log-likelihoods

relative to models that fit the data less well.

Adding covariates to a model will lower the RSS or increase the log-likelihood

Adding irrelevant covariates will improve the model fit to the data, but probably not by much, while decreasing the ability of the model to describe a replicate dataset or the population from which data are collected. This is the over-fitting problem.

Model selection is a tradeoff between fitting the data well and over-fitting the data.

Likelihood ratio tests can help determine if additional covariates help (fit data) more than they hurt (over-fitting)

Model Critiques

The ultimate test of a model’s worth may be using it to make predictionsabout a new dataset (not the one used to fit the model). With new data, the quality of the predictions can be assessed.

This may not be possible, but there are methods to approximate the Ideal confirmation study for a model:

* cross validationfit the model using some of the data and assess predictive ability on the

remaining data* Bootstrapping

from a dataset with n observations, draw n observations with replacement to get a “new” dataset. Analyze that dataset. Draw another “new” dataset and analyze it. Assess how similar the analyses are

Intervention effects and regression

Intervention effects:

E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]

where the expectation is over the target population


Intervention effects:E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]where expectation is over target population

In practice, what we can calculate with standard regression analysis is:Ave(Y | X=x1, Z=z) - Ave(Y | X=x0, Z=z')Ave(Y | X=x1, Z=z) / Ave(Y | X=x0, Z=z')

or equivalently:

E[ Y | X=x1, Z=z] - E[Y | X=x0, Z=z’]E[ Y | X=x1, Z=z] / E[Y | X=x0, Z=z’]

where the expectation is over the sample


If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:

E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]

No Confounding Assumption

“no residual confounding of X and Y given Z"


If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:

E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]

No Confounding Assumption

“no residual confounding of X and Y given Z”

There are some methods we can use to push harder to remove residualconfounding than with basic regression:

regularizationtreatment models, propensity scores, IPWboth of the above: double robust estimates

Regression standardization

E[ Y | X=x, Z=z]different values of Z correspond to different strata in which you may consider the Y~X association

You can define a overall measure of the Y~X association by taking a weighted average over the different strata or levels of Zresulting in a marginal or population averaged effect:

EW[Y | X=x] = Σ{z in Z}( w(z) * E[Y | X=x, Z=z] )

Different choices for weights w(z):w(z) = proportion of Z=z in source population...

or in a different target populationor in a standard population

Exposure Scores

An outcome regression model describes the expected value of the outcome Ygiven the treatment or exposure of interest, X, and other covariates orconfounders Z:

E[ Y | X=x, Z=z] We could make a model to describe the expected value of X given other

covariates Z: E[X | Z=z]

Once the second model is fit, we can calculate the probability that each subject in the study should have received a particular exposure, say X=1: Pr(X=1 | Z=z)

Exposure Scores

If we have a dichotomous exposure of interest (X= 1 or X=0), then Pr(X=1 | Z=z) would be called the propensity score.

If we include the propensity score as a covariate in the outcome model, then we would effectively be stratifying the analysis by the probability of exposure, so confounding would be broken

Alternatively, for subjects with X=1, we could calculate pi1 = Pr(X=1 | Z=z) and give them weights 1/pi1. For subjects with X=0, we could calculate pi0 = Pr(X=0 | Z=z), and give them weights 1/pi0. Then we fit a weighted regression model we would have a model further breaks confounding by accounting for the population distribution of exposure.

If we combine these methods with standardization, we get a “double robust” estimator of the confounding-free effect of interest

Ecological Studies

•Sample units are groups or regions rather than individuals- use aggregate measures of exposure and outcome

-- rates, proportions, regional averages, representative values- if using spatial regions, expect spatial correlations

-- use analysis methods that do not require independent observations-- GEEs, Random Effects, Hierarchical Models

- different sized regions or groups have different data quality or completeness-- small regions: sparse measurements

Ecological Studies

Ecological Studies

Ecological Studies

Group level and individual level trends may differEcological Fallacy: not appreciating this

Ecological Bias

• Confounding by group• Effect modification by group• Plus, all of the opportunities for bias as in individual level studies

Mitigation strategy:use small and well defined groups that arehomogeneous with respect to exposures

Generalized Linear Models

A broad class of models (including linear, logistic, and Poisson regression):The distribution of the outcome Y has a special form

“Exponential dispersion family”

There is a linear model for a transformed version of the expected value of Y – a “mean function”g(E[Y|X] ) = Xβwhere g() is a “link function”

The variance of Y can be expressed as a function of the expected value of YVar(Y|X) = V(g-1(Xβ ) )

There are general methods to solve many forms of these models and extensions of these models

Methods For Non-IndependentObservations

Generalized Estimating EquationsExtensions of Generalized Linear Models, where you assumethat the OBSERVATIONS have a particular correlation structure

Random Effect Models(1) Y = a + bX + e

: standard linear model, common intercept common slope (2)Y = ai + bX + e

: standard linear model, different intercepts for each group i

(3)Y = a + gi + bX + e: each group i has its own slope, but those slopes are drawn from a normal distribution with mean 0.

(3) Is more flexible than (1) and may have many fewer parameters than (2)

Documents

If we use a logistic model, we do not have the problem of suggesting risks greater than 1 or less than 0 for some values of X: E[1{outcome = 1} ] = exp(a+bX)