Upload
colby-madkins
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
If we use a logistic model, we do not have the problem of suggestingrisks greater than 1 or less than 0 for some values of X:
E[1{outcome = 1} ] = exp(a+bX)/ [1 + exp(a+bX) ]
Logistic model is a linear model, on a different scale than the linear risk model:log(Pr(outcome=1)/[1 – Pr(outcome=1) ] = a + bX
Ungrouped Poisson Regression
Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded
Collapsing covariates into group-level covariates can introduce bias and loose information
Ungrouped Poisson Regression methods have been developed use individual, time-varying covariate informationestimate effects of covariates on rates
Estimates similar to those from proportional hazards models
Ungrouped Poisson Regression
Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.
Ungrouped Poisson Regression
Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.
Survival Analyses
– Survival analysis is a set of statistical techniques whose goal is to predict the time of (or time until) an event– The dependent variable is the time to occurrence of a specific
event (te) from some time (t0)– An event is a qualitative change in some attribute
– People who do not have the event during the follow-up are said to be censored
– The independent variable may be a treatment, a clinical characteristic, a demographic characteristic, or some other predictor of survival
– Examples include different treatments, high/low blood pressure, or treating hospital
– Examples– Predicting the life expectancy of a group:
– Do smokers die younger than non-smokers?– The event is death
– Measure the efficacy of a treatment– Does AZT delay the onset of pneumoncystis carini pneumonia in
people who are HIV positive?– The event is pneumonia diagnosis
– Measure the role of multiple predictors on the time until an outcome:
– Can you predict time to conception, using lots of predictors, in couple undergoing fertility treatment?
– The event is conception
Survival Analyses
– Examples (continued)– Survival analysis deals with special problems in each of the
previous studies– The smoking study
– The subjects most likely began smoking at different ages and you need to account for the various years at risk
– Some subjects may disappear or die from trauma– AZT study
– The exact date of exposure to the virus is rarely known– Time periods may exist where participants left for a different clinical trial
and then returned– Pregnancy study
– Some couples may never get pregnant
Survival Analyses
Censoring
If you create a timeline, a number of events will occur– Uncensored observation: The event under observation occurs and
their time truly reflects the survival time– Censored observation: The events under observation does not occur
and their time reflects a minimum survivor time– Random censoring: Loss to follow-up for random reasons– Right censoring: People who never have the event– Interval censoring: People who have missing data across a chunk of time
– Subject dropped out and then rejoins– Left censoring: People who are lacking a good start time
– Informative censoring: Some people may leave the study for reasons that relate to failure
Survival functionS(t) = P(T > t): probability of surviving at least to time t
Hazard function
h(t)= limt0 P(tT<t+t | Tt)
t
– Interpretation– The hazard function h(t) gives the instantaneous potential per unit
time for the event to occur, given that the individual has survived up to time t.
– The survivor function focuses on the individual not failing, but the hazard focuses on failing.
– The “|Tt” is the given part of the formula– You need to have survived to that moment of time
Survival and Hazard Functions
Cox Proportional Hazards model
Analogous to other regressions such as linear or logistic regression* Based upon the hazard function
h(t | X)= h0(t) exp(a + bX)
* h(t | X=1) / h(t | X=0) = exp(b)exp(b) : the hazard ratio for a
unit increase in X
* Assume the hazard ratio is unchanged with respect to time
Gordon et al (1984) Coronary risk factors and exercise test performance in asymptomatichypercholeserolemic men: Application of proportional hazards analysisAJ Epi 120(2):210-214
Model Specification, Fitting, and SelectionSpecification: What is the functional FORM of the relationship between Y and X
E[Y | X0, X1] = a + b0*X0 + b1*X1
Fitting:Using data to estimate the various constants in the generic functional form of a model.
Selection:You may be able to specify several reasonable models. Then the task is to select which of the models to emphasize or report. You may use the data to help select a model:
* specify several models* fit the models to your data* examine the quality of the fit
-- what this may be depends on how you plan to use the model and the method used to fit the model
* Select a “good” model
It may be difficult to interpret or trust p-values or effect estimates from models chosen by a selection procedure
Model Fitting
There are many methods to fit statistical models:consider a simple model:
Y = a + bX + eE[Y|X] = a + bXneed to estimate “a” and “b” from data
Least Squares:Find a’, b’ to minimize: sum( (yi – a’ –b’xi)^2 )relatively fastdoes not depend on distribution of errors (e)
Maximum Likelihood:assume a distribution for the errors (e)find a’, b’ to maximize: product( likelihood (yi – a’ – b’xi) )equivalently, find a’, b’ so that: sum(derivative of log of likelihood() ) ) = 0
Estimating Equations:define a function similar to sum(derivative of log of likelihood() ) )and find a’, b’ to set it equal to 0
Model SelectionModels that fit data well will have
lower values for sum( (yi – a’ –b’xi)^2 ): residual sums of squares (RSS)sum of squared errors (SSE)
orgreater likelihoods or log-likelihoods
relative to models that fit the data less well.
Adding covariates to a model will lower the RSS or increase the log-likelihood
Adding irrelevant covariates will improve the model fit to the data, but probably not by much, while decreasing the ability of the model to describe a replicate dataset or the population from which data are collected. This is the over-fitting problem.
Model selection is a tradeoff between fitting the data well and over-fitting the data.
Likelihood ratio tests can help determine if additional covariates help (fit data) more than they hurt (over-fitting)
Model Critiques
The ultimate test of a model’s worth may be using it to make predictionsabout a new dataset (not the one used to fit the model). With new data, the quality of the predictions can be assessed.
This may not be possible, but there are methods to approximate the Ideal confirmation study for a model:
* cross validationfit the model using some of the data and assess predictive ability on the
remaining data* Bootstrapping
from a dataset with n observations, draw n observations with replacement to get a “new” dataset. Analyze that dataset. Draw another “new” dataset and analyze it. Assess how similar the analyses are
Intervention effects and regression
Intervention effects:
E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]
where the expectation is over the target population
Intervention effects and regression
Intervention effects:E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]where expectation is over target population
In practice, what we can calculate with standard regression analysis is:Ave(Y | X=x1, Z=z) - Ave(Y | X=x0, Z=z')Ave(Y | X=x1, Z=z) / Ave(Y | X=x0, Z=z')
or equivalently:
E[ Y | X=x1, Z=z] - E[Y | X=x0, Z=z’]E[ Y | X=x1, Z=z] / E[Y | X=x0, Z=z’]
where the expectation is over the sample
Intervention effects and regression
If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:
E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]
No Confounding Assumption
“no residual confounding of X and Y given Z"
Intervention effects and regression
If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:
E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]
No Confounding Assumption
“no residual confounding of X and Y given Z”
There are some methods we can use to push harder to remove residualconfounding than with basic regression:
regularizationtreatment models, propensity scores, IPWboth of the above: double robust estimates
Regression standardization
E[ Y | X=x, Z=z]different values of Z correspond to different strata in which you may consider the Y~X association
You can define a overall measure of the Y~X association by taking a weighted average over the different strata or levels of Zresulting in a marginal or population averaged effect:
EW[Y | X=x] = Σ{z in Z}( w(z) * E[Y | X=x, Z=z] )
Different choices for weights w(z):w(z) = proportion of Z=z in source population...
or in a different target populationor in a standard population
Exposure Scores
An outcome regression model describes the expected value of the outcome Ygiven the treatment or exposure of interest, X, and other covariates orconfounders Z:
E[ Y | X=x, Z=z] We could make a model to describe the expected value of X given other
covariates Z: E[X | Z=z]
Once the second model is fit, we can calculate the probability that each subject in the study should have received a particular exposure, say X=1: Pr(X=1 | Z=z)
Exposure Scores
If we have a dichotomous exposure of interest (X= 1 or X=0), then Pr(X=1 | Z=z) would be called the propensity score.
If we include the propensity score as a covariate in the outcome model, then we would effectively be stratifying the analysis by the probability of exposure, so confounding would be broken
Alternatively, for subjects with X=1, we could calculate pi1 = Pr(X=1 | Z=z) and give them weights 1/pi1. For subjects with X=0, we could calculate pi0 = Pr(X=0 | Z=z), and give them weights 1/pi0. Then we fit a weighted regression model we would have a model further breaks confounding by accounting for the population distribution of exposure.
If we combine these methods with standardization, we get a “double robust” estimator of the confounding-free effect of interest
Ecological Studies
•Sample units are groups or regions rather than individuals- use aggregate measures of exposure and outcome
-- rates, proportions, regional averages, representative values- if using spatial regions, expect spatial correlations
-- use analysis methods that do not require independent observations-- GEEs, Random Effects, Hierarchical Models
- different sized regions or groups have different data quality or completeness-- small regions: sparse measurements
Ecological Studies
Group level and individual level trends may differEcological Fallacy: not appreciating this
Ecological Bias
• Confounding by group• Effect modification by group• Plus, all of the opportunities for bias as in individual level studies
Mitigation strategy:use small and well defined groups that arehomogeneous with respect to exposures
Generalized Linear Models
A broad class of models (including linear, logistic, and Poisson regression):The distribution of the outcome Y has a special form
“Exponential dispersion family”
There is a linear model for a transformed version of the expected value of Y – a “mean function”g(E[Y|X] ) = Xβwhere g() is a “link function”
The variance of Y can be expressed as a function of the expected value of YVar(Y|X) = V(g-1(Xβ ) )
There are general methods to solve many forms of these models and extensions of these models
Methods For Non-IndependentObservations
Generalized Estimating EquationsExtensions of Generalized Linear Models, where you assumethat the OBSERVATIONS have a particular correlation structure
Random Effect Models(1) Y = a + bX + e
: standard linear model, common intercept common slope (2)Y = ai + bX + e
: standard linear model, different intercepts for each group i
(3)Y = a + gi + bX + e: each group i has its own slope, but those slopes are drawn from a normal distribution with mean 0.
(3) Is more flexible than (1) and may have many fewer parameters than (2)