30
Analysis of correlated data (AS09) EPM304 Advanced Statistical Methods in Epidemiology Course: PG Diploma/ MSc Epidemiology This document contains a copy of the study material located within the computer assisted learning (CAL) session. If you have any questions regarding this document or your course, please contact DLsupport via [email protected] . Important note: this document does not replace the CAL material found on your module CDROM. When studying this session, please ensure you work through the CDROM material first. This document can then be used for revision purposes to refer back to specific sessions. These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or further copying. © London School of Hygiene & Tropical Medicine September 2013 v2.0

as09

Embed Size (px)

DESCRIPTION

stats notes

Citation preview

  • Analysis of correlated data (AS09)

    EPM304 Advanced Statistical Methods in Epidemiology

    Course: PG Diploma/ MSc Epidemiology

    This document contains a copy of the study material located within the computer assisted learning (CAL) session. If you have any questions regarding this document or your course, please contact DLsupport via [email protected]. Important note: this document does not replace the CAL material found on your module CDROM. When studying this session, please ensure you work through the CDROM material first. This document can then be used for revision purposes to refer back to specific sessions. These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or further copying.

    London School of Hygiene & Tropical Medicine September 2013 v2.0

  • Section 1: Analysis of correlated data Aim

    To learn how to analyse correlated observations, i.e, data where observations are not independent, using robust standard errors, generalised estimating equations and random effects models.

    Objectives By the end of this session you will be able to:

    Describe the effect of intra-cluster correlation Explain why usual methods are not valid Explain why parameter estimates need to be adjusted when data is correlated Use 3 approaches to obtain parameter estimates allowing for correlated data.

    This session should take you between 1.5 and 2.5 hours to complete. Section 2: Planning your study In this session you will learn how to obtain appropriate parameter estimates and confidence intervals for data that are in some way correlated. So far in the course you have mostly applied methods that make the assumption the data are statistically independent, e.g. logistic regression, Poisson regression, Cox regression. With correlated data the assumption of independence is invalid. To begin you will look at why the usual methods are not appropriate. You will then look at 3 different approaches to obtain estimates accounting for the fact that observations are correlated. During the session you will fit logistic and Poisson regression models to correlated data using 3 different approaches, you may wish to review the appropriate sessions. Since initial concepts are based on likelihood estimation you could also review the likelihood session. Likelihood SM05 Logistic regression SM07, SM08, SM09 Poisson regression SM11, AS05 Interaction: Hyperlink: SM05: SM05 session opens in a new window. Interaction: Hyperlink: SM07: SM07 session opens in a new window.

  • Interaction: Hyperlink: SM08: SM08 session opens in a new window. Interaction: Hyperlink: SM09: SM09 session opens in a new window. Interaction: Hyperlink: SM10: SM10 session opens in a new window. 2.1: Planning your study To illustrate the ideas in this session, you will use a study carried out in Zambia on the impact of HIV on the infectiousness of patients with pulmonary TB. Details are given below. Details This was a study of the impact of HIV on the infectiousness of patients with pulmonary TB. It was based on 70 pulmonary TB patients in Zambia, 42 of whom were HIV +ve and 28 of whom were HIV ve. The aim of the study was to determine whether HIV +ve patients were more or less likely to transmit M.tuberculosis infection to their household contacts. 307 household contacts (181 contacts of HIV +ve index cases) were traced. Section 3: When are data correlated? When are data correlated? The classical statistical techniques (Mantel-Haenszel) and regression models (logistic regression, Poisson regression, Cox regression) that you have previously used all share a common assumption about the data. Can you remember what this assumption is? Interaction: Button: clouds picture (pop up box appears, text and interaction appears below): The assumption made with the classical analyses and generalised linear regression models we have looked at so far is that individual observations are statistically independent of each other. In other words, there is no correlation between individuals. Can you think of an example where the assumption of statistical independence is not valid? Interaction: Button: clouds picture (text appears below):

  • In a cluster randomised trial groups of individuals, rather than individuals themselves, are randomised to receive a particular intervention. Individuals in the same cluster may, on average, be more similar to each other than to individuals in other clusters. Therefore the assumption of independence can no longer be made. Why may individuals in clusters be more similar? Some reasons are given below: Individuals in a community tend to behave or respond more like other people in the same community than others in a different community. Individuals may have a level of exposure more like others in the same community than individuals in a different community. An infected individual is more likely to transmit their infection to an individual in the same community than to an individual in a different community. 3.1: When are data correlated? To illustrate clustering (and therefore correlation) within a population, consider the diagram below. The different coloured circles represent different characteristics for individuals. So individuals with the same colour have similar characteristics. We could select 2 groups randomly from this population. Click 'show' to do this. Interaction: Button: Show (text appears and diagram changes as shown below): Notice how each group contains a mixture of individuals of a certain type.

  • 3.2: When are data correlated?

    In certain situations, groups of individuals with similar characteristics will be clustered, as shown below. The diagram now represents 2 communities (divided by the line down the middle). We can choose a random sample from each community. Click 'show' to do this. Interaction: Button: Show (text appears and diagram changes as shown below): Now, because of the clustering within communities, individuals in the same community are more similar to each other than they are to individuals in a different community. That is, individuals in the same community are correlated with each other

  • 3.3: When are data correlated? The tabs below give some other examples of clustering. Interaction: Tabs: Example 1: Cohort studies which follow a group of individuals over a period of time, making repeated outcome measurements on each individual. The repeated outcome measurements on a particular individual are likely to be correlated with each other. Interaction: Tabs: Example 2: Ophthalmic studies in which outcome measurements are made on both eyes of each individual in the study. The outcome measurements on the left and right eyes of a particular individual are likely to be correlated with each other. Interaction: Tabs: Example 3: Family or household studies in which several members of each family/household are studied. Outcome measurements on members of the same household may well be correlated with each other.

  • 3.4: When are data correlated? The boxes below list a number of situations where there may be clustering (on the left) and some reasons for this clustering (on the right). Can you match up each situation to the corresponding reason? Select the reason for clustering from the dropdown box to the right of each situation. Situation Reason for clustering

    Interaction: Hotspot: Families: Incorrect Response Financial (text appears on bottom right handside): There may well be financial clustering within families, since the members of a family are likely to share the same income. However for this exercise there is a more fundamental answer. Please try again. Incorrect Response Ability (text appears on bottom right handside): There may be clustering with respect to ability within families, but for this exercise there is a more fundamental answer. Please try again. Correct Response Genetic (text appears on bottom right handside): That's right, you would expect genetic clustering to occur within families. Incorrect Response Social (text appears on bottom right handside): There may well be social clustering within families, since the members of a family are likely to belong to the same social group. However for this exercise there is a more fundamental answer. Please try again.

    Families Financial

    Private health clinics

    Residential areas

    Social

    Genetic

    School Ability

  • Interaction: Hotspot: School: Incorrect Response Financial (text appears on bottom right handside): There may be financial clustering in schools, for example if it is a private school. However, for this exercise there is another reason for clustering that is more specific to schools. Please try again. Correct Response Ability (text appears on bottom right handside): That's right, clustering on the basis of ability can happen within schools, for example in selective schools. Incorrect Response Genetic (text appears on bottom right handside): No, there is no reason why there should be genetic clustering within schools. Please try again. Incorrect Response Social (text appears on bottom right handside): There may be social clustering in schools, especially where a school's catchment area corresponds to a specific residential area. However, for this exercise there is another reason for clustering that is more specific to schools. Please try again. Interaction: Hotspot: Residential Areas: Incorrect Response Financial (text appears on bottom right handside): There may be financial clustering within a given residential area, but more often you would think of this as social clustering. Please try again. Incorrect Response Ability (text appears on bottom right handside): No, you would not necessarily expect clustering with respect to ability to occur within a given residential area. Please try again. Incorrect Response Genetic (text appears on bottom right handside): No, you would not necessarily expect genetic clustering to occur within a given residential area. Please try again. Correct Response Social (text appears on bottom right handside): That's right, if samples are taken from within a certain residential area, social clustering may occur, because a given residential area will often be home to a particular social class. Interaction: Hotspot: Private health clinics: Correct Response Financial (text appears on bottom right handside):

  • That's right, if samples are taken from within private health clinics, financial clustering will occur, because only the relatively rich can afford private health care. Incorrect Response Ability (text appears on bottom right handside): No, there is no reason why clustering with respect to ability should occur within private health clinics. Please try again. Incorrect Response Genetic (text appears on bottom right handside): No, there is no reason why genetic clustering should occur within private health clinics. Please try again. Incorrect Response: Social (text appears on bottom right handside): There may be social clustering within private health clinics, but there is a more fundamental reason for such clustering. Please try again. 3.5: When are data correlated? If the lack of independence of observations is not taken into account, the main problem in the analysis will be incorrect standard errors. In general, the estimates will be too small. This leads to confidence intervals that are too narrow and P-values that are too small. In other words, if the data are analysed as though they were independent, the inference may be incorrect. Interaction: Button: Why: The data provided by two individuals who are similar (e.g. from the same household) are less informative about the general study population than data from two individuals from different households. So, if we assume independence, we think we have more information than we really do have and so the standard errors are too small. Therefore, the P-values will provide stronger evidence than is really the case. 3.6: When are data correlated? The correlation induced for whatever reason between individuals in a community can be measured by the intra-cluster correlation (also called within-cluster correlation). ('rho') = intra-cluster correlation coefficient = 0 means that responses of individuals within the same cluster are no more alike than those of individuals from different clusters. = 1 means that all responses of individuals within the same cluster are identical.

  • 3.7: When are data correlated? An alternative way of thinking about intra-cluster correlation is in terms of between-cluster variation. Intra-cluster correlation means that individuals are more similar to others in the same cluster than individuals in other clusters. This happens if and only if there are differences between clusters. So intra-cluster correlation and between-cluster variation are two ways of measuring the same phenomenon. If observations are correlated within clusters, then there is variation between clusters. between-cluster intra-cluster variation correlation 3.8: When are data correlated? So, observations on individuals within the same cluster are correlated. This requires modifications in the statistical methods applied. Section 4: When are data correlated? Example Before you look at how to deal with correlated observations, consider this example which illustrates the problem. A study carried out in Zambia looked at the impact of HIV on the infectiousness of patients with pulmonary TB. All contacts underwent a Mantoux test. An induration of diameter 5mm was considered positive. Information was recorded on a number of household level variables, including: HIV status of TB index case crowding and a number of individual contact level variables, including: age of contact degree of intimacy of contact The mean number of contacts per index case were: HIV ve : 4.5 (range 1 11) [126 contacts / 28 HIV ve index TB cases] HIV +ve : 4.3 (range 1 13) [181 contacts / 42 HIV +ve index TB cases] 4.1: When are data correlated? Example

  • If some index cases are more infectious than others, or household members share previous exposures to TB, then the outcome (i.e. the result of the Mantoux test) should show some correlation within households. You are going to look at the effect of 3 different approaches to the analysis of correlated data to test this hypothesis: 1. Robust standard errors 2. Generalised estimating equations 3. Random effects models (also called multi-level modelling) 4.2: When are data correlated? Example The table below shows the distribution of contacts, by outcome of the Mantoux test and HIV status of the index TB case. Mantoux

    test status of household contact

    HIV status of index case Total Negative Positive

    n % n % n %

    Negative 36 28.6 87 48.1 123 40.1 Positive 90 71.4 94 51.9 184 59.9 Total 126 100.0 181 100.0 307 100.0 From this table, what can you say about the prevalence of tuberculin positivity and odds of tuberculin positivity? Interaction: Button: clouds picture (pop up box appears): Overall, 60% (184 / 307) of household contacts were tuberculin-positive. Odds = 184 / 123 = 1.50 The prevalence of tuberculin positivity appears lower among the contacts of HIV +ve index cases (52%) than among contacts of HIV ve index cases (71%). The respective odds of tuberculin positivity are 1.08 (94 / 87) for HIV +ve contacts and 2.50 (90 / 36) for HIV ve contacts. 4.3: When are data correlated? Example First, we analyse the data, ignoring any within-household clustering. We can estimate the odds ratio and calculate a 95% CI for the odds ratio using classical methods. Click the 'swap' button to see the classical results.

  • Interaction: Button: Swap (table from previous page changes to the following): Classical analysis Odds ratio X P > X 95% confidence

    interval 0.43 11.72 0.0006 0.26 0.71

    How would you interpret the model estimates? Interaction: Button: clouds picture (pop up box appears): The results suggest that the odds of being tuberculin positive among household contacts of HIV +ve TB cases are around half (0.43) those among household contacts of HIV ve index cases. Ignoring the clustering, we would conclude that there was strong evidence that this association was not due to chance (P = 0.0006). However, if there is clustering within households, this conclusion could be wrong. 4.4: When are data correlated? Example The methods of estimation you have used so far make the assumption that observations are independent of each other. In EPM202, you saw that these estimation methods are based on likelihood. You can write down a likelihood in terms of the risk, log odds, rate or log rate parameter by multiplying together the probabilities of each individual observation, as follows: x x (1 - ) x ... x (1 - ) Click here to review thisfrom session SM05. Interaction: Hyperlink: review this: A point estimate of the parameter of interest is obtained from choosing the parameter value that maximises the likelihood. An approximate confidence interval for the parameter estimate can be found using the quadratic approximation to the log likelihood. Click here to review this from session SM05. Interaction: Hyperlink: review this:

  • In other words, likelihood was used to derive the best estimate of the parameter and its standard error. Standard errors like this are sometimes called "model-based" standard errors, these are what you will be most familiar with from this course. Section 5: Robust standard errors One useful approach to derive standard errors that allow for the clustering is to use what are called robust standard errors. While model-based standard errors are based on predicted variability, robust standard errors are based on observed variability. Using robust standard errors, you can then obtain appropriate confidence intervals and P-values. 5.1: Robust standard errors Robust standard errors are based on the sum of the residuals.

    Variance ri2 i

    The r

    i terms are the residuals.

    The residuals are the difference between the outcome observed and the outcome predicted by the model. When observations are independent, then the summation is performed on the individual-level residuals. If data are "clustered", then cluster-level residuals are calculated and summed over the clusters. Note: This does not make any assumptions about independence within clusters but does assume that there is independence between clusters. 5.2: Robust standard errors Let's look at how robust standard errors work in an example. Consider again the study of tuberculin positivity and contact with an index case with or without HIV. The variable HIV1 represents individuals with HIV. Click the 'swap' button to see the estimates on an odds ratio scale. Interaction: Button: Swap (the table changes from the second to the first below):

  • Estimates from logistic model (ignoring clusters) on an odds ratio scale Mantoux Odds

    ratio Standard

    Err. z P > |z| 95% Conf.

    Interval HIV1 0.43 0.106760 3.396 0.001 0.27 0.70 Estimates from logistic model (ignoring clusters) on a log scale Mantoux Coefficie

    nt Standard Err.

    z P > |z|

    95% Conf. Interval

    HIV1 0.838904

    0.247025

    3.396

    0.001 1.32306

    5

    0.354744

    Constant 0.916291 0.197203

    4.646 < 0.001

    0.529781

    1.302801

    Log likelihood = 200.70621

    The table shows estimates of a logistic regression model. This model ignores any household clustering. Click below for further explanation of the model. Interaction: Button: Explanation (pop up box appears): This model is similar to many models we have fitted before. All the calculations are performed assuming that all observations are independent. The parameter estimate of the log(OR) for the effect of HIV (0.8389) is obtained by maximising the log likelihood. The standard error of this estimate (0.247025) is obtained through the quadratic approximation to the log likelihood. How does this estimate compare to the one we obtained earlier using classical methods? Interaction: Button: clouds picture (pop up box appears): On the odds ratio scale we obtain the same estimate (0.432) and a similar confidence interval to those we obtained using classical methods. 5.3: Robust standard errors To allow for correlation within households you need to use robust standard errors that are calculated using residuals at the cluster level. The table below shows this. Use the 'swap' button to see the previous model, with robust standard errors calculated at the individual level. Notice that the estimate for the effect of HIV on tuberculin positivity is the same as before. This is because the only difference is in the calculation of the standard errors. Interaction: Button: Swap (the table changes from the second to the first below): Estimates from a logistic regression model with robust standard

  • errors calculated at the individual level Manto

    ux Coefficie

    nt Standard err.

    z P > |z|

    95% confidence interval

    HIV1 0.838904

    0.247429

    3.390

    0.001 1.323855

    0.353953

    Constant

    0.916291 0.197525

    4.639 < 0.001

    0.529150 1.303432

    Log likelihood = 200.70621

    Estimates from a logistic regression model with robust standard errors calculated at the household level Mantoux

    Coefficient

    Standard err.

    z P > |z|

    95% confidence interval

    HIV1 0.838904

    0.332274

    2.525

    0.012 1.490150

    0.187658

    Constant

    0.916291 0.266485

    3.438 0.001 0.393989 1.438593

    Log likelihood = 200.70621

    Interaction: Tabs: Question 1: How do the standard errors compare to the two previous models and how will this affect the inference made about the effect of HIV contact on tuberculin positivity? Interaction: Button: clouds picture (pop up box appears): The standard errors are now quite a lot bigger. The standard error of log(OR) has increased from 0.25 to 0.33. As a consequence, the P-value for the null hypothesis, that there is no association between the HIV status of the index case and the odds of tuberculin positivity in household contacts, has also got bigger. Interaction: Tabs: Question 2: Converting back to the odds ratio scale, we obtain OR = exp(0.839) = 0.43, (95% CI: 0.23, 0.83); P = 0.012. So what can you conclude about the effect of HIV contact from this analysis? Interaction: Button: clouds picture (pop up box appears): After adjusting for the effect of clusters within households (using robust standard errors) it appears that there is still evidence of an association between HIV contact and tuberculin positivity. The odds of tuberculin positivity for contacts of HIV +ve TB cases is approximately half that of HIV ve TB cases. 5.4: Robust standard errors

  • An important parameter to note is that the log-likelihood shown in the output (200.7) has been identical for both analyses: Logistic regression with robust standard errors based on individual-level residuals Logistic regression with robust standard errors based on cluster-level residuals Why should this be? Interaction: Button: clouds picture (pop up box appears) : Initially, standard maximum likelihood estimation is performed in each analysis, and it is only afterwards that the standard errors for the parameter estimates are computed using the robust approach. The log-likelihood does not take account of the clustering. Therefore you cannot use the log likelihood from the "robust" analysis to perform a likelihood ratio test that takes correlations in the data into account. 5.5: Robust standard errors Summary The tabs opposite summarise the robust standard error approach in the analysis of correlated data. Interaction: Tabs: 1: The robust standard error method uses the standard maximum likelihood approach to obtain parameter estimates, ignoring any correlations in the data. Therefore, robust standard errors do not affect the parameter estimate. Interaction: Tabs: 2: Instead of using the quadratic approximation to the log-likelihood to obtain a standard error, this method calculates robust standard errors using household-level (or cluster-level) residuals to take account of correlations between individuals in the same household (or cluster). Interaction: Tabs: 3: The log-likelihood does not take account of clustering. Likelihood ratio tests are not valid, since they ignore any correlations in the data. Interaction: Tabs: 4: Robust standard errors will be correct providing our model is correct, and we have a reasonable number of clusters, say 30 or more.

  • Section 6: Generalised estimating equations One weakness of the robust standard error approach is that it ignores clustering when calculating the effect estimates (e.g. the odds ratio) it is only the standard errors that are adjusted. This means that, for the calculation of the effect estimate, the same weight is given to an individual in a household with many individuals as an individual who is the only contact in a household. Interaction: Button: More (text appears below): To account for clustering, do you think the weight for individuals in a household with many individuals should be lower or higher than households with only one individual contact? Interaction: Button: Lower: Correct Response: That's correct, if there is within-household correlation, relatively less weight should be given to each individual in the household with many individuals than to the individual who is an only contact. This is because the many individuals share the same household information. Interaction: Button: Higher: Incorrect Response: In fact, if there is within-household correlation, relatively less weight should be given to each individual in the household with many individuals than to the individual who is an only contact. This is because the many individuals share the same household information. 6.1: Generalised estimating equations Generalised estimating equations (GEE) use robust standard errors, but also take account of correlations when estimating the measure of effect, e.g. the odds ratio. Therefore, this method gives different weights to individuals, depending on how many individuals are in the household. When using GEEs, you must think about how the observations in a data set are likely to be correlated with each other. The three standard options for this are given opposite. Interaction: Tabs: Independence: This choice implies that you don't think the data are correlated. If you don't think the data are correlated, you probably don't need to be using GEE.

  • Interaction: Tabs: Exchangeable: This choice implies that within a "cluster", e.g. a household, any two observations are equally correlated, but that there is no correlation between observations from different "clusters". This is a common choice. Interaction: Tabs: Autocorrelation: This choice is useful for measures repeated over time, e.g. repeated measurements on the same individual such as episodes of diarrhoea. Repeated measurements on an individual are most likely to be most strongly correlated when they are made a short time apart. The greater the time interval between two measurements the smaller the correlation is likely to be. 6.2: Generalised estimating equations Let's first look at a GEE analysis that makes the working assumption that all observations are independent. The table below shows estimates from such a model. Estimates from GEE analysis with non-robust standard errors Manto

    ux Coefficie

    nt Standard Err.

    z P > |z|

    95% confidence interval

    HIV1 0.838904

    0.247025

    3.396

    0.001 1.323065

    0.354744

    Constant

    0.916291 0.197203

    4.646 < 0.001

    0.529781 1.302801

    You should notice that the parameter estimates and standard errors are identical to those we obtained using the standard likelihood approach using a logistic regression model. This is because the model: a) assumes independence b) does not use robust standard errors 6.3: Generalised estimating equations The results below were obtained from a GEE analysis with robust standard errors. Notice the parameter estimate for the log(OR) is the same as before. This analysis takes no account of the within-household correlation when estimating the effect of exposure variables. Estimates from GEE analysis with robust standard errors Manto

    ux Coefficie

    nt Standard err.

    z P > |z|

    95% confidence interval

    HIV1 0.33227 0.01

  • 0.838904 4 2.525 2 1.490150 0.187658 Constant

    0.916291 0.266485

    3.438 0.001

    0.393989 1.438593

    Click below to view the model that allows for correlation within households, which you saw on the previous page. Interaction: Button: Swap (table changes to the following): Estimates from a logistic regression model with robust standard errors, adjusted for clustering Manto

    ux Coefficie

    nt Standard err.

    z P > |z|

    95% confidence interval

    Ihiv_2 0.838904

    0.332274

    2.525

    0.012

    1.490150

    0.187658

    _cons 0.916291 0.266485

    3.438 0.001

    0.393989 1.438593

    How does the standard error for the log(OR) in the GEE analysis with robust standard errors compare to the earlier model? Interaction: Button: clouds picture (pop up box appears): The standard error obtained from this analysis is identical to that obtained using the robust standard error approach that allows for correlation within households. The GEE analysis automatically adjusts the standard errors to take account of the within-household correlation. However, it has taken no account of the within-household correlation when obtaining the parameter estimate (the log odds ratio). 6.4: Generalised estimating equations You can also fit a model using GEE that assumes exchangeable correlations within households, i.e. it accounts for within-household correlation in the estimation of the parameter estimate, e.g. the log(OR). How do the estimates compare to the previous model without correlation? Interaction: Button: clouds picture (pop up box appears): Notice that the standard errors are similar in magnitude to those of the previous analysis, but that now the parameter estimate of the log odds ratio has changed from 0.8389 to 0.9689. Estimates from GEE analysis with robust standard errors, accounting for household correlation Mantoux

    Coefficient

    Standard err.

    z P > |z|

    95% confidence interval

    HIV1 0.968266

    0.327340

    2.958

    0.003 1.609840

    0.326692

  • Constant

    1.010946 0.261726

    3.863 < 0.001

    0.497972 1.523920

    Do you know how this analysis accounts for the within household correlation in the estimate of the log(OR)? Interaction: Button: clouds picture (pop up box appears): This analysis takes account of within-household correlation when estimating the log odds ratio, i.e. it gives relatively less weight to contacts in large households. On an odds ratio scale, the model estimates are: OR = 0.38 (95% CI: 0.20, 0.72) 6.5: Generalised estimating equations Summary The main aspects of GEE analysis are: 1. GEE can include robust standard errors. 2. You need to specify how you think the data are correlated. The usual choice is 'exchangeable'. 3. If an exchangeable correlation is specified, point estimates, e.g., odds ratio, rate ratio, are adjusted for correlations in the data. 4. Within a GEE analysis likelihood ratio tests are not valid. Section 7: Random effects models Robust standard errors and generalised estimating equations are two practical approaches to dealing with correlated observations. However, they are not based on a full (probability) model for the data. Therefore statisticians usually prefer to use another approach. The third approach is to use random effects models, also known as multilevel models. Random effects models include the variation between clusters explicitly in the likelihood and therefore take account of intra-cluster correlations. 7.1: Random effects models Suppose for a moment that, in our study of tuberculin positivity in household contacts, all our contacts lived in different households and could be considered to be independent. Interaction: Button: Show (text appears below):

  • Assuming independence, individuals' contributions to the likelihood can be multiplied together to produce the full likelihood. We then maximise the full likelihood and find quadratic approximations for it. In a simple logistic regression model, for individual j, the log odds of positivity is given by: log(odds)j = (baseline log odds) + log(ORHIV) x (HIV status)j where (HIV status)j is an indicator variable which takes the value 0 if the index case for household contact j is HIV ve and 1 if the index case for household contact j is HIV +ve. 7.2: Random effects models In reality, individuals will live in the same household and will therefore be exposed to the same index case and share other potential risk factors in common i.e. they are clustered! Interaction: Button: Show (text appears below): In this situation, the random effects model states that for individual j in household i the log odds of tuberculin positivity are given by: log(odds)ij = (baseline log odds) + log(ORHIV) x (HIV status)ij + ui 7.3: Random effects models The additional term in this model is the last term ui. Each household is allowed its own value of ui. The log odds of all individuals in the household are shifted by this amount. This makes them similar to each other to some extent (within-household correlation), and different, to some extent, from individuals in other households (between-household variation). In the example, the ui would reflect both the infectivity of the particular tuberculosis case in the household, and any other past shared household exposure to TB. In this situation, the random effects model states that for individual j in household i the log odds of tuberculin positivity are given by: log(odds)ij = (baseline log odds) + log(ORHIV) x (HIV status)ij + ui 7.4: Random effects models The random effects model assumes that the household effects are drawn from a probability distribution hence they are "random" (rather than fixed) effects.

  • For logistic regression models we usually assume that the ui are normally distributed, with mean 0 and variance u . The only extra parameter that has to be estimated is u, rather than trying to estimate a specific value of u for each household. In this situation, the random effects model states that for individual j in household i the log odds of tuberculin positivity are given by: log(odds)ij = (baseline log odds) + log(ORHIV) x (HIV status)ij + ui 7.5: Random effects models Estimates from a random effects model for the effect of HIV status on tuberculin positivity are shown below. Estimates from a random effects model Mantoux Coefficie

    nt Standard err.

    z P > |z| 95% confidence interval

    HIV1 1.148913

    0.393322 2.921 0.003 1.919810

    0.378017

    Constant 1.198623 0.316097 3.792 0.000 0.579085 1.818162 log (u )

    0.034562 0.496553 0.070 0.945

    1.007788 0.938665

    u 0.982868 0.244023 0.604173 1.598926 0.491360 0.124101 0.267413 0.718830

    Log likelihood = -194.26 The model gives an estimate of the log odds ratio for the effect of HIV (1.1489) and its standard error (0.39). Converting back to the odds ratio scale we obtain OR = 0.32 (95% CI: 0.15, 0.69). 7.6: Random effects models Now consider the other estimates given in the model, u and . Interaction: Tabs: : With the random effects model an estimate of u is given (0.98). This is a measure of how much ui varies between households. If there is no clustering within households, then u will equal 0. Interaction: Tabs: :

  • is a measure of the within-cluster correlation. It is also called the intra-class correlation coefficient. Its value depends on the relative size of within and between household variation If there is no clustering within households, then will equal 0. The closer is to 1, the greater is the clustering within households For the tuberculin data set, =0.49, which is quite large and indicates considerable within-household correlation 7.7: Random effects models In a random effects model the likelihood is fully specified and all results are derived from the likelihood, therefore it is valid to perform a likelihood ratio test. Using this we can test the null hypothesis of no within household clustering, versus the alternative of some within household clustering by testing the null hypothesis that = 0. Click below to see the likelihood ratio test. Interaction: Button: Show (text appears below): The likelihood ratio test of the null hypothesis, that = 0, is obtained by comparing the log-likelihood of the original logistic regression model (in which and are 0), with the log-likelihood for the random effects model. Log-likelihood (original) = 200.71 Log-likelihood (random effects) = 194.26 LRS = 2 x (200.71 194.26) = 12.9 The resulting P-value for this is P = 0.0003. What can you conclude from this test? Interaction: Button: clouds picture (pop up box appears): The result of this test indicates strong evidence of within-household clustering in this model. 7.8: Random effects models The likelihood for the random effects logistic regression model contains a mixture of normal (for the additional variation) and binomial (for the individual outcome data) distributions, and so it is very complicated. Parameter estimates are obtained using numerical approximations, and the reliability of these approximations should be checked.

  • This is especially important when is large (> 0.25, say), or the number of observations per cluster (individuals per household in our example) is large (> 20, say). Interaction: Button: clouds picture (pop up box appears): In the study of tuberculin positivity, a check suggests that the approximations may not be reliable! You will see how to check the reliability of the approximations in Practical 9. As a result it may be safer to use the results from the GEE analysis, even though this approach is less satisfying from a statistical point of view. 7.9: Random effects models The problem with approximations applies particularly to random effects logistic regression models. The combination of binomial and normal distributions is especially problematic. The problem does not arise when the outcome is continuous (normally distributed) with normally distributed random effects. In this case we are combining normal distributions together and we can solve these equations without needing to use approximations. Interaction: Button: More (text appears below): The same is true for random effects Poisson regression, if we assume the random effects follow a Gamma distribution. The combination of Poisson and Gamma distributions produces another distribution the negative binomial and again we can solve these equations without resorting to approximations. This is appropriate for dealing with overdispersion (click here to review thisfrom AS05). Interaction: Hyperlink: review this: A window opens with AS05 However, if we specify in our model that the Poisson random effects are normally distributed, then we run into the same problems we face with random effects logistic regression, and the reliability of the estimates should be checked. 7.10: Random effects models Summary 1. A random effects model specifies the form of the between-cluster variation and includes it in the likelihood.

  • 2. The point estimates, standard errors and log-likelihood obtained from a random effects model all take account of the clustering (assuming that the random effects distribution is correctly specified). 3. Likelihood ratio tests are valid. 4. Estimates of the between cluster variation and intra-cluster correlation are obtained. 5. There needs to be a reasonable number of "clusters" in the dataset for the method to be reliable. 6. When performing random effects logistic regression analysis, the reliability of the estimates should be checked, especially when is large. Section 8: Comparison of different approaches Click on each of the analyses listed below to see the results obtained for the effect of HIV on tuberculin positivity in contacts of the index TB case. 1. Ordinary logistic regression 2. Logistic regression with robust standard error adjusted for clustering 3. Logistic regression with GEE, exchangeable correlation matrix and robust standard errors 4. Random effects logistic regression Interaction: Hyperlink: Ordinary logistic regression (text appears below): Ordinary logistic regression analysis fails to take account of the possibility that individuals living in the same household are likely to be more similar to each other than to individuals in other households. This approach is invalid in this dataset. Results Odds ratio for HIV (95% CI)

    Standard error of log odds ratio

    P-value

    0.43 (0.27, 0.70) 0.25 0.001 Interaction: Hyperlink: Logistic regression with robust standard error adjusted for clustering (text appears below): The use of robust standard errors improves an ordinary logistic regression analysis by taking account of possible clustering when computing the standard errors, but it ignores clustering when estimating the odds ratio. Therefore, the same estimate is obtained for the odds ratio but the standard error is larger.

  • This approach is valid, but not optimal. Results Odds ratio for HIV (95% CI)

    Standard error of log odds ratio

    P-value

    0.43 (0.23, 0.83) 0.33 0.01 Interaction: Hyperlink: Logistic regression with GEE, exchangeable correlation matrix and robust standard errors (text appears below): A GEE analysis with robust standard errors and an exchangeable correlation matrix is a further improvement, since it takes account of clustering when estimating the odds ratio. This approach is valid, but somewhat unsatisfying from a statistical perspective. Results

    Odds ratio for HIV (95% CI)

    Standard error of log odds ratio

    P-value

    0.38 (0.20, 0.72) 0.33 0.003 Interaction: Hyperlink: Random effects logistic regression (text appears below): A random effects model uses a different approach to deal with clustering. The clustering is incorporated explicitly in the likelihood to obtain estimates of the odds ratio and its standard error. This approach is valid. When fitting random effects logistic regression models, the reliability of the estimates should be checked. Results Odds ratio for HIV (95% CI)

    Standard error of log odds ratio

    P-value

    0.32 (0.15, 0.69) 0.39 0.003 8.1: Comparison of different approaches It is preferable to use a method that takes into account the fact that data provided by two individuals who are similar (e.g. from the same household) are less informative about the general study population than data from two individuals from different households. That is, it is preferable to use a method that takes into account that data provided by 2 individuals in a large cluster (e.g. a large household) are less informative about the general population than data provided by 2 individuals in a small cluster (e.g. a small household).

  • GEE and random effects models both take the above into account, but the analysis using robust standard errors alone does not. The odds ratio estimates produced by GEE and the random effects model are different. This is because they actually estimate slightly different things. This is explained on the tabs below. Interaction: Tabs: GEE: The odds ratio estimated by GEE is often called the population average odds ratio. This represents: odds of the average household contact of an

    HIV +ve person being tuberculin positive odds of the average household contact of an

    HIV -ve person being tuberculin positive Interaction: Tabs: Random effects: The odds ratio estimated from a random effects model is often called the cluster-specific odds ratio. This represents:

    an individual's odds of being tuberculin positive if their index case is HIV +ve

    same individual's odds of being tuberculin positive if their index case is HIV -ve

    8.2: Comparison of different approaches The two measures will only be the same if there is no random effect. What do we mean by this? Interaction: Button: clouds picture (text appears below): If there is no within household clustering then there will be no random effect. When there are random effects (clustering) then the estimate from the random effects model (also called the cluster-specific estimate) will be more extreme. GEE model estimates: Odds ratio for HIV = 0.38 (95% CI: 0.20, 0.72)

  • Random effects model estimates: Odds ratio for HIV = 0.32 (95% CI: 0.15, 0.69) 8.3: Comparison of different approaches Which is the measure of choice? Because the random-effects OR estimates the effect of the risk factor at the level of the individual, it is arguable that it is preferable to the "population-average" estimate given by GEE However, in most circumstances, there will be little difference between the cluster-specific and population average results. In such situations it does not matter which estimate is used. The difference between the 2 methods will only be large if there is substantial between-cluster variation and, in at least a proportion of clusters, the outcome is common. In such circumstances, the odds ratio will not approximate the risk/rate ratio, and in communicating your results to a wider audience the biggest problem you are likely to face is explaining how to interpret an odds ratio whether it be "population average" or "cluster specific". Section 9: Summary This is the end of AS09. When you are happy with the material covered here please move on to session AS10. The main points of this session will appear below as you click on the relevant title. The problem with correlated data When datasets contain observations that are correlated (clustered), such as: repeated outcome measurements on individuals outcome measurements on several individuals in the same household then standard methods of data analysis are invalid, and produce confidence intervals that are too narrow and P-values that are too small. Robust standard errors Robust standard errors can be obtained using residuals calculated at the cluster level, to account for correlation. The clustering in the data is not taken into account when estimating parameters, which is a disadvantage of this method. However, an advantage of this method is that it always works - you do not get problems of model

  • convergence such as sometimes occur when using GEE or random effects models. Another advantage is that it is a simple and intuitive approach. Generalised estimating equations GEEs incorporate robust standard errors and also take correlation into account when estimating parameters, e.g. log(OR). GEE is a pragmatic, relatively simple approach. However, sometimes there are convergence problems with the model when using GEE, in which case the method using robust standard errors alone (and not taking into account the within-cluster correlation when estimating parameters), or a random-effects model, should be used. Random effects models Random effects models take account of the clustering within the data when estimating parameters and their standard errors. They are more satisfactory from a statistical point of view than the method using robust standard errors alone, and than GEE, because they specify a full probability model to explain the data. They work well for Poisson models when the between-cluster variation is assumed to follow a gamma distribution, and for quantitative outcome data when the between-cluster variation is assumed to follow a normal distribution. So, in general, you should use a random-effects model when you have correlated data and you are modeling rates or a quantitative outcome. However, for logistic regression (where the outcome variable is binary), a random effects model may run into computational problems (the model may not converge, or the approximations used in the model-fitting may not be reliable). When this happens, it is preferable to use GEE or the method using robust standard errors alone Constant exposure These methods for analysing correlated data are needed when the exposure variable is a "cluster-level" variable - i.e. it takes the same value for all individuals in a cluster. When all exposure variables are "individual-level" variables - i.e. their value varies among individuals in the same cluster -, and if clusters are quite large, then we may instead be able to take account of the between-cluster variation by stratifying the analysis on cluster and using regression analysis in the usual way.

    2.1: Planning your study3.1: When are data correlated?3.2: When are data correlated?3.3: When are data correlated?3.4: When are data correlated?3.5: When are data correlated?3.6: When are data correlated?3.7: When are data correlated?3.8: When are data correlated?4.1: When are data correlated? Example4.2: When are data correlated? Example4.3: When are data correlated? Example4.4: When are data correlated? Example5.1: Robust standard errors5.2: Robust standard errors5.3: Robust standard errors5.4: Robust standard errors5.5: Robust standard errors6.1: Generalised estimating equations6.2: Generalised estimating equations6.3: Generalised estimating equations6.4: Generalised estimating equations6.5: Generalised estimating equations7.1: Random effects models7.2: Random effects models7.3: Random effects models7.4: Random effects models7.5: Random effects models7.6: Random effects models7.7: Random effects models7.8: Random effects models7.9: Random effects models7.10: Random effects models8.1: Comparison of different approaches8.2: Comparison of different approaches8.3: Comparison of different approaches