59
Australian Council for Educational Research Analysing and Understanding Learning Assessment for Evidence-based Policy Making Correlation and Regression Bangkok, 14-18, Sept. 2015

Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Australian Council for Educational Research

Analysing and Understanding Learning Assessment for Evidence-based Policy

Making

Correlation and RegressionBangkok, 14-18, Sept. 2015

Page 2: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Correlation

• The strength of a mutual relation between 2 (or more) things

• You need to know 2 things about each unit of analysis– student (e.g. maths and reading performance)– school (e.g. funding level and mean reading performance)– country (e.g. mean performance in 2010 and in 2013)

• No assumption about the direction of the relationship• Correlation is simply standardised covariance – i.e.,

covariance divided by the product of the standard deviations of the variables.

Page 3: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Formulas

• Variances:

• Standard deviation

• Covariances:

• Correlation (Pearson’s r)

1)( 2

2

−−

= ∑N

XXσ

1))((

),cov(−

−−= ∑

NYYXX

yx

xy

yxrσσ

),cov(=

1)( 2

−−

= ∑N

XXσ

Page 4: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

A note on sample vs population estimators

• Sample variances:

• Sample covariances:

• Estimate of variance based on a sample is biased, it underestimates the true variance

• Needs a correction factor of to produce an unbiased estimate

NYYXX

yx))((

),cov(−−

= ∑

1−NN

NXX 2

2 )(∑ −=σ

Presenter
Presentation Notes
Conventional formulas are as follows for variance and covariance. But these are for the population. If applied on a sample, they produce a biased estimate.
Page 5: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Type of correlation

• The correlation coefficient to use depends on the level of measurement of the variables

Ordinal – ranks, Likert scales, ordered categories• Spearman correlation (ρ), Kendall’s tau (τ)Interval/Ratio – metric scales, measures of magnitude• Pearson correlation (ρ)

Page 6: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Things to remember

• Independence – are the two values independent of each other?

• Linearity – is the relationship between the two values linear?

• Normality – are the two values distributed normally? (if not, non-parametric correlation should be used)

Page 7: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Correlation values

0 = no relationship1.0 = perfect positive relationship

-1.0 = perfect negative relationship0.1 = weak relationship (if significant)0.3 = moderate relationship (if significant)0.5 = strong relationship (if significant)

Presenter
Presentation Notes
If significant – if the correlation is significantly different from zero
Page 8: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Strong correlation

r = .80

Page 9: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Perfect correlations

r = 1 r = -1

Presenter
Presentation Notes
These are not common in surveys.
Page 10: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Moderate correlationr = .36

Presenter
Presentation Notes
“Cloud” of dots less well defined
Page 11: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

No correlation

r = .06

Presenter
Presentation Notes
No correlation between maths and reading – if you find this in your results it might indicate a problem
Page 12: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Correlation vs Regression

• Correlation is not directional. The degree of association goes both ways.

• Correlation is not appropriate if the substantive meaning of X being associated with Y is different from Y being associated with X. For example, Height and Weight.

• Not appropriate when one of the variables is being manipulated, or being used to explain the other. Use regression instead.

Presenter
Presentation Notes
No correlation between maths and reading – if you find this in your results it might indicate a problem
Page 13: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Practical exercises

• Be careful about spurious correlations. Just because two variables correlate highly does not mean there is a valid relationship between them.

• Correlation is not causation.• With large enough data, anything can be

significantly correlated with something.

Presenter
Presentation Notes
No correlation between maths and reading – if you find this in your results it might indicate a problem
Page 14: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Regression

• Also describes a relationship between 2 things (or more), but assumes a direction

• Explain one variable with one (or more) other variable(s)– How well does SES predict performance?

Page 15: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Regression – cont.

• Two main statistics– Size of the effect or slope– Strength of the effect or

explained variance

Page 16: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

The General IdeaSimple regression considers the relation between a single explanatory variable and response variable

Page 17: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Line of best fit (OLS)

Presenter
Presentation Notes
Students have an estimate of ESCS (SES) and a score on a reading test
Page 18: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Line of best fit (OLS)

Presenter
Presentation Notes
Students have an estimate of ESCS (SES) and a score on a reading test
Page 19: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Size of the effect

1 unit

50 = slope

Presenter
Presentation Notes
The best-fitting line – a line that minimizes the distances between each of those dots and the line This line can be steep or flat. To express how steep the line is: when you increase the value on your x axis by one unit (one SD on ESCS), how much does your reading variable increase? In this case it goes up by 50. That means the slope of this line is 50. Of course this value depends on the unit of your vertical scale. In this case the mean of the reading scale is 500, and the standard deviation is 100, so this goes up half a SD in reading.
Page 20: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Size of the effect – cont.

1 unit

25 = slope

Presenter
Presentation Notes
In this example the slope is less steep. If we go up one SD on ESCS we see the line only goes up by 25 reading points. For any given difference of one SD in ESCS, this is associated with an increase of 25 points on the reading scale (1/4 of the SD). The steeper the line, the larger the effect of ESCS on reading. The unit of measurement of the variable on the x-axis changes the slope of the line. We call the variable on the Y-axis our dependent variable, and the variable on the X-axis our independent variable. An increase in the variable on the Y-axis (in this case reading) is dependent upon an increase in variable on the X-axis (SES). In other words, we are trying to explain reading using ESCS. Note that we cannot infer causation from the types of surveys we use – we would only determine causation by manipulating our “cause effect”, by using controls and experimental groups
Page 21: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

The R2

The proportion of the total sample variance that is notexplained by the regression will be:

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅

Therefore, the proportion of thevariance in the dependentvariable that is explained by theindependent variable (R2) will be:

𝑅𝑅2 = 1 − 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑇𝑇𝑜𝑜𝑇𝑇𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑠𝑠 𝑜𝑜𝑜𝑜 𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅𝑠𝑠𝑅𝑅𝑅𝑅

Presenter
Presentation Notes
Students have an estimate of ESCS (SES) and a score on a reading test
Page 22: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Strength of the effect

For example, if the residual variance is a small proportion of the total variance

R2 = 1 – (162.5/1250)R-squared = 0.87

87 % of the variation in reading is explained by ESCS

Presenter
Presentation Notes
The other thing you look at with regression is the strength. If we look at the best-fitting line in this case, and the distance of each of these dots from the line, we see that all of the dots are quite close to the line. This means that the relationship is quite strong. You can look at it another way. If you take a student with an ESCS value of -1, its predicted reading performance (from the line) is 450. However, the real reading performances of kids with an ESCS value of -1 ranges from between 430 and 480 approximately. Remember this to compare with the next slide The strength is related to the correlation. In fact, the strength is just the correlation squared. In this case we had a very strong correlation (.93). If we square that: the value we get is the proportion of variation in reading that is explained by ESCS. For all of the students that have reading scores, the variation in their performance is explained 87% by ESCS. There is only 13% of variation in kids’ reading performance that is not explained by ESCS. Note that this is not very realistic. We have computed the CI here – we are 95% sure that the true slope is between 47 and 51.
Page 23: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Strength – cont.

For example, if the residual variance is a large proportion of the total variance

R2 = 1 – (1075/1250)R2 = 0.14

Only 14% of the variation in reading is explained by ESCS

Presenter
Presentation Notes
Here we have a wider scatter plot (but the slope of the line is the same as the previous slide). If we again look at all the distances between each dot and the best fitting line we see that these distances are much larger than the previous slide. Also what we see if we take a student with a value of -1 on ESCS we would estimate the student reading performance just over 400. But in reality, students with an ESCS of -1 have a very wide range in reading scores – the best student has reading performance of 800, the weakest student of just under 200. What you can already tell from this is reading explains ESCS a little bit but not as much as the previous example. In this case the correlation of this cloud was .37 (moderate correlation) and R-squared is .14. This is much more realistic than the pervious example. In other words, 86% of the variation in reading is explained by other things (not ESCS). The other thing you can see here is that, even though the slope is the same as the pervious slide, we’re much less certain about it: the CI is from 34 to 106. The less variance you explain, the larger the standard error on your slope estimate.
Page 24: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y

The intent is to look at the independent effect of each variable while “adjusting out” the influence of potential confounders

Multiple Regression

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Page 25: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Regression Modeling

• A simple regression model (one independent variable) fits a regression line in 2-dimensional space

• A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space

• This concept can be extended indefinitely but visualisation is no longer possible for >3 variables.

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

residual

Presenter
Presentation Notes
What SPSS does is fit a regression line (or plane) such that the squared residuals are minimised
Page 26: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Multiple Regression ModelAgain, estimates for the multiple slope coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model:

Again, the standard error of the regression is based on the ∑ residuals2 of all xn:

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Page 27: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Multiple Regression Model

• Intercept α predicts where the regression plane crosses the Y axis

• Slope for variable X1(β1) predicts the change in Y per unit X1 holding X2constant

• The slope for variable X2 (β2) predicts the change in Y per unit X2 holding X1constant

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 28: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Main purpose of regression analysis

• Prediction– Developing a prediction model based on a set

of predictor/independent variables. This purpose also allows for the evaluation of the predictive powers between different models as well as different sets of predictors within a model.

• Explanation– Validating or confirming an existing prediction

model using new data. This purpose also allows for the assessment of the relationship between predictor and outcome variables.

Presenter
Presentation Notes
We will utilise regression for both purposes in the exercise.
Page 29: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Regression works provided assumptions are met

• Linearity– Check using partial regression plots (PLOTS Produce all

partial plots)• Uniform variance (homoscedasticity)

– Check by plotting residuals against the predicted value (PLOTS Y:ZRESID, X:ZPRED)

– For ANOVA, check using Levene’s test for homogeneity of variance (EXPLORE PLOTS Spread vs Level)

• Independence of error terms– Check by plotting residuals against a sequencing variable

(PLOTS Produce all partial plots)• Normality of the residuals

– Check using Normal P-P plots of the residuals (PLOTS Normal probability plot)

Page 30: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Sample size

• Thorough method: a priori power analysis– Compute sample sizes for given effect sizes,

alpha levels, and power values (G*Power 3: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/)

• Fast method (but less thorough): rules of thumb– For R2 significance testing: 50 + 8k– For b-values significance testing : 104 + k– For both, use the larger number

Presenter
Presentation Notes
This shows that according to the rule of thumb, a reasonable minimum sample size is 105 (104+1)
Page 31: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Multicollinearityy= b0 + b1x1y= b0 + b1x1 + b2x2 but if x2 = x1 + 3

y= b0 + b1x1 + b2 (x1+3) y= b0 + b1x1 + b2 x1 +3b2

Checking for multicollinearityFor overall multicollinearity: VIF>10; Tolerance <0.10.For individual variables: Identify Condition Index >15, then check the Variance Proportions of each coefficient >.90.

Presenter
Presentation Notes
In example above, b1 and b2 cannot be interpreted differentially.
Page 32: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Influential values• Influential values are outliers that have

substantial effect on the regression line.

Source: Field, A. (2005). Discovering statistics using SPSS. (2nd ed). London: Sage.

Page 33: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

When does linear regression modelling become inappropriate?

• When the dependent variable is dichotomous or polytomous (use Logistical Regression).

• When data are sequential over time and variables are ‘auto correlated’ (use Time Series Analysis).

• When context effects need to be analysed and slopes are different across higher level units (use Multi-level Analysis).

Page 34: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Application: Illustrative Example

Childhood respiratory health survey. • Binary explanatory variable (SMOKE) is coded 0

for non-smoker and 1 for smoker• Response variable Forced Expiratory Volume

(FEV) is measured in liters/second (lung capacity)• Regress FEV on SMOKE least squares regression

line:ŷ = 2.566 + 0.711x

• The mean FEV in nonsmokers is 2.566 • The mean FEV in smokers is 3.277

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Presenter
Presentation Notes
FEV can be thought of as lung capacity
Page 35: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Example, cont.

• ŷ = 2.566 + 0.711x• Intercept (2.566) = the mean FEV of group 0• Slope = the mean difference in FEV (because x

is 0,1) 3.277 − 2.566 = 0.711• tstat = 6.464 with 652 df, p <.01 (b1 is significant)• The 95% CI for slope is 0.495 to 0.927

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 36: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Smoking increases lung capacity?

• Children who smoked had higher mean FEV• How can this be true given what we know

about the deleterious respiratory effects of smoking?

• ANS: Smokers were older than the nonsmokers

• AGE confounded the relationship between SMOKE and FEV

• A multiple regression model can be used to adjust for AGE in this situation

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 37: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Extending the analysis:Multiple regression

The multiple regression model is:FEV = 0.367 + −.209(SMOKE) + .231(AGE)

SPSS output for our example:Intercept a Slope b2Slope b1

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 38: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Multiple Regression Coefficients, cont.

• The slope coefficient associated for SMOKE is −.209, suggesting that smokers have .209 less FEV on average compared to non-smokers (after adjusting for age)

• The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 39: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Coefficientsa

.367 .081 4.511 .000-.209 .081 -.072 -2.588 .010.231 .008 .786 28.176 .000

(Constant)smokeage

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: feva.

Inference About the Coefficients

Inferential statistics are calculated for each regression coefficient. For example, in testing

H0: β1 = 0 (SMOKE coefficient controlling for AGE)tstat = −2.588 and P = 0.010

df = n – k – 1 = 654 – 2 – 1 = 651Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 40: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Inference About the CoefficientsThe 95% confidence interval for this slope of SMOKE controlling for AGE is −0.368 to − 0.050.

Coefficientsa

.207 .527-.368 -.050.215 .247

(Constant)smokeage

Model1

Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: feva.

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi

Page 41: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Assessing the significance of the model

• R Square (R2) – represents the proportion of variance in the outcome variable that is accounted for by the predictors in the model.For example, if for our previous model R2 = .23, then 23% of the variance in FEV is accounted for by smoking status and age.

• Adjusted R2 – compensates for the inflation of R2 due to overfitting. Useful for comparing the amount of variance explained across several models.

• Standard error of the estimate – measure of accuracy of the predictions. For example, if the SE of the estimate = 0.35 for our previous model:

FEV = 0.367 + −.209(SMOKE) + .231(AGE)then the predicted FEV for a non-smoker aged 12 years is

FEV=3.139 +/- (t x 0.35)

Page 42: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Assessing the significance of the model

Hierarchical models

Suppose Model 1: FEV = 0.367 + −.209(SMOKE) + .231(AGE), R2 =.23Model 2: FEV = 0.367 + −.209(SMOKE) + .231(AGE) + .04(GENDER), R2 =.29

What is the amount of unique variance explained by gender above and beyond that explained by smoking status and age?

FEV

AGESMOKE

FEV

AGE

GENDER

SMOKE

Page 43: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Hierarchical regression in SPSS

Page 44: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Dummy VariablesMore than two levels

For categorical variables with k categories, use k–1 dummy variables

Ex. SMOKE2 has three levels, initially coded 0 = non-smoker 1 = former smoker2 = current smoker

Use k – 1 = 3 – 1 = 2 dummy variables to code this information like this:

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers.

Page 45: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Use of standardised coefficients

• Often thought to be ‘easier’ to interpret.• Standardisation depends on variances

of independent variables.• Unstandardised coefficient can be

translated directly.• Unstandardised coefficients cannot

always be compared if different units are used for the variables.

Page 46: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Finding the best regression model

• The set of predictors must be chosen based on theory

• Avoid the “whatever sticks to the wall” approach.

• The grouping of predictors and the ordering of entry will matter.

• Selecting the “best” final model can sometimes be a judgment call.

Presenter
Presentation Notes
So “whatever sticks to the wall” is ok as long as nobody knows?
Page 47: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

How to judge whether a model is good?

• Explained variance proportion as measures by R2

• Size of regression coefficients.• Significance tests (F-test for model, t-

tests for parameters)• Inclusion of all relevant variables

(Theory!)• Is method appropriate?

Page 48: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

The six steps to interpreting results1. Look at the prediction equation to see an estimate of the

relationship.2. Refer to the standard error of the estimate (in the appropriate

model) when making predictions for individuals.3. Refer to the standard errors of the coefficients (in the most

complete model) to see how much you can trust the estimates of the effects of the explanatory variables.

4. Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model.

5. Use the coefficient of determination (R2) to measure the potential explanatory power of the model.

6. Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance.

Page 49: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Notes on interpreting the results

• Prediction is NOT causation.• In inferring causation, there has to be at least

temporal precedence, but temporal precedence alone is still not sufficient.

• Avoid extrapolating the prediction equation beyond the data range.

• Always consider the standard errors and the confidence intervals of the parameter estimates.

• The magnitude of the coefficient of determination (R2), in terms of explanatory power, is a judgment call.

Page 50: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Practice exercises!

Study: Mathematics Beliefs and Achievement of Elementary School Students in Japan and the United States: Results From the Third International Mathematics and Science Study (TIMSS). House, J. D., 2006

• Interpret the parameter estimates• Interpret the statistical significance of the

predictors• Make substantive interpretation about the findings

Page 51: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Extensions: Regression

Multiple regression considers the relation between a set of explanatory variables and response or outcome variable

Independent predictor (x1)

Outcome (y)

Independent predictor (x2)

Page 52: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Moderating effect

Moderated regressionWhen the independent variable does not affect the outcome directly but rather affects the relationship between the predictor and the outcome.

Independent predictor (x1)

Outcome (y)

Independent variable (x2)

Page 53: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Moderating effectSimple Moderating effectWhen a categorical independent variable affects the relationship between the predictor and the outcome.

C1

C2

C3

X

Y

Page 54: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Moderating effects

y = actual scaled score in the Multidimensional Perfectionism Scale (Hewitt & Flett)

Categorical moderator Continuous moderator

Presenter
Presentation Notes
EDI- eating disorder inventory
Page 55: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Types of moderators (Sharma et al., 1981)

Related to predictor and/or outcome

Not related predictor and/or outcome

No interaction with predictor

Independent predictor Homologizer

Interaction with predictorvariable

Quasi-moderator Pure moderator

Homologizer variables affect the strength (rather than the form) of the relationship between predictor and outcome (Zedeck, 1971)

Presenter
Presentation Notes
EDI- eating disorder inventory
Page 56: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Testing Moderation

• Moderation effects are also known as interaction effects.• Interaction terms are product terms of the moderator and the

relevant predictor (the variable that the moderator interacts)– Y = b0 + b1x1 + b2x2 + b3m– Interaction term = x1*m =i1

• Choosing the moderator and the relevant predictor must have theoretical support. For example, it is possible that the moderator interacts with x2 instead (i.e., x2*m =i1).

• Testing for the interaction effect necessitates the inclusion of the interaction term/s in the regression equation:

– Y = b0 + b1x1 + b2x2 + b3m + b4i1

– And test H0: b4=0

Page 57: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Mediating effect

Mediated regressionWhen the independent predictor does not affect the outcome directly but affects it through an intermediary variable (the mediator).

Independent predictor (x1)

Outcome (y)

Intermediary predictor (x2)

Page 58: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Mediation vs Moderation

Mediators explain why or how an independent variable X causes the outcome Y while a moderator variable affects the magnitude and direction of the relationship between X and Y (Saunders, 1956).These two approaches can be combined for more complex analyses:

• Moderated mediation

• Mediated moderation

Page 59: Correlation and Regression - UNESCO Bangkok · – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean

Checkists

• Moderation– Collinearity between predictor and moderator

(especially true for quasi-moderators).– Unequal variances between groups based on the

moderator.– Reliability of measures (measurement errors are

magnified when creating the product terms).• Mediation

– Theoretical assumptions on the mediator– Rationale for selecting the mediator– Significance and type (full/partial) of the mediation

effect.– Implied causation (i.e., directional paths).