32
REGRESSION Statistics for Language Research Fall 2016 12/6/2016 1 陈芳 华东师大英语系

Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

Embed Size (px)

Citation preview

Page 1: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

REGRESSION Statistics for Language Research Fall 2016

12/6/2016

1

陈芳

华东师大英语系

Page 2: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

OVERVIEW Regression

The regression line Intercept and slope estimates in a regression line

Unstandardized Standardized

Accuracy of prediction Handling outliers Hypothesis testing Prediction as a purpose SPSS Extension

12/6/2016

2

陈芳

华东师大英语系

Page 3: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

REGRESSION Regression is the process of finding the best

fitting line using the data we have. This line is most often used to predict or

estimate a score of one variable (Y) from another (X). Predict for future occurrences Estimate a mean value Examples

12/6/2016

3

陈芳

华东师大英语系

Page 4: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

12/6/2016 陈芳

华东师大英语系

4

Page 5: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

PREDICTING A SCORE If we know nothing else but

that the mean psychological symptom score of students in ECNU is 90 (based on measures such as Hopkins Symptom Checklist).

We know a student comes

from ECNU

12/6/2016

5

陈芳

华东师大英语系

Page 6: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

PREDICTING A SCORE If we have some extra information

such as the level of stress of ECNU students and we know that level of stress is related to the number of symptoms. We could give a more precise estimate of a student’s symptom score given that we know his stress level.

E.g. I can tell you the student’s stress score is 42, now your estimated symptom score maybe 105.

E.g. I can tell you that the student’s stress score is 18, now your estimated symptom score is 92.

12/6/2016

6

18 42

陈芳

华东师大英语系

Page 7: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

FIND THE LINE Find the line is the same as finding a and b that

defines the line.

12/6/2016

7

陈芳

华东师大英语系

Page 8: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

THE REGRESSION LINE

Think back to your high school math class…the equation for a straight line is:

12/6/2016

8

Predicted value of Y

Slope of regression line

Value of X

Intercept

bXaY +=ˆ

陈芳

华东师大英语系

These are our regression coefficients

Page 9: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

THE BEST-FITTING LINE ---VISUAL HELP the best-fitting line minimizes the errors of

prediction. the difference between observed Y and predicted

Ŷ is called the residual, it is represented as (Y- Ŷ ).

Ordinal least squared estimation: sum the square of the residuals.

Draw on the board. No matter how the line tilts, it will always pass one

point, that is

12/6/2016

9

陈芳

华东师大英语系

),( YX

Page 10: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

THE REGRESSION LINE The slope of the best fitting line we just described can

be estimated as below, this can be proven, but we will not do that here.

==

X

Y

X

XY

ssr

sb 2

cov

12/6/2016

10

陈芳

华东师大英语系

NXbY

XbYa ∑ ∑−=−=

Page 11: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

STANDARDIZED DATA So what do we do if our data are standardized—that is,

both X and Y are z-scores? The mean of X in this case is ____. The mean of Y in this case is also ____. The standard deviation of X and Y are both _____. Centering

Think now about the equations we used to estimate our regression coefficient—and plug in the values above

12/6/2016

11 XXbXbaY β+=+=+= 00ˆ '''

SPSS call this beta, standardized coefficient.

陈芳

华东师大英语系

0)0(0 ''' =−=−= bXbYa rrssrb

X

Y =

=

=

11'

Page 12: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

INTERPRETATION OF STANDARDIZED SLOPE COEFFICIENT (BETA)

This changes our interpretation a bit… Now, an one unit increase in X or Y corresponds to

an increase of one standard deviation If β = 0.75, this means that an increase of one

standard deviation in X relates to a ¾ standard deviation increase in Y.

12/6/2016

12

XY β+= 0ˆ

陈芳

华东师大英语系

Page 13: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

HYPOTHESIS TESTING

Regression has two coefficients (slope & intercept) We still use t-test to test this. In SPSS, the t-test has a p-value attached to it. As usual,

if p<alpha(usually 0.05), we reject the null hypothesis and conclude the coefficient is significantly different from 0.

12/6/2016

13

陈芳

华东师大英语系

Page 14: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

JUST FOR YOUR INFORMATION t-test for correlation r. You can use look up the critical t value. But we will

leave this to SPSS. t-test for slope coefficient b. It follows a t distribution

with n-2 degrees of freedom.

You can use either equation to double-check the t statistic in the following example

12/6/2016

14

[ ]21

22

−=−

−== ndf

rNr

srtr

21)1(

1)(1)(

12ˆˆ

−−

−=

−=

=−−

NNrs

Nsbs

Nsb

Nss

bt

Y

X

YY

X

X

YY

陈芳

华东师大英语系

Page 15: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

SPSS DEMO

12/6/2016

15

Stress and Syptom data online. Predicting symptom scores from level of stress. Can you write the equation for this best fitting line?

Intercept

Slope

Standardized slope coefficient. =r (simple regression only.)

Does not include 0. Yes, significant.

Smaller than .05. Significant. Reject null hypothesis and conclude the slope is different from 0.

We don’t care about this.

Unstandardized slope coefficient

陈芳

华东师大英语系

Page 16: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

PREDICTED VS. ACTUAL SCORE

For a score of X=25 we can predict a score of about 93.5 on Y. (How do we find the value?)

We can compare how the real values of Y that occurred with a score of X = 25 compare to our prediction.

12/6/2016

16

陈芳

华东师大英语系

Page 17: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

STANDARD ERROR OF ESTIMATE (Y – Ŷ) is the error of prediction, or the residual. Remember sum of squared error or SSerror? The standard error of estimate (SEE) is defined as This equation, like standard deviation, describes on

average, how our predicted scores are from the actual scores.

SEE can be more easily estimated with:

12/6/2016

17

( )2ˆ

112YY Y

Ns s rN−

− = − −

∑ − 2)ˆ( YY

22)ˆ( 2

.ˆ −=

−−

== ∑− N

SSN

YYss error

XYYY

陈芳

华东师大英语系

Page 18: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

SEE AS A MEASURE OF THE ACCURACY OF PREDICTION SEE is one way we can assess how well our

regression equation is working. predicting with a dependent variable is better that

predicting without one, but it is not without error. If the regression equation works well, it’s safe to assume

the predicted values of Y should be very close to the real values.

The smaller the SEE, the better the equation predicts.

12/6/2016

18

陈芳

华东师大英语系

Page 19: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

R2 AS A MEASURE OF ACCURACY OF PREDICTION

In this equation, SSY is the total variance in Y that are composed of two parts a. The part of the variability in Y that is associated

with X, , learn to use the following phrases. that can be explained by X that can be predicted by X that is accounted by X that is attributable to X

b. The part of the variability in Y that is independent

of X, which is SSerror.

12/6/2016

19

Y

errorY

Y

Y

SSSSSS

SSSS

r −== ˆ2

YSS ˆ

陈芳

华东师大英语系

Page 20: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

12/6/2016 陈芳

华东师大英语系

20

Page 21: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

CALCULATE THE SSY-HAT 12/6/2016

21

陈芳

华东师大英语系

256.702.410173.105

=SSY SSY-hat

Page 22: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

CALCULATE THE SSERROR 12/6/2016

22

陈芳

华东师大英语系

173.105702.410529.305 −=

Page 23: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

PREDICTABLE VARIABILITY & R2

The higher the r2, the better the predictors are working (the more variance in Y that are explained by the predictors).

If a correlation (r) is found to be 0.8, we calculate r2 = 0.82 = 0.64

How do we interpret this? This means that 64% of the variance of Y can be

explained by the variability of X. You can use the phrases on the previous slide

interchangeably. Remember, this does NOT mean that 64% of Y is

caused by X.

12/6/2016

23

陈芳

华东师大英语系

Page 24: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

SPSS DEMO Predicting course quality from grades. Interpret R2. Based on the regression equation, what overall rate will be

if the student’s expected grade is 3.1?

12/6/2016

24

陈芳

华东师大英语系

Page 25: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

WHAT AFFECTS REGRESSION ANALYSIS? REMEMBER THIS?

4 major factors affect the calculation of correlations: 1. Nonlinearity of relationship 2. Restriction of range (or variance) of X and/or Y 3. Use of heterogeneous sub-samples 4. Outliers

12/6/2016

25

陈芳

华东师大英语系

Page 26: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

12/6/2016

26

陈芳

华东师大英语系

Page 27: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

USE SPSS Sleep and mood: This is fictional data Create scatter plot to check the linear

relationship assumption. If not, transformation of data might be necessary.

Check for outliers. Delete outliers.

Read the output. Make your conclusion. Interpret the results. Evaluate model fit.

12/6/2016

27

陈芳

华东师大英语系

Page 28: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

EXTENSION: COMPARING T-TEST AND REGRESSION

12/6/2016

28

陈芳

华东师大英语系

Page 29: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

EXTENSION: COMPARING T-TEST AND REGRESSION

12/6/2016

29

陈芳

华东师大英语系

Page 30: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

SOME EXTENSIONS OF REGRESSION Multiple Regression

This method involves the prediction of Y from several independent variables. There is no regression line in this case We calculate a single a (intercept) for this multiple

regression, but each IV has a separate b (slope) estimated.

These slopes are not the exactly the slope of the line, but more like a weighting function .

We can perform a hypothesis test to see if each of these weights help predict Y using t-test. Again, SPSS gives help.

Our r2 should increase as we add more predictors—We can explain more and more of the variance in Y by adding different predictors

12/6/2016

30

陈芳

华东师大英语系

Page 31: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

SOME EXTENSIONS OF REGRESSION Chapter 15 contains a introduction to multiple

regression.

12/6/2016

31

陈芳

华东师大英语系

Page 32: Regression - faculty.ecnu.edu.cnfaculty.ecnu.edu.cn/picture/article/220/98/7a/e668def847b69a8e0f4a... · SPSS call this beta, ... if p

EXTRA Testing the difference

between independent and nonindependent rs.

PIP (proportional improvement in prediction)

Homogeneity of variance in arrays

Errors of prediction as a function of r: r=.2, SEE reduced by 2%,

SEM is 98% of the prediction without x; r=.95, SEE reduced by two-thirds.

12/6/2016 陈芳

华东师大英语系

32