39
Regression analysis Linear regression Logistic regression

Regression analysis

  • Upload
    jersey

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Regression analysis. Linear regression Logistic regression. Relationship and association. Straight line. Best straight line?. Best straight line!. Least square estimation. Simple linear regression. Is the association linear?. Simple linear regression. Is the association linear? - PowerPoint PPT Presentation

Citation preview

Page 1: Regression analysis

Regression analysis

Linear regression Logistic regression

Page 2: Regression analysis

2

Relationship and association

Page 3: Regression analysis

3

Straight line

95 95.5 96 96.5 97 97.5 98 98.5 9921.523

21.5235

21.524

21.5245

21.525

21.5255

21.526

21.5265

H ip (cm )

1 cm

-0.0008BM

I

XbbY 10

XBMI 0008.01000

)()(

12

121 XX

YYb

onintersecti0 b

HIPBMI 10 bb

Page 4: Regression analysis

4

Best straight line?

Page 5: Regression analysis

5

Best straight line!

90 92 94 96 98 100 102 104 106 10814

16

18

20

22

24

26

28

30

32

(X1,Y1)

11 YYe

N

iii YYe

1

Least square estimation

Page 6: Regression analysis

6

Simple linear regression

1. Is the association linear?

-3 -2 -1 0 1 2 3-4

-2

0

2

4

6

8

10

12

Page 7: Regression analysis

7

Simple linear regression

1. Is the association linear?2. Describe the

association: what is b0 and b1BMI = -12.6kg/m2+0.35kg/m3*Hip

21

XX

YYXXb

i

ii

nX

X i

XbYb 10

Page 8: Regression analysis

8

Simple linear regression

1. Is the association linear?2. Describe the association3. Is the slope significantly

different from 0?Help SPSS!!!

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.B Std. Error Beta

1 (Constant) -12,581 2,331 -5,396 ,000

Hip ,345 ,023 ,565 15,266 ,000

a. Dependent Variable: BMI

Page 9: Regression analysis

9

Simple linear regression

1. Is the association linear?2. Describe the association3. Is the slope significantly

different from 0?4. How good is the fit?

How far are the data points fom the line on avarage?

11

22

r

YYXX

YYXXr

ii

ii

Page 10: Regression analysis

10

The Correlation Coefficient, r

R = 0

R = 1

R = 0.7

R = -0.5

Page 11: Regression analysis

11

r2 – Goodness of fitHow much of the variation can be explained by the model?

R2 = 0

R2 = 1

R2 = 0.5

R2 = 0.2

Page 12: Regression analysis

12

Multiple linear regression

Could waist measure descirbe some of the variation in BMI?BMI =1.3 kg/m2 + 0.42 kg/m3 * WaistOr even better:

WSTHIPBMI 210 bbb

0.17WST0.25HIP12.2- BMI

Page 13: Regression analysis

13

Multiple linear regression

If Y is linearly dependent on more than one independent variable:

is the intercept, the value of Y when X1 and X2 = 01 and 2 are termed partial regression coefficients1 expresses the change of Y for one unit of X when 2 is kept constant

jjj XXY 2211

05

1015

2025

12

34

56

70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Page 14: Regression analysis

14

Multiple linear regression – residual error and estimations

As the collected data is not expected to fall in a plane an error term must be added

The error term sums up to be zero.

Estimating the dependent factor and the population parameters:

jjjj XXY 2211

05

1015

2025

12

34

56

70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

jjj XbXbaY 2211ˆ

Page 15: Regression analysis

15

Multiple linear regression – general equations

In general an finite number (m) of independent variables may be used to estimate the hyperplane

The number of sample points must be two more than the number of variables

j

m

iijij XY

1

Page 16: Regression analysis

16

Multiple linear regression – co-liniarity

Adding age: adj R2 = 0.352

Adding thigh: adj R2 = 0.352?

Coefficientsa

Model

Unstandardized

Coefficients

Standardized

Coefficients

t Sig.

95,0% Confidence Interval

for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -9,001 2,449 -3,676 ,000 -13,813 -4,190

Waist ,168 ,043 ,201 3,923 ,000 ,084 ,252

Hip ,252 ,031 ,411 8,012 ,000 ,190 ,313

Age -,064 ,018 -,126 -3,492 ,001 -,101 -,028

a. Dependent Variable: BMI

Coefficientsa

Model

Unstandardized

Coefficients

Standardized

Coefficients

t Sig.

95,0% Confidence Interval

for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) 3,581 1,784 2,007 ,045 ,075 7,086

Waist ,168 ,043 ,201 3,923 ,000 ,084 ,252

Age -,064 ,018 -,126 -3,492 ,001 -,101 -,028

Thigh ,252 ,031 ,411 8,012 ,000 ,190 ,313

a. Dependent Variable: BMI

Page 17: Regression analysis

17

Assumptions

1. Dependent variable must be metric continuous

2. Independent must be continuous or ordinal

3. Linear relationship between dependent and all independent variables

4. Residuals must have a constant spread.

5. Residuals are normal distributed6. Independent variables are not

perfectly correlated with each other

Page 18: Regression analysis

18

Multible linear regression in SPSS

Page 19: Regression analysis

19

Multible linear regression in SPSS

Page 20: Regression analysis

Non-parametric correlation

20

Page 21: Regression analysis

21

Ranked Correlation

Kendall’s Spearman’s rs

Correlation between -1 og 1. Where -1 indicates perfect inversse correlation , 0 indicates no

correlation, and 1 indicates perfect correlation

Pearson is the correlation method for normal dataRemember the assumptions:1. Dependent variable must be metric continuous2. Independent must be continuous or ordinal3. Linear relationship between dependent and all independent

variables4. Residuals must have a constant spread.5. Residuals are normal distributed

Page 22: Regression analysis

22

Kendall’s - An example

Page 23: Regression analysis

23

Kendall’s - An example

121

nnS QPS

Page 24: Regression analysis

24

Spearman – the same example

d2 1 4 9 1 1 1 9 9 1 16

0.68481010

52616

1 33

2

nnd

rs

Page 25: Regression analysis

25

Korrelation i SPSS

Page 26: Regression analysis

26

Korrelation i SPSS

Correlations

a b

a Pearson

Correlation

1 ,685*

Sig. (2-tailed) ,029

N 10 10

b Pearson

Correlation

,685* 1

Sig. (2-tailed) ,029

N 10 10

*. Correlation is significant at the 0.05 level (2-tailed).

Correlations

a b

Kendall's tau_b a Correlation

Coefficient

1,000 ,511*

Sig. (2-tailed) . ,040

N 10 10

b Correlation

Coefficient

,511* 1,000

Sig. (2-tailed) ,040 .

N 10 10

Spearman's rho a Correlation

Coefficient

1,000 ,685*

Sig. (2-tailed) . ,029

N 10 10

b Correlation

Coefficient

,685* 1,000

Sig. (2-tailed) ,029 .

N 10 10

*. Correlation is significant at the 0.05 level (2-tailed).

Page 27: Regression analysis

Logistic regression

27

Page 28: Regression analysis

28

Logistic Regression

• If the dependent variable is categorical and especially binary?

• Use some interpolation method

• Linear regression cannot help us.

Page 29: Regression analysis

29

The sigmodal curve

0 1 1

11 e

...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0; 1 = 1

Page 30: Regression analysis

30

The sigmodal curve

• The intercept basically just ‘scale’ the input variable

0 1 1

11 e

...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0; 1 = 1

0 = 2; 1 = 1

0 = -2; 1 = 1

Page 31: Regression analysis

31

The sigmodal curve

0 1 1

11 e

...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0; 1 = 1

0 = 0; 1 = 2

0 = 0; 1 = 0.5

• The intercept basically just ‘scale’ the input variable

• Large regression coefficient → risk factor strongly influences the probability

Page 32: Regression analysis

32

The sigmodal curve

0 1 1

11 e

...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0; 1 = 1

0 = 0; 1 = -1

• The intercept basically just ‘scale’ the input variable

• Large regression coefficient → risk factor strongly influences the probability

• Positive regression coefficient → risk factor increases the probability

• Logistic regession uses maximum likelihood estimation, not least square estimation

Page 33: Regression analysis

33

Does age influence the diagnosis? Continuous independent variable

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a Age ,109 ,010 108,745 1 ,000 1,115 1,092 1,138

Constant -4,213 ,423 99,097 1 ,000 ,015

a. Variable(s) entered on step 1: Age.

age1

1

10

BBze

p z

Page 34: Regression analysis

34

Does previous intake of OCP influence the diagnosis? Categorical independent variable

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a OCP(1) -,311 ,180 2,979 1 ,084 ,733 ,515 1,043

Constant ,233 ,123 3,583 1 ,058 1,263

a. Variable(s) entered on step 1: OCP.

OCP1

1

10

BBze

p z

0.48051

11

1)1( 1, OCP If

0.55801

11

1)1( 0, OCP If

311.0233.01

233.0

10

0

eeYp

eeYp

BB

B

Page 35: Regression analysis

35

Odds ratio

zeppo

1

0.7327 ratio odds 311.01010

0

10

eeeee BBBB

B

BB

Page 36: Regression analysis

36

Multiple logistic regression

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a Age ,123 ,011 115,343 1 ,000 1,131 1,106 1,157

BMI ,083 ,019 18,732 1 ,000 1,087 1,046 1,128

OCP ,528 ,219 5,808 1 ,016 1,695 1,104 2,603

Constant -6,974 ,762 83,777 1 ,000 ,001

a. Variable(s) entered on step 1: Age, BMI, OCP.

BMIageOCP1

1

3210

BBBBze

p z

Page 37: Regression analysis

37

Predicting the diagnosis by logistic regression

What is the probability that the tumor of a 50 year old woman who has been using OCP and has a BMI of 26 is malignant?

z = -6.974 + 0.123*50 + 0.083*26 + 0.28*1 = 1.6140p = 1/(1+e-1.6140) = 0.8340

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a Age ,123 ,011 115,343 1 ,000 1,131 1,106 1,157

BMI ,083 ,019 18,732 1 ,000 1,087 1,046 1,128

OCP ,528 ,219 5,808 1 ,016 1,695 1,104 2,603

Constant -6,974 ,762 83,777 1 ,000 ,001

a. Variable(s) entered on step 1: Age, BMI, OCP.

Page 38: Regression analysis

38

Logistic regression in SPSS

Page 39: Regression analysis

39

Logistic regression in SPSS