35
1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation Innovation Sciences & Industrial Engineering Phone: 5509 email: [email protected]

What you've always wanted to know about logistic regression analysis, but were afraid to ask

  • Upload
    anisa

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

What you've always wanted to know about logistic regression analysis, but were afraid to ask. Februari, 1 2010 Gerrit Rooks Sociology of Innovation Innovation Sciences & Industrial Engineering Phone: 5509 email: [email protected]. This Lecture. Why logistic regression analysis ? - PowerPoint PPT Presentation

Citation preview

Page 1: What you've always wanted to know about logistic regression analysis, but were afraid to ask

1

What you've always wanted to know about logistic regression analysis, but were afraid to

ask...

Februari, 1 2010

Gerrit RooksSociology of Innovation

Innovation Sciences & Industrial Engineering Phone: 5509

email: [email protected]

Page 2: What you've always wanted to know about logistic regression analysis, but were afraid to ask

This Lecture

• Why logistic regression analysis?• The logistic regression model• Estimation• Goodness of fit• An example

2

Page 3: What you've always wanted to know about logistic regression analysis, but were afraid to ask

3

What's the difference between 'normal' regression and logistic regression?

Regression analysis: – Relate one

or more independent (predictor) variables to a dependent (outcome) variable

Page 4: What you've always wanted to know about logistic regression analysis, but were afraid to ask

4

What's the difference between 'normal' regression and logistic regression?

• Often you will be confronted with outcome variables that are dichotomic:– success vs failure– employed vs unemployed– promoted or not– sick or healthy – pass or fail an exam

Page 5: What you've always wanted to know about logistic regression analysis, but were afraid to ask

5

ExampleRelationship between hours studied for exam and success

Hours # Failed exam

# Passed exam?

Total # students

Prob. pass exam

28 4 2 6 .33

29 3 2 5 .40

30 2 7 9 .78

31 2 7 9 .78

32 4 16 20 .80

33 1 14 15 .93

Page 6: What you've always wanted to know about logistic regression analysis, but were afraid to ask

6

Linear regression analysisWhy is this wrong?

Page 7: What you've always wanted to know about logistic regression analysis, but were afraid to ask

7

Logistic RegressionThe better alternative

Page 8: What you've always wanted to know about logistic regression analysis, but were afraid to ask

8

Page 9: What you've always wanted to know about logistic regression analysis, but were afraid to ask

9

The logistic regression equationpredicting probabilities

)( 111011)(

XbbeYP

predictedprobability(always between0 and 1)

similar to regressionanalysis

Page 10: What you've always wanted to know about logistic regression analysis, but were afraid to ask

10

The Logistic functionSometimes authors rearrange the model

)(

)(

)( 1110

1110

1110 111)(

Xbb

Xbb

Xbb ee

eYP

nn xcxcxccyp

yp

...

)1(1)1(ln 22110

or also

Page 11: What you've always wanted to know about logistic regression analysis, but were afraid to ask

11

How do we estimate coefficients?Maximum-likelihood estimation

• Parameters are estimated by `fitting' models, based on the available predictors, to the observed data

• The chosen model fits the data best, i.e. is closest to the data

• Fit is determined by the so-called log likelihood statistic

Page 12: What you've always wanted to know about logistic regression analysis, but were afraid to ask

12

Maximum likelihood estimationThe log-likelihood statistic

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

Large values of LL indicate poor fit of the model

HOWEVER, THIS STATISTIC CANNOT BE USED TO EVALUATE THE FIT OF A SINGLE MODEL

Page 13: What you've always wanted to know about logistic regression analysis, but were afraid to ask

13

Quantity of Study Hours Outcome

3 034 117 06 0

12 015 126 129 1

An example to illustrate maximum likelihood and the log likelihood statistic

Suppose we know hours spentstudying and the outcome of an exam

Page 14: What you've always wanted to know about logistic regression analysis, but were afraid to ask

14

)05.0( 111)(P Xe

Y

Quantity of Study Hours Outcome

Predicted probability (b0=0; b1 = 0.05)

Predicted probability(b0=-6.44; b1 = 0.39)

3 0 .53 .0134 1 .85 .9917 0 .71 .536 0 .57 .02

12 0 .65 .1415 1 .68 .3426 1 .79 .9729 1 .81 .99

)39.044.6( 111)(P Xe

Y

In ML different valuesfor the parameters are `tried'

Lets look at two possibilities: 1; b0 = 0 & b1= 0.05; 2, b0 = 0 & b1= 0.05

Page 15: What you've always wanted to know about logistic regression analysis, but were afraid to ask

15

Quantity of Study Hours Outcome

Predicted probability (b0=0; b1 = 0.05)

LL (b0=0; b1 = 0.05)

3 0 .53 -.7534 1 .85 -.1617 0 .71 -1.246 0 .57 -.84

12 0 .65 -1.0515 1 .68 -.3926 1 .79 -.2429 1 .81 -.21

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

We are now able to calculate the log likelihood statistic

Page 16: What you've always wanted to know about logistic regression analysis, but were afraid to ask

16

Outcome

Pr(b0=0;

b1 = 0.05)

LL (b0=0; b1 =

0.05)

Pr(b0=-6.44; b1 = 0.39)

LL(b0=-6.44; b1 =

0.39)0 .53 -.75 .01 -.011 .85 -.16 .99 -.010 .71 -1.24 .53 -.750 .57 -.84 .02 -.020 .65 -1.05 .14 -.151 .68 -.39 .34 -1.081 .79 -.24 .97 -.031 .81 -.21 .99 -.01∑ -4.88 -2.07

Two models and their log likelihood statistic

Based on a clever algorithm the model with the best fit (LL closest to 0) is chosen

Page 17: What you've always wanted to know about logistic regression analysis, but were afraid to ask

17

After estimationHow do I determine significance?

• Obviously SPSS does all the work for you

• How to interpret output of SPSS

• Two major issues1. Overall model fit

– Between model comparisons

– Pseudo R-square– Predictive accuracy /

classification test

2. Coefficients– Wald test– Likelihood ratio test– Odds ratios

)*39,044,6(11)(P studyhourse

Y

Page 18: What you've always wanted to know about logistic regression analysis, but were afraid to ask

18

Model fit: Between model comparison

)]baseline()New([22 LLLL

The log-likelihood ratio test statistic can be used to test the fit of a model

The test statistic has achi-square distribution

Model fit reduced modelModel fit full model

Page 19: What you've always wanted to know about logistic regression analysis, but were afraid to ask

19

Model fit

)( 11011)(P Xbbe

Y

)]baseline()New([22 LLLL

The log-likelihood ratio test statistic can be used to test the fit of a model

Model fit reduced modelModel fit full model

)( 011)(P be

Y

Page 20: What you've always wanted to know about logistic regression analysis, but were afraid to ask

Between model comparison

• Estimate a null model• Baseline model

• Estimate an improved model• This model contains more

variables• Assess the difference in -

2LL between the models• This difference follows a

chi-square distribution• degrees of freedom = #

estimated parameters in proposed model – # estimated parameters in null model2020

)( 2211011)(P XbXbbe

Y

)]baseline()New([22 LLLL

Model fit reduced model

Model fit full model

)( 11011)(P Xbbe

Y

Page 21: What you've always wanted to know about logistic regression analysis, but were afraid to ask

21

Overall model fitR and R2

2

22

)()ˆ(

yyyy

Ri

i

R2 in multiple regression is a measure of the variance explained by the model

SS due to regression

Total SS

Page 22: What you've always wanted to know about logistic regression analysis, but were afraid to ask

22

Overall model fitpseudo R2

Just like in multiple regression, logit R2 ranges 0.0 to 1.0

– Cox and Snell• cannot theoretically

reach 1– Nagelkerke

• adjusted so that it can reach 1

)(2)(2

LOGIT2

OriginalLLModelLLR

log-likelihood of modelbefore any predictors wereentered

log-likelihood of the modelthat you want to test

NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression

Page 23: What you've always wanted to know about logistic regression analysis, but were afraid to ask

23

What is a small or large R and R2?Strength of correlation

Small 0.10 to 0.29

Medium 0.30 to 0.49

Large 0.50 to 1.00

Page 24: What you've always wanted to know about logistic regression analysis, but were afraid to ask

24

Overall model fitClassification table

Classification Tablea

30 5 85,77 33 82,5

84,0

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

The cut value is ,500a.

How well does the model predict outcomes?

This means that we assume that if our model predictsthat a player will score with a probability of .51 (above .5)the prediction will be a score (lower than .50 is a miss).

spss output

Page 25: What you've always wanted to know about logistic regression analysis, but were afraid to ask

25

Testing significance of coefficientsThe Wald statistic: not really good

• In linear regression analysis this statistic is used to test significance

• In logistic regression something similar exists

• however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely)

b

bSE

Wald

t-distribution standard error of estimate

estimate

Page 26: What you've always wanted to know about logistic regression analysis, but were afraid to ask

26

Likelihood ratio testan alternative way to test significance of a coefficient

)( 11011)(P Xbbe

Y

)]Without()With([22 LLLL

To avoid type II errors for some variables you best use the Likelihood ratio test

model with variable model without variable

)( 011)(P be

Y

Page 27: What you've always wanted to know about logistic regression analysis, but were afraid to ask

27

Before we go to the exampleA recap

• Logistic regression– dichotomous outcome– logistic function– log-likelihood / maximum likelihood

• Model fit– likelihood ratio test (compare LL of models)– Pseudo R-square– Classification table– Wald test

Page 28: What you've always wanted to know about logistic regression analysis, but were afraid to ask

28

Illustration with SPSS

• Penalty kicks data, variables:– Scored: outcome variable,

• 0 = penalty missed, and 1 = penalty scored– Pswq: degree to which a player worries– Previous: percentage of penalties scored by a

particulare player in their career

Page 29: What you've always wanted to know about logistic regression analysis, but were afraid to ask

29

Case Processing Summary

75 100,00 ,0

75 100,00 ,0

75 100,0

Unweighted Casesa

Included in AnalysisMissing CasesTotal

Selected Cases

Unselected CasesTotal

N Percent

If weight is in effect, see classification table for the totalnumber of cases.

a.

Dependent Variable Encoding

01

Original ValueMissed PenaltyScored Penalty

Internal Value

SPSS OUTPUT Logistic Regression

Tells you somethingabout the number of observations and missings

Page 30: What you've always wanted to know about logistic regression analysis, but were afraid to ask

30

Classification Tablea,b

0 35 ,00 40 100,0

53,3

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 0

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

Constant is included in the model.a.

The cut value is ,500b.

Variables in the Equation

,134 ,231 ,333 1 ,564 1,143ConstantStep 0B S.E. Wald df Sig. Exp(B)

Variables not in the Equation

34,109 1 ,00034,193 1 ,00041,558 2 ,000

previouspswq

Variables

Overall Statistics

Step0

Score df Sig.

Block 0: Beginning Block this table is based on the empty model, i.e. onlythe constant in the model

)( 011)(P be

Y

these variableswill be enteredin the modellater on

Page 31: What you've always wanted to know about logistic regression analysis, but were afraid to ask

31

Block 1: Method = Enter

Omnibus Tests of Model Coefficients

54,977 2 ,00054,977 2 ,00054,977 2 ,000

StepBlockModel

Step 1Chi-square df Sig.

Model Summary

48,662a ,520 ,694Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than ,001.

a.

)]baseline()New([22 LLLL

Block is useful to check significance of individual coefficients, see Field

New model

this is the teststatistic

after dividing by -2

Note: Nagelkerkeis larger than Cox

Page 32: What you've always wanted to know about logistic regression analysis, but were afraid to ask

32

Variables in the Equation

,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598

previouspswqConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.

Classification Tablea

30 5 85,77 33 82,5

84,0

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

The cut value is ,500a.

Block 1: Method = Enter (Continued)

Predictive accuracy has improved (was 53%)

estimatesstandard errorestimates

significance based on Wald statistic

change in odds

Page 33: What you've always wanted to know about logistic regression analysis, but were afraid to ask

33

Variables in the Equation

,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598

previouspswqConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.

Classification Tablea

30 5 85,77 33 82,5

84,0

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

The cut value is ,500a.

How is the classification table constructed?

)*230,0*065,028,1(11)(P Pred. pswqpreviouse

Y

oops wrong prediction

oops wrong prediction

Page 34: What you've always wanted to know about logistic regression analysis, but were afraid to ask

34

How is the classification table constructed?

)*230,0*065,028,1(11)(P Pred. pswqpreviouse

Y

pswq previous scored Predict. prob.

18 56 1 .68

17 35 1 .41

20 45 0 .40

10 42 0 .85

Page 35: What you've always wanted to know about logistic regression analysis, but were afraid to ask

35

How is the classification table constructed?

pswq previous

scored Predict. prob.

predicted

18 56 1 .68 117 35 1 .41 020 45 0 .40 010 42 0 .85 1

Classification Tablea

30 5 85,77 33 82,5

84,0

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

The cut value is ,500a.