What you've always wanted to know about logistic regression analysis, but were afraid to ask

1

What you've always wanted to know about logistic regression analysis, but were afraid to

ask...

Februari, 1 2010

Gerrit RooksSociology of Innovation

Innovation Sciences & Industrial Engineering Phone: 5509

email: [email protected]

This Lecture

• Why logistic regression analysis?• The logistic regression model• Estimation• Goodness of fit• An example

2

3

What's the difference between 'normal' regression and logistic regression?

Regression analysis: – Relate one

or more independent (predictor) variables to a dependent (outcome) variable

4

What's the difference between 'normal' regression and logistic regression?

• Often you will be confronted with outcome variables that are dichotomic:– success vs failure– employed vs unemployed– promoted or not– sick or healthy – pass or fail an exam

5

ExampleRelationship between hours studied for exam and success

Hours # Failed exam

# Passed exam?

Total # students

Prob. pass exam

28 4 2 6 .33

29 3 2 5 .40

30 2 7 9 .78

31 2 7 9 .78

32 4 16 20 .80

33 1 14 15 .93

6

Linear regression analysisWhy is this wrong?

7

Logistic RegressionThe better alternative

8

9

The logistic regression equationpredicting probabilities

)( 111011)(

XbbeYP

predictedprobability(always between0 and 1)

similar to regressionanalysis

10

The Logistic functionSometimes authors rearrange the model

)(

)(

)( 1110

1110

1110 111)(

Xbb

Xbb

Xbb ee

eYP

nn xcxcxccyp

yp

...

)1(1)1(ln 22110

or also

11

How do we estimate coefficients?Maximum-likelihood estimation

• Parameters are estimated by `fitting' models, based on the available predictors, to the observed data

• The chosen model fits the data best, i.e. is closest to the data

• Fit is determined by the so-called log likelihood statistic

12

Maximum likelihood estimationThe log-likelihood statistic

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

Large values of LL indicate poor fit of the model

HOWEVER, THIS STATISTIC CANNOT BE USED TO EVALUATE THE FIT OF A SINGLE MODEL

13

Quantity of Study Hours Outcome

3 034 117 06 0

12 015 126 129 1

An example to illustrate maximum likelihood and the log likelihood statistic

Suppose we know hours spentstudying and the outcome of an exam

14

)05.0( 111)(P Xe

Y


Predicted probability (b0=0; b1 = 0.05)

Predicted probability(b0=-6.44; b1 = 0.39)

3 0 .53 .0134 1 .85 .9917 0 .71 .536 0 .57 .02

12 0 .65 .1415 1 .68 .3426 1 .79 .9729 1 .81 .99

)39.044.6( 111)(P Xe

Y

In ML different valuesfor the parameters are `tried'

Lets look at two possibilities: 1; b0 = 0 & b1= 0.05; 2, b0 = 0 & b1= 0.05

15


Predicted probability (b0=0; b1 = 0.05)

LL (b0=0; b1 = 0.05)

3 0 .53 -.7534 1 .85 -.1617 0 .71 -1.246 0 .57 -.84

12 0 .65 -1.0515 1 .68 -.3926 1 .79 -.2429 1 .81 -.21

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

We are now able to calculate the log likelihood statistic

16

Outcome

Pr(b0=0;

b1 = 0.05)

LL (b0=0; b1 =

0.05)

Pr(b0=-6.44; b1 = 0.39)

LL(b0=-6.44; b1 =

0.39)0 .53 -.75 .01 -.011 .85 -.16 .99 -.010 .71 -1.24 .53 -.750 .57 -.84 .02 -.020 .65 -1.05 .14 -.151 .68 -.39 .34 -1.081 .79 -.24 .97 -.031 .81 -.21 .99 -.01∑ -4.88 -2.07

Two models and their log likelihood statistic

Based on a clever algorithm the model with the best fit (LL closest to 0) is chosen

17

After estimationHow do I determine significance?

• Obviously SPSS does all the work for you

• How to interpret output of SPSS

• Two major issues1. Overall model fit

– Between model comparisons

– Pseudo R-square– Predictive accuracy /

classification test

2. Coefficients– Wald test– Likelihood ratio test– Odds ratios

)*39,044,6(11)(P studyhourse

Y

18

Model fit: Between model comparison

)]baseline()New([22 LLLL

The log-likelihood ratio test statistic can be used to test the fit of a model

The test statistic has achi-square distribution

Model fit reduced modelModel fit full model

19

Model fit

)( 11011)(P Xbbe

Y


The log-likelihood ratio test statistic can be used to test the fit of a model

Model fit reduced modelModel fit full model

)( 011)(P be

Y

Between model comparison

• Estimate a null model• Baseline model

• Estimate an improved model• This model contains more

variables• Assess the difference in -

2LL between the models• This difference follows a

chi-square distribution• degrees of freedom = #

estimated parameters in proposed model – # estimated parameters in null model2020

)( 2211011)(P XbXbbe

Y


Model fit reduced model

Model fit full model

)( 11011)(P Xbbe

Y

21

Overall model fitR and R2

2

22

)()ˆ(

yyyy

Ri

i

R2 in multiple regression is a measure of the variance explained by the model

SS due to regression

Total SS

22

Overall model fitpseudo R2

Just like in multiple regression, logit R2 ranges 0.0 to 1.0

– Cox and Snell• cannot theoretically

reach 1– Nagelkerke

• adjusted so that it can reach 1

)(2)(2

LOGIT2

OriginalLLModelLLR

log-likelihood of modelbefore any predictors wereentered

log-likelihood of the modelthat you want to test

NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression

23

What is a small or large R and R2?Strength of correlation

Small 0.10 to 0.29

Medium 0.30 to 0.49

Large 0.50 to 1.00

24

Overall model fitClassification table

Classification Tablea

30 5 85,77 33 82,5

84,0

ObservedMissed PenaltyScored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty KickPercentage

Correct

Predicted

The cut value is ,500a.

How well does the model predict outcomes?

This means that we assume that if our model predictsthat a player will score with a probability of .51 (above .5)the prediction will be a score (lower than .50 is a miss).

spss output

25

Testing significance of coefficientsThe Wald statistic: not really good

• In linear regression analysis this statistic is used to test significance

• In logistic regression something similar exists

• however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely)

b

bSE

Wald

t-distribution standard error of estimate

estimate

26

Likelihood ratio testan alternative way to test significance of a coefficient

)( 11011)(P Xbbe

Y

)]Without()With([22 LLLL

To avoid type II errors for some variables you best use the Likelihood ratio test

model with variable model without variable

)( 011)(P be

Y

27

Before we go to the exampleA recap

• Logistic regression– dichotomous outcome– logistic function– log-likelihood / maximum likelihood

• Model fit– likelihood ratio test (compare LL of models)– Pseudo R-square– Classification table– Wald test

28

Illustration with SPSS

• Penalty kicks data, variables:– Scored: outcome variable,

• 0 = penalty missed, and 1 = penalty scored– Pswq: degree to which a player worries– Previous: percentage of penalties scored by a

particulare player in their career

29

Case Processing Summary

75 100,00 ,0

75 100,00 ,0

75 100,0

Unweighted Casesa

Included in AnalysisMissing CasesTotal

Selected Cases

Unselected CasesTotal

N Percent

If weight is in effect, see classification table for the totalnumber of cases.

a.

Dependent Variable Encoding

01

Original ValueMissed PenaltyScored Penalty

Internal Value

SPSS OUTPUT Logistic Regression

Tells you somethingabout the number of observations and missings

30

Classification Tablea,b

0 35 ,00 40 100,0

53,3



Overall Percentage

Step 0

MissedPenalty

ScoredPenalty


Correct

Predicted

Constant is included in the model.a.

The cut value is ,500b.

Variables in the Equation

,134 ,231 ,333 1 ,564 1,143ConstantStep 0B S.E. Wald df Sig. Exp(B)

Variables not in the Equation

34,109 1 ,00034,193 1 ,00041,558 2 ,000

previouspswq

Variables

Overall Statistics

Step0

Score df Sig.

Block 0: Beginning Block this table is based on the empty model, i.e. onlythe constant in the model

)( 011)(P be

Y

these variableswill be enteredin the modellater on

31

Block 1: Method = Enter

Omnibus Tests of Model Coefficients

54,977 2 ,00054,977 2 ,00054,977 2 ,000

StepBlockModel

Step 1Chi-square df Sig.

Model Summary

48,662a ,520 ,694Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than ,001.

a.


Block is useful to check significance of individual coefficients, see Field

New model

this is the teststatistic

after dividing by -2

Note: Nagelkerkeis larger than Cox

32


,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598

previouspswqConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.


30 5 85,77 33 82,5

84,0



Overall Percentage

Step 1

MissedPenalty

ScoredPenalty


Correct

Predicted


Block 1: Method = Enter (Continued)

Predictive accuracy has improved (was 53%)

estimatesstandard errorestimates

significance based on Wald statistic

change in odds

33


,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598

previouspswqConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.


30 5 85,77 33 82,5

84,0



Overall Percentage

Step 1

MissedPenalty

ScoredPenalty


Correct

Predicted


How is the classification table constructed?

)*230,0*065,028,1(11)(P Pred. pswqpreviouse

Y

oops wrong prediction

oops wrong prediction

34


)*230,0*065,028,1(11)(P Pred. pswqpreviouse

Y

pswq previous scored Predict. prob.

18 56 1 .68

17 35 1 .41

20 45 0 .40

10 42 0 .85

35


pswq previous

scored Predict. prob.

predicted

18 56 1 .68 117 35 1 .41 020 45 0 .40 010 42 0 .85 1


30 5 85,77 33 82,5

84,0



Overall Percentage

Step 1

MissedPenalty

ScoredPenalty


Correct

Predicted


Documents

What you've always wanted to know about logistic regression analysis, but were afraid to ask