Download pdf - Logit powerpoint

7/30/2019 Logit powerpoint

1/48

to Logisticto Logistic

RegressionRegression

JohnWhiteheadJohnWhitehead

Department of EconomicsDepartment of Economics

Appalachian State UniversityAppalachian State University


2/48

OutlineOutline

Introduction andIntroduction and

DescriptionDescription Some PotentialSome Potential

Problems andProblems andSolutionsSolutions

Writing Up theWriting Up the


3/48

Introduction and DescriptionIntroduction and Description

Why use logistic regression?Why use logistic regression?

Estimation by maximum likelihoodEstimation by maximum likelihood

Interpreting coefficientsInterpreting coefficients Hypothesis testingHypothesis testing

Evaluating the performance of theEvaluating the performance of the

modelmodel


4/48

Why use logistic regression?Why use logistic regression?

There are many important researchThere are many important researchtopics for which the dependent variabletopics for which the dependent variableis "limited."is "limited."

For example: voting, morbidity orFor example: voting, morbidity ormortality, and participation data is notmortality, and participation data is notcontinuous or distributed normally.continuous or distributed normally.

Binary logistic regression is a type ofBinary logistic regression is a type ofregression analysis where theregression analysis where thedependent variable is a dummydependent variable is a dummyvariable: coded 0 (did not vote) or 1(didvariable: coded 0 (did not vote) or 1(did

vote)vote)


5/48

The Linear Probability ModelThe Linear Probability Model

In the OLS regression:In the OLS regression:

Y =Y = ++ X + e ; where Y = (0, 1)X + e ; where Y = (0, 1)

The error terms are heteroskedasticThe error terms are heteroskedastic e is not normally distributed becausee is not normally distributed because

Y takes on only two valuesY takes on only two values

The predicted probabilities can beThe predicted probabilities can begreater than 1 or less than 0greater than 1 or less than 0


6/48

Q: EVAC

Did you evacuate your home to go someplacesafer before Hurricane Dennis (Floyd) hit?

1 YES

2 NO3 DON'T KNOW4 REFUSED

An Example: HurricaneAn Example: Hurricane

EvacuationsEvacuations


7/48

The DataThe Data

EVAC PETS MOBLHOME TENURE EDUC

0 1 0 16 16

0 1 0 26 12

0 1 1 11 13

1 1 1 1 10

1 0 0 5 12

0 0 0 34 12

0 0 0 3 14

0 1 0 3 16

01 0 10 12

0 0 0 2 18

0 0 0 2 12

0 1 0 25 16

1 1 1 20 12


8/48

OLS ResultsOLS Results

Dependent Variable: EVAC

Variable B t-value

(Constant) 0.190 2.121

PETS -0.137 -5.296

MOBLHOME 0.337 8.963

TENURE -0.003 -2.973

EDUC 0.003 0.424FLOYD 0.198 8.147

R2

0.145

F-stat 36.010


9/48

Problems:Problems:

Descriptive Statistics

1070 -.08498 .76027 .2429907UnstandardizedPredicted Value

N Minimum Maximum Mean

Predicted Values outside the 0,1

range


10/48

HeteroskedasticityHeteroskedasticity

TENURE

100806040200

U

n

s

t

an

d

a

r

d

i

z

e

d

R

e

s

i

d

u

a

l

10

0

-10

-20

Dependent Variable: LNESQ

B t-stat(Constant) -2.34 -15.99

LNTNSQ -0.20 -6.19

Park Test


11/48

The Logistic Regression ModelThe Logistic Regression Model

The "logit" model solves these problems:The "logit" model solves these problems:

ln[p/(1-p)] =ln[p/(1-p)] = ++ X + eX + e

p is the probability that the event Yp is the probability that the event Y

occurs, p(Y=1)occurs, p(Y=1)

p/(1-p) is the "odds ratio"p/(1-p) is the "odds ratio"

ln[p/(1-p)] is the log odds ratio, or "logit"ln[p/(1-p)] is the log odds ratio, or "logit"


12/48

More:More:

The logistic distribution constrains theThe logistic distribution constrains the

estimated probabilities to lie between 0estimated probabilities to lie between 0and 1.and 1.

The estimated probability is:The estimated probability is:

p = 1/[1 + exp(-p = 1/[1 + exp(- -- X)]X)]

if you letif you let ++ X =0, then p = .50X =0, then p = .50

asas ++ X gets really big, p approaches 1X gets really big, p approaches 1 asas ++ X gets really small, p approachesX gets really small, p approaches

00


13/48


14/48

Comparing LP and LogitComparing LP and Logit

ModelsModels

0

1

LP Model

Logit Model


15/48

Maximum Likelihood EstimationMaximum Likelihood Estimation

(MLE)(MLE)

MLE is a statistical method forMLE is a statistical method for

estimating the coefficients of a model.estimating the coefficients of a model.

The likelihood function (L) measures theThe likelihood function (L) measures the

probability of observing the particularprobability of observing the particularset of dependent variable values (pset of dependent variable values (p11,,

pp22, ..., p, ..., pnn) that occur in the sample:) that occur in the sample:

L = Prob (pL = Prob (p

11* p* p

22* * * p* * * p

nn))

The higher the L, the higher theThe higher the L, the higher the

probability of observing the ps in theprobability of observing the ps in the

sample.sample.


16/48

MLE involves finding the coefficients (MLE involves finding the coefficients (,,

) that makes the log of the likelihood) that makes the log of the likelihood

function (LL < 0) as large as possiblefunction (LL < 0) as large as possible Or, finds the coefficients that make -2Or, finds the coefficients that make -2

times the log of the likelihood functiontimes the log of the likelihood function

(-2LL) as small as possible(-2LL) as small as possible

The maximum likelihood estimatesThe maximum likelihood estimates

solve the following condition:solve the following condition:

{Y - p(Y=1)}X{Y - p(Y=1)}X ii = 0= 0

summed over all observations, i = 1,summed over all observations, i = 1,

,n,n


17/48

Interpreting CoefficientsInterpreting Coefficients

Since:Since:

ln[p/(1-p)] =ln[p/(1-p)] = ++ X + eX + e

The slope coefficient (The slope coefficient () is interpreted) is interpretedas the rate of change in the "log odds"as the rate of change in the "log odds"as X changes not very useful.as X changes not very useful.

Since:Since:

p = 1/[1 + exp(-p = 1/[1 + exp(- -- X)]X)]

The marginal effect of a change in X onThe marginal effect of a change in X onthe probability is:the probability is: p/p/X = f(X = f(

X)X)


18/48

An interpretation of the logitAn interpretation of the logit

coefficient which is usuallycoefficient which is usually

more intuitive is the "oddsmore intuitive is the "odds

ratio"ratio"

Since:Since:

[p/(1-p)] = exp([p/(1-p)] = exp(++ XX))

exp(exp() is the effect of the) is the effect of the

independent variable on theindependent variable on the

"odds ratio""odds ratio"


19/48

From SPSS Output:From SPSS Output:

Variable B Exp(B) 1/Exp(B)

PETS -0.6593 0.5172 1.933

MOBLHOME 1.5583 4.7508

TENURE -0.0198 0.9804 1.020

EDUC 0.0501 1.0514

Constant -0.916

Households with pets are 1.933 times morelikely to evacuate than those without pets.


20/48

Hypothesis TestingHypothesis Testing

The Wald statistic for theThe Wald statistic for the coefficient is:coefficient is:

Wald = [Wald = [/s.e./s.e.BB]]22

which is distributed chi-square withwhich is distributed chi-square with

1 degree of freedom.1 degree of freedom. The "Partial R" (in SPSS output) isThe "Partial R" (in SPSS output) is

R = {[(Wald-2)/(-2LL(R = {[(Wald-2)/(-2LL()]})]}1/21/2


21/48

An Example:An Example:

Variable B S.E. Wald R Sig t-value

PETS -0.6593 0.2012 10.732 -0.1127 0.0011 -3.28

MOBLHOM 1.5583 0.2874 29.39 0.1996 0 5.42

TENURE -0.0198 0.008 6.1238 -0.0775 0.0133 -2.48

EDUC 0.0501 0.0468 1.1483 0.0000 0.2839 1.07

Constant -0.916 0.69 1.7624 1 0.1843 -1.33


22/48

Evaluating the PerformanceEvaluating the Performance

of the Modelof the Model

There are several statistics whichThere are several statistics which

can be used for comparingcan be used for comparingalternative models or evaluatingalternative models or evaluating

the performance of a single model:the performance of a single model:

Model Chi-SquareModel Chi-Square Percent Correct PredictionsPercent Correct Predictions

Pseudo-RPseudo-R22


23/48

Model Chi-SquareModel Chi-Square

The model likelihood ratio (LR), statisticThe model likelihood ratio (LR), statisticisis

LR[i] = -2[LL(LR[i] = -2[LL() - LL() - LL(,, ) ]) ]

{Or, as you are reading SPSS printout:{Or, as you are reading SPSS printout:

LR[i] = [-2LL (of beginning model)] - [-2LL (of endingLR[i] = [-2LL (of beginning model)] - [-2LL (of ending

model)]}model)]}

The LR statistic is distributed chi-squareThe LR statistic is distributed chi-square

with i degrees of freedom, where i is thewith i degrees of freedom, where i is thenumber of independent variablesnumber of independent variables

Use the Model Chi-Square statistic toUse the Model Chi-Square statistic to

determine if the overall model isdetermine if the overall model is


24/48


Beginning Block Number 1. Method: Enter

-2 Log Likelihood 687.35714

Variable(s) Entered on Step Number

1.. PETS PETS

MOBLHOME MOBLHOME

TENURE TENUREEDUC EDUC

Estimation terminated at iteration number 3 because

Log Likelihood decreased by less than .01 percent.

-2 Log Likelihood 641.842

Chi-Square df Sign.

Model 45.515 4 0.0000


25/48

Percent Correct PredictionsPercent Correct Predictions

The "Percent Correct Predictions"The "Percent Correct Predictions"statistic assumes that if the estimated pstatistic assumes that if the estimated pis greater than or equal to .5 then theis greater than or equal to .5 then the

event is expected to occur and notevent is expected to occur and notoccur otherwise.occur otherwise. By assigning these probabilities 0s andBy assigning these probabilities 0s and

1s and comparing these to the actual 0s1s and comparing these to the actual 0s

and 1s, the % correct Yes, % correct No,and 1s, the % correct Yes, % correct No,and overall % correct scores areand overall % correct scores arecalculated.calculated.


26/48


Observed % Correct

0 1

0 328 24 93.18%

1 139 44 24.04%

Overall 69.53%

Predicted


27/48

Pseudo-RPseudo-R22

OneOne psuedo-Rpsuedo-R22 statistic is the McFadden's-statistic is the McFadden's-

RR22 statistic:statistic:

McFadden's-RMcFadden's-R22

= 1 - [LL(= 1 - [LL(,,)/LL()/LL()])]{{= 1 - [-2LL(= 1 - [-2LL(,, )/-2LL()/-2LL()] (from)] (fromSPSSSPSSprintout)printout)}}

where the Rwhere the R22 is a scalar measure whichis a scalar measure which

varies between 0 and (somewhat close to)varies between 0 and (somewhat close to)1 much like the R1 much like the R22 in a LP model.in a LP model.


28/48


Beginning -2 LL 687.36

Ending -2 LL 641.84

Ending/Beginning 0.9338

McF. R2

= 1 - E./B. 0.0662


29/48

Some potential problems andSome potential problems and

solutionssolutions Omitted Variable BiasOmitted Variable Bias

Irrelevant Variable BiasIrrelevant Variable Bias

Functional FormFunctional Form MulticollinearityMulticollinearity

Structural BreaksStructural Breaks


30/48

Omitted Variable BiasOmitted Variable Bias

Omitted variable(s) can result in bias in theOmitted variable(s) can result in bias in thecoefficient estimates. To test for omittedcoefficient estimates. To test for omittedvariables you can conduct a likelihood ratio test:variables you can conduct a likelihood ratio test:

LR[q] = {[-2LL(constrained model, i=k-q)]LR[q] = {[-2LL(constrained model, i=k-q)]

- [-2LL(unconstrained model, i=k)]}- [-2LL(unconstrained model, i=k)]}

where LR is distributed chi-square with q degreeswhere LR is distributed chi-square with q degrees

of freedom, with q = 1 or more omitted variablesof freedom, with q = 1 or more omitted variables {This test is conducted automatically by{This test is conducted automatically by SPSSSPSS ifif

you specify "blocks" of independent variables}you specify "blocks" of independent variables}


31/48

An Example:An Example:Variable B Wald Sig

PETS -0.699 10.968 0.001

MOBLHOME 1.570 29.412 0.000

TENURE -0.020 5.993 0.014

EDUC 0.049 1.079 0.299

CHILD 0.009 0.011 0.917

WHITE 0.186 0.422 0.516

FEMALE 0.018 0.008 0.928Constant -1.049 2.073 0.150

Beginning -2 LL 687.36

Ending -2 LL 641.41


32/48

Constructing the LR TestConstructing the LR Test

Since the chi-squared value is less than thecritical value the set of coefficients is notstatistically significant. The full model is not animprovement over the partial model.

Ending -2 LL Partial Model 641.84

Ending -2 LL Full Model 641.41

Block Chi-Square 0.43

DF 3

Critical Value 11.345


33/48

The inclusion of irrelevantThe inclusion of irrelevant

variable(s) can result in poorvariable(s) can result in poor

model fit.model fit. You can consult your WaldYou can consult your Wald

statistics or conduct a likelihoodstatistics or conduct a likelihood

ratio test.ratio test.

Irrelevant Variable Bias


34/48

Functional FormFunctional Form

Errors in functional form can result inErrors in functional form can result in

biased coefficient estimates and poorbiased coefficient estimates and poor

model fit.model fit.

You should try different functional formsYou should try different functional formsby logging the independent variables,by logging the independent variables,

adding squared terms, etc.adding squared terms, etc.

Then consult the Wald statistics and modelThen consult the Wald statistics and model

chi-square statistics to determine whichchi-square statistics to determine which

model performs best.model performs best.


35/48

MulticollinearityMulticollinearity

The presence of multicollinearity willThe presence of multicollinearity will notnot leadleadto biased coefficients.to biased coefficients.

But the standard errors of the coefficients willBut the standard errors of the coefficients will

be inflated.be inflated.

If a variable which you think should beIf a variable which you think should bestatistically significant is not, consult thestatistically significant is not, consult the

correlation coefficients.correlation coefficients.

If two variables are correlated at a rate greaterIf two variables are correlated at a rate greater

than .6, .7, .8, etc. then try dropping the leastthan .6, .7, .8, etc. then try dropping the leasttheoretically important of the two.theoretically important of the two.


36/48

Structural BreaksStructural Breaks

You may have structural breaks in your data.You may have structural breaks in your data.Pooling the data imposes the restriction that anPooling the data imposes the restriction that anindependent variable has the same effect on theindependent variable has the same effect on thedependent variable for different groups of datadependent variable for different groups of data

when the opposite may be true.when the opposite may be true. You can conduct a likelihood ratio test:You can conduct a likelihood ratio test:

LR[i+1] = -2LL(pooled model)LR[i+1] = -2LL(pooled model)

[-2LL(sample 1) + -2LL(sample 2)][-2LL(sample 1) + -2LL(sample 2)]

where samples 1 and 2 are pooled, and i is thewhere samples 1 and 2 are pooled, and i is thenumber of independent variables.number of independent variables.


37/48

An ExampleAn Example Is the evacuation behavior fromIs the evacuation behavior from

Hurricanes Dennis and Floyd statisticallyHurricanes Dennis and Floyd statistically

equivalent?equivalent?

Floyd Dennis Pooled

Variable B B BPETS -0.66 -1.20 -0.79

MOBLHOME 1.56 2.00 1.62

TENURE -0.02 -0.02 -0.02

EDUC 0.05 -0.04 0.02Constant -0.92 -0.78 -0.97

Beginning -2 LL 687.36 440.87 1186.64

Ending -2 LL 641.84 382.84 1095.26

Model Chi-Square 45.52 58.02 91.37


38/48

Constructing the LR TestConstructing the LR Test

Floyd Dennis Pooled

Ending -2 LL 641.84 382.84 1095.26

Chi-Square 70.58 [Pooled - (Floyd + Dennis)

DF 5

Critical Value 13.277 p = .01

Since the chi-squared value is greater than thecritical value the set of coefficients are statisticallydifferent. The pooled model is inappropriate.


39/48

What should you do?What should you do?

Try adding a dummy variable:Try adding a dummy variable:

FLOYD = 1 if Floyd, 0 if DennisFLOYD = 1 if Floyd, 0 if Dennis

Variable B Wald Sig

PETS -0.85 27.20 0.000

MOBLHOME 1.75 65.67 0.000

TENURE -0.02 8.34 0.004

EDUC 0.02 0.27 0.606

FLOYD 1.26 59.08 0.000

Constant -1.68 8.71 0.003


40/48

Writing Up ResultsWriting Up Results

Present descriptive statistics in a tablePresent descriptive statistics in a table Make it clear that the dependent variableMake it clear that the dependent variable

is discrete (0, 1) and not continuous andis discrete (0, 1) and not continuous andthat you will use logistic regression.that you will use logistic regression.

Logistic regression is a standardLogistic regression is a standardstatistical procedure so you don'tstatistical procedure so you don't(necessarily) need to write out the(necessarily) need to write out theformula for it. You also (usually) don'tformula for it. You also (usually) don't

need to justify that you are using Logitneed to justify that you are using Logitinstead of the LP model or Probit (similarinstead of the LP model or Probit (similarto logit but based on the normalto logit but based on the normaldistribution [the tails are less fat]).distribution [the tails are less fat]).


41/48


"The dependent variable whichmeasures the willingness to evacuateis EVAC. EVAC is equal to 1 if the

respondent evacuated their homeduring Hurricanes Floyd and Dennisand 0 otherwise. The logisticregression model is used to estimatethe factors which influenceevacuation behavior."


42/48

In the heading state that your dependentIn the heading state that your dependent

variable (dependent variable = EVAC) and thatvariable (dependent variable = EVAC) and thatthese are "logistic regression results.these are "logistic regression results.

Present coefficient estimates, t-statistics (orPresent coefficient estimates, t-statistics (or

Wald, whichever you prefer), and (at least the)Wald, whichever you prefer), and (at least the)

model chi-square statistic for overall model fitmodel chi-square statistic for overall model fit If you are comparing several modelIf you are comparing several model

specifications you should also present the %specifications you should also present the %

correct predictions and/or Pseudo-Rcorrect predictions and/or Pseudo-R22 statisticsstatistics

to evaluate model performanceto evaluate model performance If you are comparing models with hypothesesIf you are comparing models with hypotheses

about different blocks of coefficients or testingabout different blocks of coefficients or testing

for structural breaks in the data, you couldfor structural breaks in the data, you could

present the ending log-likelihood values.present the ending log-likelihood values.

Organize your regression results in a table:


43/48


Table 2. Logistic Regression Results

Dependent Variable = EVAC

Variable B B/S.E.

PETS -0.6593 -3.28

MOBLHOME 1.5583 5.42

TENURE -0.0198 -2.48

EDUC 0.0501 1.07

Constant -0.916 -1.33

Model Chi-Squared 45.515


44/48

"The results from Model 1 indicate"The results from Model 1 indicatethat coastal residents behavethat coastal residents behaveaccording to risk theory. Theaccording to risk theory. The

coefficient on the MOBLHOMEcoefficient on the MOBLHOMEvariable is negative andvariable is negative andstatistically significant at the p < .statistically significant at the p < .01 level (t-value = 5.42). Mobile01 level (t-value = 5.42). Mobile

home residents are 4.75 timeshome residents are 4.75 timesmore likel to evacuate.

When describing the statistics

in the tables, point out thehighlights for the reader.What are the statistically significantvariables?


45/48

The overall model is significantat the .01 level according to theModel chi-square statistic. Themodel predicts 69.5% of the

responses correctly. TheMcFadden's R2 is .066."

Is the overall model statistically

significant?


46/48

Which model is preferred?"Model 2 includes three additionalindependent variables. According tothe likelihood ratio test statistic, thepartial model is superior to the fullmodel of overall model fit. The blockchi-square statistic is not statisticallysignificant at the .01 level (critical

value = 11.35 [df=3]). The coefficienton the children, gender, and racevariables are not statisticallysignificant at standard levels."


47/48

AlsoAlso

You usually don't need to discuss theYou usually don't need to discuss themagnitude of the coefficients--just themagnitude of the coefficients--just thesign (+ or -) and statistical significance.sign (+ or -) and statistical significance.

If your audience is unfamiliar with theIf your audience is unfamiliar with theextensions (beyondextensions (beyond SPSSSPSS oror SASSASprintouts) to logistic regression, discussprintouts) to logistic regression, discussthe calculation of the statistics in anthe calculation of the statistics in anappendix or footnote or provide aappendix or footnote or provide a

citation.citation. Always state the degrees of freedom forAlways state the degrees of freedom foryour likelihood-ratio (chi-square) test.your likelihood-ratio (chi-square) test.


48/48

ReferencesReferences

http://personal.ecu.edu/whiteheadj/data/logit/http://personal.ecu.edu/whiteheadj/data/logit/

http://personal.ecu.edu/whiteheadj/data/logit/logitpap.htmhttp://personal.ecu.edu/whiteheadj/data/logit/logitpap.htm

E-mail: [email protected]: [email protected]