Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1

Unit 4b: Fitting the Logistic Model to Data

© Andrew Ho, Harvard Graduate School of Education Unit 4b – Slide 1

http://xkcd.com/953/



• Building the Logistic Regression Model• Likelihood Ratio Chi-Square and Pseudo-• Interpreting Logistic Regression Model Coefficients• Probabilities, Odds, Log Odds, and Log Odds Ratios

© Andrew Ho, Harvard Graduate School of Education Unit 4b– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 4b

Today’s Topic Area

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1,000)

© Andrew Ho, Harvard Graduate School of Education Unit 4a – Slide 3

The Bivariate Distribution of HOME on HUBSAL


In Labor Force

HomemakerRQ: In 1976, were married Canadian women

who had children at home and husbands with higher salaries more likely to work at home

rather than joining the labor force (when compared to their married peers with no children at home and husbands who earn

less)?

Unit 4b – Slide 4

HUBSALeHOME

101

11P

This will be our statistical model for relating a categorical outcome to predictors.We will fit it to data using Nonlinear Regression Analysis …

We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between the dichotomous outcome, HOME, and predictors … We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between the dichotomous outcome, HOME, and predictors …

The outcome being modeled is the

underlying probability that the

value of the outcome HOME equals 1

Parameter 1 determines the slope of

the curve, but is not equal to it (in fact, the

slope is different at every point on the

curve).

Parameter 0 determines the intercept of the curve, but is not

equal to it.

The Logistic Regression Model

© Andrew Ho, Harvard Graduate School of Education


Building the Logistic Regression Model: The Unconditional Model

_cons .8715548 .1052638 8.28 0.000 .6652415 1.077868 HOME Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -263.22441 Pseudo R2 = 0.0000 Prob > chi2 = . LR chi2(0) = 0.00Logistic regression Number of obs = 434

Iteration 1: log likelihood = -263.22441 Iteration 0: log likelihood = -263.22441

. logit HOME

To gain our footing, we can fit an unconditional logistic model:

This should look familiar: it is our unconditional percentage of women who are homemakers in our sample:

HOME 434 .7050691 .456538 0 1 Variable Obs Mean Std. Dev. Min Max

. summarize HOME

We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”

Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.

Later, we’ll use the difference in -2*loglikelihoods (the deviance) in a statistical test to compare models.


Building the Logistic Regression Model

_cons -.2371923 .2626906 -0.90 0.367 -.7520565 .2776718 HUBSAL .0808408 .0184165 4.39 0.000 .0447451 .1169364 HOME Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -252.02479 Pseudo R2 = 0.0425 Prob > chi2 = 0.0000 LR chi2(1) = 22.40Logistic regression Number of obs = 434

Iteration 4: log likelihood = -252.02479 Iteration 3: log likelihood = -252.02479 Iteration 2: log likelihood = -252.02492 Iteration 1: log likelihood = -252.20292 Iteration 0: log likelihood = -263.22441

. logit HOME HUBSAL

Our fitted model

Before we interpret these coefficients directly, it is generally easiest to visualize the fitted model graphically.

We notice that our log likelihood is more positive than before (a better fit, from -263 to -252), but it took a bit longer to converge (increased complexity given the predictor).

We can show that the deviance (loglikelihood) decreases from 526 to 504.

0.5

1Is

Wo

man

a H

ome

mak

er?

0 10 20 30 40 50Husband's Annual Salary (in $1000)© Andrew Ho, Harvard Graduate School of Education Unit 4b – Slide 7

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

Graphical Interpretation of the Logistic Regression Model

�̂� (𝐻𝑂𝑀𝐸=1 )= 1

1+𝑒− (−.237+.081𝐻𝑈𝐵𝑆𝐴𝐿 )

Comparing local polynomial, linear, and logistic fits to the data.


The Likelihood Ratio Chi-Square




. logit HOME

Our Log Likelihood from our baseline model, with no predictors, is -263.22.

Deviance = -2*loglikelihood = 526.44




. logit HOME HUBSAL

Our Log Likelihood from our 1-predictor model is -252.02. The loglikelihood of the data is less negative (more likely) given the model parameter estimates.

Deviance = -2*loglikelihood = 504.04. The deviance has dropped (and will always drop).

The drop in deviance is . This drop in deviance is chi-square distributed. A difference in logs is a log of a ratio, hence “likelihood ratio chi-square.” Degrees of freedom are equal to the difference in the number of terms in the model (in this case, from 0 to 1

predictors, so 1 degree of freedom). Because we are comparing this to the baseline model, this is an omnibus test, of the null hypothesis that all

coefficients are 0, which we can reject, . We can generalize the likelihood-ratio test to compare any nested models.




. logit HOME HUBSAL


The Pseudo- Statistic0

.2.4

.6.8

1Is

Wo

man

a H

ome

mak

er?


We could calculate our usual statistic

One minus the sum of squared residuals over the original variation of .

This statistic isn’t that meaningful when is constrained to be dichotomous, as you can see from the graph.

Instead, we define another “Pseudo-” as the proportional reduction in deviance from the unconditional model, or the proportional increase in loglikelihood over the unconditional model.

4.25% of the unconditional model deviance has been accounted for by the predictors.

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?



Interpreting Model Results Graphically, Formulaically

We have emphasized graphical interpretation of results throughout this course, particularly for interactions and nonlinear relationships.

We can always pick a handful of prototypical points and describe model implications.

Husband's income in 1976 Canadian

Dollars

Estimated probability that the wife is a

homemaker$10,000 64%$20,000 80%$30,000 90%$40,000 95%

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?



Interpreting Logistic Model Parameter Estimates – Interpreting Sign



. logit HOME HUBSAL, nolog

But we must also be able to interpret parameter estimates directly, given their prominent placement in tables.

Direct interpretation of parameter estimates can be difficult with interactions and nonlinear relationships. It’s certainly difficult here.

Positive coefficients imply that positive increments in predict greater probabilities of , if all else in the model can be held constant.

Positive constants imply that when all .


Object

Is it an Easter Egg?(0 = no;1 = yes) 1 0 1 0 1 1 0 1 0 1

Probability of picking an Easter Egg at

random,p

Odds of picking an Easter Egg (vs. not an

Easter Egg),(p/1-p)

Log-Odds of picking an Easter Egg (vs. not an

Easter Egg),Log(p/1-p)

6.10

6randomat EggEaster an picking of

p

yProbabilit

5.12

3

4.

6.

6.1

6.

1

Egg"Easter an pickingnot" vs.Egg"Easter an

picking" of

p

pOdds

405.05.1Log Egg"Easter an picking

not" vs.Egg"Easter an picking" of

e

Odds

eLog

Probability, Odds, and Log-Odds: Formulaically


One issue with probabilities is that their range of admissible values is restricted, to falling between 0 and 1. This was one of our clues that a linear model would be inappropriate.

One issue with probabilities is that their range of admissible values is restricted, to falling between 0 and 1. This was one of our clues that a linear model would be inappropriate.

The logit transformation stretches the probability scale, facilitating a linear relationshipThe logit transformation stretches the probability scale, facilitating a linear relationship

00 11ppProbabilityProbability

Theoretical RangeTheoretical Range

MinimumMinimum MaximumMaximumFormulaFormulaQuantityQuantity

-- ++Log(Odds) or “logit”Log(Odds) or “logit”

p

pe 1

log

Notice that a log-odds transformation of a probability leads to a scale with an

unrestricted range

00 ++OddsOddsp

p-1

Probability, Odds, and Log-Odds: By Range


From Probabilities to Odds

: How much more likely that than ?

Percentage Probability Odds Odds

10% 0.10 1/9 0.11

25% 0.25 1/3 0.33

50% 0.50 1/1 1

75% 0.75 3/1 3

90% 0.90 9/1 9


From Probabilities to Log-Odds (Logits)

𝐿𝑜𝑔𝑖𝑡𝑠=log( 𝑝1−𝑝 )

Percentage Probability Logits

10% 0.10 -2.2

25% 0.25 -1.1

50% 0.50 0

75% 0.75 1.1

90% 0.90 2.2

Try to remember that a logit of 1 is a probability around 25%/75%, and a logit of 2 is a probability around 10%/90%.

Note that the logit transformation stretches extreme probabilities and compresses central probabilities.


The Logistic Function as the Inverse of the Logit Function

If , then what is ?

Recall that our logistic regression model is:

If is like , and is like , then we can reexpress the logistic regression model in terms of .

If , then

Our logistic regression model is a linear model for the log-odds of




. logit HOME

Revisiting our baseline, unconditional model, we can interpret our constant, 0.87, on the logit scale.

We recall that a logit of 1 is a probability around 75%, so we aren’t surprised that:

A Linear Model for the Log Odds that

0.2

.4.6

.81

Fitt

ed

Pro

bab

ility

tha

t HO

ME

=1


01

23

4Lo

git T

ran

sfo

rmed

Fitt

ed P

rob

abili

ty th

at H

OM

E=

1

0 10 20 30 40 50Husband's Annual Salary (in $1,000)© Andrew Ho, Harvard Graduate School of Education Unit 4b – Slide 17

General Relationship

Our Model





Interpreting Coefficients in Terms of Logits (Log-Odds)

Our constant term is the estimated log-odds that (the woman is a homemaker) when all (when the husband’s salary is 0).

We remember that a logit of 0 is a probability of 50%, and a logit of -1 is 25%.

If we want the exact probability,

For two observations that differ by 1 unit on , is the estimated difference in their log-odds that .

For two women whose husband’s salaries differ by $1000, their estimated difference in the log-odds that they are homemakers is .081.

Recall that shifting logits from 0 to 1 takes you from 50% to around 75%, and from 1 to 2 from around 75% to around 90%.

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?



Interpreting Model Results in Terms of Odds

Let’s take a look at the fitted odds that the woman is a homemaker when her husband’s annual salary is $10K.

Odds = , in this case,

When the husband earns $10K/year, the fitted odds that the woman is a homemaker is 1.77 to 1.

When the husband earns $10K/year, for every woman in the workforce, we estimate that 1.77 are homemakers.

When the husband earns $10K/year, the estimated probability that the woman is a homemaker is 1.77 times the estimated probability that the woman works outside the home.

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?



Interpreting Model Results in Terms of Odds Ratios

Husband's income in 1976

Canadian Dollars

Estimated probability that the

wife is a homemaker

Estimated odds that the wife is a

homemakerEstimated Odds Ratio

$10,000 64% 1.77 2.248$20,000 80% 3.99 2.248$30,000 90% 8.96 2.248$40,000 95% 20.15

We can calculate the ratio of odds at regular intervals: How much greater are the odds that a wife is a homemaker when the husband’s salary is $20,000 vs. $10,000? This odds ratio is 3.99/1.77=2.248.

𝑝1−𝑝

=.64.36

=1.77

𝑝1−𝑝

=.80.20

=3.99

This is not a typo! Successive odds ratios are constant!

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?



From Log-Odds to Odds Ratios

Let’s say that .

Let’s try to add 1 to and see what happens.

Subtracting the fourth equation from the third,

And, since the difference in logs is a ratio:Thus, exponentiating the slope, , gives a constant “odds ratio,” the multiplicative factor by which odds increment for a unit increment in .





Four Ways to Interpret Slope Coefficients in a Logistic Regression Model

Odds RatiosTwo wives whose husband’s 1976 salaries differ by $1000 have fitted odds of being a homemaker that differ by a factor of , or a predicted 8.4% increment in fitted odds for a unit increment in .

Pick Prototypical Odds

Estimated odds of being a homemaker across

prototypical husband’s income levels:

Husband's income in 1976

Canadian Dollars

Estimated probability that the

wife is a homemaker

Estimated odds that the wife is a

homemakerEstimated Odds Ratio

$10,000 64% 1.77 2.248$20,000 80% 3.99 2.248$30,000 90% 8.96 2.248$40,000 95% 20.15

Log-Odds/LogitsTwo women whose husband’s 1976 salaries differ by $1000 differ by .081 in their fitted log-odds of being a homemaker.

Pick Prototypical Probabilities

Estimated probabilities of being a homemaker

across prototypical husband’s income

levels:

Documents

Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1