Linear Probability Models and Big Data: Prediction, Inference and Selection Bias

Linear Probability Models and Big Data: Prediction, Inference and Selection Bias

Suneel ChatlaGalit Shmueli

Institute of Service ScienceNational Tsing Hua University Taiwan

Outline Introduction to binary outcome models

Motivation : Rare use of LPM

Study goals

o Estimation and inference

o Classification

o Selection bias

Simulation study

eBay data – in paper

Conclusions2

3

Binary outcome models

Logit

Probit

LPM

OLS Regression:

Standard normal cdf

The purpose of binary-outcome regression models?

Inference

and estimation

Selection Bias

Prediction (Classificatio

n)

5

Summary of IS literature (MISQ,JAIS,ISR and MS: 2000~2016)

• Inference and estimation60

• Selection bias31• Classification and

prediction5

Only 8 used LPM 3 are from this year alone

6

”Implementing a campaign fixed effects model with Multinomial logit is challenging due to incidental parameter problem so we opt to employ LPM …” – Burtch et al. (2016)”The LPM is simple for both estimation and inference. LPM is fast and it allows for a reasonable accurate approximation of true preferences.” – Schlereth & Skiera (2016)

7

Statisticians don’t like LPMEconometricians love LPM

Researchers rarely use LPM

WHY?

Criticisms

Non normal error

Non constant error varianceUnbounded predictions

Functional form

Logit

✔

✔

✔

✔✖

Probit

✔

✔

✔

✔✖

LPM

✖

✖

✖

✖

Comparison of three models in terms their theoretical properties

8

Advantages

Convergence issues

Incidental parameters

Easier interpretation

Computational speed

Logit

✖

✖

✔

✔✖

Probit

✖

✖

✖

✔✖

LPM

✔

✔

✔

✔

Comparison in terms of practical issues

9

The Questions that Matter to Researchers?

Logit Probit LPMInference & Estimation

Classification

Selection Bias

10

Inference and

estimation• Consistency

• Marginal effects

11

Latent Framework

𝑍={ 1, 𝑖𝑓 𝒀>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

Latent continuous (not observable)

12

Inference and

estimation

𝑙𝑜𝑔𝑖𝑠 (0,1) • Logit model

𝑁 (0,1)• Probit model

1)• Linear

probability model

The MLE’s of both logit and probit are consistent.�̂� 𝑝→𝛽

LPM estimates are proportionally and directionally consistent (Billinger, 2012) .

�̂�𝑙𝑝𝑚𝑝→𝑘𝛽

n

𝑘 𝛽

𝛽�̂�

�̂�𝑙𝑝𝑚

13

Inference and

estimation

Marginal effects for interpreting effect size For LPM

ME for = =

For logit model

ME for = =

For probit model

ME for = = ()

14

Easy Interpretation

No direct Interpretation

Inference and

estimation

Simulation study• Sample sizes {50,500,50000}• Error distribution {Logistic, Normal, Uniform}

• 100 Bootstrap samples

15

Inference and

estimation

Comparison of Standard Models

16

True Logit Probit LPMIntercept 0 0 0 0.5

1 0.99 1 0.47-1 -1 -1.01 -0.430.5 0.5 0.5 0.21-0.5 -0.5 -0.5 -0.21

k=0.4

Inference and

estimation

1.02-1.070.52-0.52

Non-significance results are identical

coefficient significance results are identical

Comparison of significance

17

Inference and

estimation

Comparison of marginal effects

distributions of marginal effects are identical

18

Inference and

estimation

Classification and

prediction• Predictions

beyond [0,1]

19

Is trimming appropriate?

Replace with 0.99, 0.999

Replace with 0.001, 0.0001

Logi

t Pre

dict

ions

20

Classification and

prediction

Classification

21

Classification accuracies are

identical

Classification and

prediction

Selection Bias

22

Quasi-experiments

Like randomized experimental designs that test causal hypotheses but lack random assignment

Treatment Assignment

● Assigned by experimenter

● Self selection

23

Selection Bias

Selection BiasTwo-Stage (2SLS) Methods

Stage 1:Selection model (T)

AdjustmentStage 2:Outcome model (Y)

𝐸 [𝑇∨𝑋 ]=Φ (𝑋 𝛾) 𝐼𝑀𝑅=𝜙 (𝑋𝛾)Φ (𝑋 𝛾)

(Heckman, 1977)

𝐸 [𝑇∨𝑋 ]=𝑋 𝛾 𝜆=𝑋𝛾−1(Olsen, 1980)

Probit

LPM24

Selection Adjustment

Olsen is simpler

Selection BiasOutcome model coefficients (bootstrap)

Both Heckman and Olsen’s methods perform similar to the MLE

25

Selection Bias

Bottom lineInference and

Estimation• Use LPM with

large sample; otherwise logit/probit is preferable

• With small-sample LPM use robust standard errors

Classification

• Use LPM if goal is classification or ranking

• Trim predicted probabilities

• If probabilities are needed, then logit/probit is preferable

Selection Bias

• Use LPM if the sample is large

• If both selection and outcome models have the same predictors, LPM suffers from multicollinearity

26

Thank you!

Suneel Chatla, Galit Shmueli, (2016), An Extensive Examination of Linear Regression Models with a Binary Outcome Variable, Journal of the Association for Information Systems (Accepted).

27

Data & Analytics

Linear Probability Models and Big Data: Prediction, Inference and Selection Bias