Chapter 9: Inferential Statistics of Discrete-Choice Models

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models

Chapter 9: Inferential Statistics ofDiscrete-Choice Models

I 9.1. Maximum-Likelihood EstimationI 9.2 Estimation Errors: Variance-Covariance

MatrixI 9.2.1 Example 1: SP Survey in the AudienceI 9.2.2 Example 2: RP Survey in the Audience

I 9.3 Significance Tests

I 9.4 Goodness-of-Fit Measures

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation

9.1. Maximum-Likelihood Estimation: the likelihood function

I The maximum-likelihood (ML) estimation is applicable for general stochasticmodels where the probabilities depend on a parameter vector β

I The goal is to maximize the likelihood function L(β), i.e., the probability that themodel predicts all data points (yn,xn), n = 1, ..., N :

L(β) = P(y1(β) = y1, ..., yN (β) = yN

)where yn = y(xn) gives the model estimate for xn

I For continuous endogenous variables, the likelihood function is given by themulti-dimensional probability density at the data points:

L(β) = fy1(β),...,yN (β)(y1, ...,yN )

? Verify that the density formulation is equivalent to the probability definition byrequiring the model estimations to be in small intervals around the data instead ofhitting the data exactly.

! The multi-dimensional probability density f(.) is defined such thatdP = fy1,...,yN (y)dNy. Keeping dNy small and constant, dP and thus P ismaximized if and only if f(.) is maximized.


Maximum-likelihood estimation

I The ML method maximizes the likelihood function:

β = arg maxβ

L(β)

I Equivalently, and often better, one maximizes the log-likelihood:

β = arg maxβ

L(β), L(β) = lnL(β)

? Why it does not matter whether to maximize the likelihood or the log-likelihood?

! Since, as a probability or probability density, L > 0 and the log function is definedand strictly monotonously increasing in this range. Since (i) in this case

x > y ⇔ f(x) > f(y)

(ii) the maximum function is based on this inequality relation, the argument of themaximum remains unchanged.


Application 1: Regression models

Besides OLS, the ML can also be used to estimate regression models. Does it give thesame result, at least if the statistical Gauß-Markow conditions are satisfied?

L(β)εn independent

=

N∏n=1

fn(yn)εn∼i.d.N(0,σ2)

=

N∏n=1

1√2πσ2

exp

[−(yn − βxn)2

2σ2

],

L(β) =

N∑n=1

ln fn(yn) =

N∑n=1

{−1

2(ln 2π + lnσ2)−

[(yn − βxn)2

2σ2

]}= −N

2(ln 2π + lnσ2)− 1

2σ2(y − Xβ)′(y − Xβ)

Except for the irrelevant additive and multiplicative constants, this is the SSE function ofthe OLS method and therefore leads to the same estimator!

? Why it is possible to express L(β) as a product?

! Since the random terms εn ∼ i.i.dN(0, σ2), particularly, they are independent fromeach other


Application 2: Discrete-choice models

I Probability to predict the chosen alternative in for a single decision n:

P(Y n = yn

)= P

(Yn1 = yn1, ..., YnI = ynI

)=

I∏i=1

[Pni(β)]yni = Pnin(β)

(this relies on the exclusivity/completeness of An and of independent RUs)

I Probability to predict all the decisions correctly assuming independent decisions:

L(β) = P (Y1(β) = y1, ...,YN (β) = yN )

=

N∏n=1

I∏i=1

[Pni(β)]yni

ML estimation:

β = arg maxβ

L(β), L(β) =

N∑n=1

I∑i=1

yni lnPni(β) =

N∑n=1

lnPnin(β)


Question

? Show that, in deriving the main ML result L =∑n

∑i yni lnPni, the random utilities need not to be un-

correlated between alternatives, only between choices

! Because of the exclusivity/completeness requirement for thealternatives, exactly one alternative can be chosen per de-cision so it is enough to maximize the corresponding prob-ability (which, of course, depends on possible correlations)


Estimating models with only ACsIf there are no exogenous variables, we are left with just the ACs reflecting that peopleprefer certain alternatives over others for unknown reasons:

Vni =

I−1∑m=1

βmδmi or Vni = βi if i 6= I, VnI = 0

This AC-only model will be the “reference case” when estimating the model quality, e.g.,by the likelihood-ratio index.

? Show that the estimated models gives probabilities Pni = Pi that are equal to the observedchoice fractions Ni/N . (Hint: Lagrange multiplicators to satisfy

∑i Pi = 1)

! we have L(P ) =∑n lnPin =

∑iNi lnPi; maximize under the constraint

∑i Pi = 1:

d

dPi

(L(P )− λ(

∑i

Pi − 1)

)!= 0 ⇒ Ni

Pi= λ⇒ Pi ∝ Ni

? Based on this result Pi = Ni/N , give the parameters for the AC-only MNL and for the binaryi.i.d. Probit model Logit: Pi/PI = Ni/NI = exp(βi) (notice that I is the reference w/o AC)


Exercise: simple binomial model with an AC and travel time

Vni = β1δi1 + β2Tni

Choice set Tped = T1 [min] Tbike = T2 [min] # chosen 1 # chosen 2

1 15 30 3 22 10 15 2 33 20 20 1 44 30 25 1 45 30 20 0 56 60 30 0 5


I: Graphical solution

Vni = β1δi1 + β2Tni

Logit:

L = −12β1 = −1.3, β2 = −0.14,

AC in minutes: − β1β2

= −9 min

Probit:

L = −12β1 = −1.1, β2 = −0.12,

AC in minutes: − β1β2

= −9 min


II. Numerical solution

I Generally, we have a nonlinear optimization problem.

I For parameter-linear utilities, we know for the MNL that a maximum exists and isunique.

I Standard methods of nonlinear optimization are possible:I Newton’s and quasi-Newton method: Fast but may be unstableI Gradient/steepest descent methods: slow but reliableI Broyden-Fletcher-Goldfarb-Shanno (BFGS) or Levenberg-Marquardt algorithm

combining gradient and Newton methods. Such methods are used in many softwarepackages

I genetic algorithms if the objective function landscape is complicated (nonlinearutilities).


Special case: estimating the MNL

The special structure of the MNL with parameter-linear utilities, Vni =∑

m βmXmni

allows for an intuitive formulation of the estimation problem:

The observed and modeled property sums sums of the factors X for agiven parameter m should be the same

XMNLm = Xdata

m ,∑n,i

xmni Pni(β) =∑n,i

xmni yni =∑n

xmnin


Example: four factors, two alternativesMNL model, Vni = β1Tni + β2Cni + β3giδi1 + β4δi1, g = 0, g = 1:

I X1 = T : Total travel time for the chosen alternatives:

TMNL =∑n,i

Pni(β)Tni, T data =∑n,i

yniTni =∑n

Tnin

I X2 = C: Total money spent by the decision makers:

CMNL =∑n,i

Pni(β)Cni, Cdata =∑n,i

yniCni =∑n

Cnin

I X3 = N1,

: number of woman choosing alternative 1:

NMNL

1,=∑n

Pn1(β)gn, Ndata

1,=∑n

yn1gn

I X4 = N1: total number of persons choosing alternative 1:

NMNL1 =

∑n

Pn1(β), Ndata1 =

∑n

yn1

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2 Estimation errors

9.2 Estimation Errors: Variance-Covariance MatrixSince the log-likelihood is maximized at β, we have

∂L

∂β= 0 ⇒ L(β) ≈ Lmax +

1

2∆β T ·H ·∆β, ∆β = β − β

with the (negative definite) Hessian Hlm = ∂2L(β)∂βl ∂βm

∣∣∣β=β

Compare L(β) near its maximum with the density f(x) of the general multivariate normaldistribution with variance-covariance matrix Σ:

L(β) = Lmax exp

(1

2∆β T ·H ·∆β

),

f(x) =((2π)MDetΣ

)−1/2exp

(−1

2x′Σ−1 x

)Identify ∆β with x, the sought-after variance-covariance matrix V with Σ, and assume the

asymptotic limit (higher than quadratic terms in L(β) negligible): ⇒

V = Cov(β) = E

[(β − β

)(β − β

)′]≈ −H−1(β)

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2 Estimation errors

Fisher’s information matrix

The variance-covariance matrix is related to Fisher’s information matrix I:

I = V−1 = −H , Ilm = − ∂2L(β)

∂βl ∂βm

I Roughly speaking, information is missing uncertainty, so the higher the main components ofI, the lower the main components of V

I Cramer-Rao inequality: A lower bound for the variance-covariance matrix is the inverse ofFisher’s information matrix ⇒ The ML estimator is asymptotically efficient

I Comparison with the OLS estimator V OLS = 2σ2H−1SSE of regression models:

I = −H = H SSE/(2σ2) = X ′X /σ2

The negative Hesse matrix of L(β) is proportional to the Hesse matrix of the regression SSES(β).

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.1 Example 1: SP Survey in the Audience

9.2.1 Example 1 from past lecture:SP Survey in the Audience WS18/19 (red: bad weather, W = 1)

ChoiceSet

Alt. 1:Ped

Alt. 2:Bike

Alt. 3:PT/Car

Alt 1 Alt 2 Alt 3

1 30 min 20 min 20 min+0e 1 3 7

2 30 min 20 min 20 min+2e 2 9 2

3 30 min 20 min 20 min+1e 1 5 7

4 30 min 20 min 30 min+0e 2 9 3

5 50 min 20 min 30 min+0e 0 9 4

6 50 min 30 min 30 min+0e 0 3 9

7 50 min 40 min 30 min+0e 0 2 10

8 180 min 60 min 60 min+2e 0 4 11

9 180 min 40 min 60 min+2e 0 9 6

10 180 min 40 min 60 min+2e 0 1 14

11 12 min 8 min 10 min+0e 3 5 6

12 12 min 8 min 10 min+1e 5 7 2


Model specification for Model 1 of the past lecture

Vi = β0δi1 + β1δi2+ β2Ki + β3Ti

β0 = −0.95± 0.37,β1 = −0.28± 0.24,β2 = +0.17± 0.19,β3 = −0.04± 0.02

β0

−β3= −22.4 min,

β1

−β3= −6.6 min,

60β3

β2= −15e/h

AIC=275, BIC=303,ρ2 = 0.200, ρ2 = 0.177


Likelihood and log-likelihood function for varying cost (β2) and time (β3)sensitivities

Vi = β0δi1 + β1δi2 + β2K + β3T

Likelihood functionL(β2, β3|β0, β0)

Log-likelihood functionL(β2, β3|β0, β1)


Log-likelihood function in parameter space

Vi = β0δi1 + β1δi2 + β2K + β3T + β4Wδi3

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience

9.2.2 Example 2: RP Survey in the Audience

Distance classes for the trip home to university (cumulated till 2018)

Weather: good

DistanceClass-center

ChoiceAlt. 1:ped

ChoiceAlt. 2:bike

ChoiceAlt. 2:PT

ChoiceAlt. 3:car

0-1 km 0.5 km 17 16 10 0

1-2 km 1.5 km 9 23 20 2

2-5 km 3.5 km 2 27 55 4

5-10 km 7.5 km 0 7 42 7

10-20 km 12.5 km 0 0 18 7


Revealed Choice: fit quality

V1 = β1 + β4r,V2 = β2 + β5r,V3 = β3 + β6r,V4 = 0

β1 = 4.1± 0.6,β2 = 3.6± 0.5,β3 = 3.0± 0.5,β4 = −1.43± 0.26,β5 = −0.48± 0.08,β6 = −0.14± 0.05


Revealed Choice: Modal split as a function of distance

V1 = β1 + β4r,V2 = β2 + β5r,V3 = β3 + β6r,V4 = 0

β1 = 4.1± 0.6,β2 = 3.6± 0.5,β3 = 3.0± 0.5,β4 = −1.43± 0.26,β5 = −0.48± 0.08,β6 = −0.14± 0.05


Likelihood and Log-Likelihood as f(β1, β2)

Vi =∑3

m=1 βmδm,i +∑3

m=1 βm+3 rδm,i

LikelihoodfunktionL(β1, β2, β3, ...)

Log-LikelihoodfunktionL(β1, β2, β3, ...)


Log-Likelihood: Sections through parameter space

Vi =∑3m=1 βmδm,i +

∑3m=1 βm+3 rδm,i

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests

9.3 Significance Tests: Parametric Tests

I The parameter test procedures are exactly the same as that of regression models.Because we only consider the asymptotic limit, the test statistic is always Gaussian:

I Confidence interval of a parameter βm:

CIα(βm) = [βm −∆α, βm + ∆α], ∆α = z1−α/2√Vmm

I Test of a parameter βm for H0 : βj = βj0, ≥ βj0, or ≤ βj0:

T =βj − βj0√

Vjj∼ N(0, 1) |H∗0

I p-values for H0 : βj = βj0, ≥ βj0, or ≤ βj0, respectively:

p= = 2(1− Φ(|tdata|)

), p≤ = 1− Φ(tdata), p≥ = Φ(tdata)

I As in regression, a factor 4 of more data halves the error


Significance tests II: Likelihood-ratio (LR) test

Like in regression (F-test), one sometimes wants to test null hypotheses fixing severalparameters simultaneously to given values, i.e., H0 corresponds to a restraint model

I H0: The restraint model with some fixed parameters and Mr remaining parametersdescribes the data as well as the full model with M parameters

I Test statistics:

λLR = 2 ln

L(β)

Lr(β

r) = 2

[L(β)− Lr

(β

r)]∼ χ2(M −Mr) if H0

I Data realization: calibrate both M and Mr and evaluate λLRdata

I Result: reject H0 at α based on the 1− α quantile:

λLRdata > χ2

1−α,M−Mr

p-value: p = 1− Fχ2(M−Mr)

(λLR

data

)


Example: Mode choice for the route to this lecture

Distance class nDistancern

i = 1 (ped/bike) i = 2 (PT/car)

n = 1: 0-1 km 0.5 km 7 1n = 2: 1-2 km 1.5 km 6 4n = 3: 2-5 km 3.5 km 6 12n = 4: 5-10 km 7.5 km 1 10n = 5: 10-20 km 15.0 km 0 5

Vn1(β1, β2) = β1rn + β2,

Vn2(β1, β2) = 0

I β1: Difference in distance sensitivity (utility/km) for choosing ped/bike over PT/car(expected < 0)

I β2: Utility difference ped/bike over PT/car at zero distance (> 0)

Do the data allow to distinguish this model from the trivial model Vni = 0?


LR test for the corresponding Logit models

I H0: The trivial model Vni = 0 describes the data as well as the full modelVn1(β1, β2) = (β1rn + β2)δi1

I Test statistics: λLR = 2[L(β1, β2)− L(0, 0)

]∼ χ2(2)|H0

I Data realization (1 L-unit per contour): λLRdata = 2(−26.5 + 35.5) = 18

I Decision: Rejection range λLR > χ22,0.95 = 5.99 ⇒ H0 rejected.


Fit quality of the full model

? What would be the modeledped/bike modal split for the nullmodel Vni = 0? 50:50

? Read off from the L contour plotthe parameter of the AC-onlymodel Vni = β2δi1 and give themodeled modal splitβ2 = ln(P1/P2) = −0.5, OK with

P1/P2 = eβ2 ≈ N1/N2 = 20/32

? Motivate the negative correlationbetween the parameter errors Thismakes at least sure that, in caseof correlated errors, about thesame fraction choosesalternative 2 as for the calibratedmodel

Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.4 Goodness-of-Fit Measures

9.4 Goodness-of-Fit Measures

I The parameter tests for equality and the LR test are related to significance: Is themore complicated of two nested models significantly better in describing the data?

I This can be used to find the best model using the top-down ansatz:

Make is as simple as possible but not simpler!

I Problem: For very big samples, nearly any new parameter gives significance and thetop-down ansatz fails

I More importantly: Significance/LR tests cannot give evidence for missing butrelevant factors

I A further problem: We cannot compare non-nested models

I Finally, in reality, one often is interested in effect strength (difference in the fit andvalidation quality), not significance

⇒ we need measures for absolute fit quality


Information-based goodness-of-fit (GoF) measures

I Akaike’s information criterion:

AIC = −2L+ 2MN

N − (M + 1)

I Bayesian information criterion:

BIC = −2L+M lnN

N : number of decisions; M : number of parameters

I Both criteria give the needed additional information (in bit) to obtain the actual micro-datafrom the model’s prediction, including an over-fitting penalty: the lower, the better.

I Both the AIC and BIC are equivalent to the corresponding GoF measures of regression.

I the BIC focuses more on parsimonious models (low M).

I For nested models satisfying the null hypothesis of the LR test and N �M , the expectedAIC is the same (verify!). However, since the AIC is an absolute measure, it allowscomparing non-nested models.


GoF measures corresponding to the coefficient of determination R2 of linear

models (L0: log-likelihood of the estimated AC-only or trivial model)

I LR-Index resp. McFadden’s R2:ρ2 = 1− L

L0

I Adjusted LR-Index/McFadden’s R2:

ρ2 = 1− L−ML0

I The LR-Index ρ2 and the adjusted LR-Index ρ2 correspond to the coefficient of determination R2

and the adjusted coefficient R2 of regression models, respectively: The higher, the better.

I In contrast to regression models, even the best-fitting model has ρ2 and ρ2 values far from 1. Valuesas low as 0.3 may characterize a good model, see the Example 9.2.1 , while R2 = 0.3 means a really badfit for a regression model.

I An over-fitted model with M parameters fitting N = M decisions reaches the “ideal” LR-index valueρ2 = 1 while ρ2 is near zero.


Questions on GoF metrics

? Discuss the model to be tested, the AC-only model, and the trivial model in thecontext of weather forecastsFull forecast info, info from climate table, 50:50

? Give the log-likelihood of the AC-only and trivial models if there are I alternativesand Ni decisions for alternative i (total number of decisions N =

∑Ii=1Ni)

Trivial model: Pni = 1/I, L =∑

n lnPin =∑

iNi lnPi = −N ln I;AC-only model: Pni = Ni/N , L =

∑iNi lnPi = N lnN −

∑iNi lnNi

? Consider a binary choice situation where the N/2 persons with short trips chose thepedestrian/bike option with a probability of 3/4, and the PT/car option with 1/4.The other N/2 persons with long trips had the reverse modal split with a ped/bikeusage of 25 %, only.What would be the LR-index for the “perfect” model exactly reproducing theobserved 3:1 and 1:3 modal splits for the short and long trips, respectively?(less than 0.18)

Documents

Chapter 9: Inferential Statistics of Discrete-Choice Models