EPSE 682: Multivariate Analysis - Weebly · EPSE 682: Multivariate Analysis Ed Kroc University of British Columbia [email protected] January 28, 2020 Ed Kroc (UBC) EPSE 682 January 28,

EPSE 682: Multivariate Analysis

Ed Kroc

University of British Columbia

[email protected]

January 28, 2020

Ed Kroc (UBC) EPSE 682 January 28, 2020 1 / 37

Towards the generalized linear model: logistic regression

If your response data is binary, then using ordinary linear regression isgoing to be a bad idea: you will get predicted/fitted values that makeno sense!

Moreover, when Y is binary, it is really only PrpY “ 1q that we careabout (see next slides). But probabilities have to live on r0, 1s.



If a random quantity Y is binary, then is it necessarily a Bernoullirandom variable:

Y „ Berpπq, where π :“ PrpY “ 1q.

Notice that π P r0, 1s.

Notice that π completely characterizes the distribution of Y ; e.g.

PrpY “ 0q “ 1´ PrpY “ 1q “ 1´ π.

EpY q “ π, VarpY q “ πp1´ πq.

This is in stark contrast to a normal random variable, where there aretwo free parameters: µ P R that determines the mean, and σ P R`that (independently) determines the variance.



The sum of independent and identically distributed (i.i.d.) Bernoullirandom variables is called a Binomial random variable:

Y “

mÿ

i“1

Yi „ Binpm, πq, where Yi „ Berpπq.

Such a random variable naturally counts the number of “successful”Bernoulli trials.

Notice: m P N, π P r0, 1s.

Can easily show:

EpY q “ mπ, VarpY q “ mπp1´ πq.

Notice: both parameters jointly determine the mean and variance;contrast with a normal random variable.



The associated probability mass function for Binomial Y is:

PrpY “ yq “

ˆ

m

y

˙

πy p1´ πqmý

Suppose we observe sample data on Y binary and predictorsX :“ tX1, . . . ,Xpu.

Given any particular values for the predictors, our sample data wouldyield a certain number of “successes” (Y “ 1) and a certain numberof “failures” (Y “ 0).

Assuming a Binomial model for Y (given the values of any set ofpredictors X ), we could then write

π :“ PrpY “ 1 | X q “pÿ

j“0

βjXj



Therefore, the likelihood function associated to this set of sampledata (size n) is:

`pπ | x1, . . . , xp, yq “nź

i“1

ˆ

mi

yi

˙

πyi p1´ πqmiýi

There is a problem with this formulation though: recall that we set

π “

pÿ

j“0

βjXj

Recall that probabilities always live on r0, 1s. But those predictorsXj ’s can be anything at all!



There are many different functions that will transform probabilities(#’s between 0 and 1) into numbers that span all of R. The mostcommonly used are:

Logit or inverse logistic function:

g1pπq “ log

ˆ

π

1´ π

˙

Probit or inverse Normal function:

g2pπq “ Φ´1pπq, pinverse CDF of standard Normalq

Complementary log-log function:

g3pπq “ logp´ logp1´ πqq

Log-log function:g4pπq “ ´ logp´ logpπqq


Logistic regression

The standard logistic regression model is:

log

ˆ

π

1´ π

˙

“

pÿ

j“0

βjXj ,

where π :“ PrpY “ 1 | X q.

It’s often of interest to solve this explicitly for the conditional“success” probability π:

π “expp

řpj“0 βjXjq

1` exppřp

j“0 βjXjq“

1

1` expp´řp

j“0 βjXjq

This gives us an explicit expression for PrpY “ 1 | X q.

Note: a function of the form exppxq1èxppxq is called a logistic function.


Logistic regression: MLE

Recall that we already have a likelihood function for this model:

`pπ | x1, . . . , xp, yq “nź

i“1

ˆ

mi

yi

˙

πyi p1´ πqmiýi

One can use this function then to derive the MLEs of the modelcoefficients, the βj ’s, just as in ordinary linear regression.

In general though, the MLE solutions won’t have nice closed formexpressions like we had in linear regression, but they can be very wellapproximated by a variety of numerical techniques (more on thislater).


Logistic regression: interpretation of model coefficients

Recall: the model coefficients in ordinary linear regression were ofteneasy to interpret. For example, the classic interpretation that β1 tellsyou how much the average value of Y should change if X changes by1 unit in the simple model (see HW1 also):

Y “ β0 ` β1X ` ε.

A particularly nice feature of logistic regression is that the modelcoefficients are also relatively easy to interpret; i.e. they have someinherent, stand-alone meaning.


Logistic regression: odds and odds ratios

Simply exponentiating the standard logistic regression model, we find:

π

1´ π“ eβ0`β1X1`¨¨¨`βpXp

The lefthand side of the equation is just the odds of observing Y “ 1,conditional on X :

oddspY “ 1 | X q “PrpY “ 1 | X qPrpY “ 0 | X q

Odds are not the same thing as probability (they can be anynon-negative value), but encode similar information. For example:

The probability of flipping heads on a fair coin is 12 .

The odds of flipping heads on a fair coin are 1{21{2 “ 1, or we often say

are “1 to 1.”




π

1´ π“ eβ0`β1X1`¨¨¨`βpXp



Odds are not the same thing as probability (they can be anynon-negative value), but encode similar information. For example:

The probability of drawing an ace from a standard deck of 52 cards is452 “

113 .

The odds of drawing an ace from a standard deck of 52 cards are4{5248{52 “

112 , or we often say are “1 to 12.”




π

1´ π“ eβ0`β1X1`¨¨¨`βpXp



Odds are not the same thing as probability (they can be anynon-negative value), but encode similar information.

Generally speaking, oddspY “ 1q « PrpY “ 1q only when PrpY “ 1qis small. This fact has clinical importance sometimes, e.g. whenmodelling the incidence rate of a rare disease.



Let’s look at a simple logistic regression model on one predictor:

oddspY “ 1 | X “ xq “ eβ0`β1x

Now consider the odds ratio:

oddspY “ 1 | X “ xq

oddspY “ 1 | X “ x ´ 1q“

eβ0`β1x

eβ0`β1px´1q“ eβ1

Thus, the quantity eβ1 tells us how much we would expect the oddsof conditional “success” to change multiplicatively when X changesby 1 unit.

For instance, if β1 “ 0 (corresponding to no estimated linear effect ofX on the logit of Y ), then the odds ratio equals 1; i.e. changing Xdoesn’t seem to change the expected odds of success in Y at all.



Consider the odds ratio:


oddspY “ 1 | X “ x ´ 1q“

eβ0`β1x

eβ0`β1px´1q“ eβ1

Thus, the quantity eβ1 tells us how much we would expect the oddsof conditional “success” to change multiplicatively when X changesby 1 unit.

If β1 is a large positive number, say β1 “ 10 (corresponding to alarge, positive estimated linear effect of X on the logit of Y ), thenthe odds ratio equals about 22,000; i.e. increasing X by a single unitgreatly increases the expected odds of success in Y .

Simiarly, if β1 is a large negative number, say β1 “ ´8 (correspondingto a large, negative estimated linear effect of X on the logit of Y ),then the odds ratio equals about 0.00033; i.e. increasing X by a singleunit greatly decreases the expected odds of success in Y .



Equivalently, one can talk about the log-odds ratio rather than theordinary odds ratio:

β1 “ log

„


oddspY “ 1 | X “ x ´ 1q

“ logroddspY “ 1 | X “ xqs ´ logroddspY “ 1 | X “ x ´ 1qs

The analogous interpretations hold (since the logarithm is amonotone increasing function), but now we speak about increases ordecreases in log-odds rather than just odds.

You will see different people use either scale to interpret their modelcoefficients. Regardless, it is always about describing if a change in apredictor X corresponds to an increase or decrease in theodds/probability of a success in Y . [Notice: the odds can increase ifand only if the probability of success increases.]


Residuals of a GLM

Note: all this readily extends to multiple predictors.


Generalized linear models (GLMs)

Both ordinary linear regression and logistic regression are special casesof a much more general and flexible modelling framework: generalizedlinear models.

Note: the basic GLM framework always assumes that data points areindependently sampled; e.g. random sampling.

GLMs are specified in two pieces:

(1) Specify a probability distribution for the data process; this willgenerate a likelihood function for any set of sample data.

(2) Specify a link function, gpEpY | X qq that relates the mean of theresponse process to a (fixed) linear combination of the predictors.

We then use the likelihood function to estimate the conditional meanof the response process (conditional on the predictors).


Ordinary linear regression as GLM

For the ordinary linear regression model, we first wrote

Y “

pÿ

j“0

βjXj ` ε,

where, as before, we define X0 “ 1, and ε „ Np0, σ2q.

Recall that an equivalent formulation of this model was to say thatthe probability distribution of the response Y (conditional on thepredictors) is Normal with mean

řpj“0 βjXj and variance σ2:

Y „ N

˜

pÿ

j“0

βjXj , σ2

¸

.

This is equivalent to Step 1 of specifying a GLM; i.e. we haveproposed a probability distribution for our data process.



Note:

Y „ N

˜

pÿ

j“0

βjXj , σ2

¸

is sloppy (but fairly universal) notation. It would be far more accurateto write:

Y |X „ N

˜

pÿ

j“0

βjXj , σ2

¸

,

because we have defined a distribution for Y conditional on X . Thisis not the same thing as the unconditional distribution of Y(regardless of X ).

What this means is that you do not need the response variable to benormally distributed in ordinary linear regression (we already knewthis), but we do need the response variable to be normally distributedfor every set of fixed values of the predictors X .



As we saw last time, the probability specification

Y |X „ N

˜

pÿ

j“0

βjXj , σ2

¸

generates a corresponding likelihood function for any set of sampledata:

`pβ0, . . . , βp, σ | x1, . . . , xp, yq “nź

i“1

1?

2πσe´pyi´

řpj“0

βj xj,iq2

2σ2

Notation: remember that bold-face denotes a vector. Since every oneof our n sample units generates an observation for x1, . . . , xp and y , Ihave just written

x1 :“ tx1,1, x1,2, . . . , x1,nu

x2 :“ tx2,1, x2,2, . . . , x2,nu, etc...

y :“ ty1, y2, . . . , ynu



Moreover, the above probability specification implicitly defines thelink function relating the mean of the response process to the linearcombination of predictors. For ordinary linear regression, the linkfunction is just the identity function:

gpEpY | X qq “ EpY | X q

Recall that we wrote this out explicitly in Lecture 2: if

Y |X „ N

˜

pÿ

j“0

βjXj , σ2

¸

,

then

EpY | X q “pÿ

j“0

βjXj



Thus, ordinary linear regression corresponds to a GLM with aGaussian (i.e. normal) likelihood and the identity link.

Note that a very special feature of Gaussian GLMs is that thevariance of the responses Y for any fixed X is totally indepedent ofthe (conditional) mean of the responses. This is a very helpful featurethat explains the popularity of Gaussian models.

As we will see though, such a modelling assumption is oftenunreasonable; i.e. we often expect the variance of responses about amean to actually depend on the value of the mean in some way.

Moreover, some data structures require such dependency.


Logistic regression as GLM

Recall the standard logistic regression model is:

log

ˆ

π

1´ π

˙

“

pÿ

j“0

βjXj ,

where π :“ PrpY “ 1 | X q.

Here, the link function is

gpEpY | X qq “ logitpEpY | X qq.

Why? Recall that for Bernoulli (0 or 1) responses,PrpY “ 1 | X q “ EpY | X q.

Thus,

logitpEpY | X qq “ log

ˆ

EpY | X q1´ EpY | X q

˙

“ log

ˆ

PrpY “ 1 | X q1´ PrpY “ 1 | X q

˙


Logistic regression as GLM

Since each Y conditional on X is assumed Binomialpm, πq, we havethe likelihood function:

`pπ | x1, . . . , xp, yq “nź

i“1

ˆ

mi

yi

˙

πyi p1´ πqmiýi

Recall that usually mi “ 1 always, but it is possible that you havemultiple observed responses for the same values of your predictors(especially true when your predictors are categorical).

Thus, logistic regression corresponds to a GLM with a Binomiallikelihood and the logit link.

Note also that for such a model, the mean and variance of theconditional response process Y are inherently dependent, since forBernoulli Y we have

VarpY | X q “ πp1´ πq “ EpY | X q ¨ p1´ EpY | X qq.


Residuals of a GLMNow it should be clearer as to why we do not need to write down anexplicit error term for our GLM: it is implicitly defined once we specifythe link function.

For Gaussian regression with the identity link (ordinary linearregression), the model is totally defined via

Y |X „ N

˜

pÿ

j“0

βjXj , σ2

¸

.

This says that the response data are normally distributed about the

conditional mean EpY | X q “pÿ

j“0

βjXj with constant variance σ2.

Therefore, the individual model errors (theoretical residuals) are:

ε “ Y ´ EpY | X q

“ Y ´pÿ

j“0

βjXj “ Y ´ pY


Residuals of a GLM

This matches exactly with the traditional construal of ordinary linearregression:

Note: assuming Y |X „ N

˜

pÿ

j“0

βjXj , σ2

¸

implies ε „ Np0, σ2q, or

more precisely, implies ε | X „ Np0, σ2q.


Residuals of a GLM

Similarly, for Binomial/Bernoulli regression with the logit link (logisticregression), the individual model errors will simply be the deviationsof the observed responses, Y , from the conditional mean of theresponse, EpY | X q “ PrpY “ 1 | X q:

ε “ Y ´ EpY | X q

“ Y ´1

1` expp´řp

j“0 βjXjq

Note: because we are modelling

gpEpY | X qq “ logitpEpY | X qq “pÿ

j“0

βjXj ,

when we solved this equation for EpY | X q we derived the expitfunction:

EpY | X q “1

1` expp´řp

j“0 βjXjq


Residuals of a GLM

Because the response process is binary, these model errors are alsobinary for any fixed values of the predictors X . This looks verydifferent than ordinary linear regression!

Fix a particular set of values for the predictors X . Then our modelsays that our best guess for the (conditional) mean of the response(i.e. the conditional probability that Y “ 1) is:

EpY | X q “1

1` expp´řp

j“0 βjXjq

But since Y can only equal 0 or 1, the (conditional) model error canonly assume one of two values:

0´1

1` expp´řp

j“0 βjXjq

or

1´1

1` expp´řp

j“0 βjXjq


Residuals of a GLM

Thus, the error term is a transformed Bernoulli random variable.

Note: this directly parallels ordinary linear regression (assumedGaussian response) where the error term was a transformed Gaussianrandom variable.

We will return to diagnosing model fit by examining residuals soon.


Some important and common GLMs

Multinomial regression: categorical response with more than twocategories (usually AVOID at all costs).

Poisson regression, ZIP regression, negative binomial regression:count response data.

Gamma regression, lognormal regression, inverse Gaussian regression:continuous, but skewed response data.

Beta regression, ZOIB regression: proportion/percentage (0-100%)response data.

...and many others.

We will talk about most of these at least a little.


Multinomial regression

As a general rule, multinomial regression should be avoided at allcosts, unless you are working with at least 1000’s of observations.

It often relies on suspicious assumptions, is notoriously difficult toperform diagnostics with, and suffers from extremely low power.

But you still occasionally see it used (poorly) in small data settings,so let’s just briefly introduce it.



The response data Y are now categorical on more than twocategories, so take values in t0, 1, 2, . . . , ku.

Just as in regression with Binomial (0/1) data, if we can model theprobabilities of Y falling into any of its k ` 1 categories, then weknow everything about the random phenomenon.

Now though, there are k probabilities that we need to model:

π1 :“ PrpY “ 1 | X qπ2 :“ PrpY “ 2 | X q

......

πk :“ PrpY “ k | X q

Note that if we knew these, then we would automatically know

π0 :“ PrpY “ 0 | X q “ 1´ π1 ´ π2 ´ ¨ ¨ ¨ ´ πk



The most typical way you see this modelled is via multinomial logisticregression, another kind of GLM.

Remember: 2 pieces to define a GLM:(1) a likelihood for the data, and(2) a link function relating the mean of the response (conditional on the

predictors) to a linear combination of the predictors.

Here, we assume a Multinomial model for the response data (samplesize n).

This is the natural generalization of a Binomial phenomenon on morethan two categories; in particular, it assumes that each sampleobservation of Y P t0, 1, . . . , ku is i.i.d. (independent and identicallydistributed).

Thus, while somewhat flexible, this does not characterize all possibleprobability distributions on the theoretical sample space.



To make this easier to write down, suppose our n sample observationsgenerate z0 counts of Y “ 0, z1 counts of Y “ 1, . . . , and zk countsof Y “ k. Then:

`pπ0, π1, . . . , πk | x1, . . . , xp, yq “nź

i“1

ˆ

n

z0, z1, . . . , zk

˙

πz00 πz11 ¨ ¨ ¨π

zkk ,

where

ˆ

n

z0, z1, . . . , zk

˙

is a multinomial coefficient that simply counts

the number of ways to simultaneously select z0 objects of type 0, z1objects of type 1, . . . , and zk objects of type k from n objects.[Think: how many ways could you arrange 2 green balls, 3 red balls,and 1 blue ball in a straight line?]

Notice: if you sub-in k “ 1, this model reduces back to the Binomialmodel.


Multinomial logistic regression

Standard multinomial regression now requires you to declare areference category and then posits a link function between the meannumber of outcomes in each of the k other categories.

In Binomial regression, the choice was arbitrary (i.e. either “0” or “1”could be declared a Bernoulli “success”), but now we have options;the decision is made according to the research problem.

We will declare Y “ 0 as the reference category. Then multinomiallogistic regression posits:

log

ˆ

πiπ0

˙

“

pÿ

j“0

βi ,jXj , 1 ď i ď k

Notice that we are now modelling k different logit responses, eachwith their own set of coefficients that capture the linear associationsbetween the predictor and logit response.


Multinomial logistic regression

Everything now proceeds as with Binomial logistic regression, exceptwe would no longer talk about the “odds of success” increasing ordecreasing with values of X (estimated via the model coefficients).

Instead, now we would talk about the “odds of being in category irelative to the reference category 0”. E.g. odds of voting NDPvs. Conservative, Green vs. Conservative, and Liberal vs. Conservative.

One major problem with this kind of model is that it assumesindependence of irrelevant alternatives (IIA). That is, removing onecategory from consideration should not change the odds ofresponding in another category.

This assumption is often unreasonable in practice. E.g. if there wereno Green candidates, it is very likely that more people would voteNDP and/or Liberal vs. Conservative, thus, changing these odds.

Some other multinomial regression models relax this assumption, butit is very difficult to do well.


Documents

EPSE 682: Multivariate Analysis - Weebly · EPSE 682: Multivariate Analysis Ed Kroc University of British Columbia [email protected] January 28, 2020 Ed Kroc (UBC) EPSE 682 January 28,