Lecture 6 STK3100 - Categorical responses · profit2 3 35.29647 clogfit2 3 35.60866 Lecture 6 STK3100 - Categorical responses – p. 21. Fitted probabilities for beetle data including

Lecture 6 STK3100 - Categoricalresponses

22. September 2014

Plan for lecture:1. GLM for binary and binomial data

2. Link functions

3. Parameter interpretation in logistic regression

4. Parameter interpretation with other link functions

5. Goodness-of-fit: Hosmer-Lemeshow-test

6. ROC curves

7. Over dispersion

Lecture 6 STK3100 - Categorical responses – p. 1

Binomial responses

• AssumeYi ∼ Bin(ni, πi) and independent

• The data belongs to the exponential family with pmf

f(y, θi, φi) =

(

ni

y

)

πyi (1− πi)

ni−yi

=c(y) exp(yθi − a(θi))

• θi = log(πi/(1− πi))

• a(θi) = ni log(1 + exp(θi))

• dispersion parameterφi = 1 and known andc(y) =(

ni

y

)

• E[Yi] = a′(θi) = niexp(θi)

1+exp(θi)= niπi = µi

• Var[Yi] = φia′′(θi) = ni

exp(θi)(1+exp(θi))2

= niπi(1− πi).


Binomial or binary responses?

• AssumeYi ∼ Bin(ni, πi) and independent

• The data can also be represented as

Yi,j =

1 for j = 1, ..., Yi

0 for j = Yi + 1, ..., ni

• which gives usbinary data

• Note: If Yi,j Bin(1, πi), butYi,j-s are dependent within

groupi, the sum are not binomial

• Positive dependence give overdispersion

• Grouping, and then taking into account over dispersion,

may be a way to handle such data


Binary responses or grouped data?

• Yi ∼ Bin(ni, πi), i = 1, ..., k or

• Yi′ ∼ Bin(1, πi′), i′ = 1, ..., n′ =

∑i=k

i=1 ni

Estimation equivalent for both representations

AIC for comparing models are also equivalent

The deviance goodness-of-fit test becomesdifferent!

• ∆ ∼ χ2n−q

• n = k for grouped data

• n′ =∑k

i=1 ni for binary data

• To trust the deviance goodness-of-fit test, we require:

Yi ∼ Bin(ni, πi) whereniπi > 5 andni(1− πi) > 5


Ex: Beetles> dim(beetle)

[1] 8 3

> glm(cbind(Dode,Ant-Dode)˜Dose,family=binomial,data=beetle)

Coefficients:

(Intercept) Dose

-60.72 34.27

Degrees of Freedom: 7 Total (i.e. Null); 6 Residual

Null Deviance: 284.2

Residual Deviance: 11.23 AIC: 41.43

> dim(beetle2)

[1] 481 2

> glm(Dode˜Dose,family=binomial,data=beetle2)

Coefficients:

(Intercept) Dose

-60.72 34.27



Residual Deviance: 372.5 AIC: 376.5 Lecture 6 STK3100 - Categorical responses – p. 5

GLM for binomial or binary ( ni = 1) responses

• IndependentYi with probability for successπi

• Linear predictorηi = βTxi

• Link functiong(πi) = ηi

The logit link function is the most usual:

g(πi) = log(πi

1− πi

) = logit(πi)

which gives

πi =exp(ηi)

1 + exp(ηi)= g−1(ηi)

This is the canonical link function, i.e. canonical parameter

θi = ηiThe logit link yield logistic regression


Requirements for link function for binomial data

g() should

• be smooth (can be differentiated)

• be strongly monotone (increasing)

• take values over all real numbers

• g([0, 1]) = R or equivalentg−1(R) = [0, 1]

• g−1(η) cumulative distribution function (CDF) for a

continuous distribution onR

Logit link satisfies these requirements.g−1(η) is CDF in

"standard" logistic distribution with density

exp(η)

(1 + exp(η))2


CDF and pdf in "standard" logistic distribution

x

F(x

)

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Kumulativ logistisk fordeling

x

f(x)

-6 -4 -2 0 2 4 6

0.0

0.05

0.10

0.15

0.20

0.25

Tetthet logistisk fordeling

pdf is symmetric aroundx = 0, hence expectation is 0

The variance is∫ ∞

−∞x2 exp(x)

(1 + exp(x))2dx =

π2

3= 1.8137992


Probit link: Inverse of CDF for standard normal

g(η) = Φ−1(η)

whereΦ(y) =∫ y

−∞1√2π

exp(−12x2)dx

However,

• Since the pdf in the standard normal distribution also is

symmetric around 0, with probit link we often get results

that are comparable with those from logistic regression

• However, the logistic distribution has heavier tails than the

normal, and in some situations the probit link may be better


CDF and pdf for logit and probit

x

F(x)

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Kumulative fordelingsfunksjoner

logistiskprobit (skalert)

x

f(x)

-6 -4 -2 0 2 4 6

0.0

0.05

0.10

0.15

0.20

0.25

Tettheter


Comparing estimates from logit and probit

E[Yi] =g−1(ηi)

≈g−1(0) + (g−1)′(0)ηi

=

0.5 + 0.25ηli logit

0.5 + φ′(0)ηpi probit

I.e. forηi ≈ 0, sinceφ′(0) = 1/sqrt2π,

ηli ≈(φ′(0)/0.25)ηp =√

(8/π)ηp ≈ 1.6ηp

or

βlj ≈1.6βp

j


R-output beetles: Logit vs. Probit

> logfit<-glm(cbind(Dode,Ant-Dode)˜Dose,binomial(link=logit),beetle)

> profit<-glm(cbind(Dode,Ant-Dode)˜Dose,binomial(link=probit),beetle)

> logfit

Coefficients:

(Intercept) Dose

-60.72 34.27




> profit

Coefficients:

(Intercept) Dose

-34.94 19.73




> logfit$coef/profit$coef

(Intercept) Dose

1.737999 1.737147 Lecture 6 STK3100 - Categorical responses – p. 12

Akaike information criterion (AIC)

AIC = −2l + 2q

• q = number of parameters in the model

• l is the maximum log-likelihood under the model

• AIC are used for model selection

• The model with lowest AIC model are the best according to

this criterion


R-output beetles: Logit

> summary(logfit)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5941 -0.3944 0.8329 1.2592 1.5940

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -60.717 5.181 -11.72 <2e-16 ***Dose 34.270 2.912 11.77 <2e-16 ***---

Signif. codes: 0 ’ *** ’ 0.001 ’ ** ’ 0.01 ’ * ’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 284.202 on 7 degrees of freedom

Residual deviance: 11.232 on 6 degrees of freedom

AIC: 41.43


R-output beetles: Probit

> summary(profit)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5714 -0.4703 0.7501 1.0632 1.3449

Coefficients:


(Intercept) -34.935 2.648 -13.19 <2e-16 ***Dose 19.728 1.487 13.27 <2e-16 ***---

Signif. codes: 0 ’ *** ’ 0.001 ’ ** ’ 0.01 ’ * ’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)



AIC: 40.318

Number of Fisher Scoring iterations: 4Lecture 6 STK3100 - Categorical responses – p. 15

clog-log-link based on the Gumbel distribution

The linkηi = g(πi) = log(− log(1− πi)) is called the

"complementary log-log-link"

Its inverse is given by

πi = 1− exp(− exp(ηi)) = F (ηi)

which is CDF for (the standardized) Gumbel distribution

Properties:

• not symmetric

• light tail towards+∞

• tails as the logistic distributions towards−∞

• expectation = Euler’s constant≈ −0.58

• varianceπ2/6 ≈ 0.412Lecture 6 STK3100 - Categorical responses – p. 16

CDF and pdf in Gumbel distribution

x

F(x)

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Kumulative fordelingsfunksjon Gumbel

x

f(x)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

Tetthet Gumbel


R-output beetles: Clog-log

> clogfit<-glm(cbind(Dode,Ant-Dode)˜Dose,binomial(link=cloglog),beetle)

> summary(clogfit)

Coefficients:


(Intercept) -39.572 3.240 -12.21 <2e-16 ***Dose 22.041 1.799 12.25 <2e-16 ***



AIC: 33.644

Number of Fisher Scoring iterations: 4

> logfit$coef/clogfit$coef

(Intercept) Dose

1.534342 1.554832


Comparing link functions by AIC

> AIC(logfit,profit,clogfit)

df AIC

logfit 2 41.43027

profit 2 40.31780

clogfit 2 33.64448

clog-log-link gives the lowest AICSince all three models has the same number of parameters, it alsogives the highest log-likelihood, i.e. the best fit


Fitted probabilities for beetle data

with logit link and clog-log link:

dose (log_10)

ande

l dod

e bi

ller

1.70 1.75 1.80 1.85

0.0

0.2

0.4

0.6

0.8

1.0

logistiskcloglog

The clog-log link fits observed proportions better than logit link,with residual deviance 3.45 for clog-log and 11.23 for logit


Including 2. order term of dose

> form = cbind(Dode,Ant-Dode)˜Dose+I(Doseˆ2)

> logfit2<-glm(form,binomial(link=logit),beetle)

> profit2<-glm(form,binomial(link=probit),beetle)

> clogfit2<-glm(form,binomial(link=cloglog),beetle)

> AIC(clogfit,logfit2,profit2,caufit2,clogfit2)

df AIC

clogfit 2 33.64448

logfit2 3 35.39294

profit2 3 35.29647

clogfit2 3 35.60866


Fitted probabilities for beetle data

including also models with quadratic terms ofDose

dose (log_10)

ande

l dod

e bi

ller

1.70 1.75 1.80 1.85

0.0

0.2

0.4

0.6

0.8

1.0

logistiskclogloglogistisk, 2. gradsledd

clog-log link: Quadratic term yields residual deviance 3.19compared to 3.44 with only linear term


Interpretation of parameters in logistic regression

Theodds for an event is defined:π1−π

= Odds

In logistic regression, withη = βTx, the odds are

Odds=exp(η)

1+exp(η)

1− exp(η)1+exp(η)

=

exp(η)1+exp(η)

11+exp(η)

= exp(η)

i.e.

η = logOdds


Interpretation of parameters in logistic regression:

Odds-ratio

• Let x′k = xk, k 6= j, x′

j = xj + 1, i.e.

x′ − x = (0, . . . , 0, 1, 0, . . . , 0),

• The ratio between two odds with explanatory variablesx

andx′ is called theodds-ratio,

(with π′ = eη′

/(1 + eη′

) andη′ = βTx′)

ORj =π′

1−π′

π1−π

= Odds′Odds = exp(η′ − η) = exp(βT (x′ − x))

= exp(βj)

or

βj = log(ORj),

• i.e the regression coefficients are log-odds-ratios or relative

change in odds on the log scaleLecture 6 STK3100 - Categorical responses – p. 24

Odds-ratio ≈ Relative Risk (RR) when the probabilities

are small

• Relative risk is defined as the ratio between two

probabilities:

RR=π′

π• When bothπ andπ′ are small,1− π ≈ 1 and1− π′ ≈ 1.

Therefore,

OR=π′

π

1− π

1− π′≈

π′

π= RR

• I.e., when the probabilities are small,exp(βj) expresses

approximately the relative change in probability whenxj is

increased by one unit


The approximation OR ≈ RR

Relative risk Odds-ratio

π 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20

π′ = 0.01 1 0.2 0.1 0.05 1.00 0.19 0.09 0.04

π′ = 0.05 5 1.0 0.5 0.25 5.21 1.00 0.47 0.21

π′ = 0.10 10 2.0 1.0 0.50 11.00 2.11 1.00 0.44

π′ = 0.20 20 4.0 2.0 1.00 24.75 4.75 2.25 1.00

π′ = 0.30 30 6.0 3.0 1.50 42.43 8.14 3.86 1.71

π′ = 0.40 40 8.0 4.0 2.00 66.00 12.67 6.00 2.67

π′ = 0.50 50 10.0 5.0 2.50 99.00 19.00 9.00 4.00


Interpretation of parameters with clog-log-link

π =1− exp(− exp(βTx))

or

η =βTx = log(− log(1− π))

If π is small, then− log(1− π) ≈ π (Taylor) which gives

η ≈ log(π) ⇔ π ≈ exp(η)

and thus

RRj =π′

π≈ exp(βj)


Ex: Mortality by Wilm’s tumor

444 dead, 3471 survivors

> glm(d˜unfav+factor(stg),family=binomial(link=logit),

data=nwts)$coef

(Intercept) unfav factor(stg)2 factor(stg)3 factor(stg)4

-3.2415851 1.9927784 0.6957588 1.0305140 1.7935930

> glm(d˜unfav+factor(stg),family=binomial(link=cloglog),

data=nwts)$coef

(Intercept) unfav factor(stg)2 factor(stg)3 factor(stg)4

-3.2240445 1.7404373 0.6591325 0.9664677 1.6147868


Interpretation of parameters with probit link

Sometimes we may have continuous responses, for instance

normal distributed,Yi0 ∼ N(βTxi, σ

2), but still prefer to study

Yi =

1 if Yi0 < γ = threshold value

0 if not

Ex; Yi0 = birth weight

Yi =

1 if Yi0 < 2800 gram

0 if not

Ex: Psychometric measurements,Yi0 = score on a depression

scale

Yi =

1 if Yi0 < threshold value

0 if not Lecture 6 STK3100 - Categorical responses – p. 29

Underlying scale

Yi =

1 if Yi0 < γ = threshold value

0 if not

Y0

tetth

et

0.0

0.1

0.2

0.3

0.4


Probit, cont.

Why binary response?

• Tradition to do table analysis

• Direct scoreYi0 may have a skew distribution

• Direct score may not be registered, only an underlying

scale we imagine exists ("latent" variable)

The relation between

• Yi0 ∼ N(βTxi, σ

2)

• Yi = I(Yi0 ≤ γ)

is given by

πi = P(Yi = 1) = P(Yi0 ≤ γ) = Φ(γ

σ− (

β

σ)′xi)


Relationship between parameters on probit

and underlying scale

E[Yi0] = βTxi = β0 + β1xi1 + · · ·+ βpxip is equivalent to the

linear predictor on probit scale

Φ−1(πi) = α0 + α1xi1 + · · ·+ αpxip

where

• α0 =γ−β

0

σ

• αj =−β

j

σfor j = 1, . . . , p

Note: The standard deviationσ on the underlying scale can notbe identified by the probit analysis


Ex: Birth weight and gestational age

> summary(lm(vekt˜svlengde+sex))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1447.24 784.26 -1.845 0.0791 .

svlengde 120.89 20.46 5.908 7.28e-06 ***sex -163.04 72.81 -2.239 0.0361 *---

Residual standard error: 177.1 on 21 degrees of freedom

Multiple R-Squared: 0.64, Adjusted R-squared: 0.6057

F-statistic: 18.67 on 2 and 21 DF, p-value: 2.194e-05

Here isσ = 177.1.


Ex: Birth weight and gestational age cont.

DefinesYi = 1 if birth weight is less than 2800 gram> lavvekt<-1 * (vekt<2800)

> table(lavvekt)

0 1

17 7

>

> glm(lavvekt˜svlengde+sex,family=binomial(link=probit))$coef

(Intercept) svlengde sex

24.1550285 -0.6801164 0.7522067

> -lm(vekt˜svlengde+sex)$coef/177.1

(Intercept) svlengde sex

8.1718986 -0.6826331 0.9206059

Approximately probit-estimates from linear regression:

αj ≈ −βj

σ


Goodness of fit tests for binomial data

• If Yi ∼ Bin(ni, πi) and (a)niπi > 5 and (b)ni(1− πi) > 5

for i = 1, . . . , N , we have approximately

Residual deviance ∆ = 2(l − l) ∼ χ2N−p

Pearson chi-squareX2 =∑n

i=1(Yi−niπi)

2

niπi(1−πi)∼ χ2

N−p

• l is log-likelihood in saturated model

• l log-likelihood for the fitted model withp parameters and

• πi are estimated probabilities

• If D andX2 is much larger thanN − p, it indicates that the

model fit is bad

• However, theYi-s are often binary, and then the conditions

(a) and (b) is no fulfilledLecture 6 STK3100 - Categorical responses – p. 35

Two strategies for goodness of fit tests with binary data

• With categorical explanatory variables: Aggregate to

binomial data

• Aggregation can not be used if there are many categorical

variables with many levels, or if there are continuous

variables.

Can then instead use Hosmer-Lemeshow test


Aggregation

• Count number of individuals within each combination of

the categorical variables

• Count number ofYi = 1 within each combination

• Fit a GLM on aggregated data

• The model is OK ifD andX2 are small compared toχ2N−p

,

whereN is number of combinations of the categorical

variables

• Requires that expected number of successes/failures in each

group> 5


Ex: Aggregation on Wilm’s tumor data> table(nwts$unfav)

0 1

3476 439

> table(nwts$stg)

1 2 3 4

1543 993 906 473

> nwts2 = aggregate(nwts$d,by=list(nwts$unfav,nwts$stg),FUN=table)

Group.1 Group.2 x.0 x.1

1 0 1 1371 59

2 1 1 93 20

3 0 2 809 65

4 1 2 77 42

5 0 3 697 72

6 1 3 72 65

7 0 4 329 74

8 1 4 23 47

> nwts2 = data.frame(unfav=nwts2$Group.1,stg=nwts2$Group.2,

n=nwts2$x[,1]+nwts2$x[,2],d=nwts2$x[,2])


Ex: Aggregation on Wilm’s tumor data> glmfit = glm(cbind(d,n-d)˜as.factor(unfav)+as.factor(stg),data=nwts2,family=binomial)

> glmfit

(Intercept) unfavaggr factor(stgaggr)2 factor(stgaggr)3 factor(stgaggr)4

-3.2416 1.9928 0.6958 1.0305 1.7936




> X2<-sum(residuals(glmfit,type="pearson")ˆ2)

> X2

[1] 3.259168


Ex: Aggregation on Wilm’s tumor data cont.

• The model seems to be OK, since residual deviance

D = 3.33 ≈ X2 = 3.26 = Pearson chi-square is small

compared to residual degrees of freedomdf = 3

• Is expected successes and failures> 5? We compute these:

> round((nwts2$n * glmfit$fit,2)

1 2 3 4 5 6 7 8

53.81 63.55 75.95 76.70 25.19 43.45 61.05 44.30

> round((nwts2$n * (1-glmfit$fit),2)

1 2 3 4 5 6 7 8

1376.19 810.45 693.05 326.30 87.81 75.55 75.95 25.70


Hosmer-Lemeshow test

• Fit the GLM model

• Order the individuals by fitted probabilities

π(1) ≤ π(2) ≤ · · · ≤ π(n)

• Divide the into G groups according to the ordering, with

equally many individuals in each group (“C statistic”)

• Divide the interval fromπ(1) to π(n) into G intervals (“H

statistic”)

• Compute the averageπg = of π(i) in groupg = 1, 2, . . . , G

• Compute no observationsng and successesYg in groupg

• Compute Hosmer-LemeshowX2hl =

∑G

g=1(Yg−ngπg)2

ngπg(1−πg)

• Under the 0 hypothesis (model is OK) we have

approximatelyX2hl ∼ χ2

G−2 Lecture 6 STK3100 - Categorical responses – p. 41

Ex: Hosmer-Lemeshow test on Wilm’s tumor data> glmfit<-glm(d˜unfav+factor(stg)+yr.regis+age,

data=nwts,family=binomial)

> library(MKmisc)

> HLgof.test(glmfit$fit,nwts$d)

$C

Hosmer-Lemeshow C statistic

data: glmfit$fit and nwts$d

X-squared = 3.4823, df = 8, p-value = 0.9006

$H

Hosmer-Lemeshow H statistic

data: glmfit$fit and nwts$d

X-squared = 6.6996, df = 8, p-value = 0.5694


Ex: Hosmer-Lemeshow test on Wilm’s tumor data cont.> glmfit<-glm(d˜unfav+factor(stg)+yr.regis+age,family=binomial)

> kuttoff<-sort(glmfit$fit)[c(round(length(d) * (1:10)/10))]

> gr<-rep(1,length(d))

> for (i in 1:9) gr<-gr+(glmfit$fit>kuttoff[i])

> table(gr)

1 2 3 4 5 6 7 8 9 10

392 392 391 392 392 390 391 392 392 391

> ngr<-as.numeric(table(gr))

> ngr

[1] 392 392 391 392 392 390 391 392 392 391

> dgr<-numeric(0)

> for (i in 1:10) dgr[i]<-sum(d[gr==i])

> dgr

[1] 10 14 16 26 20 28 36 48 79 167

> for (i in 1:10) pigr[i]<-mean(glmfit$fit[gr==i])

> round(pigr,3)

[1] 0.024 0.032 0.040 0.049 0.061 0.076 0.095 0.128 0.202 0.427

> X2HL<-sum((dgr-ngr * pigr)ˆ2/(ngr * pigr * (1-pigr)))

> X2HL

[1] 3.482061

> 1-pchisq(X2HL,8)

[1] 0.9005774 Lecture 6 STK3100 - Categorical responses – p. 43

Sensitivity and specificity

• Classification:

• Predict an event(Yi = 1) if πi > γ, whereγ is a

threshold value

• Predict no event ifπi ≤ γ

• Count number of correct classifications in the data set

• Sensitivity: Proportion of correct predictions when true

Yi = 1

• Specificity: Proportion of correct predictions when true

Yi = 0

• We want high values for both sensitivity and specificity

• For a given method, we can choose threshold valueγ to

give a good balance in a specific classification situationLecture 6 STK3100 - Categorical responses – p. 44

ROC curves

• For evaluating and comparing models, we can vary the

thresholdγ and plot a Receiver Operating Characteristics

curve or ROC-curve with sensitivity on the y-axis and

(1-specificity) on the x-axis

• Can also compute the area under curve (AUC)

• AUC=1 if perfect classification

• AUC=0.5 if random classification


ROC for predicting bycatch of fish

• Shrimp fishery in Barents Sea: Predict if one can expect to

catch more than 0.8 juvenile cod per kg shrimps caught

• If yes, the fishing area is temporarily closed

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of c

orre

ct p

redi

ctio

n if

obse

rved

>0.

8

model predictorno predictability


Over dispersion in “binomial” data

• With independent, binary data there is never over

dispersion (Var(Yi) = πi(1− πi)))

• If independent, binary dataYij with sameπi are aggregated

to Yi =∑j=ni

j=1 Yij,

thenYi ∼ Bin(ni, πi),

Var(Yi) = niπi(1− πi)) and no over dispersion

• However, over dispersion occurs if the outcomes of the

individuals trials are positively correlated.

Then Var(Yi) > niπi(1− πi)

• Possibility 1: Quasi-likelihood

• Possibility 2: Mixed model

• Possibility 3: Beta-binomial distribution


Over dispersion in “binomial” data - Quasi-likelihood

• Specify mean structure by link function and linear predictor

• Specify variance structure

• Possibility 1: Var(Yi) = φniπi(1− πi))

• Possibility 2: Var(Yi) = (1 + ρ(ni − 1))niπi(1− πi))

• Fit the model. (Not sure if Possibility 2 is implemented in

R)


Randomπ or beta binomial response

• Mixed model:πi random with expectationπ∗i

• If πi is random and beta distributed (continuous between 0

and 1),Yi becomes beta binomial

• Then Var(Yi) = (1 + ρ(ni − 1))niπ(1− π))

• Can be estimated i R by thebetabin function from the

aod library


Documents

Lecture 6 STK3100 - Categorical responses · profit2 3 35.29647 clogfit2 3 35.60866 Lecture 6 STK3100 - Categorical responses – p. 21. Fitted probabilities for beetle data including