P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Part II:

Linear methods for regression

and classification

Reading

• HTF: Chapters 3 and 4

• RWC: Chapters 2.1-5, and 10.1-6

59 S. Haneuse; Biostat/Stat 572

!

"

#

$

Linear methods

• From the previous example, we see that any given problem can be

addressed using a whole spectrum of methods

• Define the spectrum of methods in terms of

" structure

" complexity

" parametric assumptions

• How will they differ ?

" numerically the answers will differ; the predictions themselves

" ability to approximate the true underlying mechanism; bias

" performance under repeated sampling; variability/efficiency

" generalizability; predictive performance in an independent test set


!

"

#

$

• Linear methods exist somewhere close to the end of the spectrum which

imposes a fair amount of structure

" many approaches we’ll discuss are direct generalizations of these

methods

• Methods for continuous outcome data are generally referred to as

regression

• Methods for categorical outcome data are generally referred to as

classification


!

"

#

$

Continuous outcomes: regression

Basic set-up

• Assume we have a vector of inputs, X = (X1, . . . , Xp), and would like to

predict Y

• The linear regression model has the form

Y = XT β + ε

" β = (β1, . . . , βp)

" E[ε] = 0

" V[ε] = σ2

• Typically, assume model has an intercept


!

"

#

$

Interpretation

• Consider the interpretation of the jth element of β

• Obtained by isolating the coefficient in the regression model

" let X−j denote the X-vector with the jth element removed

" compare two levels of Xj , holding X−j constant at, say, x−j

βj = E[Y | Xj = x, X−j = x−j ] − E[Y | Xj = (x + 1), X−j = x−j ]

• Interpretation is, therefore,

Holding all other components of the model constant, βj is the

difference in conditional expectation of Y comparing two populations

whose Xj value differ by one unit

" comparison, or contrast, is independent of the baseline value of Xj


!

"

#

$

Estimation

• Observe a training data set consisting of n independent observations

(yi, xi), i = 1, . . . , n.

• Without any further assumptions, estimation typically proceeds by least

squares

" estimates of β obtained by minimization of the residual sum of squares

RSS(β) = (y − Xβ)T (y − Xβ)

" y is a p × 1 column matrix with entries yi

" X is an n × p matrix with row entries xi

• As long as XT X is non-singular, or equivalently X is of full rank, the

solution is given as

β = (XT X)−1XT y


!

"

#

$

The ‘hat’ matrix

• The predicted/fitted values at the training sample points are given by

y = Xβ

= X(XT X)−1XT y

= Hy

" H is called the ‘hat’ matrix since it converts y into y

• Explicitly, the ith fitted value can easily be shown to equal

yi = Hi1y1 + . . . + Hiiyi + . . . + Hinyn

" Hij represents the ‘influence’ of the jth observation on yi

• More generally, H plays a very important role in regression diagnostics

" studentized residuals, leverage, Cook’s distance


!

"

#

$

Degrees of freedom

• Intuitively, ‘degrees of freedom’ refers to the number of estimated

parameters, p

• Consider the trace of the hat matrix

tr(H) = tr{X(XT X)−1XT }

= tr{XT X(XT X)−1} = tr(Ip) = p

• Hence, for the linear model the trace of the hat matrix coincides with the

number of parameters

• Calculation can be applied to any linear estimator

• Can use this as a general purpose definition for the degrees of freedom

" useful in more complex settings where the number of parameters may

not be not obvious


!

"

#

$

• Remember the form of the k-nearest-neighbor prediction

yi =1

k

∑

xj∈Nk(xi)

yj

=n

∑

j=1

πjyj

" where πj = 1/k if xj ∈ Nk(xi), and 0 otherwise

• So the k-nearest-neighbor estimate is linear and can be written as

y = Ly

" L is an n × n matrix

" each row consists of 0’s and 1/k in positions for which the x’s are

among the k nearest neighbors

" tr(L) = n/k, the effective degrees of freedom


!

"

#

$

Gauss-Markov Theorem

• Suppose the random variables X and Y satisfy the following conditions:

Y = XT β + ε

E[ε] = 0

V[ε] = σ2

Then the least squares estimator of β has the smallest variance among all

linear unbiased estimators

" linear in the sense that the they are linear function of the outcome

vector, y

• Consider the prediction of y0 = xT0 β


!

"

#

$

• The least squares estimator of y0 is the form

y0 = xT0 β = xT

0 (XT X)−1XT y

" this is a linear function of the outcome vector: cT0 y

• Assuming the linear model is correct, then y0 is unbiased since

E[y0] = E[

xT0 (XT X)−1XT y

]

= xT0 (XT X)−1XT Xβ = xT

0 β

• The Gauss-Markov theorem states that for any other linear estimator,

θ = cT y,

that is unbiased for xT0 β, then

V[xT0 β] ≤ V[cT y]


!

"

#

$

• Another way of looking at the theorem is in terms of MSE

MSE = Variance + Bias2

• The Gauss-Markov theorem tells us that the least squares estimator attains

minimum MSE among those linear estimators with no bias

• We have already seen that MSE is intimately related to prediction

" under L2 loss for the linear model, MSE arises as a component of the

expected predictive error (EPE)

• If MSE is a criteria that you are interested in, the restriction to the class of

estimators may not be wise

" in optimizing MSE, we will need to seek a balance between bias and

variance

" in particular, we may be willing to trade a little bias for gains in terms

of variance


!

"

#

$

Model selection and coefficient shrinkage

• In many predictions situations there are a large number of inputs, X

• While it may be the case that f(X) = XT β appropriately describes the

underlying mechanisms, it is always the case that we have a finite training

sample size, n

• In such settings, least squares estimation may not be satisfying for a

couple of reasons

• Prediction accuracy:

" least squares estimates may have low bias, but in ‘small’-sample

settings can exhibit large variability

" we could sacrifice a little bias to reduce variation and achieve better

overall predictive accuracy


!

"

#

$

• Interpretation:

" with a large number of predictors, it may be hard conceptualize

‘holding everything else constant’

" may be desirable to restrict attention to a smaller subset of variables

which exhibit the strongest effects


!

"

#

$

King County Birth data

• As an example, let’s consider the King county birth dataset, which

contains information on n=2,500 births from 2001

• The key outcome variable of interest is birth weight

" Birth weight ranges from 255g to 5,175g

" 5.1% of babies (127) were born at low birth weight (< 2,500g)

• A total of 16 potential predictor variables are available for investigation


!

"

#

$

0 1000 2000 3000 4000 5000

Birth weight, grams

Figure 1: King county 2001 birth weight data


!

"

#

$

15 20 25 30 35 40 45

10

00

20

00

30

00

40

00

50

00

Mother’s age, years

0 5 10 15 20 25

10

00

20

00

30

00

40

00

50

00

Cigarettes smoked per day

0 5 10 15

10

00

20

00

30

00

40

00

50

00

Alchoholic drinks per week

0 5 10 15

10

00

20

00

30

00

40

00

50

00

Highest grade completed

Figure 2: King county 2001 birth weight data continued


!

"

#

$

• Complete variable listing

"gender" M = male, F = female baby

"plural" 1 = singleton, 2 = twin, 3 = triplet

"age" mother’s age in years

"race" race categories (for mother)

"parity" number of previous live born infants

"married" Y = yes, N = no

"bwt" birth weight in grams

"smokeN" number of cigarettes smoked per day during pregnancy

"drinkN" number of alcoholic drinks per week during pregnancy

"firstep" 1 = participant in program; 0 = did not participate

"welfare" 1 = participant in public assistance program; 0 = did not

"smoker" Y = yes, N = no, U = unknown

"drinker" Y = yes, N = no, U = unknown

"wpre" mother’s weight in pounds prior to pregnancy

"wgain" mother’s weight gain in pounds during pregnancy

"education" highest grade completed (add 12 + 1 / year of college)

"gestation" weeks from last menses to birth of child


!

"

#

$

• Rather than attempting to fit and report a model which includes all the

potential predictors, we can consider two strategies

" subset selection

" shrinkage methods

Subset selection

• Here we retain only a subset of variables

" the remaining variables essentially have their β coefficients set to zero

• Various strategies exist for ‘choosing’ the variables to keep (or throw out)

" best subset selection

" stepwise strategies


!

"

#

$

Best subsets regression

• Suppose X consists of p components; X1, . . . , Xp

• For each k ∈ {1, . . . , p}, find the subset of k variables which results in the

smallest residual sums of squares

" other criteria include Mallow’s Cp, R2 and adjusted R2

• Can quickly become computationally intensive when p gets large

• In R, code is implemented in the leaps package

##

library(leaps)

weight <- read.table("birthweight.txt")

names(weight) <- c("gender", "plural", "age", "race", "parity", "married",

"bwt", "smokeN", "drinkN", "firststep", "welfare",

"smoker", "drinker", "wpre", "wgain", "education",

"gestation")


!

"

#

$

## Model with only the intercept

##

fit0 <- lm(bwt ~ 1, data=weight)

## Perform best subsets analysis

##

## maxModel: a model which includes all the variables you wish to

## entertain

## nvmax: the maximum number of variables for the subset selection

## nbest: specify, for any given k, the number of the best models

## are to be returned

##

maxModel <- as.formula(bwt ~ gender + age + race + parity + married

+ smokeN + drinkN + firststep + welfare

+ smoker + drinker + wpre + wgain

+ education + gestation)

bestSub <- summary(regsubsets(maxModel, data=weight, nvmax=12, nbest=10))

##

names(bestSub)

[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"


!

"

#

$

## ’results’ contains the subset size, k, and the residual sum of squares

##

results <- c(0, sum((weight$bwt- fitted(fit0))^2))

results <- rbind(results,

cbind(apply(bestSub$which, 1, sum)-1, bestSub$rss))

##

minRSS <- tapply(results[,2], results[,1], FUN=min)


!

"

#

$Subset size, k

Re

sid

ua

l su

m o

f sq

ua

res

0 1 2 3 4 5 6 7 8 9 10 11 12


!

"

#

$

• The best-subset curve is necessarily decreasing, so it cannot be used as a

criteria for choosing k

• Typically choose a model which minimizes an estimate of the EPE

" Mallow’s Cp, AIC, BIC, cross-validation

Stepwise procedures

• Instead of performing an exhaustive enumeration for each value for k, we

can search for a ‘good path’

• Forward selection

" start with an ‘intercept-only’ model and build up the model

• Backward selection

" start with an ‘full’ model and reduce the model


!

"

#

$

Shrinkage methods

• Rather retaining some variables and discarding the rest, an alternative is to

keep all the variables but impose restrictions on the size of the coefficients

" the point esitmates for β are subject to bias

" results often don’t suffer as much in terms of variability

• Ridge regression imposes an L2-type penalty

" the solution is give by

βridge

= argminβ RSS(β)

subject to the constraint:

p∑

j=1

β2j ≤ s

" value of s influences how large the components of β can get


!

"

#

$

• An alternative way of writing the problem is

βridge

= argminβ

RSS(β) + λp

∑

j=1

β2j

• Here, λ ≥ 0 controls the amount of shrinkage

" when λ = 0, we are performing ordinary least squares estimation

" for large λ, minimizing the penalized RSS requires the components of

β to be small

" there is a one-to-one relationship between λ and s

• Minimization yields the solution

βridge

= (XT X + λI)−1XT y

• Could allow λ to be a vector, and ensure no shrinkage among certain

coefficients


!

"

#

$

• The solution is linear, and we can therefore obtain the effective degrees of

freedom as

df(λ) = tr{X(XT X + λI)−1XT }

" depends on the complexity/smoothing parameter λ

• Ridge regression for the linear model is implemented in R

##

library(MASS)

##

ridgeFit <- lm.ridge(bwt ~ ., data=weight, lambda=c(0, 100, 1000, 10000))

## Output:

##

0 100 1000 10000

genderM 67.247 64.496 47.026 12.611

age 19.756 19.532 17.125 7.662

raceblack -16.062 -16.175 -14.413 -5.620

...


!

"

#

$Smoothing parameter, !

Ridge regression coefficient estimates

0 6250 12500 18750 25000

!3

9!

13

14

41

67

genderM

age

smokerY

drinkerY


!

"

#

$

• Relationship between λ and df(λ) is non-linear

• For these data, there is a dramatic reduction in ‘complexity’ of the

model up to about λ = 1,000

Smoothing parameter, !

Eff

ective

de

gre

es o

f fr

ee

do

m,

d(!

)Smoothing parameter vs. effective degrees of freedom

0 6250 12500 18750 25000

13

57

91

11

31

51

71

9


!

"

#

$Effective degrees of freedom, d(!)

Ridge regression coefficient estimates

1 3 5 7 9 11 13 15 17 19

!3

9!

13

14

41

67

genderM

age

smokerY

drinkerY


!

"

#

$

• Ridge regression can also be derived from a Bayesian perspective

• Suppose

Yi| Xi = xi ∼ Normal(xTi β, σ2)

and that we adopt independent Normal priors on the components of β

βj ∼ Normal(0, τ2) j = 1, . . . , p

• The marginal posterior for β is given by

π(β| y, X, σ) ∝ Likelihood × Prior

=n

∏

i=1

exp

{

−1

2σ2

(

yi − xTi β

)2}

×p

∏

j=1

exp

{

−1

2τ2β2

j

}

= exp

−1

2σ2

n∑

i=1

(

yi − xTi β

)2−

1

2τ2

p∑

j=1

β2j


!

"

#

$

• Taking the log and multiplying by −2σ2, we find that the negative

log-posterior density for β is equivalent to the objective function in ridge

regression:

n∑

i=1

(

yi − xTi β

)2+ λ

p∑

j=1

β2j ,

• Hence βridge

correspond to the posterior mode in the above set-up

" also the mean, since we are assuming Normal distributions everywhere


!

"

#

$

• An alternative to ridge regression is the Lasso, which imposes an L1-type

penalty and can be written as

βlasso

= argminβRSS(β)

subject to the constraint:

p∑

j=1

|βj | ≤ t

" alternative constraint makes the solution nonlinear in y

" requires a quadrative programming solution

• Several R libraries implement the lasso

" lasso2


!

"

#

$

Model complexity

• In each of the approaches we have seen, there is an additional parameter

which controls the ‘complexity’ of the model

" best subset selection: k

" ridge regression: s or λ

" the lasso: t

• The choice can have a dramatic impact on the results

• Question: How can allow the specific value to be determined on the basis

of the data?

• The results presented here are based on 10-fold cross validation

" we’ll go over this in more detail down the road

" briefly explained here in terms of ridge regression


!

"

#

$

10-fold Cross-validation

• Initially, split the full data into two parts

" training data: n = 1,900

" test data: n = 600

• Further split the training data into 10 equal parts

" train(1)

"...

" train(10)

• For each l = 1, . . ., 10

i. fit the ridge regression model to nine-tenths of the training data

f (−l)ridge (train(−l); λ)


!

"

#

$

ii. evaluate the prediction error in the remaining one-tenth

" here we are using L2 loss

L(train(l); λ) =190∑

i=1

{

y(l)i − f (−l)

ridge (train(l)i ; λ)

}2

• Calculate the average prediction error across the 10 sets of results

CV(λ) =1

10

10∑

l=1

L(train(l); λ)

" estimate of EPE based on the training data and for a given value of λ

• Repeat this for every value of λ


!

"

#

$

• Plots of CV(k) and CV(df(λ)) for best subsets and ridge regression

Subset size, k

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Be

st

su

bse

ts

X

Effective degrees of freedom, d(!)

12 13 14 15 16 17 18 19

Rid

ge

re

gre

ssio

n

X


!

"

#

$

• Best subsets regression:

" let k∗ denote the value of k which minimizes CV(K)

" obtain the best subset regression fixing the number of variables at k∗

• Ridge regression:

" let λ∗ denote the value of λ which minimizes CV(λ), or CV(df(λ))

" fit a ridge regression to the full training data, with λ∗

• The estimated test error in each case is the error obtained by applying the

resulting fits to the test data

" n = 900 data set aside earlier


!

"

#

$

• Test error results

Method Complexity Training Error Test Error

parameter

Least squares 3610956 1521808

Best subset k∗ = 11 4037468 1038386

Ridge regression df(λ∗) = 13.44 3835416 1468410

• Note, in absolute terms, the test error values don’t mean much

" scale is dependent on choice of the loss function, as well as the data

" relative differences are meaningful though


!

"

#

$

Categorical outcomes: classification

• Here we consider problems where the outcome of interest is categorical

• For example, in the King county birth weight data scientific focus may rest

on the prediction of whether or not a baby will have low birth weight

" typically defined as weight < 2,500g

" 127 babies out of 2,500 had low birth weight

• From a notational perspective, it is useful to define the outcome random

variable G

• Denote the set of values G can take on as the discrete set G

" set consisting of K classes: G1, . . . ,GK


!

"

#

$

• The general approach is to establish decision boundaries which provide the

means to classify observations into one of the K classes

" the boundaries split the X-space

• Boundaries are determined by a set of discriminant functions:

δk(x), k = 1, . . . , K

" classify the observation, x, as one of the elements of G on the basis of

which δk(x) is largest

• Here we are going to focus on decision boundaries which are linear

" δk(x) are linear in x

" or, some monotone transformation of δk(x) is linear in x


!

"

#

$

Decision theory

• In Part I we saw the 0-1 loss function for a categorical outcome variable

L(G, f(X)) =

1 if f(X) (= G

0 if f(X) = G

• For a categorical variable, the loss function can be represented by a

K × K matrix L

" the L(k, l) element is the cost of classifying an observation as l, when

the truth is class k

" L is zero on the diagonal and (typically) nonnegative everywhere else

" for a binary outcome, 0-1 loss gives

L =

0 1

1 0


!

"

#

$

• The expected prediction error is again given by

EPE = EX,G[L(G, G(X))]

= EX

[

K∑

k=1

L(Gk, G(X)) Pr(Gk| X = x)

]

• Typically, we condition on X and minimize pointwise, so that the solution

is given as

G(x) = argming∈G

K∑

k=1

L(Gk, g) Pr(Gk| X = x)

• Under 0-1 loss, this simplifies to

G(x) = argming∈G [1 − Pr(g| X = x)]

" the solution is therefore to pick the element of G which gives the

largest Pr(Gk| X = x)

" i.e. pick the most probable outcome


!

"

#

$

Linear regression for categorical outcomes

• In the toy example, we explored the use of linear regression as an approach

for predicting a 0/1 outcome variable

• For a K-level categorical outcome, we can adapt the approach by defining

a set of K indicator variables

Yk =

1 if G = Gk

0 otherwise

for k = 1, . . ., K

• Training data form an n × K outcome matrix, Y

" consisting of 1’s and 0’s

" each row contains a single 1


!

"

#

$

• Fitting a linear regression model to each column of Y yields

Y = X(XT X)−1XT Y

" note there is a coefficient vector for each Yk

• Each row of Y consists of K fitted values

(f1(xi), . . . , fK(xi)), i = 1, . . . , n.

• Assign a class by choosing the component with the largest value

G(xi) = argmaxGk∈G fk(xi)

• This may be reasonable in many situations, especially since linear

regression attempts to estimate the conditional expectation and

E[Yk| X = x] = Pr(Gk| X = x)

" from the decision theory, linear regressions goal is a sensible one


!

"

#

$

• As long as the model contains an intercept, results from linear regression

satisfy

K∑

k=1

fk(xi) = 1

• However, some of the fk(xi) can be negative or a number greater than 1

" due to the rigid nature of linear regression

• Within the linear regression framework, we can get around this by

introducing transformations of the X-space

" quadratic terms and two-way interactions

" more general polynomial terms

" more general basis functions

• With finite n, the gains in terms of bias generally result in a cost in terms

of variance


!

"

#

$

Linear discriminant analysis

• Decision theory told us that, under 0-1 loss, the optimal classification

scheme involves the use of the class probabilities

Pr(Gk| X = x)

• Given these quantities, the decision boundary between classes Gk and Gl

can be expressed in terms of values in the X-space for which the class

probabilities are equal

{x ∈ X : Pr(G = Gk| X = x) = Pr(G = Gl| X = x)}

or{

x ∈ X :Pr(G = Gk| X = x)

Pr(G = Gl| X = x)= 1

}


!

"

#

$

• Let fk(x) denote the class-conditional density of the p-vector X in class

G = Gk

• Further, let πk denote the prior probability of class k, with∑

πk = 1

• Bayes theorem then gives us

Pr(G = Gk| X = x) =fk(x)πk

K∑

l=1fl(x)πl

• In terms of ‘choosing’ between classes, the denominator is constant and we

can see that the fk(x) provide almost as much information as the

Pr(G = Gk| X = x)


!

"

#

$

Linear Discriminant Analysis

• Suppose we assume the class densities to be multivariate Normal, with a

common variance-covariance matrix, Σ

fk(x) =1

(1π)p/2|Σ|1/2exp

{

−1

2(x − µk)T Σ−1(x − µk)

}

• Given this form, and taking logs, the decision boundary between classes Gk

and Gl reduces to

log

{

Pr(G = Gk| X = x)

Pr(G = Gl| X = x)

}

= xT Σ−1(µk − µl)

−1

2(µk + µl)

T Σ−1(µk − µl)

+ log

{

πk

πl

}


!

"

#

$

• Consequently, the log-odds function is linear in x

" holds true for any (k, l) pair

" decision boundary solutions are therefore linear

• Working somewhat backwards, we see that the linear discriminant

functions

δk(x) = xT Σ−1µk + µTk Σ−1µk + log{πk}

are sufficient to determine the decision rule, with

G(x) = argmaxGk∈G δk(x)


!

"

#

$

Estimation

• In practice, we need to estimate the parameters of the multivariate Normal

distributions using the training data

πk =nk

n

µk =1

nk

∑

gi=Gk

xi

Σ =1

n − K

K∑

k=1

∑

gi=Gk

(xi − µk)(xi − µk)T


!

"

#

$

Quadratic Discriminant Analysis

• We can relax the assumption of a common variance-covariance matrix

across the K classes

• When forming the log-odds

log

{

Pr(G = Gk| X = x)

Pr(G = Gl| X = x)

}

the convenient cancellations no longer occur

• This results in a series of quadratic discriminant functions

δk(x) = −1

2log |Σk| −

1

2(x − µk)T Σ−1

k (x − µk) + log{πk}

" these are no longer linear in x


!

"

#

$

• The decision boundary between classes Gk and Gl is now defined by a

quadratic equation

{x ∈ X : δk(x) = δl(x)}

Implementation

• In R, both LDA and QDA are implemented in the MASS library

" lda()

" qda()


!

"

#

$

Logistic regression

• In LDA, the linearity of the decision boundaries arises from the use of

multivariate normal distributions (and the common variance-covariance

assumption)

" Bayes theorem ensures that the estimates of Pr(G = Gk| X = x) add

up to 1.0

" no guarantee they remain in [0,1]

• Logistic regression attempts to model the decision boundaries directly as

linear functions of X

" ensure the Pr(G = Gk| X = x) add up to 1.0

" ensure they remain in [0,1]


!

"

#

$

• For K classes, the model has the form

log

{

Pr(G = G1| X = x)

Pr(G = GK | X = x)

}

= xT β1

log

{

Pr(G = G2| X = x)

Pr(G = GK | X = x)

}

= xT β2

...

log

{

Pr(G = GK−1| X = x)

Pr(G = GK | X = x)

}

= xT βK−1

" here, class K is the referent group

" K − 1 sets of parameters reflect the constraint that the probabilities

add up to 1.0


!

"

#

$

• A simple calculation reveals that

Pr(G = Gk| X = x) =exp{xT βk}

1 +K−1∑

l=1exp{xT βl}

, k = 1, . . . , K − 1

Pr(G = GK | X = x) =1

1 +K−1∑

l=1exp{xT βl}

" Pr(G = Gk| X = x) ∈ [0,1] for each k

" sum adds up to 1.0


!

"

#

$

Estimation

• Given the Pr(G = Gk| X = x), the multinomial distribution arises as a

natural distribution on which to base the likelihood

• Maximum likelihood estimation proceeds via iteratively re-weighted least

squares

" Fisher’s method of scoring

" Newton-Raphson with the Hessian replaced by its expected value

• In the binary case, the details of the algorithm simplify

• If we re-code g = 1/2 to y = 0/1, and let

µ ≡ µ(x; β) = E[Y | X = x, β] =exp{xT β}

1 + exp{xT β}


!

"

#

$

• For a given starting value of β, at each step of the algorithm, update the

current value of β via

βnew

←− βold

+ (XT WX)−1XT (y − µ)

= (XT WX)−1XT W (Xβold

+ W−1XT (y − µ))

= (XT WX)−1XT Wz

" z is referred to as the adjusted response

" the weight matrix W is a diagonal matrix with entries

Wii = µ(x; βold

)(1 − µ(x; βold

))

=exp{xT β

old}

[1 + exp{xT βold}]2


!

"

#

$

The hat matrix and degress of freedom

• In logistic regression, and GLMs in general, the estimated y are nonlinear

in the y

• Hence there is no matrix H such that y = Hy

• We can define a hat matrix by linearization using the following analogy

with linear regression

• For the linear model, note that

y − E[y] = H{y − E[y]}

" since HE[y] = X(XT X)−1XT Xβ = Xβ


!

"

#

$

• For a generalized linear model, we can define the hat matrix H(β) to be

the matrix such that

µ(X; β) − µ(X; β) ≈ H(β){y − E[y]}

• A definition which satisfies this is to set

H(β) = WX(XT WX)−1XT

" uses a Taylor approximation in the update step in the IWLS algorithm

" remember, the W matrix depends on β

• It is relatively straightforward to show that

tr(H(β)) = p

so that, again, using the trace of the newly defined hat matrix serves as a

reasonable definition of the degrees of freedom


!

"

#

$

Implementation

• glm() for fitting binary logistic regression models

• multinom() in the nnet library for fitting multinomial logistic regression

models

King county birth data

• Define a new outcome of ‘low birth weight’

" binary 0/1 variable

• Compare the overall results obtained by LDA and logistic regression

• Consider using all 16 predictors

" could apply model selection and shrinkage methods here


!

"

#

$

• To evaluate the expected prediction error, we can use cross-validation with

K = N

" leave-one-out cross-validation

" full code is available online

##

library(MASS)

## Cycle through each observation

##

results <- matrix(NA, nrow=2500, ncol=2)

for(i in 1:2500)

{

## Fit a linear discriminant analysis

##

fitLDA <- lda(low ~ ., data=weight[-i, -c(2,7)])

Yhat <- predict(fitLDA, newdata=weight[i, -c(2,7)])$class

results[i,1] <- (Yhat != weight[i, -c(2,7)]$low)


!

"

#

$

## fit a logistic regression model

##

fitGLM <- glm(low ~ ., data=weight[-i, -c(2,7)], family=binomial)

Yhat <- predict(fitGLM,

newdata=weight[i, -c(2,7)], type="response") > 0.5

results[i,2] <- (Yhat != weight[i, -c(2,7)]$low)

}

##

apply(results, 2, mean)


!

"

#

$

• Results based on a 0-1 loss

Method Estimated

Prediction Error

LDA 4.28%

Logistic regression 4.24%

• LDA and logistic regression don’t missclassify the same observations

though

Logistic Regression

Correct Missclassified

LDA Correct 2,380 13

Missclassified 14 93


!

"

#

$

Comparison with LDA

• In the development of LDA, we saw the decision boundary between classes

k and K to be of the form

log

{

Pr(G = Gk| X = x)

Pr(G = GK | X = x)

}

= xT Σ−1(µk − µK)

−1

2(µk + µK)T Σ−1(µk − µK)

+ log

{

πk

πK

}

= α0k + xT αk

• Hence, the log-odds have the same form as that of logistic regression

" the two approaches attempt to estimate the same quantities

" the major difference is in how they estimate these quantities


!

"

#

$

• In LDA, by virtue of making the assumption of multivariate Normality for

X within each class, we are estimating the full joint distribution of (G, X).

• Make assumptions about the conditional distribution X | G and the

marginal distribution G

X | G = Gk ∼ MVNp(µk,Σ), k = 1, . . . , K

P (G = Gk) = πk

" use Bayes theorem to inform us on G|X

" there is a resulting induced marginal distribution for X which depends

on the log-odds parameters and hence plays a role in estimation

• If these assumptions are true, then LDA will be the best we can do

" LDA estimates will be ‘maximum likelihood’


!

"

#

$

• In logistic regression, we only focus on the conditional distribution G|X

" maximize a conditional likelihood based on the multinomial distribution

• No additional assumptions on the marginal distribution of X

" actually ignored in the estimation of the model

" think of the marginal density as being estimated in a fully

nonparametric way

• In the advent that the underlying densities are multivariate Normal, there

will be a loss of efficiency when using logistic regression

" typically not large, and at most 30%

" reasonable trade-off to possible misspecification


!

"

#

$

Seperating hyperplanes

• LDA and logistic regression both estimate linear decision boundaries by

adopting structure on the log-odds scale

" exploit all the data via some global structure

• An alternative class of methods attempts to directly separate observations

into the various possible classes

" focus on the points themselves

" viewed as being more local in nature


!

"

#

$

• Consider the data in the opposite

figure

• Included is the least squares solution

to the problem

" Y = -1/1

" boundary line is given by

{

x ∈ R2 : xT β = 0

}

• We see two points are missclassified

• Surprising since its not hard to see

that one could draw a line separat-

ing the two groups

!3 !2 !1 0 1 2 3

!3

!2

!1

01

23

X1

X2


!

"

#

$

• We refer to such data as being separable

" there exists at least one separating hyperplane

" in this example there actually are infinitely many such planes

" two are shown here with dashed lines

!3 !2 !1 0 1 2 3

!3

!2

!1

01

23

X1

X2


!

"

#

$

Perceptron learning algorithm

• The perceptron algorithm finds a separating hyperplane by minimizing the

distance between the decision boundary and missclassified points

• For Y = -1/1, and for a given β, a respose is missclassified if either

yi = −1 and xTi β > 0

or

yi = 1 and xTi β < 0

• Let M denote the set of missclassified observations

• The goal is to minimize

D(β) = −∑

i∈M

yi(xTi β)


!

"

#

$

• Taking derivatives, the update is simply cycles through each element of M

via

βnew

← βold

+ ρyixi

" ρ is a tuning parameter

" algorithm uses a stochastic gradient descent by visiting each

observation in sequence

" rather than computing the sum of the derivatives

• The form of the algorithm suggests that there is a greater local focus on

points ‘close’ to the decision boundary

" algorithm updates on the basis of elements of M

" may be more robust to model misspecification than, say, LDA or

logistic regression which impose more global structure


!

"

#

$

• The algorithm suffers from a series of drawbacks however

" when the data are separable, there are many solutions and which one

you get depends on the starting values

" it can sometimes take a long time to converge

" when the data are not separable there is no solution, so that the

algorithm will not converge and can cycle

• One solution to the first problem is to incorporate an additional constraint

into the optimization problem

" choose the boundary which maximizes the margin between the two

classes in the training data

• In most situations, however, it is unlikely to be the case that the classes

are perfectly separable

" support vector machines provide a solution to this problem

" we’ll see these later


Documents

P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b