Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
!
"
#
$
Part II:
Linear methods for regression
and classification
Reading
• HTF: Chapters 3 and 4
• RWC: Chapters 2.1-5, and 10.1-6
59 S. Haneuse; Biostat/Stat 572
!
"
#
$
Linear methods
• From the previous example, we see that any given problem can be
addressed using a whole spectrum of methods
• Define the spectrum of methods in terms of
" structure
" complexity
" parametric assumptions
• How will they differ ?
" numerically the answers will differ; the predictions themselves
" ability to approximate the true underlying mechanism; bias
" performance under repeated sampling; variability/efficiency
" generalizability; predictive performance in an independent test set
60 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Linear methods exist somewhere close to the end of the spectrum which
imposes a fair amount of structure
" many approaches we’ll discuss are direct generalizations of these
methods
• Methods for continuous outcome data are generally referred to as
regression
• Methods for categorical outcome data are generally referred to as
classification
61 S. Haneuse; Biostat/Stat 572
!
"
#
$
Continuous outcomes: regression
Basic set-up
• Assume we have a vector of inputs, X = (X1, . . . , Xp), and would like to
predict Y
• The linear regression model has the form
Y = XT β + ε
" β = (β1, . . . , βp)
" E[ε] = 0
" V[ε] = σ2
• Typically, assume model has an intercept
62 S. Haneuse; Biostat/Stat 572
!
"
#
$
Interpretation
• Consider the interpretation of the jth element of β
• Obtained by isolating the coefficient in the regression model
" let X−j denote the X-vector with the jth element removed
" compare two levels of Xj , holding X−j constant at, say, x−j
βj = E[Y | Xj = x, X−j = x−j ] − E[Y | Xj = (x + 1), X−j = x−j ]
• Interpretation is, therefore,
Holding all other components of the model constant, βj is the
difference in conditional expectation of Y comparing two populations
whose Xj value differ by one unit
" comparison, or contrast, is independent of the baseline value of Xj
63 S. Haneuse; Biostat/Stat 572
!
"
#
$
Estimation
• Observe a training data set consisting of n independent observations
(yi, xi), i = 1, . . . , n.
• Without any further assumptions, estimation typically proceeds by least
squares
" estimates of β obtained by minimization of the residual sum of squares
RSS(β) = (y − Xβ)T (y − Xβ)
" y is a p × 1 column matrix with entries yi
" X is an n × p matrix with row entries xi
• As long as XT X is non-singular, or equivalently X is of full rank, the
solution is given as
β = (XT X)−1XT y
64 S. Haneuse; Biostat/Stat 572
!
"
#
$
The ‘hat’ matrix
• The predicted/fitted values at the training sample points are given by
y = Xβ
= X(XT X)−1XT y
= Hy
" H is called the ‘hat’ matrix since it converts y into y
• Explicitly, the ith fitted value can easily be shown to equal
yi = Hi1y1 + . . . + Hiiyi + . . . + Hinyn
" Hij represents the ‘influence’ of the jth observation on yi
• More generally, H plays a very important role in regression diagnostics
" studentized residuals, leverage, Cook’s distance
65 S. Haneuse; Biostat/Stat 572
!
"
#
$
Degrees of freedom
• Intuitively, ‘degrees of freedom’ refers to the number of estimated
parameters, p
• Consider the trace of the hat matrix
tr(H) = tr{X(XT X)−1XT }
= tr{XT X(XT X)−1} = tr(Ip) = p
• Hence, for the linear model the trace of the hat matrix coincides with the
number of parameters
• Calculation can be applied to any linear estimator
• Can use this as a general purpose definition for the degrees of freedom
" useful in more complex settings where the number of parameters may
not be not obvious
66 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Remember the form of the k-nearest-neighbor prediction
yi =1
k
∑
xj∈Nk(xi)
yj
=n
∑
j=1
πjyj
" where πj = 1/k if xj ∈ Nk(xi), and 0 otherwise
• So the k-nearest-neighbor estimate is linear and can be written as
y = Ly
" L is an n × n matrix
" each row consists of 0’s and 1/k in positions for which the x’s are
among the k nearest neighbors
" tr(L) = n/k, the effective degrees of freedom
67 S. Haneuse; Biostat/Stat 572
!
"
#
$
Gauss-Markov Theorem
• Suppose the random variables X and Y satisfy the following conditions:
Y = XT β + ε
E[ε] = 0
V[ε] = σ2
Then the least squares estimator of β has the smallest variance among all
linear unbiased estimators
" linear in the sense that the they are linear function of the outcome
vector, y
• Consider the prediction of y0 = xT0 β
68 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The least squares estimator of y0 is the form
y0 = xT0 β = xT
0 (XT X)−1XT y
" this is a linear function of the outcome vector: cT0 y
• Assuming the linear model is correct, then y0 is unbiased since
E[y0] = E[
xT0 (XT X)−1XT y
]
= xT0 (XT X)−1XT Xβ = xT
0 β
• The Gauss-Markov theorem states that for any other linear estimator,
θ = cT y,
that is unbiased for xT0 β, then
V[xT0 β] ≤ V[cT y]
69 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Another way of looking at the theorem is in terms of MSE
MSE = Variance + Bias2
• The Gauss-Markov theorem tells us that the least squares estimator attains
minimum MSE among those linear estimators with no bias
• We have already seen that MSE is intimately related to prediction
" under L2 loss for the linear model, MSE arises as a component of the
expected predictive error (EPE)
• If MSE is a criteria that you are interested in, the restriction to the class of
estimators may not be wise
" in optimizing MSE, we will need to seek a balance between bias and
variance
" in particular, we may be willing to trade a little bias for gains in terms
of variance
70 S. Haneuse; Biostat/Stat 572
!
"
#
$
Model selection and coefficient shrinkage
• In many predictions situations there are a large number of inputs, X
• While it may be the case that f(X) = XT β appropriately describes the
underlying mechanisms, it is always the case that we have a finite training
sample size, n
• In such settings, least squares estimation may not be satisfying for a
couple of reasons
• Prediction accuracy:
" least squares estimates may have low bias, but in ‘small’-sample
settings can exhibit large variability
" we could sacrifice a little bias to reduce variation and achieve better
overall predictive accuracy
71 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Interpretation:
" with a large number of predictors, it may be hard conceptualize
‘holding everything else constant’
" may be desirable to restrict attention to a smaller subset of variables
which exhibit the strongest effects
72 S. Haneuse; Biostat/Stat 572
!
"
#
$
King County Birth data
• As an example, let’s consider the King county birth dataset, which
contains information on n=2,500 births from 2001
• The key outcome variable of interest is birth weight
" Birth weight ranges from 255g to 5,175g
" 5.1% of babies (127) were born at low birth weight (< 2,500g)
• A total of 16 potential predictor variables are available for investigation
73 S. Haneuse; Biostat/Stat 572
!
"
#
$
0 1000 2000 3000 4000 5000
Birth weight, grams
Figure 1: King county 2001 birth weight data
74 S. Haneuse; Biostat/Stat 572
!
"
#
$
15 20 25 30 35 40 45
10
00
20
00
30
00
40
00
50
00
Mother’s age, years
0 5 10 15 20 25
10
00
20
00
30
00
40
00
50
00
Cigarettes smoked per day
0 5 10 15
10
00
20
00
30
00
40
00
50
00
Alchoholic drinks per week
0 5 10 15
10
00
20
00
30
00
40
00
50
00
Highest grade completed
Figure 2: King county 2001 birth weight data continued
75 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Complete variable listing
"gender" M = male, F = female baby
"plural" 1 = singleton, 2 = twin, 3 = triplet
"age" mother’s age in years
"race" race categories (for mother)
"parity" number of previous live born infants
"married" Y = yes, N = no
"bwt" birth weight in grams
"smokeN" number of cigarettes smoked per day during pregnancy
"drinkN" number of alcoholic drinks per week during pregnancy
"firstep" 1 = participant in program; 0 = did not participate
"welfare" 1 = participant in public assistance program; 0 = did not
"smoker" Y = yes, N = no, U = unknown
"drinker" Y = yes, N = no, U = unknown
"wpre" mother’s weight in pounds prior to pregnancy
"wgain" mother’s weight gain in pounds during pregnancy
"education" highest grade completed (add 12 + 1 / year of college)
"gestation" weeks from last menses to birth of child
76 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Rather than attempting to fit and report a model which includes all the
potential predictors, we can consider two strategies
" subset selection
" shrinkage methods
Subset selection
• Here we retain only a subset of variables
" the remaining variables essentially have their β coefficients set to zero
• Various strategies exist for ‘choosing’ the variables to keep (or throw out)
" best subset selection
" stepwise strategies
77 S. Haneuse; Biostat/Stat 572
!
"
#
$
Best subsets regression
• Suppose X consists of p components; X1, . . . , Xp
• For each k ∈ {1, . . . , p}, find the subset of k variables which results in the
smallest residual sums of squares
" other criteria include Mallow’s Cp, R2 and adjusted R2
• Can quickly become computationally intensive when p gets large
• In R, code is implemented in the leaps package
##
library(leaps)
weight <- read.table("birthweight.txt")
names(weight) <- c("gender", "plural", "age", "race", "parity", "married",
"bwt", "smokeN", "drinkN", "firststep", "welfare",
"smoker", "drinker", "wpre", "wgain", "education",
"gestation")
78 S. Haneuse; Biostat/Stat 572
!
"
#
$
## Model with only the intercept
##
fit0 <- lm(bwt ~ 1, data=weight)
## Perform best subsets analysis
##
## maxModel: a model which includes all the variables you wish to
## entertain
## nvmax: the maximum number of variables for the subset selection
## nbest: specify, for any given k, the number of the best models
## are to be returned
##
maxModel <- as.formula(bwt ~ gender + age + race + parity + married
+ smokeN + drinkN + firststep + welfare
+ smoker + drinker + wpre + wgain
+ education + gestation)
bestSub <- summary(regsubsets(maxModel, data=weight, nvmax=12, nbest=10))
##
names(bestSub)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
79 S. Haneuse; Biostat/Stat 572
!
"
#
$
## ’results’ contains the subset size, k, and the residual sum of squares
##
results <- c(0, sum((weight$bwt- fitted(fit0))^2))
results <- rbind(results,
cbind(apply(bestSub$which, 1, sum)-1, bestSub$rss))
##
minRSS <- tapply(results[,2], results[,1], FUN=min)
80 S. Haneuse; Biostat/Stat 572
!
"
#
$Subset size, k
Re
sid
ua
l su
m o
f sq
ua
res
0 1 2 3 4 5 6 7 8 9 10 11 12
81 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The best-subset curve is necessarily decreasing, so it cannot be used as a
criteria for choosing k
• Typically choose a model which minimizes an estimate of the EPE
" Mallow’s Cp, AIC, BIC, cross-validation
Stepwise procedures
• Instead of performing an exhaustive enumeration for each value for k, we
can search for a ‘good path’
• Forward selection
" start with an ‘intercept-only’ model and build up the model
• Backward selection
" start with an ‘full’ model and reduce the model
82 S. Haneuse; Biostat/Stat 572
!
"
#
$
Shrinkage methods
• Rather retaining some variables and discarding the rest, an alternative is to
keep all the variables but impose restrictions on the size of the coefficients
" the point esitmates for β are subject to bias
" results often don’t suffer as much in terms of variability
• Ridge regression imposes an L2-type penalty
" the solution is give by
βridge
= argminβ RSS(β)
subject to the constraint:
p∑
j=1
β2j ≤ s
" value of s influences how large the components of β can get
83 S. Haneuse; Biostat/Stat 572
!
"
#
$
• An alternative way of writing the problem is
βridge
= argminβ
RSS(β) + λp
∑
j=1
β2j
• Here, λ ≥ 0 controls the amount of shrinkage
" when λ = 0, we are performing ordinary least squares estimation
" for large λ, minimizing the penalized RSS requires the components of
β to be small
" there is a one-to-one relationship between λ and s
• Minimization yields the solution
βridge
= (XT X + λI)−1XT y
• Could allow λ to be a vector, and ensure no shrinkage among certain
coefficients
84 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The solution is linear, and we can therefore obtain the effective degrees of
freedom as
df(λ) = tr{X(XT X + λI)−1XT }
" depends on the complexity/smoothing parameter λ
• Ridge regression for the linear model is implemented in R
##
library(MASS)
##
ridgeFit <- lm.ridge(bwt ~ ., data=weight, lambda=c(0, 100, 1000, 10000))
## Output:
##
0 100 1000 10000
genderM 67.247 64.496 47.026 12.611
age 19.756 19.532 17.125 7.662
raceblack -16.062 -16.175 -14.413 -5.620
...
85 S. Haneuse; Biostat/Stat 572
!
"
#
$Smoothing parameter, !
Ridge regression coefficient estimates
0 6250 12500 18750 25000
!3
9!
13
14
41
67
genderM
age
smokerY
drinkerY
86 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Relationship between λ and df(λ) is non-linear
• For these data, there is a dramatic reduction in ‘complexity’ of the
model up to about λ = 1,000
Smoothing parameter, !
Eff
ective
de
gre
es o
f fr
ee
do
m,
d(!
)Smoothing parameter vs. effective degrees of freedom
0 6250 12500 18750 25000
13
57
91
11
31
51
71
9
87 S. Haneuse; Biostat/Stat 572
!
"
#
$Effective degrees of freedom, d(!)
Ridge regression coefficient estimates
1 3 5 7 9 11 13 15 17 19
!3
9!
13
14
41
67
genderM
age
smokerY
drinkerY
88 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Ridge regression can also be derived from a Bayesian perspective
• Suppose
Yi| Xi = xi ∼ Normal(xTi β, σ2)
and that we adopt independent Normal priors on the components of β
βj ∼ Normal(0, τ2) j = 1, . . . , p
• The marginal posterior for β is given by
π(β| y, X, σ) ∝ Likelihood × Prior
=n
∏
i=1
exp
{
−1
2σ2
(
yi − xTi β
)2}
×p
∏
j=1
exp
{
−1
2τ2β2
j
}
= exp
−1
2σ2
n∑
i=1
(
yi − xTi β
)2−
1
2τ2
p∑
j=1
β2j
89 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Taking the log and multiplying by −2σ2, we find that the negative
log-posterior density for β is equivalent to the objective function in ridge
regression:
n∑
i=1
(
yi − xTi β
)2+ λ
p∑
j=1
β2j ,
• Hence βridge
correspond to the posterior mode in the above set-up
" also the mean, since we are assuming Normal distributions everywhere
90 S. Haneuse; Biostat/Stat 572
!
"
#
$
• An alternative to ridge regression is the Lasso, which imposes an L1-type
penalty and can be written as
βlasso
= argminβRSS(β)
subject to the constraint:
p∑
j=1
|βj | ≤ t
" alternative constraint makes the solution nonlinear in y
" requires a quadrative programming solution
• Several R libraries implement the lasso
" lasso2
91 S. Haneuse; Biostat/Stat 572
!
"
#
$
Model complexity
• In each of the approaches we have seen, there is an additional parameter
which controls the ‘complexity’ of the model
" best subset selection: k
" ridge regression: s or λ
" the lasso: t
• The choice can have a dramatic impact on the results
• Question: How can allow the specific value to be determined on the basis
of the data?
• The results presented here are based on 10-fold cross validation
" we’ll go over this in more detail down the road
" briefly explained here in terms of ridge regression
92 S. Haneuse; Biostat/Stat 572
!
"
#
$
10-fold Cross-validation
• Initially, split the full data into two parts
" training data: n = 1,900
" test data: n = 600
• Further split the training data into 10 equal parts
" train(1)
"...
" train(10)
• For each l = 1, . . ., 10
i. fit the ridge regression model to nine-tenths of the training data
f (−l)ridge (train(−l); λ)
93 S. Haneuse; Biostat/Stat 572
!
"
#
$
ii. evaluate the prediction error in the remaining one-tenth
" here we are using L2 loss
L(train(l); λ) =190∑
i=1
{
y(l)i − f (−l)
ridge (train(l)i ; λ)
}2
• Calculate the average prediction error across the 10 sets of results
CV(λ) =1
10
10∑
l=1
L(train(l); λ)
" estimate of EPE based on the training data and for a given value of λ
• Repeat this for every value of λ
94 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Plots of CV(k) and CV(df(λ)) for best subsets and ridge regression
Subset size, k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Be
st
su
bse
ts
X
Effective degrees of freedom, d(!)
12 13 14 15 16 17 18 19
Rid
ge
re
gre
ssio
n
X
95 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Best subsets regression:
" let k∗ denote the value of k which minimizes CV(K)
" obtain the best subset regression fixing the number of variables at k∗
• Ridge regression:
" let λ∗ denote the value of λ which minimizes CV(λ), or CV(df(λ))
" fit a ridge regression to the full training data, with λ∗
• The estimated test error in each case is the error obtained by applying the
resulting fits to the test data
" n = 900 data set aside earlier
96 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Test error results
Method Complexity Training Error Test Error
parameter
Least squares 3610956 1521808
Best subset k∗ = 11 4037468 1038386
Ridge regression df(λ∗) = 13.44 3835416 1468410
• Note, in absolute terms, the test error values don’t mean much
" scale is dependent on choice of the loss function, as well as the data
" relative differences are meaningful though
97 S. Haneuse; Biostat/Stat 572
!
"
#
$
Categorical outcomes: classification
• Here we consider problems where the outcome of interest is categorical
• For example, in the King county birth weight data scientific focus may rest
on the prediction of whether or not a baby will have low birth weight
" typically defined as weight < 2,500g
" 127 babies out of 2,500 had low birth weight
• From a notational perspective, it is useful to define the outcome random
variable G
• Denote the set of values G can take on as the discrete set G
" set consisting of K classes: G1, . . . ,GK
98 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The general approach is to establish decision boundaries which provide the
means to classify observations into one of the K classes
" the boundaries split the X-space
• Boundaries are determined by a set of discriminant functions:
δk(x), k = 1, . . . , K
" classify the observation, x, as one of the elements of G on the basis of
which δk(x) is largest
• Here we are going to focus on decision boundaries which are linear
" δk(x) are linear in x
" or, some monotone transformation of δk(x) is linear in x
99 S. Haneuse; Biostat/Stat 572
!
"
#
$
Decision theory
• In Part I we saw the 0-1 loss function for a categorical outcome variable
L(G, f(X)) =
1 if f(X) (= G
0 if f(X) = G
• For a categorical variable, the loss function can be represented by a
K × K matrix L
" the L(k, l) element is the cost of classifying an observation as l, when
the truth is class k
" L is zero on the diagonal and (typically) nonnegative everywhere else
" for a binary outcome, 0-1 loss gives
L =
0 1
1 0
100 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The expected prediction error is again given by
EPE = EX,G[L(G, G(X))]
= EX
[
K∑
k=1
L(Gk, G(X)) Pr(Gk| X = x)
]
• Typically, we condition on X and minimize pointwise, so that the solution
is given as
G(x) = argming∈G
K∑
k=1
L(Gk, g) Pr(Gk| X = x)
• Under 0-1 loss, this simplifies to
G(x) = argming∈G [1 − Pr(g| X = x)]
" the solution is therefore to pick the element of G which gives the
largest Pr(Gk| X = x)
" i.e. pick the most probable outcome
101 S. Haneuse; Biostat/Stat 572
!
"
#
$
Linear regression for categorical outcomes
• In the toy example, we explored the use of linear regression as an approach
for predicting a 0/1 outcome variable
• For a K-level categorical outcome, we can adapt the approach by defining
a set of K indicator variables
Yk =
1 if G = Gk
0 otherwise
for k = 1, . . ., K
• Training data form an n × K outcome matrix, Y
" consisting of 1’s and 0’s
" each row contains a single 1
102 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Fitting a linear regression model to each column of Y yields
Y = X(XT X)−1XT Y
" note there is a coefficient vector for each Yk
• Each row of Y consists of K fitted values
(f1(xi), . . . , fK(xi)), i = 1, . . . , n.
• Assign a class by choosing the component with the largest value
G(xi) = argmaxGk∈G fk(xi)
• This may be reasonable in many situations, especially since linear
regression attempts to estimate the conditional expectation and
E[Yk| X = x] = Pr(Gk| X = x)
" from the decision theory, linear regressions goal is a sensible one
103 S. Haneuse; Biostat/Stat 572
!
"
#
$
• As long as the model contains an intercept, results from linear regression
satisfy
K∑
k=1
fk(xi) = 1
• However, some of the fk(xi) can be negative or a number greater than 1
" due to the rigid nature of linear regression
• Within the linear regression framework, we can get around this by
introducing transformations of the X-space
" quadratic terms and two-way interactions
" more general polynomial terms
" more general basis functions
• With finite n, the gains in terms of bias generally result in a cost in terms
of variance
104 S. Haneuse; Biostat/Stat 572
!
"
#
$
Linear discriminant analysis
• Decision theory told us that, under 0-1 loss, the optimal classification
scheme involves the use of the class probabilities
Pr(Gk| X = x)
• Given these quantities, the decision boundary between classes Gk and Gl
can be expressed in terms of values in the X-space for which the class
probabilities are equal
{x ∈ X : Pr(G = Gk| X = x) = Pr(G = Gl| X = x)}
or{
x ∈ X :Pr(G = Gk| X = x)
Pr(G = Gl| X = x)= 1
}
105 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Let fk(x) denote the class-conditional density of the p-vector X in class
G = Gk
• Further, let πk denote the prior probability of class k, with∑
πk = 1
• Bayes theorem then gives us
Pr(G = Gk| X = x) =fk(x)πk
K∑
l=1fl(x)πl
• In terms of ‘choosing’ between classes, the denominator is constant and we
can see that the fk(x) provide almost as much information as the
Pr(G = Gk| X = x)
106 S. Haneuse; Biostat/Stat 572
!
"
#
$
Linear Discriminant Analysis
• Suppose we assume the class densities to be multivariate Normal, with a
common variance-covariance matrix, Σ
fk(x) =1
(1π)p/2|Σ|1/2exp
{
−1
2(x − µk)T Σ−1(x − µk)
}
• Given this form, and taking logs, the decision boundary between classes Gk
and Gl reduces to
log
{
Pr(G = Gk| X = x)
Pr(G = Gl| X = x)
}
= xT Σ−1(µk − µl)
−1
2(µk + µl)
T Σ−1(µk − µl)
+ log
{
πk
πl
}
107 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Consequently, the log-odds function is linear in x
" holds true for any (k, l) pair
" decision boundary solutions are therefore linear
• Working somewhat backwards, we see that the linear discriminant
functions
δk(x) = xT Σ−1µk + µTk Σ−1µk + log{πk}
are sufficient to determine the decision rule, with
G(x) = argmaxGk∈G δk(x)
108 S. Haneuse; Biostat/Stat 572
!
"
#
$
Estimation
• In practice, we need to estimate the parameters of the multivariate Normal
distributions using the training data
πk =nk
n
µk =1
nk
∑
gi=Gk
xi
Σ =1
n − K
K∑
k=1
∑
gi=Gk
(xi − µk)(xi − µk)T
109 S. Haneuse; Biostat/Stat 572
!
"
#
$
Quadratic Discriminant Analysis
• We can relax the assumption of a common variance-covariance matrix
across the K classes
• When forming the log-odds
log
{
Pr(G = Gk| X = x)
Pr(G = Gl| X = x)
}
the convenient cancellations no longer occur
• This results in a series of quadratic discriminant functions
δk(x) = −1
2log |Σk| −
1
2(x − µk)T Σ−1
k (x − µk) + log{πk}
" these are no longer linear in x
110 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The decision boundary between classes Gk and Gl is now defined by a
quadratic equation
{x ∈ X : δk(x) = δl(x)}
Implementation
• In R, both LDA and QDA are implemented in the MASS library
" lda()
" qda()
111 S. Haneuse; Biostat/Stat 572
!
"
#
$
Logistic regression
• In LDA, the linearity of the decision boundaries arises from the use of
multivariate normal distributions (and the common variance-covariance
assumption)
" Bayes theorem ensures that the estimates of Pr(G = Gk| X = x) add
up to 1.0
" no guarantee they remain in [0,1]
• Logistic regression attempts to model the decision boundaries directly as
linear functions of X
" ensure the Pr(G = Gk| X = x) add up to 1.0
" ensure they remain in [0,1]
112 S. Haneuse; Biostat/Stat 572
!
"
#
$
• For K classes, the model has the form
log
{
Pr(G = G1| X = x)
Pr(G = GK | X = x)
}
= xT β1
log
{
Pr(G = G2| X = x)
Pr(G = GK | X = x)
}
= xT β2
...
log
{
Pr(G = GK−1| X = x)
Pr(G = GK | X = x)
}
= xT βK−1
" here, class K is the referent group
" K − 1 sets of parameters reflect the constraint that the probabilities
add up to 1.0
113 S. Haneuse; Biostat/Stat 572
!
"
#
$
• A simple calculation reveals that
Pr(G = Gk| X = x) =exp{xT βk}
1 +K−1∑
l=1exp{xT βl}
, k = 1, . . . , K − 1
Pr(G = GK | X = x) =1
1 +K−1∑
l=1exp{xT βl}
" Pr(G = Gk| X = x) ∈ [0,1] for each k
" sum adds up to 1.0
114 S. Haneuse; Biostat/Stat 572
!
"
#
$
Estimation
• Given the Pr(G = Gk| X = x), the multinomial distribution arises as a
natural distribution on which to base the likelihood
• Maximum likelihood estimation proceeds via iteratively re-weighted least
squares
" Fisher’s method of scoring
" Newton-Raphson with the Hessian replaced by its expected value
• In the binary case, the details of the algorithm simplify
• If we re-code g = 1/2 to y = 0/1, and let
µ ≡ µ(x; β) = E[Y | X = x, β] =exp{xT β}
1 + exp{xT β}
115 S. Haneuse; Biostat/Stat 572
!
"
#
$
• For a given starting value of β, at each step of the algorithm, update the
current value of β via
βnew
←− βold
+ (XT WX)−1XT (y − µ)
= (XT WX)−1XT W (Xβold
+ W−1XT (y − µ))
= (XT WX)−1XT Wz
" z is referred to as the adjusted response
" the weight matrix W is a diagonal matrix with entries
Wii = µ(x; βold
)(1 − µ(x; βold
))
=exp{xT β
old}
[1 + exp{xT βold}]2
116 S. Haneuse; Biostat/Stat 572
!
"
#
$
The hat matrix and degress of freedom
• In logistic regression, and GLMs in general, the estimated y are nonlinear
in the y
• Hence there is no matrix H such that y = Hy
• We can define a hat matrix by linearization using the following analogy
with linear regression
• For the linear model, note that
y − E[y] = H{y − E[y]}
" since HE[y] = X(XT X)−1XT Xβ = Xβ
117 S. Haneuse; Biostat/Stat 572
!
"
#
$
• For a generalized linear model, we can define the hat matrix H(β) to be
the matrix such that
µ(X; β) − µ(X; β) ≈ H(β){y − E[y]}
• A definition which satisfies this is to set
H(β) = WX(XT WX)−1XT
" uses a Taylor approximation in the update step in the IWLS algorithm
" remember, the W matrix depends on β
• It is relatively straightforward to show that
tr(H(β)) = p
so that, again, using the trace of the newly defined hat matrix serves as a
reasonable definition of the degrees of freedom
118 S. Haneuse; Biostat/Stat 572
!
"
#
$
Implementation
• glm() for fitting binary logistic regression models
• multinom() in the nnet library for fitting multinomial logistic regression
models
King county birth data
• Define a new outcome of ‘low birth weight’
" binary 0/1 variable
• Compare the overall results obtained by LDA and logistic regression
• Consider using all 16 predictors
" could apply model selection and shrinkage methods here
119 S. Haneuse; Biostat/Stat 572
!
"
#
$
• To evaluate the expected prediction error, we can use cross-validation with
K = N
" leave-one-out cross-validation
" full code is available online
##
library(MASS)
## Cycle through each observation
##
results <- matrix(NA, nrow=2500, ncol=2)
for(i in 1:2500)
{
## Fit a linear discriminant analysis
##
fitLDA <- lda(low ~ ., data=weight[-i, -c(2,7)])
Yhat <- predict(fitLDA, newdata=weight[i, -c(2,7)])$class
results[i,1] <- (Yhat != weight[i, -c(2,7)]$low)
120 S. Haneuse; Biostat/Stat 572
!
"
#
$
## fit a logistic regression model
##
fitGLM <- glm(low ~ ., data=weight[-i, -c(2,7)], family=binomial)
Yhat <- predict(fitGLM,
newdata=weight[i, -c(2,7)], type="response") > 0.5
results[i,2] <- (Yhat != weight[i, -c(2,7)]$low)
}
##
apply(results, 2, mean)
121 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Results based on a 0-1 loss
Method Estimated
Prediction Error
LDA 4.28%
Logistic regression 4.24%
• LDA and logistic regression don’t missclassify the same observations
though
Logistic Regression
Correct Missclassified
LDA Correct 2,380 13
Missclassified 14 93
122 S. Haneuse; Biostat/Stat 572
!
"
#
$
Comparison with LDA
• In the development of LDA, we saw the decision boundary between classes
k and K to be of the form
log
{
Pr(G = Gk| X = x)
Pr(G = GK | X = x)
}
= xT Σ−1(µk − µK)
−1
2(µk + µK)T Σ−1(µk − µK)
+ log
{
πk
πK
}
= α0k + xT αk
• Hence, the log-odds have the same form as that of logistic regression
" the two approaches attempt to estimate the same quantities
" the major difference is in how they estimate these quantities
123 S. Haneuse; Biostat/Stat 572
!
"
#
$
• In LDA, by virtue of making the assumption of multivariate Normality for
X within each class, we are estimating the full joint distribution of (G, X).
• Make assumptions about the conditional distribution X | G and the
marginal distribution G
X | G = Gk ∼ MVNp(µk,Σ), k = 1, . . . , K
P (G = Gk) = πk
" use Bayes theorem to inform us on G|X
" there is a resulting induced marginal distribution for X which depends
on the log-odds parameters and hence plays a role in estimation
• If these assumptions are true, then LDA will be the best we can do
" LDA estimates will be ‘maximum likelihood’
124 S. Haneuse; Biostat/Stat 572
!
"
#
$
• In logistic regression, we only focus on the conditional distribution G|X
" maximize a conditional likelihood based on the multinomial distribution
• No additional assumptions on the marginal distribution of X
" actually ignored in the estimation of the model
" think of the marginal density as being estimated in a fully
nonparametric way
• In the advent that the underlying densities are multivariate Normal, there
will be a loss of efficiency when using logistic regression
" typically not large, and at most 30%
" reasonable trade-off to possible misspecification
125 S. Haneuse; Biostat/Stat 572
!
"
#
$
Seperating hyperplanes
• LDA and logistic regression both estimate linear decision boundaries by
adopting structure on the log-odds scale
" exploit all the data via some global structure
• An alternative class of methods attempts to directly separate observations
into the various possible classes
" focus on the points themselves
" viewed as being more local in nature
126 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Consider the data in the opposite
figure
• Included is the least squares solution
to the problem
" Y = -1/1
" boundary line is given by
{
x ∈ R2 : xT β = 0
}
• We see two points are missclassified
• Surprising since its not hard to see
that one could draw a line separat-
ing the two groups
!3 !2 !1 0 1 2 3
!3
!2
!1
01
23
X1
X2
127 S. Haneuse; Biostat/Stat 572
!
"
#
$
• We refer to such data as being separable
" there exists at least one separating hyperplane
" in this example there actually are infinitely many such planes
" two are shown here with dashed lines
!3 !2 !1 0 1 2 3
!3
!2
!1
01
23
X1
X2
128 S. Haneuse; Biostat/Stat 572
!
"
#
$
Perceptron learning algorithm
• The perceptron algorithm finds a separating hyperplane by minimizing the
distance between the decision boundary and missclassified points
• For Y = -1/1, and for a given β, a respose is missclassified if either
yi = −1 and xTi β > 0
or
yi = 1 and xTi β < 0
• Let M denote the set of missclassified observations
• The goal is to minimize
D(β) = −∑
i∈M
yi(xTi β)
129 S. Haneuse; Biostat/Stat 572
!
"
#
$
• Taking derivatives, the update is simply cycles through each element of M
via
βnew
← βold
+ ρyixi
" ρ is a tuning parameter
" algorithm uses a stochastic gradient descent by visiting each
observation in sequence
" rather than computing the sum of the derivatives
• The form of the algorithm suggests that there is a greater local focus on
points ‘close’ to the decision boundary
" algorithm updates on the basis of elements of M
" may be more robust to model misspecification than, say, LDA or
logistic regression which impose more global structure
130 S. Haneuse; Biostat/Stat 572
!
"
#
$
• The algorithm suffers from a series of drawbacks however
" when the data are separable, there are many solutions and which one
you get depends on the starting values
" it can sometimes take a long time to converge
" when the data are not separable there is no solution, so that the
algorithm will not converge and can cycle
• One solution to the first problem is to incorporate an additional constraint
into the optimization problem
" choose the boundary which maximizes the margin between the two
classes in the training data
• In most situations, however, it is unlikely to be the case that the classes
are perfectly separable
" support vector machines provide a solution to this problem
" we’ll see these later
131 S. Haneuse; Biostat/Stat 572