April 6, 2010 Generalized Linear Models 2010 LISA Short Course Series Mark Seiss, Dept. of Statistics

April 6, 2010

Generalized Linear Models2010 LISA Short Course Series

Mark Seiss, Dept. of Statistics

Presentation Outline

1. Introduction to Generalized Linear Models

2. Binary Response Data - Logistic Regression Model

3. Teaching Method Example

4. Count Response Data - Poisson Regression Model

5. Mining Example

6. Open Discussion

Reference Material

Categorical Data Analysis – Alan Agresti

Contemporary Statistical Models for Plant and Soil Sciences – Oliver Schabenberger and F.J. Pierce

Presentation and Data from Examples

www.lisa.stat.vt.edu

Generalized Linear Models• Generalized linear models (GLM) extend ordinary

regression to non-normal response distributions.

• Response distribution must come from the Exponential Family of Distributions• Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc.

• 3 Components• Random – Identifies response Y and its probability distribution

• Systematic – Explanatory variables in a linear predictor

function (Xβ)

• Link function – Invertible function (g(.)) that links the mean of the response (E[Yi]=μi) to the systematic component.

Generalized Linear Models• Model

• for i =1 to n

j= 1 to p

• Equivalently,

ijj

jx ig

ijj

jxg 1i

Generalized Linear Models• Why do we use GLM’s?

• Linear regression assumes that the response is distributed normally

• GLM’s allow us to analyze the linear relationship between predictor variables and the mean of the response variable when it is not reasonable to assume the data is distributed normally.

Generalized Linear Models• Connection Between GLM’s and Multiple Linear

Regression• Multiple linear regression is a special case of the GLM• Response is normally distributed with variance σ2

• Identity link function μi = g(μi) = xiTβ

Generalized Linear Models• Predictor Variables

• Two Types: Continuous and Categorical

• Continuous Predictor Variables• Examples – Time, Grade Point Average, Test Score, etc.

• Coded with one parameter – βixi

• Categorical Predictor Variables• Examples – Sex, Political Affiliation, Marital Status, etc.• Actual value assigned to Category not important• Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc.• Coded Differently than continuous variables

Generalized Linear Models• Predictor Variables cont.

• Consider a categorical predictor variable with L categories

• One category selected as reference category • Assignment of reference category is arbitrary • Some suggest assign category with most observations

• Variable represented by L-1 dummy variables • Model Identifiability

Generalized Linear Models• Predictor Variables cont.

• Two types of coding• Dummy Coding (Used in R)

• xk = 1 If predictor variable is equal to category k

0 Otherwise

• xk = 0 For all k if predictor variable equals category i

• Effect Coding (Used in JMP)• xk = 1 If predictor variable is equal to category k

0 Otherwise

• xk = -1 For all k if predictor variable equals category i

Generalized Linear Models• Model Evaluation - -2 Log Likelihood

• Specified by the random component of the GLM model• For independent observations, the likelihood is the

product of the probability distribution functions of the observations.

• -2 Log likelihood is -2 times the log of the likelihood function

• -2 Log likelihood is used due to its distributional properties – Chi-square

n

ii

n

ii yfyfLogL

11

)(log2)(log22

Generalized Linear Models• Saturated Model

• Contains a separate indicator parameter for each observation

• Perfect fit μi = yi

• Not useful since there is no data reduction• i.e. number of parameters equals number of observations

• Maximum achievable log likelihood (minimum -2 Log L) – baseline for comparison to other model fits

Generalized Linear Models• Deviance

• Let L(β|y) = Maximum of the log likelihood for the model

L(y|y) = Maximum of the log likelihood for the saturated model

• Deviance = D(β) = -2 [L(β|y) - L(y|y)]

Generalized Linear Models• Deviance cont.

)|(2 yyLog )|ˆ(2 yLog )|ˆ(2 0 yLog

D

0D

Model Chi-Square

ˆˆ0 DD

Generalized Linear Models• Deviance cont.

• Lack of Fit test • Likelihood Ratio Statistic for testing the null hypothesis that the

model is a good alternative to the saturated model• Has an asymptotic chi-squared distribution with N – p degrees

of freedom, where p is the number of parameters in the model.

• Also allows for the comparison of one model to another using the likelihood ratio test.

Generalized Linear Models• Nested Models

• Model 1 - Model with p predictor variables

{X1, X2…,Xp} and vector of fitted values μ1

• Model 2 - Model with q<p predictor variables

{X1, X2,…,Xq} and vector of fitted values μ2

• Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1.

• i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1

Generalized Linear Models• Nested Models

• Model 2 is a special case of Model 1 - all the coefficients corresponding to Xp+1, Xp+2, Xp+3,….,Xq are equal to zero

q2p1ppp110 0…+0+0++…+ = g(u) XXXXX

Generalized Linear Models• Likelihood Ratio Test

• Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit.

• Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit.

Generalized Linear Models• Likelihood Ratio Test

• Likelihood Ratio Statistic = -2L(y,u2) - (-2L(y,u1))

=D(y,μ2) - D(y, μ1)

Difference of the deviances of the two models

• Always D(y,μ2) > D(y,μ1) implies LRT > 0

• LRT is distributed Chi-Squared with p-q degrees of freedom

• Later, the Likelihood Ratio Test will be used to test the significance of variables in Logistic and Poisson regression models.

Generalized Linear Models• Theoretical Example of Likelihood Ratio Test

• 3 predictor variables – 1 Continuous (X1), 1 Categorical with 4 Categories (X2, X3, X4), 1 Categorical with 1 Category (X5)

• Model 1 - predictor variables {X1, X2, X3, X4, X5}

• Model 2 - predictor variables {X1, X5}

• Null Hypothesis – Variables with 4 categories is not significant to the model (β2 = β3 = β4 = 0)

• Alternate Hypothesis - Variable with 4 categories is significant

Generalized Linear Models• Theoretical Example of Likelihood Ratio Test Cont.

• Likelihood Ratio Statistic = D(y,μ2) - D(y, μ1)• Difference of the deviance statistics from the two models• Equivalently, the difference of the -2 Log L from the two models

• Chi-Squared Distribution with 5-2=3 degrees of freedom

Generalized Linear Models• Model Selection

• 2 Goals: Complex enough to fit the data well

Simple to interpret, does not overfit the data

• Study the effect of each predictor on the response Y• Continuous Predictor – Graph Y versus X

• Discrete Predictor - Contingency Table of Mean of Y (μy) versus categories of X

• Unbalance Data – Few responses of one type• Guideline – 10 outcomes of each type for each X terms• Example – Y=1 for only 30 observations out of 1000

Model should contain no more than 3 X terms

Generalized Linear Models• Model Selection cont.

• Multicollinearity• Correlations among predictors resulting in an increase in

variance• Reduces the significance value of the variable • Occurs when several predictor variables are used in the model• Affects sign, size, and significance of parameter estimates

• Determining Model Fit• Other criteria besides significance tests (i.e. Likelihood Ratio

Test) can be used to select a model


• Determining Model Fit cont.• Akaike Information Criterion (AIC)

– Penalizes model for having many parameters– AIC = Deviance+2*p where p is the number of parameters

in model

• Bayesian Information Criterion (BIC)– BIC = -2 Log L + ln(n)*p where p is the number of

parameters in model and n is the number of observations– Also known as the Schwartz Information Criterion (SIC)


• Selection Algorithms• Best subset – Tests all combinations of predictor variables

to find best subset• Algorithmic – Forward, Backward and Stepwise

Procedures

Generalized Linear Models• Stepwise Selection

• Idea: Combination of forward and backward selection • Forward Step then backward step

• Step One: Fit each predictor variable as a single predictor variable and determine fit

• Step Two: Select variable that produces best fit and add to model

• Step Three: Add each predictor variable one at a time to the model and determine fit

• Step Four: Select variable that produces best fit and add to the model

Generalized Linear Models• Stepwise Selection Cont.

• Step Five: Delete each variable in the model one at a time and determine fit

• Step Six: Remove variable that produces best fit when deleted

• Step Seven: Return to Step Two• Loop until no variables added or deleted improve the fit.

Generalized Linear Models• Outlier Detection

• Studentized Residual Plot and Deviance Residual Plots• Plot against predicted values• Looking for “sore thumbs”, values much larger than those for

other observations

Generalized Linear Models• Summary

• Setup of the Generalized Linear Model• Continuous and Categorical Predictor Variables• Log Likelihood• Deviance and Likelihood Ratio Test

• Test lack of fit of the model• Test the significance of a predictor variable or set of predictor

variables in the model.

• Model Selection• Outlier Detection

Generalized Linear Models• Questions/Comments

Logistic Regression• Consider a binary response variable.

• Variable with two outcomes • One outcome represented by a 1 and the other

represented by a 0• Examples:

Does the person have a disease? Yes or No

Outcome of a baseball game? Win or loss

Logistic Regression• Teaching Method Data Set

• Found in Aldrich and Nelson (Sage Publications, 1984)

• Researcher would like to examine the effect of a new teaching method – Personalized System of Instruction (PSI)

• Response variable is whether the student received an A in a statistics class (1 = yes, 0 = no)

• Other data collected:• GPA of the student

• Score on test entering knowledge of statistics (TUCE)

Logistic Regression• Consider the linear probability model

where yi = response for observation i

xi = 1x(p+1) matrix of covariates for observation i

p = number of covariates

Tiiiii xxxYPYE )()|0(

Logistic Regression• GLM with binomial random component and identity link

g(μ) = μ• Issues:

• π(Xi) can take on values less than 0 or greater than 1

• Predicted probability for some subjects fall outside of the [0,1] range.

Logistic Regression• Consider the logistic regression model

• GLM with binomial random component and logit link g(μ) = logit(μ)

• Range of values for π(Xi) is 0 to 1

Ti

Ti

iiiix

xxxYPYE

exp1

exp)()|0(

T

ii

ii x

x

xxit

1

loglog

Logistic Regression• Interpretation of Coefficient β – Odds Ratio

• The odds ratio is a statistic that measures the odds of an event compared to the odds of another event.

• Say the probability of Event 1 is π1 and the probability of Event 2 is π2 . Then the odds ratio of Event 1 to Event 2 is:

2

2

1

1

1

1

2

1

)(

)(_

Odds

OddsRatioOdds

Logistic Regression• Interpretation of Coefficient β – Odds Ratio Cont.

• Value of Odds Ratio range from 0 to Infinity• Value between 0 and 1 indicate the odds of Event 2 are

greater• Value between 1 and infinity indicate odds of Event 1 are

greater• Value equal to 1 indicates events are equally likely


• Link to Logistic Regression :

• Thus the odds ratio between two events is

• Note: One should take caution when interpreting parameter estimates

• Multicollinearity can change the sign, size, and significance of parameters

)()()()()_( 2111 2

2

1

1

LogitLogitLogLogRatioOddsLog

)}()(exp{_ 12 LogitLogitRatioOdds


• Consider Event 1 is Y=0 given X and Event 2 is Y=0 given X+1

• From our logistic regression model

• Thus the ratio of the odds of Y=0 for X and X+1 is

))|0(())1|0(()_( XYPLogitXYPLogitRatioOddsLog

)())1(( XX

)exp(_ RatioOdds

Logistic Regression• Interpretation for a Continuous Predictor Variable

• Consider the following JMP output:Parameter EstimatesTerm Estimate Std Error ChiSquare Prob>ChiSqIntercept 11.8320025 4.7161552 6.29 0.0121*GPA -2.8261125 1.262941 5.01 0.0252*TUCE -0.0951577 0.1415542 0.45 0.5014PSI[0] 1.18934379 0.5322821 4.99 0.0255*

Interpretation of the Parameter Estimate:

Exp{-2.8261125} = 0.0592 = Odds ratio between the odds at x+1 and odds at x for all x

The ratio of the odds of NOT getting an A between a person with a 3.0 gpa and 2.0 gpa is equal to 0.0592 or in other words the odds of the person with the 3.0 is 0.0592 times the odds of the person with the 2.0.

Equivalently, the odds of getting an A for a person with a 3.0 gpa is equal to 1/0.0592=16.8919 times the odds of getting an A for a person with a 2.0 gpa.

Logistic Regression• Single Categorical Predictor Variable

• Consider the following JMP output:Parameter EstimatesTerm Estimate Std Error ChiSquare Prob>ChiSqIntercept 11.8320025 4.7161552 6.29 0.0121*GPA -2.8261125 1.262941 5.01 0.0252*TUCE -0.0951577 0.1415542 0.45 0.5014PSI[0] 1.18934379 0.5322821 4.99 0.0255*

Interpretation of the Parameter Estimate:

Exp{2*1.1893} = 10.78 = Odds ratio between the odds of NOT getting an A for a student that was not subject to the teaching method and the odds of NOT getting an A for a student that was subject to the teaching method.

The odds of getting an A without the teaching method is 1/10.78=0.0927 times the odds of getting an A with the teaching method.

I

Logistic Regression• ROC Curve

• Receiver Operating Curve• Sensitivity – Proportion of positive cases (Y=1) that

were classified as a positive case by the model

• Specificity - Proportion of negative cases (Y=0) that were classified as a negative case by the model

)1|1ˆ( yyP

)0|0ˆ( yyP

Logistic Regression• ROC Curve Cont.

• Cutoff Value - Selected probability where all cases in which predicted probabilities are above the cutoff are classified as positive (Y=1) and all cases in which the predicted probabilities are below the cutoff are classified as negative (Y=0)

• 0.5 cutoff is commonly used

• ROC Curve – Plot of the sensitivity versus one minus the specificity for various cutoff values

• False positives (1-specificity) on the x-axis and True positives (sensitivity) on the y-axis

Logistic Regression• ROC Curve Cont.

• Measure the area under the ROC curve• Poor fit – area under the ROC curve approximately equal to 0.5

• Good fit – area under the ROC curve approximately equal to 1.0

Logistic Regression• Teaching Method Example

Logistic Regression• Summary

• Introduction to the Logistic Regression Model• Interpretation of the Parameter Estimates β – Odds Ratio• ROC Curves• Teaching Method Example

Logistic Regression• Questions/Comments

Poisson Regression• Consider a count response variable.

• Response variable is the number of occurrences in a given time frame.

• Outcomes equal to 0, 1, 2, ….• Examples:

Number of penalties during a football game.

Number of customers shop at a store on a given day.

Number of car accidents at an intersection.

Poisson Regression• Mining Data Set

• Found in Myers (1990)• Response of interest is the number of fractures that occur

in upper seam mines in the coal fields of the Appalachian region of western Virginia

• Want to determine if fractures is a function of the material in the land and mining area

• Four possible regressors• Inner burden thickness• Percent extraction of the lower previously mined seam• Lower seam height• Years the mine has been open

Poisson Regression• Mining Data Set Cont.

• Coal Mine Seam

Poisson Regression• Mining Data Set Cont.

• Coal Mine Upper and Lower Seams

• Prevalence of overburden fracturing may lead to collapse

Poisson Regression• Consider the model

where Yi = Response for observation i

xi = 1x(p+1) matrix of covariates for observation i

p = Number of covariates

μi = Expected number of events given xi

• GLM with Poisson random component and identity link g(μ) = μ

• Issue: Predicted values range from -∞ to +∞

Tiii xYE

Poisson Regression• Consider the Poisson log-linear model

• GLM with Poisson random component and log link g(μ) = log(μ)

• Predicted response values fall between 0 and +∞• In the case of a single predictor, An increase of one unit

of x results an increase of exp(β) in μ

Tiiii xxYE exp|

Tii x log

Poisson Regression• Continuous Predictor Variable

• Consider the JMP outputTerm Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CLIntercept -3.59309 1.0256877 14.113702 0.0002* -5.69524 -1.660388Thickness -0.001407 0.0008358 3.166542 0.0752 -0.0031620.0001349Pct_Extraction 0.0623458 0.0122863 31.951118 <.0001* 0.0392379 0.0875323Height -0.00208 0.0050662 0.174671 0.6760 -0.0128740.0070806 Age -0.030813 0.0162649 3.8944386 0.0484*

-0.064181 -0.000202

Interpretation of the parameter estimate:

Exp{-0.0308} = .9697 = multiplicative effect on the expected number of fractures for an increase of 1 in the years the mine has been opened

Poisson Regression• Overdispersion for Poisson Regression Models

• More variability in the response than the model allows

• For Yi~Poisson(λi), E [Yi] = Var [Yi] = λi

• The variance of the response is much larger than the mean.

• Consequences: Parameter estimates are still consistent

Standard errors are inconsistent

• Detection: D(β)/n-p• Large if overdispersion is present

Poisson Regression• Overdispersion for Poisson Regression Models Cont.

• Remedies1. Change linear predictor – XTβ

– Add or subtract regressors, transform regressors, add interaction terms, etc.

2. Change link function – g(XTβ)3. Change Random Component

– Use Negative Binomial Distribution

Poisson Regression• Mining Example

Poisson Regression• Summary

• Introduction to the Poisson Regression Model• Interpretation of β• Overdispersion• Mining Example

Poisson Regression

• Questions/Comments

Generalized Linear Models

• Open Discussion

Documents

April 6, 2010 Generalized Linear Models 2010 LISA Short Course Series Mark Seiss, Dept. of Statistics