Stat 342 - Wk 12: Advanced regression and model building.jackd/Stat342/Lect_Wk12.pdf · Stat 342 - Wk 12: Advanced regression and model building. Mixed effect models – crash course

Stat 342 - Wk 12: Advanced regression and model building.

Mixed effect models – crash course

Mixed effect models (proc glm)

Logistic regression – crash course

Binary responses (proc logistic)

Ordinal and multinomial responses (proc logistic)

Stat 342 Notes. Week 12 Page 1 / 54

Last week, we examined complex models with proc glm and model selection with proc glmselect.

This week, we're going to introduce three major expansions to our library of regression tools.

1. Mixed effect models. ( proc glm, 'random' statement )

2. Logistic regression. (proc logistic)

3. Maximum likelihood estimation. (proc genmod)


Crash course on MIXED EFFECT MODELS

In previous lectures, we have used categorical variables like 'number of cylinders in an engine' or 'shelf that a brand of cereal is found on'.

We have treated every categorical variable in the same way, as if these were the only categories that were possibly of interest. In short, we have treated each of these variables like 'fixed effects' – things that only ever take on the specific categorical values that we observe in the data.


But there are two kinds of interpretations of categorical variables: fixed effect (standard, traditional), and RANDOM EFFECTS.

The distinction between a fixed and random effect factor often depends on what would happen if the experiment were to be repeated.


If the experiment were to be repeated, would the same levels be chosen (fixed effects) or would a new set of levels be chosen (random effects).

The distinction can also be made about the scope of inference for the experiment. Is the experiment interested inthe effects of the levels that actually occurred in the experiment (fixed effects) or do you wish to generalize to a large population of levels from which you happened to choose a few for this experiment (random effects).


Example 1: An experiment on the effects of soil compressionon subsequent tree growth. Suppose that the experimenter obtained seedlings from several different seed sources.

This experiment could be viewed as having two factors (categorical variables) - the level of soil compression, and the seeding source.


The soil compression is something under our control, and we're interested specifically in the levels of compaction given. THIS IS A FIXED EFFECT

By treating soil compressed as a fixed effect, SAS (and R, and other programs) will return something like this:

Category Estimate Std. Err. T P-value

No Compression 0.00000 -- -- ---

Light Compression 3.4235 1.512 2.264 .012

Medium Compress 6.1290 2.623 2.337 .010

Heavy Compress 11.907 1.997 5.962 < .0001


In short, we are given standard errors and hypothesis tests for each category compared to the baseline category.

We have a lot of details about these categories in particular, but the model can't be applied to additional categories.


No Compression 0.00000 -- -- ---

Light Compression 3.4235 1.512 2.264 .012

Medium Compress 6.1290 2.623 2.337 .010

Heavy Compress 11.907 1.997 5.962 < .0001


The seed sources are not completely under our control. If wewere to run this experiment again, we could end up with different batches of seeds.

Also, for an experiment like this, it doesn't make sense to apply to results to only those batches of seeds, but to ALL seeds of this species as a whole.

Therefore, it makes the most sense to treat seed source as a random effect. In essence, the categories we chose were randomly selected from a large pool of possible categories.


The model summary for a random effect might look like this... We have the change in the response mean (the 'effect') of each seed batch, but we don't have any way to tell if any particular batch is statistically significantly differentfrom any other.


Seed batch 1 0.00000 -- -- ---

Seed batch 2 3.4235 NA NA NA

Seed batch 3 -2.908 NA NA NA

Seed batch 4 -7.124 NA NA NA


No p-values? Why deal with random effects this at all?

We can find the amount of variance explained by the different seed batches in general, but can't isolate it to any given batch.

It lets us find the size of the fixed effects, like soil compression, while controlling for any additional effect fromthe different batches.


Also, we can add any other seed batch to the model by comparing its mean response to the baseline.

One final advantage of using random effects is that they are 'cheaper'. For fixed effects, each category after the first needs its own dummy variable and 'costs' us a degree of freedom.

For random effects, we can add new categories without losing degrees of freedom. New categories don't contribute to other complexity-based problems like co-linearity (high VIF).


This makes random effects an ideal treatment for categoricalvariables with many (e.g. hundreds, thousands) of categories

A regression (or other statistical) model that incorporates random effects is called a MIXED EFFECT MODEL. (Mixed because most reasonable models have some fixed effects or numeric predictors as well)

Random effects notes source: Chapter 101 – Random effects, by Carl Schwarz

http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDF/ChapterRandom.pdf




Ready to start mixing in some randomness?


PROC GLM is capable of incorporating random effects into models. We'll demonstrate this on the batting dataset.proc import datafile='...Batting.csv'

out=batting dbms=csv replace;

delimiter=',';

getnames=yes;

run;

data batting_600AB;

set batting;

batavg = H / AB;

so_rate = SO / AB;

if AB >= 600 then output;

run;


First, start with a regression model, like the rate of strikeoutsover the years.

proc glm data = batting_600AB;

model so_rate = yearID / solution;

run;


There is a lot of variation unexplained...


...like that from individual batters.


To add a random effect, use the RANDOM statement.

proc glm data = batting_600AB;

class playerID;

model so_rate = yearID playerID;

random playerID / test;

run;

/* The order is ALWAYS class --> model --> random */



Tons of rainbows, what does it mean?!


Logistic Regression

Logistic regression is a variant of regression that is used when the response of interest isn't a continuous variable likestrikeout rate, or horsepower.

Instead, it's used when the response a binary yes or no. It's used...

- In medical science (will this treatment work or not?)- In e-mail filters (Is this message spam or not?)- In banking (will this lender pay back their loan or not?)


It works by predicting the log-odds of the 'yes' response. Log-odds is related to probability, but it doesn't have the same [0-1] limitations that probability does.


We're going to use proc logistic to predict the level of contact people in different neighbourhoods and housing situations have with their neighbours of the city of Copenhagen.

The two possible responses are 'low contact' and 'high contact'.

This is done with the Copenhagen datalines code you were given.


There are 72 lines of code, but these represent more than 1600 houses.

housing: the type of housing they had (1=tower blocks, 2=apartments, 3=atrium houses and 4=terraced houses),

influence: their feeling of influence on apartment management(1=low, 2=medium,3=high),

contact: their degree of contact with neighbours (1=low, 2=high), and

satisfaction: their satisfaction with housing conditions (1=low, 2=medium, 3=high).

n: The number of houses in these categories



First, some preliminary analysis with proc freq and proc means.

Notice the () in the TABLES statement, which means 'make a two-way table with 'contact' and each of the variables in ().'

proc freq data=copenhagen;

tables contact*(housing influence satisfaction) / nocol norow nopercent;

weight n;

run;


Also notice the WEIGHT statement. This tells SAS that each row doesn't just represent one observation, it represents n observations.

Every modelling PROC has a WEIGHT statement or something like it.

proc freq data=copenhagen;

tables contact*(housing influence satisfaction) / nocol norow nopercent;

weight n;

run;


Without the WEIGHT statement, the crosstabs look like this.

It's just a count of the rows with each combination.


But when you have a proper WEIGHT statement, the crosstabs show the number of OBSERVATIONS in each cell.


In PROC FREQ and PROC GENMOD, the statement that indicates the number of observations is the WEIGHT statement. For these procedures, the weight doesn't have tobe a whole number.

In the ANOVA, GLM, and GLMSELECT procedures, there is the FREQ statement, and it does have to be an integer. However, it also affects things like standard errors, which tend to get smaller as sample sizes increase. The weight statement doesn't have the same effect.


In PROC LOGISTIC and PROC TTEST, there is both a FREQ statement to indicate the number of identical observations in a row, and a WEIGHT statement to indicate how importanteach row is.


The basic syntax of PROC LOGISTIC follows the same patterns of the GLM and GLMSELECT procedures.

However, random effects don't work with the LOGISTIC proc.

proc logistic data=copenhagen;

class <categorical predictors>;

model <response> = <explanatory> / <options>;

freq <varname>;

output out =<dataset> <var=newname>;

run;


For the predicting the level of neighbour contact.


class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction;

run;


Here, everything is done predicting the chance of 'high'.


Here, everything is done predicting the chance of 'high'.

... which was decided by SAS, and it may not have been the category that we wanted to have as our 'yes' category.

To change this, define the 'yes' category that you want with the event option in the model statement.

model contact(event = 'low') = housing influence satisfaction;


Significance levels are the same, estimates are 'reversed'.


Don't confuse logistic models with logical models.


Model selection is done with the SELECTION option in the MODEL statement after the slash.

The DETAILS option tells SAS to report on the entire model selection process, not just the end result.



freq n;

model contact = housing influence satisfaction / selection = stepwise details;

run;


The ODDSRATIO option shows the odds-ratio (and confidence interval of the odds-ratio) of each category compared to the baseline for your selected variable.



freq n;

model contact = housing influence satisfaction;

oddsratio housing;

run;



There is more than one link function, which is the function used to convert probability, which is bounded by [0,1], into something that is unbounded.

This is important because logistic regression is doing something very similar to linear regression at its basic mechanic, and linear regression depends on the response variable to be some continuous variable which can, theoretically, take any value.


The default link function is the 'logit' link, which is the one we use to put log-odds in the place of probability.

One common alternative is the 'probit' link, which uses the CDF of the normal distribution instead. The theory behind why one link is selected over any other link is graduate-level theory (in Generalized Linear Models), so for now I recommend using the default 'logit' most of the time.


If there are numerical issues (e.g. failure to converge, nonsense summary data) with the logit link, you can treat probit as an 'alternate mode' of logistic regression, which may have better luck.



freq n;

model contact = housing influence satisfaction / link=probit;

run;


Why deal with link function at all? Because there's another procedure called proc probit, which is like proc logistic, but isolder with fewer features.

To avoid having to learn an outdated proc, if you ever have to use a probit link instead of a logit link, then just use the LINK option in the MODEL statement of PROC LOGISTIC.


We can also get additional summary data, such as Naglekirke's R-squared (a logistic version of the regular r-squared), and the confidence limits of the odds ratios with the rsquared and clodds options, respectively.



freq n;

model contact = housing influence satisfaction / rsquare clodds = wald;

run;



But what if there's more than 2 levels?


If you are trying to make predictions about a categorical response with more than two levels, there's one thing you have to ask before going any further.

Do the categories I wish to predict form a natural ordering, (e.g. None, Low, Medium, High, Extreme), or,

are they just nominal , unordered categories

(e.g. Cat, Dog, Dragon, Capybara)?


If the data is ordered, you can use proc logistic to conduct anORDINAL LOGISTIC REGRESSION.

Just code you categories into integers {1,2,...,k} and use those coded categories as your response.

Data copenhagen;

set copenhagen;

sat_lvl = 1;

if satisfaction = 'medium' then sat_lvl = 2;

if satisfaction = 'high' then sat_lvl = 3;

run;


The logistic procedure will understand that each integer value is an ordered category.


class housing influence contact;

freq n;

model sat_lvl = housing influence contact / rsquare;

oddsratio housing;

run;


With ordinal responses, all the effect sizes refer to the log-odds of any given response being in the 'next category up'.

Each response category after the first has its own intercept.


If there is no natual ordering to the categories, you can use the generalized logit to do a logistic regression on several categorical responses together.

To do this, use the link option and set it to 'glogit'


class contact influence satisfaction;

freq n;

model housing = contact influence

satisfaction / link=glogit;

run;


There results show the effect of each variable on the log-odds of any observation having the list response (compared to the 'baseline' response)


Documents

Stat 342 - Wk 12: Advanced regression and model building.jackd/Stat342/Lect_Wk12.pdf · Stat 342 - Wk 12: Advanced regression and model building. Mixed effect models – crash course