54
Stat 342 - Wk 12: Advanced regression and model building. Mixed effect models – crash course Mixed effect models (proc glm) Logisc regression – crash course Binary responses (proc logisc) Ordinal and mulnomial responses (proc logisc) Stat 342 Notes. Week 12 Page 1 / 54

Stat 342 - Wk 12: Advanced regression and model building.jackd/Stat342/Lect_Wk12.pdf · Stat 342 - Wk 12: Advanced regression and model building. Mixed effect models – crash course

  • Upload
    buidieu

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Stat 342 - Wk 12: Advanced regression and model building.

Mixed effect models – crash course

Mixed effect models (proc glm)

Logistic regression – crash course

Binary responses (proc logistic)

Ordinal and multinomial responses (proc logistic)

Stat 342 Notes. Week 12 Page 1 / 54

Last week, we examined complex models with proc glm and model selection with proc glmselect.

This week, we're going to introduce three major expansions to our library of regression tools.

1. Mixed effect models. ( proc glm, 'random' statement )

2. Logistic regression. (proc logistic)

3. Maximum likelihood estimation. (proc genmod)

Stat 342 Notes. Week 12 Page 2 / 54

Crash course on MIXED EFFECT MODELS

In previous lectures, we have used categorical variables like 'number of cylinders in an engine' or 'shelf that a brand of cereal is found on'.

We have treated every categorical variable in the same way, as if these were the only categories that were possibly of interest. In short, we have treated each of these variables like 'fixed effects' – things that only ever take on the specific categorical values that we observe in the data.

Stat 342 Notes. Week 12 Page 3 / 54

But there are two kinds of interpretations of categorical variables: fixed effect (standard, traditional), and RANDOM EFFECTS.

The distinction between a fixed and random effect factor often depends on what would happen if the experiment were to be repeated.

Stat 342 Notes. Week 12 Page 4 / 54

If the experiment were to be repeated, would the same levels be chosen (fixed effects) or would a new set of levels be chosen (random effects).

The distinction can also be made about the scope of inference for the experiment. Is the experiment interested inthe effects of the levels that actually occurred in the experiment (fixed effects) or do you wish to generalize to a large population of levels from which you happened to choose a few for this experiment (random effects).

Stat 342 Notes. Week 12 Page 5 / 54

Example 1: An experiment on the effects of soil compressionon subsequent tree growth. Suppose that the experimenter obtained seedlings from several different seed sources.

This experiment could be viewed as having two factors (categorical variables) - the level of soil compression, and the seeding source.

Stat 342 Notes. Week 12 Page 6 / 54

The soil compression is something under our control, and we're interested specifically in the levels of compaction given. THIS IS A FIXED EFFECT

By treating soil compressed as a fixed effect, SAS (and R, and other programs) will return something like this:

Category Estimate Std. Err. T P-value

No Compression 0.00000 -- -- ---

Light Compression 3.4235 1.512 2.264 .012

Medium Compress 6.1290 2.623 2.337 .010

Heavy Compress 11.907 1.997 5.962 < .0001

Stat 342 Notes. Week 12 Page 7 / 54

In short, we are given standard errors and hypothesis tests for each category compared to the baseline category.

We have a lot of details about these categories in particular, but the model can't be applied to additional categories.

Category Estimate Std. Err. T P-value

No Compression 0.00000 -- -- ---

Light Compression 3.4235 1.512 2.264 .012

Medium Compress 6.1290 2.623 2.337 .010

Heavy Compress 11.907 1.997 5.962 < .0001

Stat 342 Notes. Week 12 Page 8 / 54

The seed sources are not completely under our control. If wewere to run this experiment again, we could end up with different batches of seeds.

Also, for an experiment like this, it doesn't make sense to apply to results to only those batches of seeds, but to ALL seeds of this species as a whole.

Therefore, it makes the most sense to treat seed source as a random effect. In essence, the categories we chose were randomly selected from a large pool of possible categories.

Stat 342 Notes. Week 12 Page 9 / 54

The model summary for a random effect might look like this... We have the change in the response mean (the 'effect') of each seed batch, but we don't have any way to tell if any particular batch is statistically significantly differentfrom any other.

Category Estimate Std. Err. T P-value

Seed batch 1 0.00000 -- -- ---

Seed batch 2 3.4235 NA NA NA

Seed batch 3 -2.908 NA NA NA

Seed batch 4 -7.124 NA NA NA

Stat 342 Notes. Week 12 Page 10 / 54

No p-values? Why deal with random effects this at all?

We can find the amount of variance explained by the different seed batches in general, but can't isolate it to any given batch.

It lets us find the size of the fixed effects, like soil compression, while controlling for any additional effect fromthe different batches.

Stat 342 Notes. Week 12 Page 11 / 54

Also, we can add any other seed batch to the model by comparing its mean response to the baseline.

One final advantage of using random effects is that they are 'cheaper'. For fixed effects, each category after the first needs its own dummy variable and 'costs' us a degree of freedom.

For random effects, we can add new categories without losing degrees of freedom. New categories don't contribute to other complexity-based problems like co-linearity (high VIF).

Stat 342 Notes. Week 12 Page 12 / 54

This makes random effects an ideal treatment for categoricalvariables with many (e.g. hundreds, thousands) of categories

A regression (or other statistical) model that incorporates random effects is called a MIXED EFFECT MODEL. (Mixed because most reasonable models have some fixed effects or numeric predictors as well)

Random effects notes source: Chapter 101 – Random effects, by Carl Schwarz

http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDF/ChapterRandom.pdf

Stat 342 Notes. Week 12 Page 13 / 54

Ready to start mixing in some randomness?

Stat 342 Notes. Week 12 Page 14 / 54

PROC GLM is capable of incorporating random effects into models. We'll demonstrate this on the batting dataset.proc import datafile='...Batting.csv'

out=batting dbms=csv replace;

delimiter=',';

getnames=yes;

run;

data batting_600AB;

set batting;

batavg = H / AB;

so_rate = SO / AB;

if AB >= 600 then output;

run;

Stat 342 Notes. Week 12 Page 15 / 54

First, start with a regression model, like the rate of strikeoutsover the years.

proc glm data = batting_600AB;

model so_rate = yearID / solution;

run;

Stat 342 Notes. Week 12 Page 16 / 54

There is a lot of variation unexplained...

Stat 342 Notes. Week 12 Page 17 / 54

...like that from individual batters.

Stat 342 Notes. Week 12 Page 18 / 54

To add a random effect, use the RANDOM statement.

proc glm data = batting_600AB;

class playerID;

model so_rate = yearID playerID;

random playerID / test;

run;

/* The order is ALWAYS class --> model --> random */

Stat 342 Notes. Week 12 Page 19 / 54

Stat 342 Notes. Week 12 Page 20 / 54

Tons of rainbows, what does it mean?!

Stat 342 Notes. Week 12 Page 21 / 54

Logistic Regression

Logistic regression is a variant of regression that is used when the response of interest isn't a continuous variable likestrikeout rate, or horsepower.

Instead, it's used when the response a binary yes or no. It's used...

- In medical science (will this treatment work or not?)- In e-mail filters (Is this message spam or not?)- In banking (will this lender pay back their loan or not?)

Stat 342 Notes. Week 12 Page 22 / 54

It works by predicting the log-odds of the 'yes' response. Log-odds is related to probability, but it doesn't have the same [0-1] limitations that probability does.

Stat 342 Notes. Week 12 Page 23 / 54

We're going to use proc logistic to predict the level of contact people in different neighbourhoods and housing situations have with their neighbours of the city of Copenhagen.

The two possible responses are 'low contact' and 'high contact'.

This is done with the Copenhagen datalines code you were given.

Stat 342 Notes. Week 12 Page 24 / 54

There are 72 lines of code, but these represent more than 1600 houses.

housing: the type of housing they had (1=tower blocks, 2=apartments, 3=atrium houses and 4=terraced houses),

influence: their feeling of influence on apartment management(1=low, 2=medium,3=high),

contact: their degree of contact with neighbours (1=low, 2=high), and

satisfaction: their satisfaction with housing conditions (1=low, 2=medium, 3=high).

n: The number of houses in these categories

Stat 342 Notes. Week 12 Page 25 / 54

Stat 342 Notes. Week 12 Page 26 / 54

First, some preliminary analysis with proc freq and proc means.

Notice the () in the TABLES statement, which means 'make a two-way table with 'contact' and each of the variables in ().'

proc freq data=copenhagen;

tables contact*(housing influence satisfaction) / nocol norow nopercent;

weight n;

run;

Stat 342 Notes. Week 12 Page 27 / 54

Also notice the WEIGHT statement. This tells SAS that each row doesn't just represent one observation, it represents n observations.

Every modelling PROC has a WEIGHT statement or something like it.

proc freq data=copenhagen;

tables contact*(housing influence satisfaction) / nocol norow nopercent;

weight n;

run;

Stat 342 Notes. Week 12 Page 28 / 54

Without the WEIGHT statement, the crosstabs look like this.

It's just a count of the rows with each combination.

Stat 342 Notes. Week 12 Page 29 / 54

But when you have a proper WEIGHT statement, the crosstabs show the number of OBSERVATIONS in each cell.

Stat 342 Notes. Week 12 Page 30 / 54

In PROC FREQ and PROC GENMOD, the statement that indicates the number of observations is the WEIGHT statement. For these procedures, the weight doesn't have tobe a whole number.

In the ANOVA, GLM, and GLMSELECT procedures, there is the FREQ statement, and it does have to be an integer. However, it also affects things like standard errors, which tend to get smaller as sample sizes increase. The weight statement doesn't have the same effect.

Stat 342 Notes. Week 12 Page 31 / 54

In PROC LOGISTIC and PROC TTEST, there is both a FREQ statement to indicate the number of identical observations in a row, and a WEIGHT statement to indicate how importanteach row is.

Stat 342 Notes. Week 12 Page 32 / 54

The basic syntax of PROC LOGISTIC follows the same patterns of the GLM and GLMSELECT procedures.

However, random effects don't work with the LOGISTIC proc.

proc logistic data=copenhagen;

class <categorical predictors>;

model <response> = <explanatory> / <options>;

freq <varname>;

output out =<dataset> <var=newname>;

run;

Stat 342 Notes. Week 12 Page 33 / 54

For the predicting the level of neighbour contact.

proc logistic data=copenhagen;

class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction;

run;

Stat 342 Notes. Week 12 Page 34 / 54

Here, everything is done predicting the chance of 'high'.

Stat 342 Notes. Week 12 Page 35 / 54

Here, everything is done predicting the chance of 'high'.

... which was decided by SAS, and it may not have been the category that we wanted to have as our 'yes' category.

To change this, define the 'yes' category that you want with the event option in the model statement.

model contact(event = 'low') = housing influence satisfaction;

Stat 342 Notes. Week 12 Page 36 / 54

Significance levels are the same, estimates are 'reversed'.

Stat 342 Notes. Week 12 Page 37 / 54

Don't confuse logistic models with logical models.

Stat 342 Notes. Week 12 Page 38 / 54

Model selection is done with the SELECTION option in the MODEL statement after the slash.

The DETAILS option tells SAS to report on the entire model selection process, not just the end result.

proc logistic data=copenhagen;

class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction / selection = stepwise details;

run;

Stat 342 Notes. Week 12 Page 39 / 54

The ODDSRATIO option shows the odds-ratio (and confidence interval of the odds-ratio) of each category compared to the baseline for your selected variable.

proc logistic data=copenhagen;

class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction;

oddsratio housing;

run;

Stat 342 Notes. Week 12 Page 40 / 54

Stat 342 Notes. Week 12 Page 41 / 54

There is more than one link function, which is the function used to convert probability, which is bounded by [0,1], into something that is unbounded.

This is important because logistic regression is doing something very similar to linear regression at its basic mechanic, and linear regression depends on the response variable to be some continuous variable which can, theoretically, take any value.

Stat 342 Notes. Week 12 Page 42 / 54

The default link function is the 'logit' link, which is the one we use to put log-odds in the place of probability.

One common alternative is the 'probit' link, which uses the CDF of the normal distribution instead. The theory behind why one link is selected over any other link is graduate-level theory (in Generalized Linear Models), so for now I recommend using the default 'logit' most of the time.

Stat 342 Notes. Week 12 Page 43 / 54

If there are numerical issues (e.g. failure to converge, nonsense summary data) with the logit link, you can treat probit as an 'alternate mode' of logistic regression, which may have better luck.

proc logistic data=copenhagen;

class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction / link=probit;

run;

Stat 342 Notes. Week 12 Page 44 / 54

Why deal with link function at all? Because there's another procedure called proc probit, which is like proc logistic, but isolder with fewer features.

To avoid having to learn an outdated proc, if you ever have to use a probit link instead of a logit link, then just use the LINK option in the MODEL statement of PROC LOGISTIC.

Stat 342 Notes. Week 12 Page 45 / 54

We can also get additional summary data, such as Naglekirke's R-squared (a logistic version of the regular r-squared), and the confidence limits of the odds ratios with the rsquared and clodds options, respectively.

proc logistic data=copenhagen;

class housing influence satisfaction;

freq n;

model contact = housing influence satisfaction / rsquare clodds = wald;

run;

Stat 342 Notes. Week 12 Page 46 / 54

Stat 342 Notes. Week 12 Page 47 / 54

But what if there's more than 2 levels?

Stat 342 Notes. Week 12 Page 48 / 54

If you are trying to make predictions about a categorical response with more than two levels, there's one thing you have to ask before going any further.

Do the categories I wish to predict form a natural ordering, (e.g. None, Low, Medium, High, Extreme), or,

are they just nominal , unordered categories

(e.g. Cat, Dog, Dragon, Capybara)?

Stat 342 Notes. Week 12 Page 49 / 54

If the data is ordered, you can use proc logistic to conduct anORDINAL LOGISTIC REGRESSION.

Just code you categories into integers {1,2,...,k} and use those coded categories as your response.

Data copenhagen;

set copenhagen;

sat_lvl = 1;

if satisfaction = 'medium' then sat_lvl = 2;

if satisfaction = 'high' then sat_lvl = 3;

run;

Stat 342 Notes. Week 12 Page 50 / 54

The logistic procedure will understand that each integer value is an ordered category.

proc logistic data=copenhagen;

class housing influence contact;

freq n;

model sat_lvl = housing influence contact / rsquare;

oddsratio housing;

run;

Stat 342 Notes. Week 12 Page 51 / 54

With ordinal responses, all the effect sizes refer to the log-odds of any given response being in the 'next category up'.

Each response category after the first has its own intercept.

Stat 342 Notes. Week 12 Page 52 / 54

If there is no natual ordering to the categories, you can use the generalized logit to do a logistic regression on several categorical responses together.

To do this, use the link option and set it to 'glogit'

proc logistic data=copenhagen;

class contact influence satisfaction;

freq n;

model housing = contact influence

satisfaction / link=glogit;

run;

Stat 342 Notes. Week 12 Page 53 / 54

There results show the effect of each variable on the log-odds of any observation having the list response (compared to the 'baseline' response)

Stat 342 Notes. Week 12 Page 54 / 54