Why Dummy Variable Makes You SMART and How to Do It SEXY

1

Paper 74902

Why Dummy Variable Makes You SMART, and How to Do it SEXY Guanghui (Brian) Sun, Rick Hansen Institute,

Department of Orthopaedics, University of British Columbia, Vancouver, BC

ABSTRACT When considering a categorical variable as an independent variable in a regression analysis, should you directly include the categorical variable into your model by using CLASS statement or instead include multiple dummy variables converted from levels of the categorical variable? Theoretically those two approaches are equivalent in terms of estimation results. The CLASS statement offers various possibilities of variable parameterization. However, the availability and power of CLASS statement differs in different SAS regression procedures. So the dummy variable approach is still very appealing for the consistency, flexibility and transparency of variable parameterization. In practice the dummy variable approach also has advantages in terms of ODS statistical graphics output, variable Boolean calculation, and model identification for better fit. However, coding dummies can be very tedious and painful. This paper introduces two approaches for automatic dummy variable generation, which makes the application of dummy variables easy and efficient.

INTRODUCTION

In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. When a categorical variable considered for a model, a parameterization process, which actually converts categorical variable into dummy variables, is required for creating design matrix. This categorical variable parameterization can be done either by using CLASS statement, or by user coding dummy variables.

Take the following model as an example,

Model: Length of Stay (LOS) in Hospital = Age + Injury Level (high, medium, low)

CLASS Statement Approach

proc glm data=patient; class injury_level; model LOS=age injury_level; quit;

Dummy Variable Approach without CLASS Statement

proc glm data=patient; model LOS=age injury_high injury_medium; quit;

Theoretically those two approaches are equivalent in term of estimation results with almost identical SAS output. The CLASS statement offers various possibilities of variable parameterization. However, the availability and power of CLASS statement differs in different SAS regression procedures, as Table 1 shows.

The CLASS statement in LOGISTIC, GENMOD, and GLMSELECT procedure is very powerful. However, a thorough understanding and mastery of CLASS statement syntax is required for correctly utilizing its power. For more details, please refer to Michelle Pritchard and David Pasta’s 2004 paper: "Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC."

It is obvious that the CLASS statement is still very limited or even unavailable in some very popular SAS procedures, such as GLM and REG. For those beginners who may feel overwhelmed by the volume of syntax documentation, and those power modelers who require more freedom to tailor the parameterization and design matrix, the dummy variable approach seems still very appealing for its flexibility, transparency, and consistency with all SAS procedures. As matter of fact, using the dummy variable, in practice, is also rewarding in various aspects, such as ODS statistical graphics output, variable Boolean calculation, and model identification for better fit.

2

SAS Procedure REG GLM MIXED GLMMIX GLMSELECT GENMOD LOGISTIC

CLASS Statement Availability √ √ √ √ √ √

CLASS Statement Options

Parameterization Methods

EFFECT √ √ √ √ √ √

GLM √ √ √

ORDINAL √ √ √

POLYNOMIAL √ √ √

REFERENCE √ √ √

ORTHEFFECT √ √ √

ORTHORDINAL √ √ √

ORTHOTHERM √ √ √

ORTHPOLY √ √ √

ORTHREF √ √ √

Level Sorting √ √ √ √ √ √

Reference Level Selection √ √ √ Table 1. Summary of CLASS statement in SAS regression procedures.

USING DUMMY VARIABLES IS SMART

Although choosing different baselines for categorical independent variables seems only to result in differences in the model intercept, a study (2007) by Wissmann, Toutenburg, and Shalabh shows that the choice of reference category can affect the degree of multicollinearity. In addition, choosing a reasonable reference is also critical for model interpretation and application.

TAKE FULL CONTROL OF THE BASELINE

Unfortunately, the CLASS statement approach has some constraints in choosing reference for categorical variable. In the GLM procedure, the reference is automatically selected to be the last level sorted alphabetically. In other procedures, such as LOGISTIC, it is possible to specify the reference category on the CLASS statement.

CLASS statement Approach

proc glm data=patient; class injury_level; model LOS=age injury_level; quit;

Injury_level has 3 values: high, medium, and low. In that case, the medium is selected by the GLM procedure as the reference, which is not desirable for explanation.

Dummy Variable Approach without CLASS statement

proc glm data=patient; model LOS=age injury_high injury_medium; quit; proc glm data=patient; model LOS=age injury_high; quit;

By excluding Injury_low dummy, low injury level is set as the reference level.

By excluding both Injury_medium and Injury_low dummies, the combined low/medium injury level is set as the reference level.

3

As shown in the above, dummy variable approach is very flexible for reference definition. Any one dummy variable excluded from the model automatically becomes the reference.

As an extension of this discussion, dummy variable approach also has more flexibility in model selection procedure, such as stepwise selection, AIC, etc. To illustrate this point, let’s take a look at an example.

Suppose we are interested in understanding the relationship of smoking or not (binary variable) with age (continuous), education and other covariates (X1-X5). Education is a categorical variable with 7 levels: 1) Primary School or Below, 2) Middle School, 3) High School, 4) College Diploma or Certificate, 5) Bachelor Degree, 6) Master Degree, 7) Ph.D Degree.

Model: Smoking = Age + Education + X1 + X2 + X3 + X4 + X5

CLASS Statement Approach (Stepwise Selection)

proc logistic data=patient; class education(param=ref ref=’Primary School or Below’);

model Smoking(event='Yes')=age education x1-x5 /selection=stepwise; quit;

Stepwise Selection Result: Smoking = Age + Education + X1 + X5

Model: Smoking = Age + Edu_midschool + Edu_higschool + Edu_college + Edu_bache + Edu_master + Edu_phd + X1 + X2 + X3 + X4 + X5

Dummy Variable Approach without CLASS Statement (Stepwise Selection)

proc logistic data=patient; model Smoking(event='Yes')= age edu_midschool edu_higschool edu_college edu_bache edu_master edu_phd x1-x5 /selection=stepwise;

quit;

Stepwise Selection Result: Smoking = Age + edu_bache + edu_master + edu_phd + X1 + X5

Using the CLASS statement, the model selection algorithm will include or exclude the education variable as a whole. However, the dummy variable approach makes it possible to include or exclude education variable by each level because the education variable is broken into multiple dummy variables with binary values.

With the dummy variable approach, the stepwise selection result implies only degree education has significant effect on smoking, and education below the bachelor’s level are automatically merged as estimation baseline. It is obvious that the final model using dummy variables is a more parsimonious model with reduced degrees of freedom included.

ODS statistical graphics was experimental in SAS 9.1 and is production in SAS 9.2, having been totally reworked and expanded. In SAS 9.2, if you include ODS graphics statement before the REG procedure, the REG procedure produces a set of diagnosis plots, including a scatter plot of the input data overlaid with the fitted regression line, confidence band, and prediction limits, as shown in Figure 1.1; a plot of the residuals versus the regressor in the model, as shown in Figure 1.

TAKE FULL ADVANTAGE OF PROC REG WITH ODS STATISTICAL GRAPHICS

ods graphics; proc reg data=patient; model LOS=age injury_high injury_medium; quit;

4

Figure 1. Diagnosis plots of Proc REG by ODS graphics

However, this graphics feature is so far not available in the other procedures (such as GLM procedure). The CLASS statement can not be applied in the REG procedure. If you want to have both categorical variables in the model and diagnostic plots in the output, dummy variable approach is probably the only choice.

EASILY RECODE VARIABLE WITH BOOLEAN OPERATION

Figure 2. Diagram of Boolean operations

The nature of dummy variable implies a logical proposition with a value of true or false, which facilitates variable recoding and calculated variable creation by Boolean operation.

Example: A: patients with pneumonia as complication after surgery

dummy variable representation: pneumonia_dummy=1 (without pneumonia: pneumonia_dummy=0)

B: patients with neuropathic pain as complication after surgery

dummy variable representation: pain_dummy=1 (without pain: pain_dummy=0)

Variable Creation by Boolean Operations: A AND B: patients with both pneumonia and pain after surgery

dummy variable representation: Pneumonia_pain_both=pneumonia_dummy* pain_dummy OR Pneumonia_pain_both=min(pneumonia_dummy,pain_dummy)

5

A OR B: patients with at least one complication of pneumonia and pain after surgery

dummy variable representation: Pneumonia_pain_either=(pneumonia_dummy+pain_dummy > 0) OR Pneumonia_pain_either=max(pneumonia_dummy,pain_dummy)

A XOR B: patients with only one complication of pneumonia and pain (both not both) after surgery dummy variable representation: Pneumonia_pain_xor=(pneumonia_dummy+pain_dummy-2*pneumonia_dummy*pain_dummy OR Pneumonia_pain_xor=max(pneumonia_dummy,pain_dummy) – min(pneumonia_dummy,pain_dummy)

A typical question is to model seasonality.

MODEL NON-LINEAR RELATIONSHIP IN A LINEAR WAY

Figure 3. Non-linear pattern of seasonality

As show in Figure 3, the format of season variable is numerical; however the implication of the seasonality and response is a non-linear relationship, which is more accurately captured by creating season dummy variables rather than considering complex mathematical transformation of variable. The use of dummy variables here, in a nutshell, is to model non-linear relationship in a linear presentation.

Dummy variable is also widely used for capturing structural change in a very broad research analysis. A typical area is incident analysis, which evaluates the structural change before and after a specific event happened. The simple way is to introduce an event dummy variable (before=0, after=1) in the model to capture the offset effect of event on the response. A more sophisticated method is to consider the poly-line-fitting or piecewise linear approach, constructed by dummy variables, to capture both the offset effect and “slope” change effect of the event. Following is an example:

CAPTURE STRUCTURAL CHANGE IN STATISTICAL MODELING

Model 1: Linear fitting Years Lived After Injury = Age at Injury + Other Covariates

Model 2: Polynomial fitting Years Lived After Injury =Age at Injury + (Age at Injury)**2 + Other Covariates

Model 3: Poly-line (Piecewise Linear) Fitting Years Lived After Injury = Age at Injury + Age_50 + Age_50_inter + Other Covariates

Age_50=

Age_50_inter=Age_50*(Age at injury)

1 if age at injury<=50

0 otherwise

6

Figure 4. Fitting comparison of 3 modeling methods

As shown in Figure 4, it is very obvious that the poly-line (piecewise linear) model identification is superior to the other two model identifications in term of goodness of fit.

For additional suggestions, see Pasta (2005) “Parameterizing models to test the hypotheses you want: Coding indicator variables and modified continuous variables.”

HOW TO DO DUMMY VARIABLE SEXY

It is very tedious and time-consuming for “if-then-else” coding of manually creating dummy variables. We need smarter ways to generate dummy variables. Here we introduce two automatic approaches.

The GLMMOD procedure constructs the design matrix for a general linear model, and automatically codes categorical variables in the dataset based on the design matrix. Considering a patient dataset with Length of Stay in hospital (LOS), age at injury, and injury level (high, medium, low) as a categorical variable as Table 2 shows:

GLMMOD PROCEDURE

age Injury level LOS 36 high 208

42 low 120

21 medium 178 … … …

Table 2. Patient dataset

Run the following code:

proc glmmod data=patient OUTDESIGN=patient_dummy; class injury_level; model LOS=age injury_level; quit;

Adjusted R-SQ Linear fitness: 0.52

Polynomial fitness: 0.63 Poly-line fitness: 0.75

7

The OUTDESIGN= DataSet option will output the patient dataset with Injury_level variables converted into 3 dummy variables for the 3 levels (high, medium and low).

LOS Intercept age injury_level high

injury_level low

injury_level medium

208 1 36 1 0 0

120 1 42 0 1 0

178 1 21 0 0 1

… … … … … …

Table 3. Output dataset by OUTDESIGN= option

Please be noted that the header shows labels of variables. From the property of output design table, we get:

Column Name Label LOS

Col1 Intercept

Col2 age

Col3 injury_level high

Col4 injury_level low

Col5 injury_level medium Table 4. Variable name and label in output dataset by OUTDESIGN= option

All the independent variables, including dummy variables and intercept, are coded as Col1-ColN.

The SAS macro I developed is more powerful for automation of dummy variables generation with more functionality, such as controlling the maximum number of levels, coding multiple options text, and coding missing dummy, etc. The syntax of the macro is as following:

%AUTO_DUMMY_VARIABLE

Figure 5. Syntax of %Auto_Dummy_Variable

%Auto_Dummy_Variable ( tablename=, variablename=, outtablename=, missing=, MaxLevel=, delimiter= );

Input table name

Input variable name or specify batch mode: _ALL_ - all variables, _NUMERIC_ - all numerical variables, _CHAR_- all character variables

Output table name

Whether code missing value as dummy variable: Y-Yes, N-No (default)

Delimiter for coding multiple options text variable

Max number of levels considered for coding dummy variables, the default is 10

8

Key features of %AUTO_DUMMY_VARIABLE:

o Automatically detecting all levels/values of categorical/numerical variable, and creating dummy variables accordingly.

o Automatically naming and labeling the dummy variables.

o Batch mode: Automatically making dummy variables for all variables or all character/numerical variables in a table.

o Automatically exclude ID variables from dummy variable creation by controlling the maximum number of levels considered for coding dummy variable.

o Missing variables: able to automatically create dummy variable for missing values.

o By specifying delimiter, automatically code dummy variables for multiple options text variable.

Complications=”Pneumonia | Neuropathic Pain

Example:

Examples of Use:

o create dummy variables for one variable :

%Auto_Dummy_Variable (tablename=patient, variablename=gender, outtablename=patient);

o create dummy variables for all variables in a table :

%Auto_Dummy_Variable (tablename=patient, variablename=_All_, outtablename=patient);

Note: If the number of distinct values is greater than 10, the variable would be automatically excluded from generating dummy variables.

o create dummy variable for multiple options text variable

%Auto_Dummy_Variable (tablename=patient, variablename=complications, outtablename=patient, delimiter=|);

CONCLUSION The empowered CLASS statement in some SAS regression procedures offers various possibilities of categorical variable parameterization. However, the dummy variable approach is still very appealing for its consistency, flexibility and transparency as long as we have a smart and efficient way to do it. In practice the dummy variable approach also has many benefits in terms of ODS statistical graphics output, variable Boolean calculation, and model identification for better fit. GLMMOD procedure is a shortcut for automating dummy variables. Using the SAS macro to create dummy variables is more powerful with more controls and functionality.

REFERENCE 1. Suits, Daniel B. 1984. “Dummy Variables: Mechanics v. Interpretation.” Review of Economics and.

Statistics, Vol. 66, No.1 (February), 177-180

Complications_Pneumonia=1

Complications_NeuroPain=1

9

2. Pritchard, Michelle L. and Pasta, David J. (2004), “Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC," Proceedings of the Twenty-Ninth Annual SAS Users Group International Conference, Paper 194-29

3. Pasta, David J. (2005), “Parameterizing models to test the hypotheses you want: coding indicator variables and modified continuous variables,” Proceedings of the Thirtieth Annual SAS Users Group International Conference, 212-30

4. Kathryn Martin (2010), “Dummy Coding for Dummies”, Proceedings of Western Users of SAS Software Conference 2010

5. Wissmann, M., Toutenburg, H. and Shalabh (2007): "Role of Categorical Variables in Multicollinearity in Linear Regression Model", Technical Report No. 8, Department of Statistics, University of Munich, Munich, Germany

6. Art Carpenter (2004), “Carpenter’s Complete Guide to the SAS Macro Language”, 2nd edition

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Guanghui (Brian) Sun Rick Hansen Institute / Department of Orthopaedics, University of British Columbia 6400 - 818 West 10th Avenue Blusson Spinal Cord Centre Vancouver, British Columbia, V5Z 1M9 Email: [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

mailto:[email protected]�

10

APPENDIX: THE SAS CODE FOR %AUTO_DUMMY_VARIABLE %macro Auto_Dummy_Variable(tablename=,variablename=,outputtable=,MissingDummy=N, MaxLevel=10,delimiter=); %local _N_Vars_ var_name variabletype _Value_Raw_List_ dsid _N_Var_values_ missing_dummy_name missing_dummy_label MissingMark i kkk; /*create variable list for automatically generating dummy variables*/ proc contents data=&tablename.(keep=&variablename.) noprint varnum out=_var_list_(keep=name rename=(name=VariableName)); run; %if &tablename.^=&outputtable. %then %do; data &outputtable.; set &tablename.; run; %end; %let _N_Vars_=0; %let fid=%sysfunc(open(work._var_list_)); %let _N_Vars_=%sysfunc(attrn(&fid,NOBS)); %let fid=%sysfunc(close(&fid)); %if &_N_Vars_>0 %then %do; Data _null_; set _var_list_; call symputx('_Var_name_'||left(put(_n_,8.)),VariableName); run; %do kkk=1 %to &_N_Vars_.; %let Var_name=&&_Var_name_&kkk.; /*get variable type*/ %let dsid=%sysfunc(open(&tablename.,i)); %let variabletype=%sysfunc(vartype(&dsid,%sysfunc(varnum(&dsid,&var_name.)))); %let rc=%sysfunc(close(&dsid)); %let _Value_Raw_List_=&tablename.; /*split the multiple options to get values*/ %if %length(&delimiter.)^=0 %then %do; %let _Value_Raw_List_=_zzz_; data _zzz_(drop=_iii_); set &tablename.; _iii_=1; do until(scan(&var_name.,_iii_,"&delimiter.")=''); _vvv_=scan(&var_name.,_iii_,"&delimiter."); output; _iii_=_iii_+1; end; drop &var_name.; rename _vvv_=&var_name.; run; %end; /*create list of distinct values*/ proc sql noprint; create table _Value_distinct_List_ as select distinct &var_name. as Variable_Value label="Variable Value" from &_Value_Raw_List_.;

11

quit; /*create dummy variable name and label*/ data _Value_distinct_List_; format variable_name $32.; set _Value_distinct_List_; format dummy_name $32.; format dummy_label $500.; label Variable_name="Variable Name" dummy_name="Dummy Variable Name" dummy_label="Dummy Variable Label"; %if &variabletype.=C %then %do; variable_value=propcase(variable_value); %end; variable_name="&var_name."; dummy_name=cats("DV_",substr("&var_name.",1,min(24,length("&var_name."))),_N_); dummy_label=cats("Dummy Variable||","&var_name.",":",Variable_Value); where Variable_Value is not null; run; %let missing_dummy_name=DV_%substr(&var_name.,1,%sysfunc(min(21,%length(&var_name.))))_Missing; %let missing_dummy_label=Dummy Variable||&var_name.: Missing Value; /*create macro matrix*/ %let _N_Var_values_=0; %let fid=%sysfunc(open(work._value_distinct_list_)); %let _N_Var_values_=%sysfunc(attrn(&fid,NOBS)); %let fid=%sysfunc(close(&fid)); %put N of levels = &_N_Var_values_.; %put Max number of levels considered for coding dummy variables = &maxlevel.; %if "&MaxLevel"="" %then %let Maxlevel=1000; %if &_N_Var_values_.>0 and &_N_Var_values_.<=&maxlevel. %then %do; Data _null_; set _value_distinct_list_; %if &variabletype.=N %then %do; call symputx('_var_value_'||left(put(_n_,8.)),Variable_Value); %end; %if &variabletype.=C %then %do; call symputx('_var_value_'||left(put(_n_,8.)),cats('"', Variable_Value, '"')); %end; call symputx('_dummyname_'||left(put(_n_,8.)),dummy_name); call symputx('_dummylabel_'||left(put(_n_,8.)),dummy_label); run; /*set Missing Mark*/ %if &variabletype.=N %then %let MissingMark=.; %if &variabletype.=C %then %let MissingMark=""; /*Start to generate dummy variables*/ Data &outputtable.; set &outputtable.; _var_temp_=&var_name.; %if &variabletype.=C %then %do; _var_temp_=propcase(_var_temp_); %end;

12

/*generate non-missing dummy variables*/ if _var_temp_^= &MissingMark. then do; /*if the variable is multiple options text variable*/ %if %length(&delimiter.)^=0 %then %do; %do i=1 %to &_N_Var_values_; &&_dummyname_&i.=0; label &&_dummyname_&i.="&&_dummylabel_&i."; %end; _iii_=1; do until(scan(_var_temp_,_iii_,"&delimiter.")=''); _jjj_=scan(_var_temp_,_iii_,"&delimiter."); %if &variabletype.=C %then %do; _jjj_=propcase(_jjj_); %end; %do i=1 %to &_N_Var_values_; if _jjj_=&&_var_value_&i then &&_dummyname_&i.=1; %end; _iii_=_iii_+1; end; drop _iii_ _jjj_; %end; /*if the variable is not multiple options text variable*/ %else %do i=1 %to &_N_Var_values_; if _var_temp_=&&_var_value_&i then &&_dummyname_&i.=1; Else &&_dummyname_&i.=0; %end; end; /*generate missing dummy variable*/ %if &MissingDummy=Y %then %do; if _var_temp_= &MissingMark. then do; %do i=1 %to &_N_Var_values_; if &&_dummyname_&i.=. then &&_dummyname_&i.=0; %end; &missing_dummy_name=1; end; else &missing_dummy_name=0; label &missing_dummy_name="&missing_dummy_label"; %end; drop _var_temp_; run; %end; proc datasets library=work nolist; delete _Value_distinct_List_ %if %length(&delimiter.)^=0 %then %do; _zzz_ %end; ; quit; %end; %end; proc datasets library=work nolist; delete _var_list_; quit; %mend Auto_Dummy_Variable;

Documents

Why Dummy Variable Makes You SMART and How to Do It SEXY