Multiple Regression Analysis with Qualitative …Dummy variables Independent variable with value 0/1 to use qualitative information in regression analysis. Model with only one dummy

Binary (or Dummy) Variables

Multiple Regression Analysis with

Qualitative Information

Why should be care about qualitative

variables?

Need a method to incorporate qualitative information because:

Not all information can be easily quantified

Effect of belonging to certain group or category

e.g. gender, location, status, occupation; beneficiary of

program / policy.

Ordinal variables: e.g. answers to scaled questions, etc.

Effect of some quantitative variable might differ between

different groups/categories: e.g. returns to education differ

between ethnic groups (see interaction terms, next class).

Interest in determinants of belonging to a certain

group/category: e.g. determinants of being poor (see linear

probability model, next class).

Dummy variables

Independent variable with value 0/1 to use

qualitative information in regression analysis.

Model with only one dummy included:

: indicates differences in mean value y between 2

categories.

If dummy=0 =>

If dummy=1 =>

So we can know about the outcome differing between

two groups by looking at the significancy of

(=comparison-of-means test)

udummyy 00

0

0)( yE

00)( yE

0

Example: Determinants of starting wage for Thai engineers

Source: engin.dta, Wooldridge. Data from engineers in Thailand

during 1998.

mleeduc0 byte %9.0g male*(educ - 14)mleeduc byte %9.0g male*educpexpersq int %9.0g pexper^2lswage float %9.0g log(swage)highdrop byte %8.0g =1 if no high school degreepolytech byte %8.0g =1 if a polytechgrad byte %8.0g =1 if some graduate schoolcollege byte %8.0g =1 if college graduatehighgrad byte %8.0g =1 if high school graduateexpersq int %9.0g exper^2lwage float %9.0g log(wage)pexper byte %8.0g previous experienceexper byte %8.0g years on current jobswage long %12.0g starting wagewage long %12.0g monthly salary, Thai bahteduc byte %8.0g highest grade completedmale byte %8.0g =1 if male variable name type format label variable label storage display value size: 14,105 (99.9% of memory free) vars: 17 24 May 2002 12:43 obs: 403 Contains data from http://fmwww.bc.edu/ec-p/data/wooldridge/engin.dta

. bcuse engin.dta

Comparison of means test

_cons 9.450113 .0204922 461.16 0.000 9.409828 9.490399 male .4315182 .0281872 15.31 0.000 .376105 .4869313 lswage Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 50.6939157 402 .126104268 Root MSE = .28247 Adj R-squared = 0.3673 Residual 31.9945726 401 .079786964 R-squared = 0.3689 Model 18.6993431 1 18.6993431 Prob > F = 0.0000 F( 1, 401) = 234.37 Source SS df MS Number of obs = 403

. reg lswage male

The starting wage of a male engineer is on average 43% higher than

the starting wage for a female engineer. The difference is statistically

significant at 1% level.

Question: Is this due to gender discrimination?

=> We could check whether it is related to observed characteristics.

Dummy variables as intercept shifters

Model:

Similar interpretation as before BUT

remember that variable can only take 2 values: 0 or 1

relative to benchmark, i.e. non-specified group e.g. if

dummy is female: benchmark is male. Ex:

Where male=1 if individual is male, =0 if female.

is the approximate % difference in starting salary

between male and female individuals with same

amount of education and past experience.

check t-stat to see if this difference is significant.

uxxxdummyy kk ...221100

uereducmaleswage exp)log( 2100

0

Interpreting dummy coefficient when

dependent variable is log(y)

Coefficient on a dummy variable, when multiplied

by 100, is interpreted as the percentage difference

in y, holding all other factors fixed.

Being a male increases wage by about *100 per

cent.

More exactly, the effect is given by:

(exp -1)*100 per cent.

umalewage 00)log(

0

0

_cons 8.661703 .1028831 84.19 0.000 8.459443 8.863964 exper -.0099646 .0061564 -1.62 0.106 -.0220676 .0021384 educ .0752772 .0044725 16.83 0.000 .0664846 .0840698 male .2221869 .0248217 8.95 0.000 .1733891 .2709846 lswage Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lswage male educ exper

The predicted difference between starting wage of male and female

engineers with equal years of education and past experience is

approximately 22%. The difference is significant at the 1% level. (the

more precise prediction is 24%:exp(0.22)-1). Hence while the

difference is less than in the comparison-of-means test, it is still large and

highly significant. This could be interpreted as evidence of discrimination.

(if it can be argued that there are no important omitted variables that

could bias the result). What about graphical interpretation?

Example: Effect of ethnic background and smoking

on birth weight

Use the bwght2.dta dataset and estimate the

following equation:

mwhte=1 if mother is Caucasian

mwhte=0 if mother is not Caucasian

: difference in birth weight of child between

white and non-white mothers smoking the same

amount of cigarettes per day.

ucigsmwhtebwght 100

0

_cons 3337.464 40.79848 81.80 0.000 3257.444 3417.483 cigs -11.80881 3.244517 -3.64 0.000 -18.17242 -5.44519 mwhte 95.34956 43.31733 2.20 0.028 10.38933 180.3098 bwght Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 559668816 1721 325199.777 Root MSE = 567.73 Adj R-squared = 0.0089 Residual 554064402 1719 322317.86 R-squared = 0.0100 Model 5604414.27 2 2802207.14 Prob > F = 0.0002 F( 2, 1719) = 8.69 Source SS df MS Number of obs = 1722

. reg bwght mwhte cigs

Children of white mothers are predicted to have a 95 gram higher birth weight than

children of non-white mothers that smoked the same amount of cigarettes during

pregnancy. The coefficient is significantly different from 0 at the 5% level. Predicted

birth weight for a child of a non-white mother that didn’t smoke during pregnancy is

3337 grams. What about causality? (very important for policy evaluation)

Dummy variables for multiple categories

Distinguishing g categories can be done with

inclusion of g-1 dummy variables, along with the

interceptavoid dummy variable trap.

Note: if you do not omit one group, stata will do it.

Interpretation: relative to the omitted category

(need to omit 1 category to avoid perfect

collinearity).

eststo clear

eststo:reg bwght mwhte mblck cigs

eststo:reg bwght moth mblck cigs

esttab,r2

* p<0.05, ** p<0.01, *** p<0.001t statistics in parentheses R-sq 0.011 0.011 N 1722 1722 (56.64) (228.87) _cons 3282.1*** 3432.8***

(-2.52) moth -150.7*

(-3.65) (-3.65) cigs -11.84*** -11.84***

(1.34) (-0.69) mblck 109.6 -41.12

(2.52) mwhte 150.7* bwght bwght (1) (2)

Ordinal variables

Qualitative ratings can be transferred into dummy variables

How can we incorporate in a model happiness with marriage on 1-5 scale (happy) ? and the number of affairs (naffairs)?

One unit increase from 1 to 2 might not have same effect as a one unit increase from 4 to 5.

Create 4 dummy variables: hap1, hap2, hap3, hap4. hap1 = 1 if happy = 1

hap1 = 0 if happy ≠ 1

hap2 = 1 if happy = 2

hap2= 0 if happy ≠ 2 …..

Estimate

Because linearity assumption does not seem reasonable.

uyrsmarhaphaphaphapnaffairs 143210 4321

Example: Determinants of number of extra-

marital affairs (source: affairs.dta)

0yrsmarr float %9.0g years married vry unhap 3 = avg, 2 = smewht unhap, 1 =ratemarr byte %9.0g 5 = vry hap marr, 4 = hap than avg,naffairs byte %9.0g number of affairs within last year variable name type format label variable label storage display value

. d naffairs ratemarr yrsmarr

_cons 3.769133 .5671483 6.65 0.000 2.655289 4.882978 yrsmarr .0748148 .0237706 3.15 0.002 .0281307 .1214989 happy -.7439478 .120047 -6.20 0.000 -.9797128 -.5081827 naffairs Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 6529.08153 600 10.8818026 Root MSE = 3.1466 Adj R-squared = 0.0901 Residual 5920.90284 598 9.90117532 R-squared = 0.0931 Model 608.178688 2 304.089344 Prob > F = 0.0000 F( 2, 598) = 30.71 Source SS df MS Number of obs = 601

. reg naffairs happy yrsmarr

. ge happy=ratemarr

Interpretation: an increase of 1 on the happiness scale

decreases the predicted number of affairs per year by

0.74. This is not very intuitive, hence it is better to define

separate dummies for each category.

Total 601 100.00 5 232 38.60 100.00 4 194 32.28 61.40 3 93 15.47 29.12 2 66 10.98 13.64 1 16 2.66 2.66 happy Freq. Percent Cum.

. tab happy,ge(hap)

Or, you could do:

ge hap1=happy==1 if happy<.




This method is highly

valued by lazy and

efficient people.

This one is more transparent

but you have to write more

lines.

ge hap1=1

replace hap1=0 if happy!=1

ge hap2=1


ge hap3=1


ge hap4=1


ge hap5=1


Finally, you can always find a

way to write more lines and

get the same result. It is just a

matter of preferences.

_cons .2232698 .2561997 0.87 0.384 -.2798958 .7264354 yrsmarr .0768243 .0237727 3.23 0.001 .0301358 .1235129 hap4 .3941927 .3097408 1.27 0.204 -.2141256 1.002511 hap3 .5400974 .3868409 1.40 0.163 -.2196423 1.299837 hap2 2.933748 .4435974 6.61 0.000 2.062541 3.804955 hap1 2.681823 .8202996 3.27 0.001 1.070788 4.292858 naffairs Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg naffairs hap1 hap2 hap3 hap4 yrsmarr

Interpretation: People who are very unhappy in their marriage are predicted to have on average

2.68 more affairs per year (than which group??), controlling for the years of marriage. People

who are somewhat unhappy in their marriage are predicted to have 2.93 more affairs, ceteris

paribus. Both effects are statistically significant at the 1% level. People who are average happy,

or more than average happy with their marriage are not predicted to have more affairs than

people who are very happy in their marriage. => Given these findings, could you think about

redefining different categories in a way that seems more sensible?

Prob > F = 0.4417 F( 3, 595) = 0.90

( 3) hap4 = 0 ( 2) hap3 = 0 ( 1) hap1 - hap2 = 0

. test (hap1=hap2) (hap3==0) (hap4=0)

•Answer: We use an F-test to justify such a re-categorization.

We fail to reject that hap1 and hap2 have the same impact

on naffairs, and that hap3 and hap4 do not matter. So we

choose to create a dummy to distinguish those who are very

unhappy or somewhat unhappy, from everybody else

.ge unhappyd= hap1==1|hap2==1

.ta unhappyd

_cons .4128829 .2271662 1.82 0.070 -.0332576 .8590233 yrsmarr .083804 .0231971 3.61 0.000 .0382463 .1293617 unhappyd 2.621681 .3761948 6.97 0.000 1.882858 3.360505 naffairs Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg naffairs unhappyd yrsmarr

Question: how do you test the null hyp. that

happiness of marriage has no impact on

number of affairs?

Prob > F = 0.0000 F( 4, 595) = 12.81

( 4) hap4 = 0 ( 3) hap3 = 0 ( 2) hap2 = 0 ( 1) hap1 = 0

. test hap1 hap2 hap3 hap4

A binary dependent variable:

The Linear Probability Model

What if the dependent variable is a dummy?

Ex: you want to investigate the determinants of higher

education, use of illegal drugs, etc.

Assuming the ZCM assumption holds, we have:

As y only takes values 0 and 1, E(y|x)=P(y=1|x).

Linear Probability Model: the response probability is

linear in the parameters of the model.

kk xxxxyE ...)( 22110

jj xxyP )1(

Linear Probability Model

In short:

predicted probability that y = 1 (“success”)

P(y = 1)

cannot be interpreted as change in y given a one unit increase in xj. Instead: predicted change in the probability of success when xj increases by one unit, keeping everything else constant.

predicted probability of success when each xj

equals 0.

kk xxxy ˆ...ˆˆˆ22110

y

j

0

Ex: determinants of having an affair.

Source: affairs.dta

_cons .1865604 .1203043 1.55 0.121 -.0497115 .4228323 educ .0011737 .0072041 0.16 0.871 -.0129747 .0153221 yrsmarr .0138661 .0031789 4.36 0.000 .0076228 .0201093 vryrel -.163037 .056419 -2.89 0.004 -.2738411 -.0522328 smerel -.1592786 .0388917 -4.10 0.000 -.23566 -.0828972 affair Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg affair smerel vryrel yrsmarr edu

Interpretation:

The intercept shows us that the probability of having an affair

when the individual is slightly/not at all/anti religious, and has

been married for 0 year (impossible in the data), is 18%.

The probability of having an affair increases with 1.3

percentage points for each additional year of marriage.

Note that this implies that the predicted probability of having

an affair for an individual who has been married for 20 years is

27 (=20*0.014) percentage point higher than for someone who

has been married for 0 year, other factors being held fixed.

Limitations of the LPM

Predicted probability of success may be outside the [0,1]

range, for some combination of values of the independent

variables.

prob_affair 601 .249584 .1038671 -.0627137 .4634911 Variable Obs Mean Std. Dev. Min Max

. su prob

(option xb assumed; fitted values). predict prob_affair

_cons .3258566 .0720615 4.52 0.000 .1843313 .467382 age -.0055315 .0029614 -1.87 0.062 -.0113475 .0002845 yrsmarr .0209761 .0049415 4.24 0.000 .0112712 .0306811 vryrel -.1535096 .0564881 -2.72 0.007 -.2644495 -.0425697 smerel -.1588698 .0387371 -4.10 0.000 -.2349475 -.0827921 affair Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg affair smerel vryrel yrsmarr age

Limitations of the LPM (2)

What does it mean that P(affair)<0?

Related limitation: a probability cannot be linearly

related to the independent variables for all their

possible values: , meaning that the effect of

one more year of marriage from 0 to 1 has the same

impact as from 20 to 21, which may not be the case.

451. .125 42 0 1 433. 1.5 37 0 1 288. 1.5 42 1 0 yrsmarr age vryrel smerel

. list yrsmarr age vryrel smerel if prob<0

02.0ˆ yrsmarr

Limitations of the LPM (3)

Violation of homoskedaticity assumption:

where p(x) is the probability of success.

First four assumptions are ok=>no bias, but: need to be cautious with the standard errors of the coefficients, thus with t and F-test.

Usually, OLS analysis of LPM is acepted in applied work.

))(1)(()( xpxpxyVar

Interaction terms

What if we expect the effect of one variable to

depend on the magnitude of another variable?

ex:

What is the effect of x1 on y?

Interpretation is always for a particular value of

the variable with which interacted (x2).

Always introduce both variables and interaction

term in the regression.

uxxxxy )*( 21322110

231

1

xx

y

Interaction between 2 binary variables

Does the effect of ethnicity on education depend on location? =>Allow for interaction between black and south :

Interpretation of coefficients?

Compute the expected value of y for each possible case described by the binary variables (for a given value of other factors).

Compare the expected values: Let value other factors = 0

Case of black=0 & south=0 => β0

Case of black=0 & south=1 => β0+ β2

Case of black=1 & south=0 => β0+ β1

Case of black=1 & south =1=> β0+ β1+ β2+β3

rsotherfactosouthblacky 210

rsotherfactosouthblacksouthblacky *3210

Predicted difference between:

black from south and black from north: β2+β3

black from north and non-black from south: β1-β2

black from south and non-black from north: β1+β2+ β3

Can also define separate dummies for each

category

same point estimates, but more intuitive interpretation.

can test whether differences are statistically significant.

Example: Determinants of education level young

men in 1980 (age 28-38). Source: wage2.dta

_cons 14.25463 .1227144 116.16 0.000 14.0138 14.49546 sibs -.1956228 .0315413 -6.20 0.000 -.2575232 -.1337225 blsouth .7366762 .4325432 1.70 0.089 -.1121976 1.58555 south -.377183 .162062 -2.33 0.020 -.6952325 -.0591334 black -1.106547 .3364179 -3.29 0.001 -1.766773 -.4463207 educ Coef. Std. Err. t P>|t| [95% Conf. Interval]


. regress educ black south blsouth sibs

. gen blsouth = black*south

Interpretation

Non-black men without siblings who live in the north are predicted to have

14.25 (β0)years of education. Black men without siblings who live in the

north are predicted to have 13.14 years of education (β0 +β1=14.25-

1.11). Non-black men without siblings who live in the south are predicted

to have 13.88 years of education (β0 +β2=14.24-.38). And black men

without siblings who live in the south are predicted to have 13.5 years of

education (β0 +β1+β2 +β3=14.25-1.11 -.38 +.74).

The difference between black and non-black in the north (β1) is significant

at the 1% level.

The difference between non-black in the north and the south (β2) is

significant at the 5% level.

The difference between black and non-black in the south (β1+β3),

compared to the difference between black and non-black in the north (β1),

i.e. β3, is only significant at the 10% level.

In order to make interpretation easier we can redefine the variables

in the following way:

_cons 14.25463 .1227144 116.16 0.000 14.0138 14.49546 sibs -.1956228 .0315413 -6.20 0.000 -.2575232 -.1337225 blnorth -1.106547 .3364179 -3.29 0.001 -1.766773 -.4463207 blsouth -.7470537 .267403 -2.79 0.005 -1.271837 -.2222705 nblsouth -.377183 .162062 -2.33 0.020 -.6952325 -.0591334 educ Coef. Std. Err. t P>|t| [95% Conf. Interval]


. regress educ nblsouth blsouth blnorth sibs

(243 real changes made). replace nblsouth = 1 if black==0 & south==1

. gen nblsouth = 0

(44 real changes made). replace blnorth = 1 if black==1 & south==0

. gen blnorth = 0

Magnitude of coefficient leads to same interpretation as before. We can also see that the difference

between black in south and non-black in the north is significant at the 1% level. Note that in order to

test whether the difference between black in the south and black in the north is statistically significant,

we would need to re-estimate the equation with one of these categories as omitted category (or use

“test”).

Interaction between binary variable and a

continuous variable

Allowing for different slopes: is the return to education the same for

men and women, allowing for a constant wage differential between

men and women? (See graphical explanation)

Estimate a model allowing for different intercept and slope effects of education

on wage. Include experience, tenure and their quadratics. Conclude.

Does the effect of number of siblings differ between ethnic groups?

allow for interaction between black and sibs

Interpretation of coefficients:

Effect of sibs when black = 0: β4

Effect of sibs when black = 1: β4+β5

rsotherfactoblacksibssibssouthblacksouthblacky ** 543210

_cons 14.2744 .131814 108.29 0.000 14.01571 14.53308 blsibs .0303679 .0737401 0.41 0.681 -.1143485 .1750843 sibs -.2029534 .0362297 -5.60 0.000 -.2740549 -.1318519 blsouth .7305857 .4329891 1.69 0.092 -.1191644 1.580336 south -.3777887 .1621411 -2.33 0.020 -.6959939 -.0595835 black -1.232071 .4540722 -2.71 0.007 -2.123197 -.3409449 educ Coef. Std. Err. t P>|t| [95% Conf. Interval]


. regress educ black south blsouth sibs blsibs

. gen blsibs = black*sibs

Interpretation: the coefficient of the interaction effect between black

and sibs is not significant. Hence we do not find support for the

hypothesis that the effect of the number of siblings differs between

ethnic group.

Interaction between 2 continuous variables

What does β6 mean? What does a t-test on β6

capture?

Whether there is differential effect of IQ on education

depending on the number of siblings

But also whether there is a differential effect of siblings on

education depending on the IQ

What is the effect of an increase of IQ score of 1?

uIQsibssibsIQsouthblacksouthblackeduc ** 6543210

sibsdIQ

deduc64

_cons 4.697746 .7135252 6.58 0.000 3.297436 6.098056 sibsIQ -.0052664 .0016448 -3.20 0.001 -.0084943 -.0020385 sibs .3959715 .1596179 2.48 0.013 .0827175 .7092254 IQ .0899512 .0068738 13.09 0.000 .0764613 .1034412 blsouth .4711314 .38146 1.24 0.217 -.2774928 1.219756 south -.059337 .143882 -0.41 0.680 -.3417087 .2230347 black -.0438851 .3037981 -0.14 0.885 -.6400961 .552326 educ Coef. Std. Err. t P>|t| [95% Conf. Interval]


. regress educ black south blsouth IQ sibs sibsIQ

. ge sibsIQ=sibs*IQ

Interpretation: IQ has a significant positive effect on education, but the effect is

slightly smaller for people with a lot of siblings (this could indicate that as one has

more siblings, other concerns (such as income) might start having a larger impact

on education levels, and therefore might decrease the effect of IQ. Looking at the

coefficient of sibs and sibsIQ, the interpretation of the interaction effect is also

that siblings have a positive effect on education for people with very low IQ

levels (below 74=- β5 / β6 ), but a negative effect for higher IQ levels.

Example Linear Probability Model with interaction

terms: Correlates of the Probability of arrests

Interpretation first column:

- Intercept: Predicted probability of being arrested at least once in 1986 for a non-

hispanic, non-black, unemployed man that was previously arrested, but does not have

any prior convictions, and has not served in prison, was 38%.

- Dummy variables for race: Keeping everything else constant, the predicted

probability of being arrested was almost 10 percentage points higher for a hispanic

man than for a non-hispanic, non-black man; and 17 percentage points higher for a

black man than for a non-hispanic, non-black man. These differences are significant

at the 1% level. Note however that these do not necessarily point to discrimination.

(~omitted variables).

- Other variables: The average sentence length and the total time in prison do not

have a significant effect on the probability of being arrested. All other variables are

however significant at the 1% level: men who got convicted after a prior arrest, are

15 percentage points less likely to have been arrested in 1986, keeping everything

else constant. Each month spent in prison in 1986 decreases the likelihood of being

arrested by a predicted 2.4 percentage points, ceteris paribus (note that this leads to

implausible predictions for some men who have been in prison the whole year). And

each quarter of employment is predicted to decrease the likelihood of being arrested

with 3.8 percentage points, ceteris paribus

- Note that for the interpretation one should think about the possible values the

independent variables can take.

.

Interpretation second column:

In order to see whether employment has a different effect depending

on the ethnic groups, interaction effects were introduced (e.g. one

might want to do this to figure out to whom to target employment

services).

-Each quarter of employment decreases the predicted likelihood of

being arrested with 2.9 percentage points for a non-hispanic man, and

with 6.3 percentage points for a Hispanic man, ceteris paribus (6.3 =

2.9+3.4). Note that the unsignificant coefficient of the interaction

term of employment with black, indicates that the effect of

employment is the same for the black men as for the non-black, non-

hispanic men.

- Also, note the robustness of the estimated coefficient for the other

variables

Documents

Multiple Regression Analysis with Qualitative …Dummy variables Independent variable with value 0/1 to use qualitative information in regression analysis. Model with only one dummy