33
Overview of Regression Analysis

Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Embed Size (px)

Citation preview

Page 1: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Overview of Regression Analysis

Page 2: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Conditional Mean

We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males is $54,648

(March 2010)

We are also often interested in how this mean differs by other individual characteristics. E.g. How do the mean earnings differ between black and non-black workers?

Mean earnings for working non-black males ages 25-44 = $56,614 Mean earnings for working black males ages 25-44 = $39,380

These are known as Conditional Means (the mean conditioned on some other characteristics, in this case race)

So without controlling for anything else, 25-44 yr old black working males earn on average $17,234 less annually, or 30% less, than similar aged white working males.

Page 3: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Conditional Means

When testing a theory though, we often want to know how much of a given mean difference can be attributed to a particular observable variable, after controlling for other observable differences.

For example, we also know that earnings are highly tied to schooling, and there is a significant racial gap in schooling, so we might want to know how large is racial earnings gap net of racial differences in years of schooling (i.e., controlling for schooling).

Page 4: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Conditional Means

One way to do this is to calculate even more complicated conditional means.

E.g., Non-Black males between 25-44 w/out hs degree = $25,278 Black males between 25-44 w/out hs degree = $22,275

Non-Black males between 25-44 w/ hs degree = $39,922 Black males between 25-44 w/ hs degree = $32,670

Non-Black males between 25-44 w/ college degree = $80,295 Black males between 25-44 w/ college degree = $61,136

Page 5: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Conditional Means

Then, we can find how much less blacks earn than whites, after controlling for education, via the following weighted mean formula:

where i corresponds to the three education categories, nb,i/ nb corresponds to the fraction of black male workers in education

category i, earningsb,i corresponds to the mean earnings for black workers in education

category i, earningsw,i corresponds to the mean earnings for white workers in education

category i.

Doing so we find that according to the above conditional mean calculations, black male workers earn about $11,064, or 11,064/54,648 = 20 percent less, than white male workers with similar education characteristics So conditioning on years of education explains about 33% of racial earnings

gap ([0.30 - 0.20]/0.30 = 0.33)

)( ,,,

iwibi b

ib earningsearningsn

n

Page 6: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Conditional Means

This can be quite cumbersome to compute all these conditional means though, especially if we start adding in more categories for education e.g., only up to 10th grade, only up to 11th grade, only up to 12th grade, 1

yr of college, 2 years of college, 3 years of college, etc.

Moreover, what if we are also interested in the impact of another year of schooling on earnings, after controlling for race? That would require a whole new set of calculations.

Page 7: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression This is why a regression model is often a simpler way to describe

conditional means.

earningsi = α + β1*blacki + β2*yrs of schooli + ei

Often, the left-hand-side variable is called dependant variables, right-hand-side variables are called “control” variables or “regressors” (or sometimes “independent” variables, but I don’t like that).

α is known as intercept, β’s are (slope) coefficients, ei is the “residual” Estimating a regression amounts to finding the intercept and slope coefficients

that minimize the sum of the squared ei terms across the sample (i.e. find best “fit”)

So intercepts and coefficients essentially account for the variation in the dependant variable (earnings) that is common across all people with respect to the control variables, while the residual is the individual specific variation, or how each individual differs from the average. Graphically?

Page 8: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression

α

α+β1

Yrs of Schooling

EarningsSlope = β2

Page 9: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression When I estimate this model I get:

earningsi = -70,003 – 10,381*black + 8,888*yrs of schoolingi + ei

Or Earnings regressionCharacteristic Coeffi cientblack -10,381***

(1,126)

yrs of educ 8,888***(138))

intercept -70,003***(1,968)

R-squared 0.154

Page 10: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression If we simply take coefficients, this can be referred to as our

estimated linear conditional expectation function.

E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schoolingi

Computing the equation for particular characteristics gives the “expected,” or mean, earnings for a person with those characteristics.

So for a non-black with 12 years of schooling, expected earnings are:

-70,003 – 10,381*0 + 8,888*12 = $36,653

(compares to $39,922 when we computed this directly using actual conditional means)

Page 11: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression

Often we are simply interested in coefficients, how each right-hand side variable is associated with dependant variable.

Interpreting the coefficients (i.e. “Betas”). Consider a generic linear function y = 100 + 0.5*x1 – 12*x2

How do we determine the change in y associated with a one unit change in x2 holding everything else constant?

Now suppose we want to know the expected change in earnings due to a one year increase in schooling, holding all other variables constant? Recall our estimated linear conditional expectation function

earningsi = -70,003 – 10,381*black + 8,888*yrs of schoolingi

Page 12: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression So, if one is interested in the expected change in a dependant variable

associated with a one unit change in one of the control variables, simply take the derivative of the estimated conditional expectation function with respect to that control variable.

So, given our estimate, earnings are “expected” to increase by $8,888 for each additional year of schooling.

Page 13: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression Consider again our estimated linear conditional expectation function

for earnings:

E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schoolingi

Taking the derivative with respect to the “black” indicator variable we get 10,381.

This means that, holding everything else equal (i.e. yrs of education), on average, black workers earn $10,381 less than white workers (i.e., expected earnings for black workers are $10,381 less than expected earnings for white workers with similar education)

This compares similarly to the $11,064 conditional pay differential we computed before, but is still a little different. Why?

Page 14: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression Often, when we run regressions, we aren’t really interested in “point

estimates” (i.e. specific coefficient estimates), but rather in using these estimates to test hypotheses.

For example, what if what we are really interested in is whether black workers have a different return to an additional year of schooling than white workers.

How could we test this?

Page 15: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression What if I added in an “interaction” term between schooling and race?

E[earnings] = α + β1*blacki + β2*yrs of schooli + β3*blacki*yrs of schooli

Doing this estimation I get:E[earnings] = -47,011 + 1381*blacki + 7,321*yrs of schooli - 982*black*yrs of schooli

How do we interpret these coefficients? What is the avg impact of another year of schooling on a white

worker’s earnings?

What is the avg impact of another year of schooling on a black worker’s earnings?

So how do we test whether return to an additional year of schooling is different for blacks than whites?

Page 16: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Regression Precision/Significance of estimates:

Consider again the previous estimates

What we are testing is whether coefficient of interest is “significantly” different than zero (i.e., how likely is it that we would have gotten this large of an estimate by chance even if it was really equal to zero)

To hypothesis test, we must compare size of coefficient to its standard error.

A good rule of thumb is that absolute magnitude of coefficient is close to or above twice the standard error.

So what will generally impact whether an estimate is significant?

Earnings regressionCharacteristic Coeffi cientblack 1380

(6748)

yrs of educ 7320***(117)

black*yrs educ -982**(496)

intercept 47,011***(1630)

R-squared 0.15

Page 17: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Specification form

Often when doing regressions researchers will use the natural log of earnings rather than simply earnings as the dependant variable:

ln(earningsi)= α + β1*blacki + β2*schoolingi + ei

This is done for two reasons:

1. This specification often “fits” the data better, as log transformation makes a variable with a highly skewed distribution closer to a normal distribution, which generally helps the regression fit.

2. The coefficients can be roughly interpreted as percentage changes in dependant variable associated with a unit change in the corresponding control variable (i.e., elasticity), rather than how the level of the dependant variable changes given a unit change in the corresponding control variable.

Page 18: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Specification form

Earnings regression

Characteristic (1) earnings (2) ln earningsblack -10,381*** -0.23***

(1,126) (1,126)

yrs of educ 8,888*** 0.15***(138)) (0.002)

intercept -70,003*** 8.39***(1,968) (0.034)

R-squared 0.15 0.16

Coeffi cients

Page 19: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

If we are really interested in the wage gap between black workers and white workers after conditioning on years of education, what are we missing from the basic specification that might obscure the answer we are really looking for?

E[ln(earnings)] = α + β1*blacki + β2*schoolingi

Page 20: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

E[ln(earnings)] = α + β1*blacki + β2*Hispanici + β3*schoolingi + ei

What will this likely do to coefficient on black indicator?

Page 21: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

E[ln(earningsi)] = α + β1*blacki + β2*Hispanici + β3*schoolingi

What will this likely do to coefficient on black indicator? Earnings regression (ln earnings)

Characteristic (1) (2)black -0.23*** -0.27***

(0.02) (0.02)

Hispanic -0.18***(0.016)

yrs educ 0.15*** 0.15***(0.002) (0.002)

intercept 8.39*** 8.51***(0.034) (0.036)

R-squared 0.16 0.17

Coeffi cients

Page 22: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

Page 23: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

Earnings regression (ln earnings)

Characteristic (1) (2) (3)black -0.23*** -0.27*** -0.27***

(0.02) (0.02) (0.02)Hispanic -0.18*** -0.17***

(0.016) (0.016)yrs educ 0.15*** 0.15*** 0.14***

(0.002) (0.002) (0.002)age 0.03***

(0.001)midwest -0.10***

(0.017)south -0.02

(0.02)west -0.035**

(0.017)intercept 8.39*** 8.51*** 7.57***

(0.034) (0.036) (0.050)R-squared 0.16 0.17 0.19

Coeffi cients

Page 24: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Omitted variables

In the end, it is not necessary to control for every possible thing that can affect dependant (y, or left-hand side) variable.

What to control for depends on your question of interest.

Robustness – A finding is said to be relatively robust if basic qualitative finding is

unchanged by inclusion of further variables, adding more interaction terms (i.e., the combination of two existing variables such as the term black*years of school), or changes in specification form (i.e. log transformation of dependant variable)

Page 25: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

Be weary of making causal inferences of significant correlations

In particular, there are often issues of sample selection/endogeneity/omitted variables Many characteristics are often the products of choice (often called

endogenous characteristics). In such cases it is hard to identify how the outcome of interest

depends on that endogenous characteristic, versus other unobserved/omitted characteristics that determined that choice.

Consider the Brooklyn Bridge “effect” on wages.

Page 26: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

More specifically, what if we wanted to estimate the effect of being in a gang on individual criminality or the effect of being marriage on criminality?

Suppose we estimated

E[y] = α + β1*Gang + β2*x2 + β3*x3

E[y] = α + β1*Marriage + β2*x2 + β3*x3?

Will this tell us what we want to know?

What if we further controlled for income, neighborhood, education, and lots and lots of other stuff in the other x’s?

Page 27: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

Table 1 - Pooled Cross-Sectional Poisson Regression Model Results

In Gang Married SmokedDependant Coefficient Coefficient CoefficientVariable (SE) (SE) (SE)

Total Crimes 2.383 -1.120 1.518(0.095)*** (0.201)*** (0.089)***

Property Crimes 2.933 -1.450 1.464(0.143)*** (0.340)*** (0.146)***

Assaults 2.628 -0.790 1.123(0.106)*** (0.310)** (0.107)***

Drug Sales 1.811 -1.197 1.818(0.121)*** (0.298)*** (0.120)***

Individual Characteristic Variable of Interest(i.e. "Treatment")

Page 28: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

In general, we are often interested in estimating the expected effect of increasing some variable x1 on some outcome variable y. But it is often the case that x1 isn’t randomly determined for each

person, rather it is chosen (gang status, marriage). Moreover, people who choose different amount of x might be expected

to have different values of y even if they didn’t choose a different x. Essentially, there is some unobservable variable z that may impact both

an individual’s expected value of y and his expected value of x1.

When we estimate E[y] = α + β1*x1 + β2*x2 + β3*x3, β1 is reflecting both the impact of x1 on y and the impact of the unobserved z on y (since z affects x1). A basic regression can’t separately identify these two mechanisms.

Page 29: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

One way to handle selection is to use what is referred to as an Instrument or Instrumental Variables. The idea is to find something that is essentially random, or at least not a

choice made by the individuals in the sample, that impacts the individual’s value of x1.

Consider Job Corps We want to know impact of participating in Job Corp (x1) on Earnings

(y). Problem, it isn’t random who participates in the program (i.e., who gets

x1 = 1).

Consider the following simple model:

Page 30: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

Two types of people A’s and B’s who are eligible for Job Corp

Type Job Corp No Job Corp

A’s $40,000 $34,000

B’s $30,000 $31,000 Suppose if given chance, only A’s would enroll, not B’s (which

everyone knows). So true impact of program on those who would participate is $6000

Researcher wants to uncover this effect, but doesn’t know above info, and can’t observe each person’s type.

If Job Corp offered access to all eligibles, what would be estimated impact if one just compared Job Corp to no Job Corp?

Page 31: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

Two types of people A’s and B’s who are eligible for Job Corp

Type Job Corp No Job Corp

A’s $40,000 $34,000

B’s $31,000 $30,000 Now suppose researcher randomized access of who could enroll

(this access can be called an Instrumental Variable)? Amongst winners, half would enroll (A’s) half wouldn’t (B’s). Average

earnings would be 0.5*40,000 + 0.5*30,000 = $35,000. Amongst losers, none would enroll. Average earnings would be

0.5*34,000 + 0.5*30,000 = $32,000.

Comparing winners vs. losers, estimated impact? Comparing winner participants vs. losers, estimated impact? Instrumental Variable (IV), estimated impact?

Page 32: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Selection

In Job Corp study we read, randomization into “treatment” and “control” was explicit, but effectively an IV.

Often researchers can’t randomize explicitly, so they get creative. Look for “natural experiments” that effectively do randomization. Ask, what could impact an individual’s realization of x1, but should not be at all

correlated with an individual’s expected outcome y?

Page 33: Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males

Summary In summary,

Coefficient on a given variable tells you how the expected change in the outcome of interest due to a one unit change in that variable, after controlling for all of the other included characteristics.

Little credence should be given to imprecisely estimated coefficients (i.e. large enough standard errors so that they are not statistically different from zero), especially when hypothesis testing.

Part of the key details of a paper is the “empirical strategy” it uses to deal with selection effects. Much of this class will be spent on discussing various empirical

strategies authors use in the papers we read.

In the end, use your empirical intuition---can this data really answer the question of interest?