EC320 FINAL Assumptions - …d2oqb2vjj999su.cloudfront.net/users/000/087/470/766/attachments... · EC320 FINAL STUDY GUIDE: 1 ... Hedonic pricing (prediction) 6. ... Differences between

EC320 FINAL STUDY GUIDE:

1. Section1: Econometrics a. Purposes b. Process c. Model d. Properties of variables

2. Section 2: Review of statistics a. Random variables b. PDFs c. CDFs d. Expected Value e. Variance f. Correlation coefficient

3. Section 3: Simple Regression Analysis

a. Assumptions b. Statistical inference c. OLS method d. Properties of estimators e. Deriving regression coefficients

(simple model) f. Residual g. Variances (precision) of

regression coefficients h. Standard error of regression

coefficients i. Sum of squares j. R2 k. Random/nonrandom components l. Types of data

4. Section 4: Hypothesis testing a. Formation of null b. One vs two-tailed tests c. testing hypotheses relating to

regression coefficients d. Confidence intervals e. Type 1 / 2 Errors f. P-value g. F-tests h. Interpreting Stata Tables i. ANOVA table

5. Section 5: Multiple Regression Analysis

a. Definition

b. Assumptions c. Deriving a fitted model d. Efficiency e. Precision f. Standard errors of coefficients g. Multicollinearity h. Further Analysis of Variance i. Adjusted/Corrected R2 j. Hedonic pricing (prediction)

6. Section 6: Nonlinear models & transformation of variables

a. Linearity in var/par b. Linearizing functions nonlin in var c. Linearizing functions nonlin in

params d. Log model e. Semilog models f. Economies of scale g. Linear/semilog/loglog table h. Disturbance terms i. Comparing linear & logarithmic

specifications j. Making RSS comparable: Box &

Cox k. Models with quadratic and

interactive variables l. Quadratic model m. Interactive explanatory variables n. Ramsey’s RESET tests o. Nonlinear regression

7. Section 7: Dummy Variables a. Definition b. reasons for use c. Combining multiple functions into

one d. Standard errors & hypothesis

testing of dummies e. More elaborate dummies f. Fitting individual implicit costs g. Changing reference categories h. Dummy trap i. Multiple sets of dummies j. Slope dummy (interactive) k. F-testing dummies l. Dummies in log/semilog m. Chow test n. Dummies vs chow

8. Section 8: specification of regression variables a. Consequences (table) b. Omitted variable bias c. Invalidation of statistical tests d. Proxy variables e. Testing a linear restriction f. F-testing a restriction g. T-testing a restriction h. Multiple restrictions i. Zero restrictions

Some notes: While there will obviously be repeated information from my previous guides, I did my best to alter it in a way that would possibly make it more understandable (as I came to understand the topics better in time), less repetitive, and more relevant to the course. We have built on topics

such as Z-tests, and our methods of processes (such as testing hypotheses) has changed.

This is not be a definitive list, although with this many pages I clearly put in everything that I could think of as relevant. I do want to note that Chapter 6 (Section 8) was I didn’t cover Chapter 6 as well as I could have, and I apologize. However, I spent a lot of time compiling the rest of the information, and I hope that you find it helpful.

Other study tips: Review the HW. Even if you can’t practice the actual Stata work, practice the

general information that it covers by hand.

Review old midterms and finals. So far, we’ve seen pretty similar tests from the past this term. I wanted to point out a few topics on the test that seem to be covered (especially on True/False) on

every single test:

Properties of R2 (even the obscure ones). Know when, why, and by how much R2 changes in specific

F-tests and degrees of freedom

Sums of squares Variances Correlations Differences between simple and multiple regression assumptions and processes Estimators: finding them, properties, etc.

OLS methods R2! It seems to come up on every test Errors (probabilities, definitions, etc) Properties of the null hypothesis

Be sure to make a good note sheet. Try to analyze the trends of test questions in the past, and figure out what to expect on this.

Good luck! Feel free to email me or send a message directly through Notehall with questions, rather

than leaving a nasty review if I made a small mistake. If you liked the guide, please rate it!

Section 1: Introduction

Econometrics: the application of statistical and mathematic methods to the analysis of economic data, with a

purpose of giving empirical content to economic theories and verifying/refuting them (Maddala)

Purposes of econometrics

1. Put empirical content to theory

2. Conduct hypothesis tests

3. Forecast

Process of econometrics

Problem: What are you doing?

Theory: Identify a theorized relationship, figure out what you need from the theory to address your

problem

Represent this need in an econometric model

Collect data

Estimate the model

o Measure the accuracy of the estimated model with specification tests/diagnostics

Answer the problem by describing it, testing your hypothesis, and forecast

Econometric model: 𝑌𝑖 = 𝛽1

+ 𝛽2𝑋2 + 𝛽

3𝑋3 + ⋯ + 𝛽𝐾𝑋𝑖 + 𝑢𝑖

Y: dependent/explained/caused/endogenous variable

β's: unknown but constant parameters (estimated as b’s)

X’s: known and observed variables. Independent/explanatory/causal/exogenous. Variables that

explain Y

u: A random error to capture unknown influences on Y

Properties of variables

Y is caused by X’s, and X’s are not caused by Y

Y’s and X’s are measured w/o error

The X’s are not correlated with u’s

Section 2: Review of statistics

Random Variables (RVs): any variable whose value can’t be predicted exactly

Discrete RVs have a specific set of possible values

o EX: The outcomes of rolling a die are specific integers. There is no possibility of getting a 1.5

Continuous RVs can take on any value in a given range (infinite # of possibilities)

o EX: The temp in a room can fall at any value between 55 and 85 degrees. This means that it

could be 60.74274 or 55.1123 degrees

o The chance of getting a specific value of X in this case is 0, because the prob would equal 1/∞

Probability distributions: pdfs are formulas that give the prob of getting different values of the RV

Prob density FNs (pdfs) show this graphically

Uniform distributions: the chances of getting specific values of the RV X are equal for all outcomes.

If a pdf is distributed uniformly, the chances of getting a specific X value is constant across a range,

and is zero outside of the range

This should be review, so I’m not going to cover it extensively

This can calculated geometrically (base * height = 1) or with

integrals:

Cumulative distributions: a function derived from the pdf

Gives the prob of X being less than or equal to some constant

Remember that the prob of X taking a specific value is zero if it’s

continuous, and the prob of X taking a value between -∞ and ∞ is 1

The base * height approach works for uniform dists, but you’ll need integrals for cdfs

To derive a cdf:

o Take the integral of ∫1

4𝑑𝑋𝑐

0=

𝑋4

|0

𝑐=

𝐶4 , where C is any given X value, and ¼ is the height of

the associated pdf

o The height of the cdf (f(X)) at a given X value gives the chance of getting a value between -∞ and X

Expected values (EV): the EV of an RV is often described as its population mean

EV of a discrete RV: The weighted avg of all outcomes times their associated prob

𝐸(𝑋) = ∑ 𝑥1𝑝1 + ⋯ + 𝑥𝑛𝑝𝑛

o If g(X) is any FN of X, then E{g(X)} = ∑ 𝑔(𝑋𝑖)𝑃𝑖= g(X1)p1+…+g(XN)PN

EV of a continuous RV: 𝐸(𝑋) = ∫ 𝑋 ∗ 𝐹(𝑥)𝑑𝑥∞

−∞

Population variance (var): measures the spread of a prob dist

Var(X) = 𝜎𝑥2 = 𝐸[(𝑋 − ��)2]

= (𝑋1 − ��)2𝑃1 + ⋯ + (𝑋𝑁 − ��)

2𝑃𝑁

= ∑ (𝑋𝑖 − ��)2𝑃𝑖

Popvar of an RV can also be written: 𝜎𝑥2 = 𝐸 (𝑋

2) − 𝜇𝑋

2

Since all of the variation in X is due to u, var(X)=var(u)

Sample variance: If the distribution of X has variance of 𝜎𝑥2 , then its sample mean �� has a sample

variance of 𝑠𝑋2 =

1

𝑁−1∑(𝑋𝑖 − ��)

Population covariance (cov): measures whether or not X&Y move together

𝑐𝑜𝑣(𝑋, 𝑌) = 𝜎𝑋𝑌 = 𝐸[(𝑋𝑖 − ��) ∗ (𝑌𝑖 − ��)]

Ranges from -1 to 1

Negative cov: X & Y move in opposite directions

Two variables are independent if cov=0

Sample cov: 𝑆𝑋𝑌 =1

𝑛−1∗ ∑ (𝑋 − ��) (𝑌 − ��)

Population correlation coefficient: 𝜌𝑋𝑌 =𝜎𝑋𝑌

√(𝜎𝑋2 𝜎𝑌

2 )

cov(X,Y) is unsatisfactory bc it depends on the units of measurement of X & Y

The pop corr coef is better because it is not subject to changes in the units of measure

Ranges btwn-1 and 1, w/ neg numbers showing a negative correlation, pos showing positive

If equal to zero, X&Y are independent (no relation)

Sample corr coef: 𝑟𝑥𝑦 = ∑(𝑋𝑖−𝑋) (𝑌𝑖−𝑌)

√∑(𝑋𝑖−��)2

∑(𝑌𝑖−��)2

Section 3: Simple Regression analysis

Assumptions of the classical model for simple regressions (one X value)

* Note that these assumptions will be referred to by number later in the study

guide

1) The model is linear in parameters (b’s) and correctly specified

o Linear: 𝒀 = 𝜷1 + 𝜷2𝑿 + 𝒖

o Not linear: 𝒀 = 𝜷1𝑿𝜷2 + 𝒖

2) There is some variation in the regressor in the sample, and is measured without error

o While b’s and Y’s may be estimated, X is a true value

o A population that is unique and varied could be referred to as “heterogeneous”, one with little

variation is “homogenous”

o Notice to the right that the homogenous pop would NOT provide data that would allow us to

extrapolate to low & high values of X

3) The disturbance term (u) has an expectation of zero

𝑬(𝒖𝒊) = 0 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊

4) The disturbance term is homoscedastic (its values will have a constant pop variance)

𝑬(𝝈𝒖𝒊2

) = 𝝈𝒖2 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊

If assumption 4 is not satisfied, the OLS reg coefs will be inefficient

5) The values of u have independent dists (independence assumption)

o This means that u is not subject to autocorrelation (no systematic association btwn ui & uj

o This implies that the cov of 𝒖𝒊 𝒂𝒏𝒅 𝒖𝒋 is equal to zero

6) ui has a normal distribution

Assumptions 3-6 might be referred to as properties of errors

Statistical inference: is defined as “drawing conclusions based on data” to describe the population

While your sample won’t describe the entire population perfectly, estimating an econometric model

allows you to get a good guess of population characteristics

Note that a fitted model fits the sample data better than it fits the population. This is by design.

Ordinary least squares method: best way to obtain estimators. The Gauss-Markov theorem states that OLS

is BLUE:

Best (smallest variance)

Linear (combinations of the Yi)

Unbiased [E(bj)=βj]

Estimator of regression parameters

Estimator properties:

Since we can never know their true values, we estimate 𝛽1

𝑎𝑛𝑑 𝛽2 with 𝑏1 𝑎𝑛𝑑 𝑏2

1) Unbiasedness: we want the EV of the estimator to be equal to the pop characteristic.

a. E(b1)=β1

b. See p118 for proof

2) Efficiency: want pdf to be as concentrated as possible around the mean (want pop var to be as

small as possible)

3) Consistency: an estimator is said to be consistent if it

has a prob limit so that its dist closes into a spike

around the pop mean. The spike is located at the true

value of the characteristic you are trying to estimate.

This is called the central limit theorem (right)

Deriving estimators of regression coefs (simple model)

We know that 𝑏1 = �� − 𝑏2�� (known as intercept) and (slope)

We’ve already covered this extensively, but if you want further proof check out pages 85 through 92

Define the residual for each observation. The residual is the vertical distance between the actual and

fitted values of y

𝑒𝑖 = 𝑌𝑖 − ��𝑖

Our goal is to minimize the residuals

We do this by calculating RSS: residual sum of squares:

𝑅𝑆𝑆 = ∑(𝑌𝑖 − ��𝑖)2

= ∑ 𝑒𝑖2

= 𝑒12

+𝑒22

+ 𝑒32

To get the best fit, we want to minimize the residual

First-order conditions for a minimum: take partial derivates of RSS and set them equal to zero

o 𝜕𝑅𝑆𝑆𝜕𝑏1

= 0 𝑎𝑛𝑑 𝜕𝑅𝑆𝑆𝜕𝑏2

= 0

Variances (Precision) of the reg coefficients

𝜎𝑏12 = 𝜎𝑢

2 (1

𝑁 +𝑋

2

∑(𝑋𝑖−��)2)

𝜎𝑏22 =

𝜎𝑢2

∑(𝑋𝑖−��)2

o MSD(X) = 1

𝑁∑(𝑋𝑖 − ��)2: The size of the variations in X around its mean

o The size of the variations in X around its mean

Standard error of the reg coefficients:

In reality, you can’t calculate the popvars of b1 or b2, bc 𝜎𝑢2 is unknown. However, we can find an

estimator of it with 𝑠𝑢2 (derivation on p130)

𝑠𝑒(𝑏1) = √𝑠𝑢2 [

1

𝑁+

𝑋2

∑(𝑋𝑖−��)2]

𝑠𝑒(𝑏2) = √𝑠𝑢

2

∑(𝑋𝑖−��)2

SE only gives a general guide to the likely accuracy of a reg coefficient

o SE gives you an idea of the narrowness of the estimator’s pdf

o However, it does not tell you whether your estimate comes from the middle of the pdf, or one

of the tails

With a greater 𝜎𝑢2 the sample variance of the residuals is likely to be higher

o This means the SE of the coefficients be higher

o This reflects the risk of getting inaccurate estimators

Sum of squares (SS)

ESS is the variation explained by the model

o “Explained” SS

o ESS = ∑ (𝑌𝑖 − ��)

2

o Fitted value - sample mean

RSS is the variation not explained by the model

o “Unexplained” SS

o RSS = ∑ 𝑒𝑖2 = ∑ (𝑌𝑖 − ��𝑖)

2

o True value - fitted value

TSS = ESS + RSS

o TSS = ∑ (𝑌𝑖 − ��)2

o True value - sample mean of Y

Measuring Fit: R2:

coef of determination, Widely used measure of fit

𝑅2

=𝐸𝑆𝑆

𝑇𝑆𝑆=

∑(𝑌𝑖 −��)

2

∑(𝑌𝑖−��)2 = 1 −

𝑅𝑆𝑆𝑇𝑆𝑆 = 1 −

∑(𝑌𝑖−��𝑖)2

∑(𝑌𝑖−��)2

R2 is constrained to values between 0 and 1

Its value can be thought of as the percent of data explained by the model

Adding variables to the model (increasing K) will not decrease R2

The addition of variables to a model will typically increase R2, but often by negligible amounts

o R2 will increase as K goes up, until N=K

Added observations can decrease R2. If you add an observation that’s an outlier in the population (very

different from the rest), it can contribute to the fitted model being less accurate

Note that in the case of multiple reg models, it is impossible to measure each explanatory variable’s

contribution to the overall R2

I noticed that the properties of R2 (even some obscure ones) have shown up in the T/F questions in

every test (old and current). Make sure you are very comfortable answering questions about it

Random/nonrandom components: With the model we’re using, Y depends on the non-random X

according to 𝑌𝑖 = 𝛽1

+ 𝛽2𝑋𝑖 + 𝑢𝑖, and we fit it with 𝑌𝑖

= 𝑏1 + 𝑏2𝑋𝑖

Stochastic (random) components: random, unpredicted variation

Nonrandom component of Y: 𝜷1

& 𝜷2 may be unknown, but they are fixed constants

Random component: u

To decompose an estimator into fixed & random components, see p116

Types of data

Cross-sectional data: observations relating to units of observation at one moment in time (units could

be ppl, households, companies, countries, etc)

Time series data: repeated observations through time on the same subjects (ex: quarterly GDP)

Panel data: basically a hybrid of the two above, repeated observations on the same elements through

time

Section 4: Hypothesis testing

Forming a null hypothesis

If you can, always specify the null as the thing that you don’t believe (Straw person’s principle),

and the HA as the thing you want to prove to be true

We believe the null until the departure of �� from µ0 is so great that we cannot accept it

Closed parameters: you must account for all possible values of the test stat in H0 & H1.

o Closed: 𝐻0: 𝛽1

≥ 0 𝐻1: 𝛽1

< 0

o Not closed: : 𝐻0: 𝛽1

= 0 𝐻1: 𝛽1

< 0

Note that this null/alternative does not account for positive values of 𝛽1

One vs two tailed tests

Two-tail test: testing whether your test stat has a specific value or not EX: HA: β2 ≠ 3

One-tail: you wish to prove that �� lies 1) below or 2) above a specific value

o EX: HA: β2>0

o Tcrit will be lower in the one sided test, making it easier to prove your belief

Testing hypotheses relating to reg coefficients:

We originally looked at hypotheses through the z-test, but now we’ve realized that there are too many

factors at play to use such a simple test (ex: df)

o In the case of simple regression coefficients, we test hypotheses with the t-distribution

o The t dist is symmetric and bell shaped, but has heavier tails. This means that it’s more likely to

produce values that are very far from the true mean

As before, reject the null if the difference between 𝑏2 𝑎𝑛𝑑 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝛽2

0 is too great.

o To find the critical value, refer to the t table, and find where the confidence interval (top of the

table) intersects the degrees of freedom (DF) (left of table)

DF=N-K: sample size (N) minus the number of parameters estimated (K)

K is equal to 2 in a simple regression, since we estimated 2 variables (𝛽1

&𝛽2

)

o Measured in terms of standard errors

o 𝑡 =𝑏2−𝛽2

0

𝑠𝑒(𝑏2)

o Where 𝛽2

0 is the value stated in the null

o Reject if |𝑡| > 𝑡𝑐𝑟𝑖𝑡

The t-test helps you to determine whether X has some effect on Y, rather than a specific effect

Confidence intervals: a range of values that lead to acceptance/rejection of the null

Sometimes different confidence intervals will produce different results.

Percentage level measures how sure you are with your result. This means that at the 5% level you’re

“pretty sure” of the result, but at the 1% level, you’re “VERY sure”

If the t stat is VERY high, try testing it at the 0.1% level (99.9% confident). This reduces the risk of a

Type 1 Error

True model: 𝑌𝑖 = 𝛽1

+ 𝛽2𝑋𝑖 + 𝑢𝑖

Fitted model: ��𝑖 = 𝑏1+𝑏2𝑋𝑖,

The reg coefficient 𝑏2 is incompatible with hypothetical value 𝛽2

0= 𝛽

2 if either

𝑏2−𝛽

2

0

𝑠𝑒(𝑏2)> 𝑡𝑐𝑟𝑖𝑡 𝑜𝑟 𝑖𝑓

𝑏2−𝛽2

0

𝑠𝑒(𝑏2)< −𝑡𝑐𝑟𝑖𝑡 - in either case, reject b2

𝛽2 must satisfy this double inequality to be valid (two-tail)

𝑏2 − 𝑠𝑒(𝑏2) ∗ 𝑡𝑐𝑟𝑖𝑡 ≤ 𝛽2

≤ 𝑏2 + 𝑠𝑒(𝑏2) ∗ 𝑡𝑐𝑟𝑖𝑡

Ex: 1.999 ≤ 𝛽2

≤ 2.911. In this case, we would reject hypothetical values below 1.999

or above 2.911

Error types

Type 1 error: where the null is rejected when its actually true

o Think of a type 1 error as the conviction of an innocent person

o Prb (type 1) is the size of the rejection region. In a 5% test, a true null is rejected %5 of the

time

Type 2 error: when the null isn’t rejected, but it’s false

o Think of a type 2 error as letting a guilty person go free

o Prb(type 2) is beta. Find your rejection region’s t or z value, then subtract it from 1 (ex: z=1,

prb(type 2)=1 - 0.8413 = .1587

To see type 1 / 2 errors graphically, look at problem 3 on the winter 2010 MT2, and IV(d) on this

year’s MT2

P-value: an alternative approach to reporting the significance of reg coefficients

Notation in stata table: 𝑷 > |𝒕|

P is the prb of obtaining the corresponding t stat as a matter of chance (if null is true)

A p-value less than 0.01 means that the prb is less than 1%, which means that the null would be

rejected at the 1% level

The p-value approach tells you more than the 5%/1% approach, bc it gives the exact prb of a type 1

error if the null were true

F-tests:

Even if there is no relationship between Y & X, in any sample there may appear to be one

R2=0 Only by coincidence for a reg of Y on X

F-tests test whether R2 is reflecting a true relationship, or if it’s just a coincidence

In other words, F-tests give you a critical value for R2, which gives a cutoff point for when you can

declare that X causes Y (for a given confidence interval)

o H0: β2=0 : no relationship between Y & X

𝐹 = 𝐸𝑆𝑆/(𝑘−1)

𝑅𝑆𝑆/(𝑛−𝑘)=

𝑅2(𝐾−1)⁄

(1−𝑅2)/(𝑁−𝐾)=

𝑅2

(1−𝑅2)/(𝑛−𝐾)

o where k: # of parameters in fitted reg model

The critical value of F gives a cutoff point (just like t), at which point you can conclude that the

variables are correlated

o To find the critical value of F, refer to the corresponding table which matches your confidence

interval

o Find where the df(num) and df(denom) intersect. This gives you Fcrit

o If your calculated value of F is greater than Fcrit, you can reject the null

o For our purposes we only need to find Fcrit that lies to one side of the distribution (think of it as

a one-tailed test)

Only in simple regressions:

o Fcrit & tcrit have same Ho: β2=0 and HA: β2≠0

o Fcrit equals the squared tcrit of a two-tailed test

o This is only true in simple models! F & t play very different roles in multi reg models

o Proof on P147

* Note that the ANOVA table below

was removed from the top left

portion of the stata output to make

room

ANOVA table (stata): analysis of variance

K=number of parameters estimated (including the intercept)

K-1=Number of RHS (right hand side) variables (not including

the intercept)

N=Number of observations

Section 5: Multiple Regression Analysis

Definition:

Multiple reg models are 3D. There are multiple independent variables

EX: True model: 𝑌𝑖 = 𝛽1

+ 𝛽2𝑋2 + 𝛽

3𝑋3 + 𝑢𝑖

Our objective is to discriminate between the effects of each variable on Y

o We do this by varying the X of interest, while holding the others constant

Source SS df MS=SS/DF

Model ESS K-1 ESS/(K-1)

Residual RSS N-K RSS/(N-K)

Total TSS N-1 TSS/(N-1)

o EX: if 𝑝𝑎�� = −12 + 2.2𝑆 + 0.7𝐸𝑋𝑃

o -12 implies that one with no exp and no education would have to pay $12 per hour to work.

This is not realistic (solution later)

o The coef on S implies an extra $2.20/hr for each year of education completed

o The coef on EXP implies extra $0.70/hr for year of experience

Assumptions for multiple reg coefficients

1) Model is linear in parameters and correctly specified

(𝑌𝑖 = 𝛽1

+ 𝛽2𝑋2 + 𝛽

3𝑋3 + 𝑢)

2) There is no exact linear relationship btwn regressors in the sample

a) This is the only assumption that differs between the simple & multi reg models

b) See section on multicollinearity

3) Disturbance term has zero expectation

i) 𝑬(𝒖𝒊) = 0 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊

4) u is homoscedastic (constant popvar)

i) 𝝈𝒖𝒊2 = 𝝈𝒖

2 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒊

5) ui is distributed independently of uj for all j≠i

6) Disturbance term has normal dist (unbiasedness)

Deriving a fitted model w/ RSS for multiple reg models

we still use RSS to measure goodness of fit

𝑅𝑆𝑆 = ∑ 𝑒𝑖2

𝑒𝑖 = 𝑌𝑖 − 𝑌𝑖 = 𝑌𝑖 − 𝑏1 − 𝑏2𝑋2 − 𝑏3𝑋3

Thus, 𝑅𝑆𝑆 = ∑(𝑌𝑖 − 𝑏1 − 𝑏2𝑋2 − 𝑏3𝑋3)2

To get first order minimums, take the partial derivatives for b1, b2, b3

𝜕𝑅𝑆𝑆𝜕𝑏1

⁄ = −2 ∑(𝑌𝑖 − 𝑏1 − 𝑏2𝑋2 − 𝑏3𝑋3) = 0


⁄ = −2 ∑ 𝑋2(𝑌𝑖 − 𝑏1 − 𝑏2𝑋2 − 𝑏3𝑋3) = 0


⁄ = −2 ∑ 𝑋3(𝑌𝑖 − 𝑏1 − 𝑏2𝑋2 − 𝑏3𝑋3) = 0

𝑏1 = �� − 𝑏2𝑋2 − 𝑏3𝑋3

𝑏3: switch all of the X2' s with X3’s and vice versa in the b2 above

Efficiency

Gauss-Markov theorem proves that OLS yields the most efficient linear estimators of the parameters

(lowest possible variance)

Applies to multi & simple reg models

Precision (multireg)

True model: 𝑌𝑖 = 𝛽1

+ 𝛽2𝑋2 + 𝛽

3𝑋3 + 𝑢𝑖

Fitted model: �� = 𝑏1 + 𝑏2𝑋2 + 𝑏3𝑋3

Popvar of b2:

𝜎𝑏22 =

𝜎𝑢2

∑(𝑋2−��)2 ∗

1

1−𝑟𝑋2𝑋32

𝜎𝑢2 : pop variance of u

𝑟𝑋2𝑋3: corr btwn X2&X3

Replace ∑ (𝑋2 − 𝑋2 )

2 𝑤/ ∑ (𝑋3𝑖 − 𝑋3

)2

to get pop variance of b3

Standard errors of coefs (multiple reg models)

𝑠𝑒(𝑏2) 𝑖𝑠 √𝑣𝑎𝑟(𝑏2)

Standard error is the estimate of the standard dev of b2

SE of a coef is an estimate of its standard deviation

𝑆𝐸(𝑏2) = √𝑠𝑢

2

∑(𝑋2𝑖−��2)2∗

1

1−𝑟𝑋2𝑋32

Multicollinearity (multico):

The greater the 𝑟𝑋2𝑋3, the greater the 𝜎𝑢2 .

Greater risk of picking a coefficient that doesn’t represent the sample

If N & MSD are large, and var(u) is small, you could get good ones

Multico must be caused by a combination of high corr and one of the other components being

unhelpful

multico is an issue of degree, not a yes/no question

Multico is most common in time series data, since the same subjects are analyzed over time

Overcoming multico:

There are 4 factors responsible for variance of the disturbance term:

1) 𝜎𝑢2 2) 𝑁 3) 𝑀𝑆𝐷(𝑋) 4) 𝑟𝑋2𝑋3

2

Direct methods attempt to improve the conditions that are responsible for the variances in the reg

coefs. E

1) Reducing 𝝈𝒖2 :

a. u is the joint effect of all of the variables influencing Y

b. If you remember an important omitted variable (thereby contributing to u), adding it

back will reduce the popvar of u & the coefs

c. Adding more variables typically improves the model fit, but the effect is usually

insignificant

d. If additional variables are correlated with the old ones, SE could increase

2) Increase N: In time series data, take surveys in shorter intervals (quarterly instead of annually)

3) Increase MSD(X) in the survey design phase

a. Stratify sample (ex: all social statuses represented in housing survey)

4) In design phase, obtain a heterogeneous sample rather than a more homogenous one (easier

said than done)

Indirect methods

1) If correlated variables measure a similar concept, it might make sense to combine them into an

overall index

a. Ex: Vocabulary and grammar should be correlated

2) Drop some correlated variables that have insignificant coefs

a. You risk introducing bias if said variable really should be in the model

3) Use extraneous info concerning the coef of one of the variables

a. Ex: Using data found in a cross-sectional study in a time series data

b. See pg 174&175 for an extended example

4) Use a theoretical restriction (hypothetical relationship among the params of a reg model

a. EX: 𝑆𝑃𝐸𝐴𝐾𝐼𝑁𝐺 = 𝜷1 + 𝜷2𝑿2 + 𝜷3𝑿3 + 𝑢

b. Assume that β2 (grammar skills) are equally important to β3 (vocabulary) when determining

overall public speaking skills

i. β2 = β3

c. Now, 𝑆𝑃𝐸𝐴𝐾𝐼𝑁𝐺 = 𝛽1 + 𝛽2(𝑋2 + 𝑋3) + 𝑢

d. See example on p174 for more

Further Analysis of Variance

Using an F-test to see if joint marginal contribution of a group of variables is significant (when adding

more)

Given model: 𝑌 = 𝜷1 + 𝜷2𝑿2 + ⋯ + 𝜷𝐾𝑿𝐾 + 𝑢, with K variables

o Where ESS = ESSK

Next, add M-K variables and fit the model:

o 𝑌 = 𝜷1 + 𝜷2𝑿2 + ⋯ + 𝜷𝐾𝑿𝐾 + 𝜷𝐾+1𝑿𝐾+1 + ⋯ +𝜷𝑀𝑿𝑀 + 𝑢

With ESS = ESSM

You now have explained an additional SS equal to ESSM - ESSK, using up an additional M-K df

Is the increase due to chance? Test it.

F-test verbally: 𝐹 = 𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡 𝑖𝑛 𝑓𝑖𝑡 ÷𝐸𝑥𝑡𝑟𝑎 𝐷𝐹 𝑢𝑠𝑒𝑑 𝑢𝑝

𝑅𝑆𝑆 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 ÷𝐷𝐹 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔

Since RSSM = TSS - ESSM, and RSSK = TSS - ESSK,

the appropriate F-stat is: 𝐹(𝑀 − 𝐾, 𝑁 − 𝑀) =(𝑅𝑆𝑆𝐾−𝑅𝑆𝑆𝑀)/(𝑀−𝐾)

𝑅𝑆𝑆𝑀/(𝑁−𝑀)

Under the H0: additional variables contribute nothing to the original equation:

o H0: BK+1 = BK+2 = BM = 0

o The f=stat is distributed with M-K and N-M df

o See table below. The upper half gives the ANOVA for the explanatory power of the original K-1

variables. The lower half gives it for the joint marginal contribution of the new variables

Sum of Squares (SS) df SS/df F-stat

Explained by original

variables

ESSK K-1 ESSK / (K-1) ESSK/(K − 1)

𝑅𝑆𝑆𝐾/(𝑁 − 𝐾)

Residual RSSK = TSS - ESSK N-K RSSK / (N-K)

Explained by new variables

ESSM- ESSK =

RSSK - RSSM

M-K (RSSK - RSSM) / (M -

K) (RSSK − RSSM)/(M − K)

𝑅𝑆𝑆𝑀/(𝑁 − 𝑀)

Residual RSSM = TSS - ESSM N-M RSSM/(N-M)

𝑹𝟐

: Adjusted/corrected R2:

Remember how R2 can’t decrease with additional variables? Adj. R2 compensates for this by imposing a

penalty for increasing K

𝑅2

= 1 − (1 − 𝑅2

)𝑁−1

𝑁−𝐾 = 𝑁−1

𝑁−𝐾 𝑅2

− 𝐾−1

𝑁−𝐾 = 𝑅2

− 𝐾−1

𝑁−𝐾 (1 − 𝑅2

)

It can be shown that the addition of a new variable to a model can cause adj R2 to go up, but only if

the absolute value of its t-stat is greater than 1

This means if adj R2 increases when K goes up, it doesn’t necessarily mean that the coef of the new

variable is significantly different than zero

This does not mean that the fit has improved

Hedonic pricing: an example of prediction

assumes that the value of a good is determined by the combined values of its components

Hedonic pricing model: 𝑃𝑖 = 𝛽1

+ ∑ 𝛽𝑗𝑋𝑗𝑖 + 𝑢𝑖

o Fitted model: �� = 𝑏1 + ∑ 𝑏𝑗𝑋𝑗𝑖

Suppose a new variety of the good comes out with characteristics: {𝑋2

∗, 𝑋3

∗, … , 𝑋𝐾

∗}

It’s natural to predict that the price of the new variety should be given by:

𝑃∗

= 𝑏1 + ∑ 𝑏𝑗Xj∗

Start by assuming that the good only has one relevant characteristic and we have fitted the simple reg

model Pi = 𝑏1 + b2Xi

Because the new variety of the good has the characteristic X=X*: P∗ = 𝑏1 + b2X∗

Prediction error (PE): difference btwn actual (P*) and predicted price (P∗)

o PE= P* - P∗

Assume model applies to the new good, therefore the actual price is 𝑃∗

= 𝛽1

+ 𝛽2𝑋

∗+ 𝑢∗

𝑃𝐸 = 𝑃∗− 𝑃∗ = (𝛽1 + 𝛽2𝑋∗

+ 𝑢∗) − (𝑏1 + b2X∗)

𝐸(𝑃𝐸) = 𝐸(𝛽1 + 𝛽2𝑋∗+ 𝑢∗) − 𝐸(𝑏1 + b2X

∗)

o = 𝛽1

+ 𝛽2𝑋

∗+ 𝐸(𝑢∗) − 𝐸(𝑏1) − 𝑋

∗𝐸(𝑏2)

o = 𝛽1

+ 𝛽2𝑋

∗− 𝛽

1− 𝑋

∗𝛽

2

o 𝐸(𝑃𝐸) = 0

You assume that there is no prediction error, and that your prediction was exact

This assumes that the reg model assumptions are met

Popvar of the PE

o 𝜎𝑃𝐸2 = {1 +

1

𝑁 +(𝑋

∗−��)

2

∑(𝑋𝑖−��)2}

Obvious implications: the further the value of X* is from the sample mean, the larger its

popvar. Also, var(PE) goes down as N goes up

Confidence interval for actual outcome P*

o 𝑃∗

− (𝑡𝑐𝑟𝑖𝑡 ∗ 𝑠𝑒) < 𝑃∗

< 𝑃∗

+ (𝑡𝑐𝑟𝑖𝑡 ∗ 𝑠𝑒)

Chapter 4: Nonlinear Models & Transformation of variables

Linearity in variables/parameters

This model is:

Linear in variables (lin in var): every term consists of a straightforward variable times a parameter

Linear in parameters (lin in par): every term consists of a straightforward parameter times a

variables

Linearizing functions that are nonlin in vars:

Given

This transformation is only cosmetic, but it gives us a function that is linear in var & par

Linearizing functions that are nonlin in params

𝑒 = 𝛽1 +𝛽2

𝑔+ 𝑢

Defining Z=1/g, it’s now linear in both categories:

𝑒 = 𝛽1 + 𝛽2𝑍 + 𝑢

Logarithmic/loglinear model

- This model is nonlinear in parameters AND variables

When you see a function that looks like this, you can immediately say that the elasticity WRT (w/

respect to) X is constant and equal to β2

Regardless of how Y & X are related mathematically, or their definitions, the elasticity of Y WRT X is

the proportional (%) change in Y for a given proportional change in X:

EX: if Y is demand for a commodity, and X is income, this defines the income elasticity of

demand for that good

Rewrite: . In the demand example, this could be seen as the marginal

divided by avg propensity to consume

o Example: If the relationship btwn Y & X takes the form: , then

o Therefore,

Semilog models

Common functional form: , where β2 is the proportional change in Y per unit change in X.

This is shown by differentiating:

o

Therefore,

o

Note that this same function can be made linear in params by logging both sides:

Only the left side is in logarithmic in variables (Logβ1 means it’s logarithmic in

parameters), still making it a semilog model

Economies of scale

In class, there have been a few questions involving “economies of scale”

The Economist defines economies of scale as factors that cause the average cost of producing

something to fall as the volume of output increases

o Example: cell phone providers. Smaller companies are unable to compete due to the expensive

infrastructure that the market requires (like service towers)

o However, these seemingly large costs are negligible for very large firms. Through their ability to

make large investments, they are able to exploit the fact that not everyone is able to produce

as cheaply as them

In economies of scale, LR average costs go down as Q

goes up (opposite in perfectly competitive ones)

This is most commonly found in markets where fixed

costs are significant.

With higher output, these fixed costs are insignificant to

larger firms. If a small firm is a part of this industry, they

would be unable to afford the production level required

to stay competitive

In the question on Midterm 2, the function was given as LN(cost) = β1 + β2 ln(Q)+u, with a null

hypothesis being that it was a competitive market. You believed that economies of scale were present,

estimating a slope of 0.6

As can be seen in the table below, the slope coefficient of this log model (β2) tells you the effect of a

1% increase in quantity on the percentage change in costs

On the test question, you found a slope of 0.6, implying that increasing Q by 1% would be

accompanied by a 0.6% increase in cost

This would imply that economies of scale are present, since in a perfectly competitive market, the

increase in cost would be equal to the change in output

o This would mean that under the null (economies of scale NOT present), β2 ≥1, and in the

alternative (economies of scale ARE present), β2 < 1

Note: I don’t understand why we use LN sometimes, and LOG sometimes. I figured that they’re

interchangeable in this context, but I have not yet confirmed it. I suggest asking before the test if you’re

confused

Overview:

Linear model Semilog A Semilog B Log/Log

𝑌 = 𝛽1 + 𝛽2𝑋 𝑌 = 𝛽1 + 𝛽2log 𝑋 𝑙𝑜𝑔𝑌 = 𝛽1 + 𝛽2𝑋 𝑙𝑜𝑔𝑌

= 𝛽1 + 𝛽2log𝑋

β2: effect of ∆X on ∆Y β2: effect of a 1% ∆X

on the level of Y

β2: effect of ∆X on

%∆Y

β2: effect of a 1% ∆X

on %∆Y

Anything w/ a log is a percent change

Anything w/o a log is a unit change

Disturbance terms: we’ve been ignoring them so far

u needs to be an additive term (+u) that satisfies conditions of reg model. If this is untrue, least

squares reg will not have normal properties, making tests invalid

EX: In 4.6, we were given , which transformed into

o In both of these cases, the disturbance is additive. No problems

But, what happens when we start with a model like ?

After taking logs, the reg model is when u is included. Therefore, the

original Y should be rewritten as , where logV=u

V modifies the model by increasing/decreasing it by a random proportion, not an amount

o If v=1, then the random factor is 0 (as multiplying it by 1 doesn’t change the model)

This showed that to obtain an additive disturbance term, we need to start with a multiplicative one in

the original equation

If the term was additive in the original, then taking the log of would be impossible, as

we can’t calculate this complex term mathematically. You would have to use a nonlinear technique

(later)

Comparing linear and logarithmic specifications

Should you model a relationship with a linear of nonlinear function?

o If nonlin, what kind?

Sometimes looking at the scatter plot can tell you if its linear or not, but not

always (ex to the right)

The problem with choosing btwn 2 models is that RSS and R2 can’t be

compared between different functional forms of Y

o But, if R2 is much larger in one function, choose that one

If they’re both similar, you can scale the observations of Y so that RSS in

lin/log models are directly comparable (Box & Cox)

Making RSS comparable between a linear and log model (Box & Cox):

1. Calculate geometric mean of Y values in the sample. This is equal to the exponential of the mean of

logY:

2. Scale the observations on Y by dividing by this figure: 𝑌𝑖∗

= 𝑌𝑖 𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐⁄ 𝑚𝑒𝑎𝑛𝑜𝑓 𝑌, where

Y* is the scaled value in observation i

3. Regress the linear model using Y* instead of Y, and reg the log model using logY* instead of logY,

leaving the models otherwise unchanged

The RSS’s are now comparable, and the lower value provides the better fit

Do not use this method to find coefs. This is solely for deciding preferred model

Models w/ quadratic and interactive variables:

These models can be fitted w/ OLS w/o modification

Interpretation of coefs:

Quad: can’t follow normal rule of holding other variables fixed to measure another’s effect on Y

because it’s not possible to change X2 while keeping X22 fixed

This is also the case in the inter model, since X2 also appears as X2X3

Quadratic model:

Differentiate quad model: - This is the change in Y per unit change in X2

Viewed this way, the impact of a ∆X2 on Y (B2+2B3X2) changes with X2

This means that B2 has a different interpretation than the ordinary model (Y=B1+B2X2+u), where B2 is

the unqualified effect of a unit change in X2 on Y

In the quad model, B2 is the effect of a unit change in X2 on Y for the special case where X2 = 0

o For nonzero values of X2, the coef will be different

B3 also has a special interpretation. Rewriting the model as:

o B3 is the rate of change (RoC) of the coef of X2 per unit change in X2

Only β1 has a conventional translation. It is the value of Y (apart from random component) when X2=0

There’s another problem. We’ve seen that the intercept of a regression usually doesn’t have a sensible

meaning if X2 =0 is outside the data range

o Note that in this scatter, the quadratic model predicts a

wage of $15 with zero schooling. This is not realistic

o The linear model is also unrealistic, because it predicts

that someone with zero schooling would have to PAY $10

an hour to work

o The Log model makes the most sense in this example

Why do we stop at quadratics? Why not a cubic? Or one with

even more coefs?

o Quadratics are justified due to the concepts of diminishing marginal returns (parabolic shape)

o As higher order terms are added, the fit will be improved slightly, but it will be sample specific

Economic theory rarely justifies higher order polys

Interactive explanatory variables

Ex:

This is lin in par, but not in var

To properly interpret the coefs, rewrite as:

Makes it explicit that the Marginal effect of X2 on Y (B2+B4X3) depends on X3

o Special interpretation: B2: marginal effect of X2 on Y when X3 = 0

Rewrite: : this shows that the marginal effect of X3 on Y (holding X2

constant) is (β3 + β4X2), and that β3 may be seen as the marginal effect of X3 on Y when X2 = 0

o If X3 = 0 is a long way outside of the range of X3 in the sample, the interpretation of the

estimate of β2 as an estimate of the marginal effect of X2 when X3=0 should be treated with

caution.

o Sometimes the estimate will be completely implausible, like giving a literal explanation of the y-

intercepts of a model

o We have just ran into a similar problem here, with the interpretation of β2 in the quadratic

specification

It’s often of interest to compare estimates of the effect of X2 & X3 on Y in models excluding and

including the interactive term

o Changes in meanings of β2 & β3 make this difficult

Solution: Rescale X2 & X3 so that they’re measured as sample means

𝑋2

∗= 𝑋2 − 𝑋2

𝑋3

∗= 𝑋3 − 𝑋3

Subbing these into the original model for X2 and X3:

I couldn’t figure out how to get asterisks to go directly above the X’s with the software I used to type these

functions. Note that in the functions above and below, the star near the middle of the terms should be over

the X (as in 𝑋2∗)

Where

Now, coefs of 𝑋2

∗ & 𝑋3

∗ give the marginal effect of their variables if the other is held at the

sample mean

Rewrite:

o

o It can be seen that 𝛽2∗ gives the marginal effect of 𝑋2

∗ (and therefore X2)) when X3 is at

its sample mean

o 𝛽3∗ is interpreted in a similar fashion

RESET test (Ramsey): tests for nonlinearity

Adding quad terms of X’s, and interactive terms to the specification is one way of investigating the

possibility of nonlinearity in Y

If there are a lot of explanatory variables in the model, we might want to have some sort of evidence

of nonlinearity before spending too much time manipulating them

Ramsey’s RESET Test of functional misspecification is intended to provide a simple indicator

o Run reg in original form, save the fitted values of depvar (Y-hat)

By definition:

o Yhat2: linear combo of squares of X variables & their interactions

o If Yhat2 is added to reg specification, it should pick up quad or inter nonlinearity, without

necessarily being highly correlated with any X variables (and consuming only one DF)

o If the t-stat of the coef of Yhat2 is significant, some kind of nonlin is likely to be present

o This does not tell you WHICH kind of nonlin your data is represented by, and may fail to detect

other types of nonlin

o However, it’s easy to implement and potentially helpful

o In principal, we could include higher powers of Yhat, but most don’t think this is worthwhile

Nonlinear Regression: You believe that Y depends on X according to

, and you want to obtain estimates of the betas, given data on Y & X

Note that this cannot be transformed to obtain a linear relationship, so it’s not possible to apply the

regular reg procedure

But, we can still use the process of minimizing RSS to estimate params

This nonlinear regression algorithm is a simple method that uses the principle of RSS minimization

1) Guess plausible values for the params

2) Calculate the predicted values of Y from the data on X using these values as the params

3) Calculate residuals for each observation and find RSS

4) Make small changes in one or more of your estimates of the params

5) Calculate the new predicted values of Y, residuals, and RSS

6) If the new RSS is smaller than the original, your new estimates of the parameters are better.

Take them as your new starting point

7) Repeat steps 4, 5, and 6 again and again until you are unable to reduce RSS any further

8) Conclude that you have minimized RSS, and describe the final estimates of the params as the

least squares estimates

CH5: Dummy Variables

Dummy variables (DVs): used for categorical data as opposed to numerical

Assigns a 0 or 1 for true/false

o Dummies are treated just like ordinary variables, despite the fact that they only have 2 possible

values

Better than having to use more than one reg model within sample

The example of types of schooling in Shanghai that persists throughout Chapter 5 is great for

explaining dummies, and will be used here

General cost FN (for population, no DVs yet):

o β2 is the MC (marg cost) or slope, because it changes as N (#

students) goes up.

o β1 is the FC (fixed cost) or intercept because it doesn’t vary with the # of students

o We could denote COST as the cost of regular schools, and COST’

as the cost of occupational (OCC) schools (shown at right), but

using dummies makes this much easier

Combining multiple qualitative measurements into one function (using dummies)

EX: , where OCC is a dummy. If OCC=true, then OCC=1. If not,

OCC=0, negating its associated δ (delta) value

δ is the associated increase in overhead when an OCC school is being looked at (changes intercept, but

not slope. This will be addressed later)

From the Stata output of the data we get fitted values for 𝛿, 𝑏1 , 𝑏2, 𝐶𝑂𝑆��:

Setting OCC equal to 0 and 1, respectively, we can obtain the implicit costs for the two types of

schools:

the intercept implies an annual overhead cost of -34000 Yuan for regular schools The negative value is

not at all realistic, and you should immediately realize that the model is misspec’d (doesn’t pass the

“laugh test”)

the N-coef of 331 implies a constant slope in both categories (fixed later)

Standard errors & hypothesis testing of dummies

Perform t-test on the DV, with H0: 𝛿 = 0, 𝑎𝑛𝑑 𝐻𝐴: 𝛿 ≠ 0

H0: no difference in overhead costs

If the t stat is greater than tcrit, conclude that the special schools are significantly more expensive

than regular ones

Standard errors are usually given in the outputs. Make sure they are included in the appropriate t-

tests

More elaborate dummies: extension to more than 2 categories and multiple sets of DVs

Now, we are going to make models with 4 possible categories of schools: General (GEN), Occupational

(OCC), Skilled Worker schools (WORKER), and Vocational (VOC)

Pick one reference category (refcat) to which the basic equation applies. You should always start with

the dominant or most normal category (unless you have good reason to do otherwise)

In the school example, general schools are picked as the first refcat

New model:

𝛿= extra overhead o Where

required by specific special schools in addition to the overhead of general ones, and

TECH/WORKER/VOC are dummies which will equal 1 when true and zero otherwise

o Two dummies cannot equal 1 simultaneously in this model. One will equal 1, the rest zero. This

will change later when we use 2 separate qualitative measurements of individual schools (with

the RES

Do not make a DV for the reference category. This is why it’s called the omitted category. Note that

GEN is the reference category in the case above

Regression results (from book):

o Notice that TECH schools require an additional 154,000 Yuan of overhead over general schools

o Overhead: costs not related to level of output or labor

o MC of each student: 343 Yuan

o Notice that the general school has a negative number again (despite being measured against

itself). Something’s wrong w/ the model

Finding individual implicit costs of each type of school

Changing the reference category (refcat): replace the variable and its associated parameter (δ) with one

for your previous omitted category

EX: changing reference category from GEN to WORKER:

When refcat changes:

o R2, coefs for other variables, t-stats for other variables, F-stat for the whole equation all stay

the same

o Standard errors and the interpretations of the t-tests are the only things that change

DV trap:

Happens if you include a DV for the refcat

If it were possible to calculate coefs, you couldn’t interpret them. Dummies change the intercept, and

there would be no definition of the base-level intercept (b1

In general models, there’s actually an X1 value that we have ignored

Ex: 𝑌𝑖 = 𝛽1𝑋1 + 𝛽2𝑋𝑖 + 𝑢𝑖

o X1 is equal to 1, having no effect on β1 or Y

Suppose there are M dummy categories, and you define DVs D1, … DM

Since one DV will equal one, and the rest zero, the sum of DVs will always be 1

Because the intercept is a product of β1 and a special variable equal to 1 in all observations

o This means that for all observations, the sum of the DVs is equal to this special variable

As a consequence, this model is subject to a special case of exact multico, preventing the calculation of

coefs

Multiple sets of DVs

EX:

This model proposes that schools of any type will

have a higher cost if they’re in a residential area.

Note that we’re just using RES/non-RES OCC and GEN

schools for simplicity

o ε: extra cost of residential schools

o Effects the intercept

o the reference category now has 2 dimensions ,

one for each qualitative characteristic

o Note that it is assumed (and intuitive) that the

increase in cost for a residential location will be the same for OCC & regular schools

o Regression output:

This implies that OCC schools cost 110,000 more Yuan, and residential schools as a whole cost 58,000

more

The 4 combos of OCC and RES can be broken into individual implicit cost functions by the 4 combos of OCC

and RES:

* I didn’t include the other 2 because they’re pretty easy to find, and I think this process has been

pretty well-covered. Check pg 238 if you need clarification

Slope DV (interactive):

drops assumption that the slope of the reg is the same for each category of qualitative variables, since

it would be unrealistic for MC to remain fixed by each school

Slope DV: N*OCC

Effect: allows the coefs of N for OCC schools to be λ greater than that of regular schools

Setting OCC equal to zero gets rid of the OCC conditions (steeper slope and intercept) and gives

you the original cost function for reg schools:

Setting OCC equal to 1 makes NOCC=N:

λ is the incremental marginal cost associated w/ OCC schools, just like δ is the incremental

overhead cost

The addition of these variables has allowed the OCC and regular schools to have their own

slopes (MC) and intercepts (FC) while remaining a part of the same function:

If the Y-int is negative for a data set with no negative values, it’s probably misspec’d. Likely due

to an MC which was a compromise between the slopes of regular and occ schools

T & F tests can still be used

F-tests of dummies

The joint explanatory power of the intercept and slope dummies can be tested with the normal F-tests,

comparing RSS when dummies are included and excluded

H0: δ=λ=0

HA: at least one 𝛿 ≠ 0

o 𝐹 =𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑅𝑆𝑆 𝑓𝑟𝑜𝑚 𝑎𝑑𝑑𝑖𝑛𝑔 𝐷𝑉𝑠 ÷ 𝐶𝑜𝑠𝑡 𝑖𝑛 𝑑𝑓

𝑅𝑆𝑆 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝐷𝑉𝑠 ÷𝐷𝐹 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑓𝑡𝑒𝑟 𝐷𝑉𝑠 𝑎𝑑𝑑𝑒𝑑

Numerator: Cost in df: additional K estimated (# of DVs)

DVs in log/semilog forms

* I don’t think that these will be

covered, but there was about a

half page on it (230) and knowing

how to manipulate logs is

important in other areas

o The term 𝑒𝛿𝐷 multiplies Y by e0 (equal to 1) when D=0 (reference category) and 𝑒𝛿

when

D=1 (other category)

o If 𝛿 is small, 𝑒𝛿 ≈ (1 + 𝛿). This means that Y is a proportion larger in the other category

than the reference category

o If 𝛿 is not small, the proportional difference is (𝑒𝛿 − 1)

Chow Test: Type of F-Test for 2+ subsamples

Should you run 2 regs of A & B, or together as P? (pooled)

Since the subsample regs already minimized RSSA & RSSB, pooling them will produce a regression that

doesn’t fit as well (usually)

This means that 𝑅𝑆𝑆𝐴 ≤ 𝑅𝑆𝑆𝐴𝑃 𝑎𝑛𝑑 𝑅𝑆𝑆𝐵 ≤ 𝑅𝑆𝑆𝐵

𝑃

o Therefore, (𝑅𝑆𝑆𝐴 + 𝑅𝑆𝑆𝐵 ≤ 𝑅𝑆𝑆𝑃 , where RSSP (total sum of residuals in pooled reg) is

equal to the sum of 𝑅𝑆𝑆𝐴𝑃 𝑎𝑛𝑑 𝑅𝑆𝑆𝐵

𝑃

In general, there will be an improvement (RSSP - RSSA - RSSB) when the sample is split up

However, there’s a price to pay when splitting it. K extra df have been used up, since instead of K

params for the pooled regression, there are now 2K params

After breaking up the sample, we are still left with (RSSA+RSSB) (unexplained) sum of squares of the

residuals, and N-2K df remaining

We can now test whether the improved fit due to splitting the sample is significant w/ a special

F-test known as a Chow Test

o We use the F-stat 𝐹 =𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑅𝑆𝑆 𝑓𝑟𝑜𝑚 𝑎𝑑𝑑𝑖𝑛𝑔 𝐷𝑉𝑠 ÷ 𝐶𝑜𝑠𝑡 𝑖𝑛 𝑑𝑓

𝑅𝑆𝑆 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝐷𝑉𝑠 ÷𝐷𝐹 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑓𝑡𝑒𝑟 𝐷𝑉𝑠 𝑎𝑑𝑑𝑒𝑑

o Which is distributed w/ K and (N-2K) df under the null (no improvement in fit)

o If FStat > FCrit, use the separate regressions (because the improvement is significant)

DV method vs Chow test

Chow is quick. You just run 3 regs and calculate test stats

o However, it doesn’t tell you how the functions differ (if they do)

DV gives you more info bc you can perform t-tests on individual dummy coefs. This may show where

the FNs differ (if they do)

o DV takes longer bc you have to define a DV for each intercept and each slope coef

Chapter 6: Specification of Regression Variables

Model Specification: If we know exactly which variables need to be included in a relationship, the goal is

only to calculate estimates of their coefs, confidence intervals for these estimates, and so on. However, we

can never know if we specified the model correctly. We might be leaving out variables that should be included,

or including ones that should not

1. If you leave out a variable that should be included, the reg estimates will generally (not always) be

biased. This makes the SE of the coefs and corresponding t-tests generally invalid

2. If you include a variable that should not be included, the coefs are generally (not always) inefficient,

but not biased

Consequences of variable specification

True Model

𝑌𝑖 = 𝛽1 + 𝛽2𝑋2 + 𝑢 𝑌𝑖 = 𝛽1 + 𝛽2𝑋2 + 𝛽3𝑋3 + 𝑢

Fitt

ed M

od

el

�� = 𝑏1 + 𝑏2𝑋2 Correct specification. No

Problems

B3X3 left out. This means coefs are biased (in general)

and SEs invalid

�� = 𝑏1 + 𝑏2𝑋2

+ 𝑏3𝑋3

Coefs are unbiased (in

general), but inefficient. SEs

are valid (in general) due to

inclusion of b3x3

Correct specification. No problems.

Effect of omitting a relevant variable:

Suppose that Y depends on X2 and X3 according to

But, you are unaware of the importance of X3. You think the model should be (simple reg)

You then calculate b2 using the expression:

Instead of:

b2 is unbiased only if E(b2)=b2:

*proof of unbiasedness can be found on p253

Assuming the model is correctly specified in that it has multiple regressors, b2 is subject to omitted

variable bias

If X3 is omitted from the reg model, X2 will appear to have a double effect

It will have a direct effect and a proxy effect by mimicking the effects of X3

The direction of the bias will depend on the signs of B3 and ∑(𝑋2𝑖 − 𝑋2 ) (𝑋3𝑖 − 𝑋3

) (which

is the numerator in the sample correlation btwn X2 and X3, 𝑟𝑋2𝑋3), and the denom of the corr coef

will always be positive

o If β3 and the correlation is pos, the bias will be positive, and b2 will tend to overestimate β2

o The direction of the bias could just as likely be negative. It depends on the sign of the true

coef of the omitted variable, and on the sign of the correlation btwn the included and omitted

variables

Invalidation of statistical tests

Omitting a variable that should be included makes the SEs of the coefs and the test stats invalid

(generally)

Suppose that the true model is represented by a simple regression, but you think it’s a multiple reg model. You

estimate b2 with

Instead of

Generally, adding a redundant variable doesn’t cause bias, it just causes inefficient estimations

However, you could just rewrite the true model as

o If you regress Y on X2 and X3, b2 will be an unbiased estimator of β2, and b3 will be an unbiased

estimator of zero (as long as reg models are correct)

Proxy Variables

Used when you are unable to get data on a variable that you think should be included, or if it’s too

difficult to measure

o Ex: Intelligence is vaguely defined, practically impossible to measure definitively

In this case, it’s usually a good idea to use a proxy variable to stand in for the missing variables

(instead of simply dropping it)

o EX: Since socioeconomic status (SES) isn’t easily measurable, use income to stand in

2 good reasons to use a proxy:

o Leaving the variable out can cause reg to suffer from omitted variable bias, making statistical

tests invalid

o The results from your proxy reg may indirectly shed light on the influence of the missing

variable

Testing a linear restriction

Linear restrictions: the parameters conform to a simple linear equation

o EX: 𝛽2 = 𝛽3 𝑜𝑟 𝛽2 + 𝛽3 = 1

Nonlinear restriction: 𝛽2 = 𝛽3𝛽4

These procedures only relate to linear restrictions

F-test of a linear restriction

Run reg on restricted and unrestricted forms, denoting RSS as RRSS for the restricted model, and URSS

for the unrestricted one

The restriction makes it harder to fit a model, so RRSS≥URSS, generally being greater

We want to test if the improvement in fit when going from the restricted to unrestricted model is

significant

If it is, the restriction should be rejected

o As in chapter 3, 𝐹 = 𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡 𝑖𝑛 𝑓𝑖𝑡 ÷𝐸𝑥𝑡𝑟𝑎 𝐷𝐹 𝑢𝑠𝑒𝑑 𝑢𝑝

𝑅𝑆𝑆 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 ÷𝐷𝐹 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔

In this case, the improvement is RRSS-URSS, and one additional df is used up in the unrestricted model

(bc there’s one more parameter to estimate), and the RSS remaining is URSS

o The F-stat in this case is 𝐹(1, 𝑁 − 𝐾) =𝑅𝑅𝑆𝑆−𝑈𝑅𝑆𝑆

𝑈𝑅𝑆𝑆/(𝑁−𝐾) (under H0: restriction is valid)

F is distributed with 1 & N-K df

o First argument for the distribution of the F stat: we are testing ONE restriction

o Second: df in the unrestricted model

T-test of a linear restriction

Suppose your hypothetical restriction is

o Where Θ is scalar. Define And reparameterize

o Θ will become the coef of one of the variables in the model, and a t-test of H0: Θ=0 is

effectively a t-test of H0: ∑ 𝜆𝑗𝛽𝑗 = 𝛼, hence the restriction

o Basically, we’re seeing if we need to include a reparameterizing term (Θ)

o Example on p273

o If the estimate of Θ is not significantly different than zero, we can drop it and use the restricted

version

o If it is significantly different, we can’t drop it

Multiple restrictions: F-test can be applied to test whether several restrictions are valid simultaneously

Suppose there are P restrictions

Let URSS be RSS for the fully unrestricted model

Let RRSS be for the model where all P restrictions have been imposed

o The test stat is now 𝐹(𝑃, 𝑁 − 𝐾) =(𝑅𝑅𝑆𝑆−𝑈𝑅𝑆𝑆)÷𝑃

𝑈𝑅𝑆𝑆÷(𝑁−𝐾) where K is the number of params

in the original unrestricted version

o The t-stat can only be used to test single restrictions in isolation (one at a time)

Zero restrictions

Zero restrictions means that a particular param is hypothesized to be zero

Since it’s in isolation, use the t-test

o Special case, no need for reparam

The testing of multiple zero restrictions is a special case of testing multiple restrictions

o The test of the joint explanatory of a group of explanatory variables can be thought of in this

way

o The F stat for the equation as a whole can also be thought of in this way

Here the unrestricted model is

The restricted model is

o Since all of the slope coefs are hypothesized to be zero

If this model is fitted,

o the OLS estimate of 𝛽1 𝑖𝑠 ��

o The residual in observation i is 𝑌𝑖 − ��

o RRSS is ∑(𝑌𝑖 − ��)2 which is the TSS for Y

o The F-stat is now

o Where URSS & UESS are for the original unrestricted model

Documents

EC320 FINAL Assumptions - …d2oqb2vjj999su.cloudfront.net/users/000/087/470/766/attachments... · EC320 FINAL STUDY GUIDE: 1 ... Hedonic pricing (prediction) 6. ... Differences between