A Brief Introduction. A. Data (variables). Can be in three forms: 1.Interval – There is a common scale to measure the variable, so that a value of two

Econometric Forecasting with

Linear RegressionA Brief Introduction

A. Data (variables). Can be in three forms:1. Interval – There is a common scale to

measure the variable, so that a value of two is actually twice a value of one. Examples: % of vote, degrees Fahrenheit, number killed, duration of regime, number of soldiers, GDP

2. Ordinal – There is a rank-ordering to the variable, so 2 > 1, but the scale varies so that 2 is not exactly twice one. Examples: Yes/No variables, how close a bill is to passage (no houses, one house, both houses, signature), war outcomes (win, lose, or draw)

3. Nominal – There are numbers, but they are completely arbitrary. Examples: country codes, leader names, strategy choices, apples and oranges.

I. Fundamental Concepts

1. Examples include % of the two-party Presidential vote, % seats held by Dems, war/non-war, political (in)stability, etc.

2. Easiest to have a continuous (interval) DV, but techniques exist for all three types

B. Dependent Variable: What you are trying to predict

1. Can be either interval or ordinal. So…2. Transform nominal into ordinal. Example:

Is this country the US? A nominal variable (USA) becomes an ordinal one (Yes or No).

3. Again, examples in syllabus

C. Independent Variables: What variables predict the DV

1. Positive (or direct) correlation: the values of the IV and DV move up and down together (poverty and crime, CO2 and global temperature, drug addiction and prostitution, geographic proximity and conflict)

D. Correlation


2. Negative (or inverse): The values of the IV and DV move in opposite directions (alcohol and coordination, democracy and interstate conflict, war and development)

D. Correlation


2. Negative (or inverse): The values of the IV and DV move in opposite directions (alcohol and coordination, democracy and interstate conflict, war and development)

3. Conditional: Direction depends on the value of some other variable

D. Correlation

4. Correlation ≠ Causation: Coincidence and Omitted Variables

E. Example: Forecasting Political Stability with Five Variables

DependentVariable

IndependentVariables

Statistical Relationships

A. Simplest tool: the scatterplot or scatter diagram. Example from medicine:

II. Modeling Relationships

A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn

The following data set provide information on 15 pregnant mothers who were contacted for this study

Example

BMI (Kg/m2) Birth-weight (Kg)

20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8

Scatter diagram plots bivariate observations (X, Y) BMI (the IV) is X and birthweight (the DV) is Y◦ Y is the dependent variable (Dependent

goes Down the side)◦ X is the independent variable (goes across

the graph)

Scatter Diagrams / Scatterplots

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70

Scatter diagram of BMI and Birthweight

Example from politics

People tend to mentally fit a line or curve to describe the shape of the scatterplot

Examples:

B. Interpreting Scatterplots

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

Linear Correlation

Linear (lack of) Correlation

Y

X

Y

X

No relationship

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Curvilinear Correlation

1. Intended to simplify relationship. The line is ultimately an estimate, usually known to be wrong (but close enough to be useful)

2. Line is probabilistic, not deterministic – otherwise it would perfectly pass through every point on the scatterplot

3. = key difference between predicting politics and predicting planetary orbits. Kepler’s equations are deterministic, but econometric models are probabilistic

C. What does the line mean?

0204060

0 20 40 60

X

Y

Sample scatterplot:

D. Problem: How do we draw the “right” line?

0204060

0 20 40 60

X

Y

Thinking Challenge

How would you draw a line through the points? How do you determine which line ‘fits best’?

0204060

0 20 40 60

X

Y

Thinking Challenge


0204060

0 20 40 60

X

Y

Thinking Challenge


0204060

0 20 40 60

X

Y

Thinking Challenge


0204060

0 20 40 60

X

Y

Thinking Challenge


0204060

0 20 40 60

X

Y

Thinking Challenge


0204060

0 20 40 60

X

Y

Thinking Challenge


Regression = using an equation to find the line (or curve) that most closely fits the data

E. Solution: Regression

Y X 0 1

a. Relationship Between Variables Is a Linear Function

1. Linear Regression Model

Dependent Variable

Independent (Explanatory or Control) Variable

Coefficient of X, or Slope

Constant, or Y-Intercept

Random Error

It should….

Does this equation look a bit familiar?

Y

Y = mX + b

b = Y-intercept

X

Changein Y

Change in X

m = Slope

b. Linear Equations

High School Teacher

c. Quick math review

As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept

One definition of m is that for every one unit increase in x, there is an m unit increase in y

One definition of b is the value of y when x is equal to zero

Line

y = 1.5x + 4

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12

Sample Scatterplot

Look at the data in this picture

Does there seem to be a correlation (linear relationship) in the data?

Is the data perfectly linear?

Could we fit a line to this data?

0

5

10

15

20

25

0 2 4 6 8 10 12

2. What is linear regression? Linear regression tries

to find the best line (curve) to fit the data

The equation of the line is

The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points

y = 1.5x + 4

0

5

10

15

20

25

0 2 4 6 8 10 12

3. Ordinary Least Squares (OLS): The most common form of linear regression

a. Find the values of b that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei

2

b. Why minimize squared errors? ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum But Positive Differences Offset Negative! (errors of 10 and -10 add to zero) squaring errors solves the problem: 10 * 10 = 100 and -10 * -10 also = 100.

2

Y

X

1 3

4

^^

^^

c. Least Squares Graphically: Predicted Values of Y vs. Actual Values of Y

Y Xi i 0 1

For each observation i, the equation is merely an estimate, not the actual value. There are errors (εi), and the line minimizes the sum of ε1

2, ε22, ε3

2, ε42, ε5

2, and so on.

d. Recap: Interpreting the Linear Regression Formula

Regression Formula: Y = a + bX, Y = α + βX, Y = α + β1X1, Y = β0 + β1X1, etc all are the same formula!• Y = the predicted value of the dependent variable (its

estimated mean given X)• a (or alpha: α, or beta-zero: β0) = the Y intercept, or the value of

Y when X = 0 (constant)• b (or beta: β) = the regression coefficient, the slope of the

regression line, or the amount of change produced in Y by a unit change in X Positive sign of regression coefficient: positive direction of

association Negative sign of regression coefficient: negative direction of

association

• X = the value of the independent variable 47

What is:◦ Y?◦ X?◦ β1?

◦ β0?

Example

e. Multivariate Linear Regression Typical formula: Y = β0 + β1X1 + β2X2 + β3X3, etc.

• DV, constant haven’t changed• But now there are several independent variables• Each IV has its own coefficient. So the first X may be

positively related to Y, while the others might be negatively related to Y.

• Could plot the effect of any one independent variable on Y as a line, but can no longer plot the whole equation since there are now as many dimensions as there are independent variables (plus one, for Y).

• Multivariate regression is best interpreted by consulting tables of coefficients, evaluating the effect of each X separately (i.e. all else being equal)

49

F. Other statistics generated by linear regression

1. R2 : Proportion of the variation in the dependent variable (Y ) that is explained by the independent variable (X) R2 =Explained variation/Total variation

Ranges between 0 (no reduction in error) and 1 (no errors remain – the model perfectly predicts the dependent variable)

R2 is a comparative measure – it compares the amount of error made by the linear regression to the amount of error made by guessing the mean (average) value of Y for every case (e.g. Y = 12 for every case)

50

Example: Regression vs. guessing the mean

0 1 2 3 402468

10121416

Y

Predicted Y

X (Education level)

Y (

Inte

rnet

use

, ho

urs

per

w

eek)

It is how much variation there is when you know X (i.e. how good your line fits the data) compared to how much variation there is when you don’t know X (which means you just assume the mean of Y is constant). First the regression….

51

…and now the variance without regression

Simple graphical exampleGood Fit

y = 1.9599x + 0.2823

R2 = 0.9369

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1

x1

y

Simple graphical examplePoorer Fit

y = 1.9696x + 0.5683

R2 = 0.811

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1 1.2

x1

y

As it turns out…

2. Statistical Significance

a. Statistical significance of the regression model uses one of a number of indicators (χ2, for example). No need to understand the indicator to interpret it. Look for a “p value” associated with the indicator.

b. Statistical Significance of each Regression Coefficient (β1, for example). Also measured by a p value.

c. Key is to find p and see if p < .05 (in the social sciences). If yes statistically significant. If no not statistically significant.

The p value is the probability that random noise would have coincidentally given you an association this strong. Hence, lower values of p are “better.”

56

d. Finding and interpreting p• The p value is the probability that random data (i.e. no real

relationship with Y) would have coincidentally given you an association this strong. Hence, lower values of p are “better.”

• Authors sometimes say “significant at the .001 level.” This means p < .001. There may or may not be a table of p values for coefficients – authors frequently use asterisks to highlight coefficients at a given level of significance.

• If the model is not significant, the author has failed to discover a significant correlation between the model’s predicted values of Y and the actual values of Y.

• If a coefficient is not significant, then the author has failed to discover a significant correlation between that particular X and Y.

57

e. Two common mistakes• “p <.6 so the relationship is statistically insignificant, and

therefore I conclude that X doesn’t affect Y” – Not true, because p could be .001. All we know is that it is less than .23. In other words, absence of evidence is not evidence of absence. Indeed when the number of cases is very small, all of the p values – even for real relationships – are likely to be too large to make the coefficients statistically significant

• “p < .000001 so the relationship between X and Y is very strong” – Not true, because p values for any coefficient (no matter how tiny) becomes smaller as the number of cases increases. Millions of cases just about every relationship is “statistically significant,” but many are substantively trivial

58

This depends on what you are looking for!• What units are X and Y measured in• Does the coefficient mean that small

increases in X lead to large increases in Y? If statistically significant, this is also substantively significant

• Does the coefficient mean that large increases in X only produce trivial changes in Y? Then regardless of statistical significance, the relationship is substantively uninteresting

• This is a qualitative judgment based on your needs, but it takes into account the numbers

3. Substantive significance

Example Research hypothesis: The level of economic

development has a positive effect on civil liberties in countries of the world

Dependent variable: civil liberties◦ Interval-ratio

Independent variable: GDP per capita ($1000)◦ Measure of the level of the economic development◦ Interval-ratio

61

Example Regression Coefficient (beta) = .257• Substantive significance

• Increase of $1000 in the level of GDP per capita increases the civil liberties score by .257.

• On a 5-point scale, this is interesting. On a 1000-point scale it would not be interesting.

Statistical significance: p < .001 Statistically significant at the .001 or .1% level

• R square=.525 GDP per capita explains 52.5% of variation in civil liberties

• Research hypothesis: was not falsified by bivariate regression analysis (i.e. was consistent with the regression)

The level of economic development has a positive and statistically significant effect on civil liberties 62

4.Confidence Intervals Linear regression predicts best near the mean values

of X. Extreme values of X (low or high) are associated with greater error when predicting Y.

Solution: Confidence intervals. A 95% confidence interval is where 95% of observations of Y at a given value of X are expected to fall, given the significance of the coefficient of X.

Example: Polls with “margins of error” (typically 95% confidence intervals)

Another example:

63

Also known as “time series analysis.”A. Simplest form: Yt = Yt-1+α

◦ Y is the DV, t is time, and α is a constant◦ If Yt-1Y is 38 and α is 1, then y will be 101, 102,

103, etc as time passes◦ Note that this is simply a rearranged linear

regression equation. The DV is predicted by previous values of the DV (which fill in as the IVs in the model)

III. Extrapolation

Form: Yt = βYt-1 + α β is the multiplicative relationship between

Yt-1 and Yt

So if β=1, then Y never changes over time. ◦ If β>1 then Y increases over time◦ If β<1 then Y diminishes over time

B. More common form

1. Time’s arrow: Since cause must precede effect, time series analysis can be used to rule out the possibility that Y causes X

2. Autocorrelation: Sometimes we need to address the correlation of a variable with itself over time. Example: to predict defense budget, first thing to know is that it’s usually similar to last year’s budget. Then one can add IVs that might cause it to increase or decrease.

3. Omitted variable bias: Failing to “control” for a relevant IV (one that may correlate with both X and Y) can generate “false positives” – statistically significant relationships between variables that are causally unrelated (example: high correlation between Vietnam vets and supermarkets)

C. Why use time series analysis?

A. Is the relationship causal? Difficult to know for sure…

1. Possibility of coincidence: Addressed by requiring models to be statistically significant. Chance remains, but is low.

2. Sources of bias:a. Y causes X. That is, perhaps the researcher has reversed the

DV and IV. Use time-series analysis to rule this out.b. Faulty data – But only if the data is biased in some manner

that makes X and Y correlate. Random noise is already accounted for. Example of bias = serial autocorrelation, or correlation across time. Many things (kids and dogs) grow larger over time. But height of your kid does not cause your dog to get bigger!

c. Omitted variables – suppose Z causes X and Z causes Y. Then X and Y will appear to be causally related when in fact they are merely correlated. Adding Z to the model would reveal that X has no independent effect on Y.

IV. Forecasting with Econometric Models

A. Is the relationship causal? Difficult to know for sure…

1. Possibility of coincidence: Addressed by requiring models to be statistically significant. Chance remains, but is low.

2. Sources of bias:a. Y causes X. That is, perhaps the researcher has reversed the

DV and IV. Use time-series analysis to rule this out.b. Faulty data – But only if the data is biased in some manner

that makes X and Y correlate. Random noise is already accounted for. Example of bias = serial autocorrelation, or correlation across time. Many things (kids and dogs) grow larger over time. But height of your kid does not cause your dog to get bigger!

c. Omitted variables – suppose Z causes X and Z causes Y. Then X and Y will appear to be causally related when in fact they are merely correlated. Adding Z to the model would reveal that X has no independent effect on Y.

IV. Forecasting with Econometric Models

1. Requires eithera. The ability to forecast the IVs themselves, orb. A model that forecasts Y(t) from IVs in t-1, t-10,

etc.

2. Long-term forecasting models are rare. Why?

B. Extrapolating

1. Find a linear regression (OLS) that forecasts something

2. Find the future values of X3. Plug these into the equation4. Multiply each X with its corresponding B

(order of operations)5. Add it all together. Don’t forget the

intercept.6. Presto! You have a forecast for Y!

C. Process

Documents

A Brief Introduction. A. Data (variables). Can be in three forms: 1.Interval – There is a common scale to measure the variable, so that a value of two