Upload
nancy-ryan
View
218
Download
4
Embed Size (px)
Citation preview
Class 10(And some of 9)
Line fittingOLS regression
Discussion of time series and panel models
Project check in
Statistical Significance• Research and null hypotheses
– Hypothesis states the relationship between two variables.
– The null hypothesis state that there is NO (or a random) relationship between two variables. • H: Democracies trade more with each other
than with non-democracies.• H0: Status as a democracy is not related to
trade volume
– You are testing to reject H0 not accept H.
Types of Error
Decision based on Sample
State of Nature
H0 true H0 Untrue
Reject H0Type 1 error
(false alarm)Correct
Do not Reject H0
Correct Type 2 error
Alpha level
=.05, 5% chance of committing Type 1 error, or 95% chance of the decision to reject the null hypothesis being correct.
Causality
• In establishing causality there is a dependent variable, which you are trying to explain, and one or more independent variables that are assumed to be factors in the variation of the dependent variable.
• You need a logical model to “explain” this relationship or causality
Thinking in Models (again)
• What is a model?– Explains which elements relate to each
other and how.– Describing Relationships in a model
• Covariation – move in the same direction– Direct or Positive – Inverse or Negative– Nonlinear
• False of spurious– Control (confounding) variables
• Are you looking for the best model or testing someone else’s?
Developing models
• Where does a model come from?– From your own assessment and
observation of the problem, or from talking to others.
– From the literature.• Elements others include or consider important• Definitions of these elements • Descriptions of the “expected” relationships
among variables• Results and explanations• Sources and strategies for data• Suggestions of models or variations to be
tested in the future
Types of Models
1. Schematic
2. Symbolica) Economic growth is a function of
changes to the amount of capital (K) and changes to the amount of Labor (L).
b) G=f(K,L)
Capital
Labor
Econ Growth
The basic linear model (equation)
You can express many relationships as the linear equation:
y = a + bx, where
• y is the dependent variable• x is the independent variable• a is a constant• b is the slope of the line• For every increase of 1 in x, y changes by an amount
equal to bA perfectly linear relationship is where each change
results in exactly the same change. i.e. a strict ad valorem tariff.
Line FittingOther relationships may not be so exact.
Weight, is only to some degree a function of height.
If you take a sample of actual heights and weights, you might see something like the graph to the right. 100
120
140
160
180
200
220
60 65 70 75
Height
Weight
Source: http://www.tennessee.gov/tacir/Fiscal%20Capacity/Workshop/Regression%20Analysis%20Handout%20(Methodology%20Part%201).ppt
Line Fitting (cont.)The line is the “average” relationship described by the equation:y = a + bx+eThe difference between the line and any individual observation is the error (e).The observations that contributed to this analysis were all for heights between 5’ and 6’4”. You cannot, extrapolate the results to heights outside of those observed. The regression results are only valid for the range of actual observations.
100
120
140
160
180
200
220
60 65 70 75
Height
Weight
RegressionRegression is the method by which we find the line that best fits
the observations, i.e. has the lowest error.
Since the line describes the mean of the effects of the independent variables, by definition, the sum of the actual errors will be zero.
If you add up all of the values of the dependent variable and you add up all the values predicted by the model, the sum is the same and the sum of the negative errors (for points below the line) will exactly offset the sum of the positive errors (for points above the line).
Therefore Summing the errors would always equal zero. So,
instead, regression must find another way to measure the scale of the error. An Ordinary Least Squares (OLS) regression finds the line that results in the lowest sum of squared errors.
Multiple Regression
• What if we have multiple factors contributing to a result or a prediction?– For example basic economic theory
suggests that capital and labor contribute to economic growth.
– Hard to “see” how these two factors contribute to growth.
The multiple regression equation
Each of these factors has a separate relationship with the price of a home. The equation that describes a multiple regression relationship is:
y = a + b1L + b2K + e
This equation separates each individual independent variable from the rest, allowing each to have its own coefficient describing its relationship to the dependent variable. If Labor and Capital have the same coefficient than both contribute equally to economic growth.
In a statistics software program you will enter your dependent variable first and then your independent variables.
You will need to make sure the data and the variables conform to the assumptions of the model
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.803316
R Square 0.645316
Adjusted R Square 0.639305
Standard Error 1.788119
Observations 121
ANOVA
df SS MS F Significance F
Regression 2 686.446 343.223 107.3455 2.76E-27
Residual 118 377.2895 3.197368
Total 120 1063.735
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -15.0184 2.064116 -7.27595 4.12E-11 -19.1059 -10.9309 -19.1059 -10.9309
ln pop2001 0.656415 0.106162 6.183167 9.31E-09 0.446186 0.866643 0.446186 0.866643
ln GDP per capita 1.490166 0.104154 14.30738 1.53E-27 1.283913 1.696418 1.283913 1.696418
How good is the model
The R2 value • tells you what proportion of differences is
explained by the model. An R2 of .68, for example, means that 68% of the variance in the observed values of the dependent variable is explained by the model, and 32% of those differences remains unexplained in the error term.
Returning to the model of economic growth… Is explaining 50% of the causes good enough?
How much should you explain?
• Random error need not be a problem.– There is always error, a larger R-square is not a goal in
and of itself.
• Some error is due to latent variables that can not be observed. – There may be additional variables that can be logically
assumed to measure these causes of variation indirectly in some way.
– But even if they empirically appear to “explain” the variation within the regression model, variables should not necessarily be added unless there appears to be a logical way in which they might explain variation in the independent variable.
Statistical Significance• Each independent variable has a “p-value” or significance
level in the results. Sometimes it is explicitly given, sometimes just the test statistic with which significance can be derived.
• The p-value is a percentage. It tells you how likely it is that the coefficient for that independent variable emerged by chance and does not describe a real relationship (type I error).
• A p-value of .05 means that there is a 5% chance that the relationship emerged randomly and a 95% chance that the relationship is real.
• It is generally accepted practice to consider variables with a p-value of less than .1 as significant, though the only basis for this cutoff is convention.
Direction and Size
Look at the signs of the B coefficients. Do they have the expected signs?
– Your model and hypothesis should give you an expectation of the direction of each independent variable’s influence.
• Is the effect large or small? – Even if it is significant and in the right
direction, does a change in the independent variable yield a large or small change in the independent variable or vice versa?
F-Test
There is also a significance level for the model as a whole.
• The F-test or “Significance F” value in Excel measures the likelihood that the model as a whole describes a relationship that emerged at random, rather than a real relationship. As with the p-value, the lower the significance F value, the greater the chance that the relationships in the model are real.
Other Errors or Problems
• Multicollinearity
• Omitted Variables
• Endogeneity
• Other
Presenting Regression Results