Upload
wajidshahsw1
View
251
Download
0
Embed Size (px)
Citation preview
7/24/2019 Intro econometrics
1/64
Regression
7/24/2019 Intro econometrics
2/64
Regression.
Assumptions of regression
Violations of regression Multicollineraty
Basic
Heterioscedicity
. How to remove these violations.
Concept of dummy variables
7/24/2019 Intro econometrics
3/64
ntroduction
!he study of the dependence of one variable"dependent variable# on one or more variables"e$planatory variables#
n regression% we deal with random "or stochastic#
variables. &ependent variable' e$plained% predicted% regressand%
response% endogenous% outcome% controlled variable.
($planatory Variable'ndependent% predictor%
regressor% stimulus% e$ogenous% covariate% controlvariable.
&ependent variable is plotted on vertical a$is andindependent variable is plotted on hori)ontal a$is.
7/24/2019 Intro econometrics
4/64
*.+ !(RM,--/0 A,&,-!A!-,
n the literature the terms dependent variable andexplanatory variable are described variously. Arepresentative list is'
7/24/2019 Intro econometrics
5/64
R(/R(11-, V(R121 C-RR(A!-,
n correlation analysis% the primary ob3ective is tomeasure the strength or degree of linearassociation between two variables. !hecoe4cient% measures this strength of "linear#
association. !he value of coe4cient of correlation varies
between 5* and 6*.
n regression analysis% we are not primarily
interested in such a measure. nstead% we try toestimate or predict the average value of onevariable on the basis of the 7$ed values of othervariables.
7/24/2019 Intro econometrics
6/64
1catter 8lots of &ata with VariousCorrelation Coe4cients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
X
r = 0rom' 1tatistics for Managers 2sing Microsoft9 ($cel :th (dition% ;
7/24/2019 Intro econometrics
7/64
n regression% we are dealing with randomvariable.
!he term random is a synonym for the term
stochastic. A random or stochastic variable is avariable that can ta=e on any set of values%positive or negative% with a given probability.
!he dependent variable is assumed to be
statistical% random% or stochastic% that is% tohave a probability distribution. !he e$planatoryvariables% on the other hand% are assumed tohave 7$ed values "in repeated sampling#.
7/24/2019 Intro econometrics
8/64
!ypes of &ata
!ime 1eries
Cross5section
8ooled "8anel#
7/24/2019 Intro econometrics
9/64
!ime 1eries
A set of observations on the values that avariable ta=es at di>erent times% collectedat regular time intervals "daily% wee=ly%
monthly% ?uarterly% annually%?uin?uennially "every @ years#%decennially "every *< years#
!ime series is based on the assumption of
1tationarity. hich means that its meanand variance do not vary systematicallyover time.
7/24/2019 Intro econometrics
10/64
Cross 1ectional
&ata on one or more variables collected at the same point intime% such as the Censes every *< years "last time% in *#.
-r data on cotton production and Cotton prices for the :provinces in the union for *< and **. Dor each year thedata on the @< states are cross5sectional data.
Cross5sectional data has the problem of heterogeneity"combination of very large or very small values#
for e$ample% collecting data of 8un3ab and Balochistan as8un3ab is the biggest populous province and Balochistan isthe biggest geographical province.
Dor e$ample% 8un3ab produces huge amounts of eggs andBalochistan produces very little. hen we include suchheterogeneous units in a statistical analysis% the si)e or scalee>ect must be ta=en into account so as not to mi$ appleswith oranges.
7/24/2019 Intro econometrics
11/64
8ooled &ata
!he combination of cross5sectional andtime series data
May be in a form of 8anel% longitudinal or
micropanel data &ata on cotton production and Cotton
prices for the : provinces in 8a=istan for*< and **. Dor each year the data onthe @< states are cross5sectional data. Andfor both years% it became 8ooled data.
7/24/2019 Intro econometrics
12/64
ntroduction
8anel data is also =nown aslongitudinal or cross5sectional time5series data#
is a dataset in which the behavior ofentities are observed across time.
!hese entities could be states%companies% individuals% countries%etc.
7/24/2019 Intro econometrics
13/64
How to -rgani)e 8anel &ata
7/24/2019 Intro econometrics
14/64
!wo5Variable Regression
Regression analysis is largely concernedwith estimating andEor predicting the"population# mean value of the dependent
variable on the basis of the =nown or 7$edvalues of the e$planatory variable"s#.
Bivariate or !wo5Variable
Regression in which the dependent variable
"the regressand# is related to a singlee$planatory variable "the regressor#.
7/24/2019 Intro econometrics
15/64
!he simple linear regression model isgiven as
is the i5th dependent variable%
is the i5th independent variable.
iy
7/24/2019 Intro econometrics
16/64
is the intercept parameter%
is called slope parameter andrepresent change in for unitchange in.
is i5th error term.
7/24/2019 Intro econometrics
17/64
inearity
inearity in the Variables
!he 7rst meaning of linearity is that the Y is a linear function ofXi, the regression curve in this case is a straight line. But
Y = 1+ 2X2i is not a linear function
Y = 1+ 2Xi is a linear function
inearity in the 8arameters
!he second interpretation of linearity is Y is a linear function ofthe parameters, the s! it may or may not be linear in the
variable F. Y = 1+ 2X2i is a linear function
Y = 1+ 2Xi is a linear function
is a linear "in the parameter# regression model.
7/24/2019 Intro econometrics
18/64
7/24/2019 Intro econometrics
19/64
7/24/2019 Intro econometrics
20/64
(rror !erm
e can e$press the deviation of an individual Yi
around its e$pected value
!echnically% ui is no#n as the stochastic
disturbance or stochastic error term.
!he stochastic disturbance term is a proxyfor allthe omitted or ne&lected variables that may a>ectY but are not included in the regression model.
ut the stochastic speci(cation has the advantagethat it clearly shows that there are other variablesincluded in the regression model.
Residual term
7/24/2019 Intro econometrics
21/64
hy (rror !ermI
!he disturbance term ui shows allomittedvariables fromthe model but that collectively a>ect Y. hy dont #eintroduce theminto the model e$plicitlyI !he reasons aremany'
*. -a&ueness of theory /he theory, if any, determinin& thebehavior of Y may be% and often is% incomplete. 0e mi&ht beignorant or unsure about the other variables a>ecting Y.
;. navailability of data ac= of ?uantitative informationabout these variables% e.g.% information on family wealthgenerally is not available.
J. ore variables versus peripheral variablesAssume thatbesides incomeX1, the number of children per family X2, sex
X3, reli&ion X4, education X5, and &eo&raphical re&ion X6also
a7ect consumption e$penditure. But the 3oint inKuence of allor some of these variables may be so small and it does notpay to introduce them into the model e$plicitly. -ne hopes
7/24/2019 Intro econometrics
22/64
hy (rror !ermI
:. *ntrinsic randomness in human behavior (ven if we succeed inintroducing all the relevant variables into the model% there isbound to be some Lintrinsic randomness in individual 0s thatcannot be e$plained no matter how hard we try. !he disturbances%the us, may very well reKect this intrinsic randomness.
@. 8oor proxy variables ut since data on these variables are notdirectly observable% in practice we use pro$y variables% which maynot be true representative.
+. 8rinciple of parsimony #e #ould lie to =eep our regressionmodel as simple as possible. f we can e$plain the behavior of Y
$substantially% #ith t#o or three explanatory variables and if ourtheory is not strong enough to suggest what other variables mightbe included% why introduce more variablesI et uirepresent all
other variables.
7/24/2019 Intro econometrics
23/64
;J
The Population Linear Regression Model
7/24/2019 Intro econometrics
24/64
Assumptions
*. is random variable %it has normaldistributed with mean )ero andvariance
i.e.
!he constant variance assumption is=nown as homoscedasticty .
7/24/2019 Intro econometrics
25/64
Assumptions
;.
!he disturbance terms areindependent of each other.
J.
!he e$planatory variable is non5stochastic and assumed withouterror.
.
7/24/2019 Intro econometrics
26/64
Assumptions
:. !he e$planatory variables are notperfectly linear correlated.
7/24/2019 Intro econometrics
27/64
Assumptions
8roperties of least s?uares estimates
-1 estimators are the linearfunction of actual observation .
!he least s?uares estimate are theunbiased estimates of
7/24/2019 Intro econometrics
28/64
Assumptions
Variance of
here N is the total number ofparameter estimated from in theregression line.
7/24/2019 Intro econometrics
29/64
Assumptions
7/24/2019 Intro econometrics
30/64
Autocorrelation
(conometric problems
7/24/2019 Intro econometrics
31/64
Regression &iagnostics
1ession J'
7/24/2019 Intro econometrics
32/64
hat shall we learnI
At the end of this session% we shall beable to'
Dind and remove inKuential observations
Chec= for homogeneity%
multicollinearity% model speci7cation
7/24/2019 Intro econometrics
33/64
nKuential data
hy a single inKuential observationcan be a concern for a researcherI
2nusual observations include' -utliers' an observation with large
residual
everage' e$treme inKuence of anobservation on the dependent variable
7/24/2019 Intro econometrics
34/64
-utliers
1catter plot
1ummary statistic if the gap betweenminimum and ma$ is unusuallygreater
Coo=s5& test is used to remove bothoutlier and inKuential variable at
same time...
7/24/2019 Intro econometrics
35/64
How to 7nd unseal data
e might start e$amining data with'
1ummary statistics
/raphs
,umerical tests
7/24/2019 Intro econometrics
36/64
1ummary 1tatistics
&o you see any problem with anyvariableI
7/24/2019 Intro econometrics
37/64
&iagnostic tests
1ee the standard deviation
1ee the ma$ and min values
7/24/2019 Intro econometrics
38/64
Dinding 2nusual &ata' /raphs
graph matri$ debt ta$ pro7t tang
varincome
7/24/2019 Intro econometrics
39/64
(stimate the regression e?uation
7/24/2019 Intro econometrics
40/64
1tatistical !ests
e can use studenti)ed residualsas a 7rst means foridentifying outliers
After estimating -1% residuals can be predicted with
predict r, rstudent
e should pay attention to studenti)ed residuals thate$ceed 6; or 5;% and get even more concerned aboutresiduals that e$ceed 6;.@ or 5;.@ and even yet moreconcerned about residuals that e$ceed 6J or 5J
7/24/2019 Intro econometrics
41/64
How to identify r greater than ;
e can use list comand with if option
list [variables] if abs(r) > 2
Abs is used for absolute values
e can drop outliers with dropcommand
drop [variables] if abs(r) > 2
7/24/2019 Intro econometrics
42/64
nKuential observation
!o identify observation that have greater inKuence onthe dependent variable% we can use levera&e functionafter -1
predict lev, leverage
/enerally% a point with leverage greater than";=6;#En should be carefully e$amined. Here = is the
number of predictors and n is the number ofobservations.
'2+2)9n :::::'2;3 + 2)951:::::
7/24/2019 Intro econometrics
43/64
How to identify inKuentialobservations
e can use list command with ifoption
list [variables] if lev> value
e can drop inKuential observations
with drop command
drop [variables] if lev > value
7/24/2019 Intro econometrics
44/64
($ercise' ..
oad the 7le and estimate the regression e?uation
regress .
,ow chec='
Dor outliers
Dor inKuential data
2sing both graphical and numerical tests
7/24/2019 Intro econometrics
45/64
Can we chec= for residuals andinKuence at the same time
Coo=s& combines information onthe residual and leverage.
!he lowest value that Coo=Os &can assume is )ero% and thehigher the Coo=Os & is% the more
inKuential the point.
!he convention cut5o> point is
:En
7/24/2019 Intro econometrics
46/64
Coo=s& test
predict d, cooksd
list [variables] d if d>4/n
7/24/2019 Intro econometrics
47/64
More /raphical options
After -1% we can use avplots
An avplot is an attractive graphicmethod to present multiple inKuentialpoints on a predictor.
hat we are loo=ing for in an avplot are
those points that can e$ert substantialchange to the regression line.
7/24/2019 Intro econometrics
48/64
2. Checking homoscedasticity orHETEOSCE!ST"C"T#
hen variance of the residuals is not constant% so itmeans there is heteroscedasticity. hile when thevariance is constant% so it is =nown ashomoscedasticity.
Heteroscedasticity mostly occurs in cross5sectionaldata. t can be detected by several graphical or non5graphical methods.
hen we detect heteroscedasticity the hypothesistests are invalidbecause the standard errors are
biased so are the values of ! and D statistics% hencewe cannot analy)e them correctly in the presence ofheteroscedasticity. "Deng i. &epartment of 1tatistics%1toc=holm 2niversity#
7/24/2019 Intro econometrics
49/64
2. Checking homoscedasticityor HETEOSCE!ST"C"T#
-ne of the main assumptions for theordinary least s?uares regression is thehomogeneity of variance of the residuals.
f the model is well57tted% there should beno pattern to the residuals plotted againstthe 7tted values.
!he hetroscedasticity is as a result ofcross5sectional data
7/24/2019 Intro econometrics
50/64
e can use graphical command or statistical commands
/raphical$ rvfplot
1tatistical ' Breusch58agan test And hiteOs test estat hettest is the Breusch58agan test.
estat imtest is the hites test.
t test the null hypothesis that the variance of the residuals ishomogenous.
f p5value is P
7/24/2019 Intro econometrics
51/64
2. Checking homoscedasticity orHETEOSCE!ST"C"T#
Heteroscedasticity can also occur when the modelis not speci7ed correctlyS
other reason may be when there are a limitednumber of dependent variables or if reliability of
independent variable is somehow lin=ed with two ormore dependent variables "Hayes and Cai% ;
7/24/2019 Intro econometrics
52/64
2. Checking homoscedasticity orHETEOSCE!ST"C"T#
/raphically when there are deviations from the centralline it means there is a problem of heteroscedasticity.
!here are several tests for detecting heteroscedasticityone of them used in the research is Breusch58agantest.
!his test chec=s the null hypothesis and also veri7esthat variance is constant.
estat Hettest command is used in stata to chec=heteroscedasticity.
f p5value is so small then we will accept alternative
hypothesis and re3ect null hypothesis% which meansvariance is not constant and there is heteroscedasticity.
h ki h d i i
7/24/2019 Intro econometrics
53/64
2. Checking homoscedasticity orHETEOSCE!ST"C"T#
Robust command is then used to controlheteroscedasticity% outliers and other inKuentialvariables.
IIII
$treg dependent variable independentvariables% fe robust
fe is used to specify that 7$ed e>ect has beenselected as a model and robust command isutili)ed in order to control heteroscedasticity%outliers and other inKuential variables that arepresent in the data.
7/24/2019 Intro econometrics
54/64
M%LT"COLL"&E!R"T#
!he term multicollinearity was 7rst used by8owel Ciompa in **
7/24/2019 Intro econometrics
55/64
M%LT"COLL"&E!R"T#
L!he name given to general problem which ariseswhere some or all of the e$planatory variables inrelation and are so highly correlated one with anotherthat it becomes very di4cult% if not impossible to
disentangle is their separate inKuence and obtain areasonably precise estimate of their relative e>ects.
As multicollinearity e$pands then the =ey concern is asudden boost in standard errors for the coe4cients%due to which reliability of the model decreases. !he
values of t5statistics become smaller incase of highermulticollinearity due to which it is di4cult to acceptalternative hypothesis.
7/24/2019 Intro econometrics
56/64
Multicollinearity
hat happens when two or more variables arehighly correlatedI
hen there is a perfect linear relationshipamong the predictors% the estimates for aregression model cannot be uni?uely computed.
!he primary concern is that as the degree ofmulticollinearity increases the standard errorsfor the coe4cients can get wildly inKated.
7/24/2019 Intro econometrics
57/64
M%LT"COLL"&E!R"T#
t commonly results in misleadingand confusing conclusions. -ne ofthe reasons for multicollinearity
might be the use of inappropriatedummy variable.
Tests 'or
7/24/2019 Intro econometrics
58/64
Tests 'orM%LT"COLL"&E!R"T#
Calculate Correlation Coe4cient
!he easiest way to detect multicollinearity is by calculatingcorrelation between pairs of independent variables. f correlationis * or 5* so the researcher should then remove one of the twocorrelated variables from the sample.
1catter diagram between independent variables will give some
indication about the multicollinearity issue.Variance nKation Dactor "VD#
hile in case of 1!A!A variance inKation factor "VD# is used tocompute the amountEdegree of multicollinearity among thevariables. -ne could simply enter the command Lvif in 1!A!A%
after running regression analysis on the data. f the value of VD isgreater than or e?ual to *< then it means there is the problem ofmulticollinearity in the data.
VD can also be computed as'
here as is coe4cient of determination of model.
7/24/2019 Intro econometrics
59/64
How to &eal with Multi5Collinearity
&rop one of the two variables whichare linear correlated with oneanother.
hich variable to be droppedIIIII"decision is based on
Chec=ing ,ormality of
7/24/2019 Intro econometrics
60/64
Chec=ing ,ormality ofResiduals
After we run a regression analysis% we can usethe predict command to create residuals andthen use commands to chec= the normalityboth graphically and numerically.
(raphical$such as kdensity) *norm andpnormto chec= the normality of the residuals.
&umerical$i*r and s+ilk
After regression% we then use the predictcommand to generate residuals.
predict ,residual -ariale/) resid
Chec=ing ,ormality of
7/24/2019 Intro econometrics
61/64
Chec=ing ,ormality ofResiduals
Below we use the =density commandto produce a =ernel density plot withthe normal option re?uesting that a
normal density be overlaid on the plot. =density stands for =ernel density
estimate. t can be thought of as a
histogram with narrow bins andmoving average.
kdensity ,residual -ariale/) normal
Chec=ing ,ormality of
7/24/2019 Intro econometrics
62/64
Chec=ing ,ormality ofResiduals
!he pnorm command graphs a standardi)ednormal probability "858# plot while ?norm plotsthe ?uantiles of a variable against the ?uantilesof a normal distribution.
pnorm is sensitive to non5normality in the middlerange of data and ?norm is sensitive to non5normality near the tails.
e can accept that the residuals are close to a
normal distribution.pnorm ,residual -ariale/
*norm ,residual -ariale/
Chec=ing ,ormality of
7/24/2019 Intro econometrics
63/64
Chec=ing ,ormality ofResiduals
!here are also numerical tests for testingnormality.
i?r stands for inter5?uartile range and assumesthe symmetry of the distribution.
1evere outliers consist of those points that areeither J inter5?uartileranges below the 7rst?uartile or J inter5?uartile5ranges above the third?uartile. !he presence of any severe outliersshould be su4cient evidence to re3ect normalityat a @Q signi7cance level.
Mild outliers are common in samples of any si)e.n our case% we donOt have any severe outliersand the distribution seems fairly symmetric. !he
residuals have an appro$imately normal
Chec=ing ,ormality of
7/24/2019 Intro econometrics
64/64
Chec=ing ,ormality ofResiduals
Another test available is the swil= testwhich performs the 1hapiro5il= testfor normality.
!he p5value is based on theassumption that the distribution isnormal ",ull Hypothesis#.
f p5value is more than