Intro econometrics

7/24/2019 Intro econometrics

1/64

Regression


2/64

Regression.

Assumptions of regression

Violations of regression Multicollineraty

Basic

Heterioscedicity

. How to remove these violations.

Concept of dummy variables


3/64

ntroduction

!he study of the dependence of one variable"dependent variable# on one or more variables"e$planatory variables#

n regression% we deal with random "or stochastic#

variables. &ependent variable' e$plained% predicted% regressand%

response% endogenous% outcome% controlled variable.

($planatory Variable'ndependent% predictor%

regressor% stimulus% e$ogenous% covariate% controlvariable.

&ependent variable is plotted on vertical a$is andindependent variable is plotted on hori)ontal a$is.


4/64

*.+ !(RM,--/0 A,&,-!A!-,

n the literature the terms dependent variable andexplanatory variable are described variously. Arepresentative list is'


5/64

R(/R(11-, V(R121 C-RR(A!-,

n correlation analysis% the primary ob3ective is tomeasure the strength or degree of linearassociation between two variables. !hecoe4cient% measures this strength of "linear#

association. !he value of coe4cient of correlation varies

between 5* and 6*.

n regression analysis% we are not primarily

interested in such a measure. nstead% we try toestimate or predict the average value of onevariable on the basis of the 7$ed values of othervariables.


6/64

1catter 8lots of &ata with VariousCorrelation Coe4cients

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1 r = -.6 r = 0

r = +.3r = +1

Y

X

r = 0rom' 1tatistics for Managers 2sing Microsoft9 ($cel :th (dition% ;


7/64

n regression% we are dealing with randomvariable.

!he term random is a synonym for the term

stochastic. A random or stochastic variable is avariable that can ta=e on any set of values%positive or negative% with a given probability.

!he dependent variable is assumed to be

statistical% random% or stochastic% that is% tohave a probability distribution. !he e$planatoryvariables% on the other hand% are assumed tohave 7$ed values "in repeated sampling#.


8/64

!ypes of &ata

!ime 1eries

Cross5section

8ooled "8anel#


9/64

!ime 1eries

A set of observations on the values that avariable ta=es at di>erent times% collectedat regular time intervals "daily% wee=ly%

monthly% ?uarterly% annually%?uin?uennially "every @ years#%decennially "every *< years#

!ime series is based on the assumption of

1tationarity. hich means that its meanand variance do not vary systematicallyover time.


10/64

Cross 1ectional

&ata on one or more variables collected at the same point intime% such as the Censes every *< years "last time% in *#.

-r data on cotton production and Cotton prices for the :provinces in the union for *< and **. Dor each year thedata on the @< states are cross5sectional data.

Cross5sectional data has the problem of heterogeneity"combination of very large or very small values#

for e$ample% collecting data of 8un3ab and Balochistan as8un3ab is the biggest populous province and Balochistan isthe biggest geographical province.

Dor e$ample% 8un3ab produces huge amounts of eggs andBalochistan produces very little. hen we include suchheterogeneous units in a statistical analysis% the si)e or scalee>ect must be ta=en into account so as not to mi$ appleswith oranges.


11/64

8ooled &ata

!he combination of cross5sectional andtime series data

May be in a form of 8anel% longitudinal or

micropanel data &ata on cotton production and Cotton

prices for the : provinces in 8a=istan for*< and **. Dor each year the data onthe @< states are cross5sectional data. Andfor both years% it became 8ooled data.


12/64

ntroduction

8anel data is also =nown aslongitudinal or cross5sectional time5series data#

is a dataset in which the behavior ofentities are observed across time.

!hese entities could be states%companies% individuals% countries%etc.


13/64

How to -rgani)e 8anel &ata


14/64

!wo5Variable Regression

Regression analysis is largely concernedwith estimating andEor predicting the"population# mean value of the dependent

variable on the basis of the =nown or 7$edvalues of the e$planatory variable"s#.

Bivariate or !wo5Variable

Regression in which the dependent variable

"the regressand# is related to a singlee$planatory variable "the regressor#.


15/64

!he simple linear regression model isgiven as

is the i5th dependent variable%

is the i5th independent variable.

iy


16/64

is the intercept parameter%

is called slope parameter andrepresent change in for unitchange in.

is i5th error term.


17/64

inearity

inearity in the Variables

!he 7rst meaning of linearity is that the Y is a linear function ofXi, the regression curve in this case is a straight line. But

Y = 1+ 2X2i is not a linear function

Y = 1+ 2Xi is a linear function

inearity in the 8arameters

!he second interpretation of linearity is Y is a linear function ofthe parameters, the s! it may or may not be linear in the

variable F. Y = 1+ 2X2i is a linear function

Y = 1+ 2Xi is a linear function

is a linear "in the parameter# regression model.


18/64


19/64


20/64

(rror !erm

e can e$press the deviation of an individual Yi

around its e$pected value

!echnically% ui is no#n as the stochastic

disturbance or stochastic error term.

!he stochastic disturbance term is a proxyfor allthe omitted or ne&lected variables that may a>ectY but are not included in the regression model.

ut the stochastic speci(cation has the advantagethat it clearly shows that there are other variablesincluded in the regression model.

Residual term


21/64

hy (rror !ermI

!he disturbance term ui shows allomittedvariables fromthe model but that collectively a>ect Y. hy dont #eintroduce theminto the model e$plicitlyI !he reasons aremany'

*. -a&ueness of theory /he theory, if any, determinin& thebehavior of Y may be% and often is% incomplete. 0e mi&ht beignorant or unsure about the other variables a>ecting Y.

;. navailability of data ac= of ?uantitative informationabout these variables% e.g.% information on family wealthgenerally is not available.

J. ore variables versus peripheral variablesAssume thatbesides incomeX1, the number of children per family X2, sex

X3, reli&ion X4, education X5, and &eo&raphical re&ion X6also

a7ect consumption e$penditure. But the 3oint inKuence of allor some of these variables may be so small and it does notpay to introduce them into the model e$plicitly. -ne hopes


22/64

hy (rror !ermI

:. *ntrinsic randomness in human behavior (ven if we succeed inintroducing all the relevant variables into the model% there isbound to be some Lintrinsic randomness in individual 0s thatcannot be e$plained no matter how hard we try. !he disturbances%the us, may very well reKect this intrinsic randomness.

@. 8oor proxy variables ut since data on these variables are notdirectly observable% in practice we use pro$y variables% which maynot be true representative.

+. 8rinciple of parsimony #e #ould lie to =eep our regressionmodel as simple as possible. f we can e$plain the behavior of Y

$substantially% #ith t#o or three explanatory variables and if ourtheory is not strong enough to suggest what other variables mightbe included% why introduce more variablesI et uirepresent all

other variables.


23/64

;J

The Population Linear Regression Model


24/64

Assumptions

*. is random variable %it has normaldistributed with mean )ero andvariance

i.e.

!he constant variance assumption is=nown as homoscedasticty .


25/64

Assumptions

;.

!he disturbance terms areindependent of each other.

J.

!he e$planatory variable is non5stochastic and assumed withouterror.

.


26/64

Assumptions

:. !he e$planatory variables are notperfectly linear correlated.


27/64

Assumptions

8roperties of least s?uares estimates

-1 estimators are the linearfunction of actual observation .

!he least s?uares estimate are theunbiased estimates of


28/64

Assumptions

Variance of

here N is the total number ofparameter estimated from in theregression line.


29/64

Assumptions


30/64

Autocorrelation

(conometric problems


31/64

Regression &iagnostics

1ession J'


32/64

hat shall we learnI

At the end of this session% we shall beable to'

Dind and remove inKuential observations

Chec= for homogeneity%

multicollinearity% model speci7cation


33/64

nKuential data

hy a single inKuential observationcan be a concern for a researcherI

2nusual observations include' -utliers' an observation with large

residual

everage' e$treme inKuence of anobservation on the dependent variable


34/64

-utliers

1catter plot

1ummary statistic if the gap betweenminimum and ma$ is unusuallygreater

Coo=s5& test is used to remove bothoutlier and inKuential variable at

same time...


35/64

How to 7nd unseal data

e might start e$amining data with'

1ummary statistics

/raphs

,umerical tests


36/64

1ummary 1tatistics

&o you see any problem with anyvariableI


37/64

&iagnostic tests

1ee the standard deviation

1ee the ma$ and min values


38/64

Dinding 2nusual &ata' /raphs

graph matri$ debt ta$ pro7t tang

varincome


39/64

(stimate the regression e?uation


40/64

1tatistical !ests

e can use studenti)ed residualsas a 7rst means foridentifying outliers

After estimating -1% residuals can be predicted with

predict r, rstudent

e should pay attention to studenti)ed residuals thate$ceed 6; or 5;% and get even more concerned aboutresiduals that e$ceed 6;.@ or 5;.@ and even yet moreconcerned about residuals that e$ceed 6J or 5J


41/64

How to identify r greater than ;

e can use list comand with if option

list [variables] if abs(r) > 2

Abs is used for absolute values

e can drop outliers with dropcommand

drop [variables] if abs(r) > 2


42/64

nKuential observation

!o identify observation that have greater inKuence onthe dependent variable% we can use levera&e functionafter -1

predict lev, leverage

/enerally% a point with leverage greater than";=6;#En should be carefully e$amined. Here = is the

number of predictors and n is the number ofobservations.

'2+2)9n :::::'2;3 + 2)951:::::


43/64

How to identify inKuentialobservations

e can use list command with ifoption

list [variables] if lev> value

e can drop inKuential observations

with drop command

drop [variables] if lev > value


44/64

($ercise' ..

oad the 7le and estimate the regression e?uation

regress .

,ow chec='

Dor outliers

Dor inKuential data

2sing both graphical and numerical tests


45/64

Can we chec= for residuals andinKuence at the same time

Coo=s& combines information onthe residual and leverage.

!he lowest value that Coo=Os &can assume is )ero% and thehigher the Coo=Os & is% the more

inKuential the point.

!he convention cut5o> point is

:En


46/64

Coo=s& test

predict d, cooksd

list [variables] d if d>4/n


47/64

More /raphical options

After -1% we can use avplots

An avplot is an attractive graphicmethod to present multiple inKuentialpoints on a predictor.

hat we are loo=ing for in an avplot are

those points that can e$ert substantialchange to the regression line.


48/64

2. Checking homoscedasticity orHETEOSCE!ST"C"T#

hen variance of the residuals is not constant% so itmeans there is heteroscedasticity. hile when thevariance is constant% so it is =nown ashomoscedasticity.

Heteroscedasticity mostly occurs in cross5sectionaldata. t can be detected by several graphical or non5graphical methods.

hen we detect heteroscedasticity the hypothesistests are invalidbecause the standard errors are

biased so are the values of ! and D statistics% hencewe cannot analy)e them correctly in the presence ofheteroscedasticity. "Deng i. &epartment of 1tatistics%1toc=holm 2niversity#


49/64

2. Checking homoscedasticityor HETEOSCE!ST"C"T#

-ne of the main assumptions for theordinary least s?uares regression is thehomogeneity of variance of the residuals.

f the model is well57tted% there should beno pattern to the residuals plotted againstthe 7tted values.

!he hetroscedasticity is as a result ofcross5sectional data


50/64

e can use graphical command or statistical commands

/raphical$ rvfplot

1tatistical ' Breusch58agan test And hiteOs test estat hettest is the Breusch58agan test.

estat imtest is the hites test.

t test the null hypothesis that the variance of the residuals ishomogenous.

f p5value is P


51/64


Heteroscedasticity can also occur when the modelis not speci7ed correctlyS

other reason may be when there are a limitednumber of dependent variables or if reliability of

independent variable is somehow lin=ed with two ormore dependent variables "Hayes and Cai% ;


52/64


/raphically when there are deviations from the centralline it means there is a problem of heteroscedasticity.

!here are several tests for detecting heteroscedasticityone of them used in the research is Breusch58agantest.

!his test chec=s the null hypothesis and also veri7esthat variance is constant.

estat Hettest command is used in stata to chec=heteroscedasticity.

f p5value is so small then we will accept alternative

hypothesis and re3ect null hypothesis% which meansvariance is not constant and there is heteroscedasticity.

h ki h d i i


53/64


Robust command is then used to controlheteroscedasticity% outliers and other inKuentialvariables.

IIII

$treg dependent variable independentvariables% fe robust

fe is used to specify that 7$ed e>ect has beenselected as a model and robust command isutili)ed in order to control heteroscedasticity%outliers and other inKuential variables that arepresent in the data.


54/64

M%LT"COLL"&E!R"T#

!he term multicollinearity was 7rst used by8owel Ciompa in **


55/64

M%LT"COLL"&E!R"T#

L!he name given to general problem which ariseswhere some or all of the e$planatory variables inrelation and are so highly correlated one with anotherthat it becomes very di4cult% if not impossible to

disentangle is their separate inKuence and obtain areasonably precise estimate of their relative e>ects.

As multicollinearity e$pands then the =ey concern is asudden boost in standard errors for the coe4cients%due to which reliability of the model decreases. !he

values of t5statistics become smaller incase of highermulticollinearity due to which it is di4cult to acceptalternative hypothesis.


56/64

Multicollinearity

hat happens when two or more variables arehighly correlatedI

hen there is a perfect linear relationshipamong the predictors% the estimates for aregression model cannot be uni?uely computed.

!he primary concern is that as the degree ofmulticollinearity increases the standard errorsfor the coe4cients can get wildly inKated.


57/64

M%LT"COLL"&E!R"T#

t commonly results in misleadingand confusing conclusions. -ne ofthe reasons for multicollinearity

might be the use of inappropriatedummy variable.

Tests 'or


58/64

Tests 'orM%LT"COLL"&E!R"T#

Calculate Correlation Coe4cient

!he easiest way to detect multicollinearity is by calculatingcorrelation between pairs of independent variables. f correlationis * or 5* so the researcher should then remove one of the twocorrelated variables from the sample.

1catter diagram between independent variables will give some

indication about the multicollinearity issue.Variance nKation Dactor "VD#

hile in case of 1!A!A variance inKation factor "VD# is used tocompute the amountEdegree of multicollinearity among thevariables. -ne could simply enter the command Lvif in 1!A!A%

after running regression analysis on the data. f the value of VD isgreater than or e?ual to *< then it means there is the problem ofmulticollinearity in the data.

VD can also be computed as'

here as is coe4cient of determination of model.


59/64

How to &eal with Multi5Collinearity

&rop one of the two variables whichare linear correlated with oneanother.

hich variable to be droppedIIIII"decision is based on

Chec=ing ,ormality of


60/64

Chec=ing ,ormality ofResiduals

After we run a regression analysis% we can usethe predict command to create residuals andthen use commands to chec= the normalityboth graphically and numerically.

(raphical$such as kdensity) *norm andpnormto chec= the normality of the residuals.

&umerical$i*r and s+ilk

After regression% we then use the predictcommand to generate residuals.

predict ,residual -ariale/) resid



61/64


Below we use the =density commandto produce a =ernel density plot withthe normal option re?uesting that a

normal density be overlaid on the plot. =density stands for =ernel density

estimate. t can be thought of as a

histogram with narrow bins andmoving average.

kdensity ,residual -ariale/) normal



62/64


!he pnorm command graphs a standardi)ednormal probability "858# plot while ?norm plotsthe ?uantiles of a variable against the ?uantilesof a normal distribution.

pnorm is sensitive to non5normality in the middlerange of data and ?norm is sensitive to non5normality near the tails.

e can accept that the residuals are close to a

normal distribution.pnorm ,residual -ariale/

*norm ,residual -ariale/



63/64


!here are also numerical tests for testingnormality.

i?r stands for inter5?uartile range and assumesthe symmetry of the distribution.

1evere outliers consist of those points that areeither J inter5?uartileranges below the 7rst?uartile or J inter5?uartile5ranges above the third?uartile. !he presence of any severe outliersshould be su4cient evidence to re3ect normalityat a @Q signi7cance level.

Mild outliers are common in samples of any si)e.n our case% we donOt have any severe outliersand the distribution seems fairly symmetric. !he

residuals have an appro$imately normal



64/64


Another test available is the swil= testwhich performs the 1hapiro5il= testfor normality.

!he p5value is based on theassumption that the distribution isnormal ",ull Hypothesis#.

f p5value is more than

Documents

Intro econometrics