Intro econometrics

Embed Size (px)

Citation preview

  • 7/24/2019 Intro econometrics

    1/64

    Regression

  • 7/24/2019 Intro econometrics

    2/64

    Regression.

    Assumptions of regression

    Violations of regression Multicollineraty

    Basic

    Heterioscedicity

    . How to remove these violations.

    Concept of dummy variables

  • 7/24/2019 Intro econometrics

    3/64

    ntroduction

    !he study of the dependence of one variable"dependent variable# on one or more variables"e$planatory variables#

    n regression% we deal with random "or stochastic#

    variables. &ependent variable' e$plained% predicted% regressand%

    response% endogenous% outcome% controlled variable.

    ($planatory Variable'ndependent% predictor%

    regressor% stimulus% e$ogenous% covariate% controlvariable.

    &ependent variable is plotted on vertical a$is andindependent variable is plotted on hori)ontal a$is.

  • 7/24/2019 Intro econometrics

    4/64

    *.+ !(RM,--/0 A,&,-!A!-,

    n the literature the terms dependent variable andexplanatory variable are described variously. Arepresentative list is'

  • 7/24/2019 Intro econometrics

    5/64

    R(/R(11-, V(R121 C-RR(A!-,

    n correlation analysis% the primary ob3ective is tomeasure the strength or degree of linearassociation between two variables. !hecoe4cient% measures this strength of "linear#

    association. !he value of coe4cient of correlation varies

    between 5* and 6*.

    n regression analysis% we are not primarily

    interested in such a measure. nstead% we try toestimate or predict the average value of onevariable on the basis of the 7$ed values of othervariables.

  • 7/24/2019 Intro econometrics

    6/64

    1catter 8lots of &ata with VariousCorrelation Coe4cients

    Y

    X

    Y

    X

    Y

    X

    Y

    X

    Y

    X

    r = -1 r = -.6 r = 0

    r = +.3r = +1

    Y

    X

    r = 0rom' 1tatistics for Managers 2sing Microsoft9 ($cel :th (dition% ;

  • 7/24/2019 Intro econometrics

    7/64

    n regression% we are dealing with randomvariable.

    !he term random is a synonym for the term

    stochastic. A random or stochastic variable is avariable that can ta=e on any set of values%positive or negative% with a given probability.

    !he dependent variable is assumed to be

    statistical% random% or stochastic% that is% tohave a probability distribution. !he e$planatoryvariables% on the other hand% are assumed tohave 7$ed values "in repeated sampling#.

  • 7/24/2019 Intro econometrics

    8/64

    !ypes of &ata

    !ime 1eries

    Cross5section

    8ooled "8anel#

  • 7/24/2019 Intro econometrics

    9/64

    !ime 1eries

    A set of observations on the values that avariable ta=es at di>erent times% collectedat regular time intervals "daily% wee=ly%

    monthly% ?uarterly% annually%?uin?uennially "every @ years#%decennially "every *< years#

    !ime series is based on the assumption of

    1tationarity. hich means that its meanand variance do not vary systematicallyover time.

  • 7/24/2019 Intro econometrics

    10/64

    Cross 1ectional

    &ata on one or more variables collected at the same point intime% such as the Censes every *< years "last time% in *#.

    -r data on cotton production and Cotton prices for the :provinces in the union for *< and **. Dor each year thedata on the @< states are cross5sectional data.

    Cross5sectional data has the problem of heterogeneity"combination of very large or very small values#

    for e$ample% collecting data of 8un3ab and Balochistan as8un3ab is the biggest populous province and Balochistan isthe biggest geographical province.

    Dor e$ample% 8un3ab produces huge amounts of eggs andBalochistan produces very little. hen we include suchheterogeneous units in a statistical analysis% the si)e or scalee>ect must be ta=en into account so as not to mi$ appleswith oranges.

  • 7/24/2019 Intro econometrics

    11/64

    8ooled &ata

    !he combination of cross5sectional andtime series data

    May be in a form of 8anel% longitudinal or

    micropanel data &ata on cotton production and Cotton

    prices for the : provinces in 8a=istan for*< and **. Dor each year the data onthe @< states are cross5sectional data. Andfor both years% it became 8ooled data.

  • 7/24/2019 Intro econometrics

    12/64

    ntroduction

    8anel data is also =nown aslongitudinal or cross5sectional time5series data#

    is a dataset in which the behavior ofentities are observed across time.

    !hese entities could be states%companies% individuals% countries%etc.

  • 7/24/2019 Intro econometrics

    13/64

    How to -rgani)e 8anel &ata

  • 7/24/2019 Intro econometrics

    14/64

    !wo5Variable Regression

    Regression analysis is largely concernedwith estimating andEor predicting the"population# mean value of the dependent

    variable on the basis of the =nown or 7$edvalues of the e$planatory variable"s#.

    Bivariate or !wo5Variable

    Regression in which the dependent variable

    "the regressand# is related to a singlee$planatory variable "the regressor#.

  • 7/24/2019 Intro econometrics

    15/64

    !he simple linear regression model isgiven as

    is the i5th dependent variable%

    is the i5th independent variable.

    iy

  • 7/24/2019 Intro econometrics

    16/64

    is the intercept parameter%

    is called slope parameter andrepresent change in for unitchange in.

    is i5th error term.

  • 7/24/2019 Intro econometrics

    17/64

    inearity

    inearity in the Variables

    !he 7rst meaning of linearity is that the Y is a linear function ofXi, the regression curve in this case is a straight line. But

    Y = 1+ 2X2i is not a linear function

    Y = 1+ 2Xi is a linear function

    inearity in the 8arameters

    !he second interpretation of linearity is Y is a linear function ofthe parameters, the s! it may or may not be linear in the

    variable F. Y = 1+ 2X2i is a linear function

    Y = 1+ 2Xi is a linear function

    is a linear "in the parameter# regression model.

  • 7/24/2019 Intro econometrics

    18/64

  • 7/24/2019 Intro econometrics

    19/64

  • 7/24/2019 Intro econometrics

    20/64

    (rror !erm

    e can e$press the deviation of an individual Yi

    around its e$pected value

    !echnically% ui is no#n as the stochastic

    disturbance or stochastic error term.

    !he stochastic disturbance term is a proxyfor allthe omitted or ne&lected variables that may a>ectY but are not included in the regression model.

    ut the stochastic speci(cation has the advantagethat it clearly shows that there are other variablesincluded in the regression model.

    Residual term

  • 7/24/2019 Intro econometrics

    21/64

    hy (rror !ermI

    !he disturbance term ui shows allomittedvariables fromthe model but that collectively a>ect Y. hy dont #eintroduce theminto the model e$plicitlyI !he reasons aremany'

    *. -a&ueness of theory /he theory, if any, determinin& thebehavior of Y may be% and often is% incomplete. 0e mi&ht beignorant or unsure about the other variables a>ecting Y.

    ;. navailability of data ac= of ?uantitative informationabout these variables% e.g.% information on family wealthgenerally is not available.

    J. ore variables versus peripheral variablesAssume thatbesides incomeX1, the number of children per family X2, sex

    X3, reli&ion X4, education X5, and &eo&raphical re&ion X6also

    a7ect consumption e$penditure. But the 3oint inKuence of allor some of these variables may be so small and it does notpay to introduce them into the model e$plicitly. -ne hopes

  • 7/24/2019 Intro econometrics

    22/64

    hy (rror !ermI

    :. *ntrinsic randomness in human behavior (ven if we succeed inintroducing all the relevant variables into the model% there isbound to be some Lintrinsic randomness in individual 0s thatcannot be e$plained no matter how hard we try. !he disturbances%the us, may very well reKect this intrinsic randomness.

    @. 8oor proxy variables ut since data on these variables are notdirectly observable% in practice we use pro$y variables% which maynot be true representative.

    +. 8rinciple of parsimony #e #ould lie to =eep our regressionmodel as simple as possible. f we can e$plain the behavior of Y

    $substantially% #ith t#o or three explanatory variables and if ourtheory is not strong enough to suggest what other variables mightbe included% why introduce more variablesI et uirepresent all

    other variables.

  • 7/24/2019 Intro econometrics

    23/64

    ;J

    The Population Linear Regression Model

  • 7/24/2019 Intro econometrics

    24/64

    Assumptions

    *. is random variable %it has normaldistributed with mean )ero andvariance

    i.e.

    !he constant variance assumption is=nown as homoscedasticty .

  • 7/24/2019 Intro econometrics

    25/64

    Assumptions

    ;.

    !he disturbance terms areindependent of each other.

    J.

    !he e$planatory variable is non5stochastic and assumed withouterror.

    .

  • 7/24/2019 Intro econometrics

    26/64

    Assumptions

    :. !he e$planatory variables are notperfectly linear correlated.

  • 7/24/2019 Intro econometrics

    27/64

    Assumptions

    8roperties of least s?uares estimates

    -1 estimators are the linearfunction of actual observation .

    !he least s?uares estimate are theunbiased estimates of

  • 7/24/2019 Intro econometrics

    28/64

    Assumptions

    Variance of

    here N is the total number ofparameter estimated from in theregression line.

  • 7/24/2019 Intro econometrics

    29/64

    Assumptions

  • 7/24/2019 Intro econometrics

    30/64

    Autocorrelation

    (conometric problems

  • 7/24/2019 Intro econometrics

    31/64

    Regression &iagnostics

    1ession J'

  • 7/24/2019 Intro econometrics

    32/64

    hat shall we learnI

    At the end of this session% we shall beable to'

    Dind and remove inKuential observations

    Chec= for homogeneity%

    multicollinearity% model speci7cation

  • 7/24/2019 Intro econometrics

    33/64

    nKuential data

    hy a single inKuential observationcan be a concern for a researcherI

    2nusual observations include' -utliers' an observation with large

    residual

    everage' e$treme inKuence of anobservation on the dependent variable

  • 7/24/2019 Intro econometrics

    34/64

    -utliers

    1catter plot

    1ummary statistic if the gap betweenminimum and ma$ is unusuallygreater

    Coo=s5& test is used to remove bothoutlier and inKuential variable at

    same time...

  • 7/24/2019 Intro econometrics

    35/64

    How to 7nd unseal data

    e might start e$amining data with'

    1ummary statistics

    /raphs

    ,umerical tests

  • 7/24/2019 Intro econometrics

    36/64

    1ummary 1tatistics

    &o you see any problem with anyvariableI

  • 7/24/2019 Intro econometrics

    37/64

    &iagnostic tests

    1ee the standard deviation

    1ee the ma$ and min values

  • 7/24/2019 Intro econometrics

    38/64

    Dinding 2nusual &ata' /raphs

    graph matri$ debt ta$ pro7t tang

    varincome

  • 7/24/2019 Intro econometrics

    39/64

    (stimate the regression e?uation

  • 7/24/2019 Intro econometrics

    40/64

    1tatistical !ests

    e can use studenti)ed residualsas a 7rst means foridentifying outliers

    After estimating -1% residuals can be predicted with

    predict r, rstudent

    e should pay attention to studenti)ed residuals thate$ceed 6; or 5;% and get even more concerned aboutresiduals that e$ceed 6;.@ or 5;.@ and even yet moreconcerned about residuals that e$ceed 6J or 5J

  • 7/24/2019 Intro econometrics

    41/64

    How to identify r greater than ;

    e can use list comand with if option

    list [variables] if abs(r) > 2

    Abs is used for absolute values

    e can drop outliers with dropcommand

    drop [variables] if abs(r) > 2

  • 7/24/2019 Intro econometrics

    42/64

    nKuential observation

    !o identify observation that have greater inKuence onthe dependent variable% we can use levera&e functionafter -1

    predict lev, leverage

    /enerally% a point with leverage greater than";=6;#En should be carefully e$amined. Here = is the

    number of predictors and n is the number ofobservations.

    '2+2)9n :::::'2;3 + 2)951:::::

  • 7/24/2019 Intro econometrics

    43/64

    How to identify inKuentialobservations

    e can use list command with ifoption

    list [variables] if lev> value

    e can drop inKuential observations

    with drop command

    drop [variables] if lev > value

  • 7/24/2019 Intro econometrics

    44/64

    ($ercise' ..

    oad the 7le and estimate the regression e?uation

    regress .

    ,ow chec='

    Dor outliers

    Dor inKuential data

    2sing both graphical and numerical tests

  • 7/24/2019 Intro econometrics

    45/64

    Can we chec= for residuals andinKuence at the same time

    Coo=s& combines information onthe residual and leverage.

    !he lowest value that Coo=Os &can assume is )ero% and thehigher the Coo=Os & is% the more

    inKuential the point.

    !he convention cut5o> point is

    :En

  • 7/24/2019 Intro econometrics

    46/64

    Coo=s& test

    predict d, cooksd

    list [variables] d if d>4/n

  • 7/24/2019 Intro econometrics

    47/64

    More /raphical options

    After -1% we can use avplots

    An avplot is an attractive graphicmethod to present multiple inKuentialpoints on a predictor.

    hat we are loo=ing for in an avplot are

    those points that can e$ert substantialchange to the regression line.

  • 7/24/2019 Intro econometrics

    48/64

    2. Checking homoscedasticity orHETEOSCE!ST"C"T#

    hen variance of the residuals is not constant% so itmeans there is heteroscedasticity. hile when thevariance is constant% so it is =nown ashomoscedasticity.

    Heteroscedasticity mostly occurs in cross5sectionaldata. t can be detected by several graphical or non5graphical methods.

    hen we detect heteroscedasticity the hypothesistests are invalidbecause the standard errors are

    biased so are the values of ! and D statistics% hencewe cannot analy)e them correctly in the presence ofheteroscedasticity. "Deng i. &epartment of 1tatistics%1toc=holm 2niversity#

  • 7/24/2019 Intro econometrics

    49/64

    2. Checking homoscedasticityor HETEOSCE!ST"C"T#

    -ne of the main assumptions for theordinary least s?uares regression is thehomogeneity of variance of the residuals.

    f the model is well57tted% there should beno pattern to the residuals plotted againstthe 7tted values.

    !he hetroscedasticity is as a result ofcross5sectional data

  • 7/24/2019 Intro econometrics

    50/64

    e can use graphical command or statistical commands

    /raphical$ rvfplot

    1tatistical ' Breusch58agan test And hiteOs test estat hettest is the Breusch58agan test.

    estat imtest is the hites test.

    t test the null hypothesis that the variance of the residuals ishomogenous.

    f p5value is P

  • 7/24/2019 Intro econometrics

    51/64

    2. Checking homoscedasticity orHETEOSCE!ST"C"T#

    Heteroscedasticity can also occur when the modelis not speci7ed correctlyS

    other reason may be when there are a limitednumber of dependent variables or if reliability of

    independent variable is somehow lin=ed with two ormore dependent variables "Hayes and Cai% ;

  • 7/24/2019 Intro econometrics

    52/64

    2. Checking homoscedasticity orHETEOSCE!ST"C"T#

    /raphically when there are deviations from the centralline it means there is a problem of heteroscedasticity.

    !here are several tests for detecting heteroscedasticityone of them used in the research is Breusch58agantest.

    !his test chec=s the null hypothesis and also veri7esthat variance is constant.

    estat Hettest command is used in stata to chec=heteroscedasticity.

    f p5value is so small then we will accept alternative

    hypothesis and re3ect null hypothesis% which meansvariance is not constant and there is heteroscedasticity.

    h ki h d i i

  • 7/24/2019 Intro econometrics

    53/64

    2. Checking homoscedasticity orHETEOSCE!ST"C"T#

    Robust command is then used to controlheteroscedasticity% outliers and other inKuentialvariables.

    IIII

    $treg dependent variable independentvariables% fe robust

    fe is used to specify that 7$ed e>ect has beenselected as a model and robust command isutili)ed in order to control heteroscedasticity%outliers and other inKuential variables that arepresent in the data.

  • 7/24/2019 Intro econometrics

    54/64

    M%LT"COLL"&E!R"T#

    !he term multicollinearity was 7rst used by8owel Ciompa in **

  • 7/24/2019 Intro econometrics

    55/64

    M%LT"COLL"&E!R"T#

    L!he name given to general problem which ariseswhere some or all of the e$planatory variables inrelation and are so highly correlated one with anotherthat it becomes very di4cult% if not impossible to

    disentangle is their separate inKuence and obtain areasonably precise estimate of their relative e>ects.

    As multicollinearity e$pands then the =ey concern is asudden boost in standard errors for the coe4cients%due to which reliability of the model decreases. !he

    values of t5statistics become smaller incase of highermulticollinearity due to which it is di4cult to acceptalternative hypothesis.

  • 7/24/2019 Intro econometrics

    56/64

    Multicollinearity

    hat happens when two or more variables arehighly correlatedI

    hen there is a perfect linear relationshipamong the predictors% the estimates for aregression model cannot be uni?uely computed.

    !he primary concern is that as the degree ofmulticollinearity increases the standard errorsfor the coe4cients can get wildly inKated.

  • 7/24/2019 Intro econometrics

    57/64

    M%LT"COLL"&E!R"T#

    t commonly results in misleadingand confusing conclusions. -ne ofthe reasons for multicollinearity

    might be the use of inappropriatedummy variable.

    Tests 'or

  • 7/24/2019 Intro econometrics

    58/64

    Tests 'orM%LT"COLL"&E!R"T#

    Calculate Correlation Coe4cient

    !he easiest way to detect multicollinearity is by calculatingcorrelation between pairs of independent variables. f correlationis * or 5* so the researcher should then remove one of the twocorrelated variables from the sample.

    1catter diagram between independent variables will give some

    indication about the multicollinearity issue.Variance nKation Dactor "VD#

    hile in case of 1!A!A variance inKation factor "VD# is used tocompute the amountEdegree of multicollinearity among thevariables. -ne could simply enter the command Lvif in 1!A!A%

    after running regression analysis on the data. f the value of VD isgreater than or e?ual to *< then it means there is the problem ofmulticollinearity in the data.

    VD can also be computed as'

    here as is coe4cient of determination of model.

  • 7/24/2019 Intro econometrics

    59/64

    How to &eal with Multi5Collinearity

    &rop one of the two variables whichare linear correlated with oneanother.

    hich variable to be droppedIIIII"decision is based on

    Chec=ing ,ormality of

  • 7/24/2019 Intro econometrics

    60/64

    Chec=ing ,ormality ofResiduals

    After we run a regression analysis% we can usethe predict command to create residuals andthen use commands to chec= the normalityboth graphically and numerically.

    (raphical$such as kdensity) *norm andpnormto chec= the normality of the residuals.

    &umerical$i*r and s+ilk

    After regression% we then use the predictcommand to generate residuals.

    predict ,residual -ariale/) resid

    Chec=ing ,ormality of

  • 7/24/2019 Intro econometrics

    61/64

    Chec=ing ,ormality ofResiduals

    Below we use the =density commandto produce a =ernel density plot withthe normal option re?uesting that a

    normal density be overlaid on the plot. =density stands for =ernel density

    estimate. t can be thought of as a

    histogram with narrow bins andmoving average.

    kdensity ,residual -ariale/) normal

    Chec=ing ,ormality of

  • 7/24/2019 Intro econometrics

    62/64

    Chec=ing ,ormality ofResiduals

    !he pnorm command graphs a standardi)ednormal probability "858# plot while ?norm plotsthe ?uantiles of a variable against the ?uantilesof a normal distribution.

    pnorm is sensitive to non5normality in the middlerange of data and ?norm is sensitive to non5normality near the tails.

    e can accept that the residuals are close to a

    normal distribution.pnorm ,residual -ariale/

    *norm ,residual -ariale/

    Chec=ing ,ormality of

  • 7/24/2019 Intro econometrics

    63/64

    Chec=ing ,ormality ofResiduals

    !here are also numerical tests for testingnormality.

    i?r stands for inter5?uartile range and assumesthe symmetry of the distribution.

    1evere outliers consist of those points that areeither J inter5?uartileranges below the 7rst?uartile or J inter5?uartile5ranges above the third?uartile. !he presence of any severe outliersshould be su4cient evidence to re3ect normalityat a @Q signi7cance level.

    Mild outliers are common in samples of any si)e.n our case% we donOt have any severe outliersand the distribution seems fairly symmetric. !he

    residuals have an appro$imately normal

    Chec=ing ,ormality of

  • 7/24/2019 Intro econometrics

    64/64

    Chec=ing ,ormality ofResiduals

    Another test available is the swil= testwhich performs the 1hapiro5il= testfor normality.

    !he p5value is based on theassumption that the distribution isnormal ",ull Hypothesis#.

    f p5value is more than