121
Pepperoni Plain M ushroom Sausage Pepper and Onion Mushroom and Onion Garlic M eatball Category Meatball 5.0% Garlic 2.3% Mushroom and Onion 9.2% Pepper and Onion 7.3% Sausage 5.8% M ushroom 16.2% Plain 32.5% Pepperoni 21.8% Pie ChartofPercentvs Type Listing 900000 800000 700000 600000 500000 400000 300000 200000 100000 BoxplotofListing Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Percent 1000000 800000 600000 400000 200000 0 99 95 90 80 70 60 50 40 30 20 10 5 1 M ean 369687 StD ev 156865 N 51 AD 0.994 P-Value 0.012 Probability PlotofListing Normal- 95% CI Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Frequency 900000 800000 700000 600000 500000 400000 300000 200000 14 12 10 8 6 4 2 0 H istogram ofListing Listing Percent 9000 0 0 80 00 00 7000 0 0 6 0 00 00 50 0 0 0 0 400000 30 00 00 2000 0 0 1000 00 0 100 80 60 40 20 0 M ean 369687 StD ev 156865 N 51 Em piricalCD F ofListing Norm al Incom ePC Listing 30000 25000 20000 15000 1000000 800000 600000 400000 200000 M arginalPlotofListing vs IncomePC Pepperoni Plain M ushroom Sausage Pepper and Onion Mushroom and Onion Garlic M eatball Category Meatball 5.0% Garlic 2.3% Mushroom and Onion 9.2% Pepper and Onion 7.3% Sausage 5.8% M ushroom 16.2% Plain 32.5% Pepperoni 21.8% Pie ChartofPercentvs Type Listing 900000 800000 700000 600000 500000 400000 300000 200000 100000 BoxplotofListing Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Percent 1000000 800000 600000 400000 200000 0 99 95 90 80 70 60 50 40 30 20 10 5 1 M ean 369687 StD ev 156865 N 51 AD 0.994 P-Value 0.012 Probability PlotofListing Normal- 95% CI Incom ePC Listing 32500 30000 27500 25000 22500 20000 17500 15000 900000 800000 700000 600000 500000 400000 300000 200000 100000 ScatterplotofListing vs IncomePC Listing Frequency 900000 800000 700000 600000 500000 400000 300000 200000 14 12 10 8 6 4 2 0 H istogram ofListing Listing Percent 9000 0 0 80 00 00 7000 0 0 6 0 00 00 50 0 0 0 0 400000 30 00 00 2000 0 0 1000 00 0 100 80 60 40 20 0 M ean 369687 StD ev 156865 N 51 Em piricalCD F ofListing Norm al Incom ePC Listing 30000 25000 20000 15000 1000000 800000 600000 400000 200000 M arginalPlotofListing vs IncomePC Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

Statistical Inference and Regression Analysis: GB.3302.30

  • Upload
    edan

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Inference and Regression. Perfect Collinearity. Perfect Multicollinearity. - PowerPoint PPT Presentation

Citation preview

Statistics

Statistical Inference and Regression Analysis: GB.3302.30Professor William GreeneStern School of BusinessIOMS Department Department of Economics

#/1001Inference and RegressionPerfect Collinearity

#/1002Perfect MulticollinearityIf X does not have full rank, then at least one column can be written as a linear combination of the other columns.XX does not have rank and cannot be inverted.b cannot be computed.3

#/100Multicollinearity4

Enhanced Monet Area Effect Model: Height and Width EffectsLog(Price) = 1 + 2 log Area + 3 log Aspect Ratio + 4 log Height + 5 Signature + (Aspect Ratio = Height/Width)

#/100Short Rank X5

Enhanced Monet Area Effect Model: Height and Width EffectsLog(Price) = 1 + 2 log Area + 3 log Aspect Ratio + 4 log Height + 5 Signature + (Aspect Ratio = Height/Width)

X1 = 1, X2 = logArea, X3 = LogAspect, X4 = logHeight, X5 = Signature

X2 = logH + LogWX3 = logH - LogWX4 = logH x2 + x3 2x4 = (logH + logW) + (logH logW) - 2logH = 0X5 = SignatureX4 = 1/2X2 + 1/2X3c = [0, 1, 1, -2, 0]

#/100Inference and RegressionLeast Squares Fit

#/1006Minimizing eeb minimizes ee = (y - Xb)(y - Xb).Any other coefficient vector has a larger sum of squares. (Least squares is least squares.) A quick proof: d = the vector, not b u = y - Xd. Then, uu = (y - Xd)(y-Xd) = [y - Xb - X(d - b)][y - Xb - X(d - b)] = [e - X(d - b)] [e - X(d - b)]Expand to find uu = ee + (d-b)XX(d-b) > ee

#/1007Dropping a VariableAn important special case. Comparing the results that we get with and without a variable z in the equation in addition to the other variables in X. Results which we can show using the previous result:

1. Dropping a variable(s) cannot improve the fit - that is, reduce the sum of squares. The relevant d is (* ,* ,*. , 0) i.e., some vector that has a zero in a particular place.

2. Adding a variable(s) cannot degrade the fit - that is, increase the sum of squares. Compare the sum of squares when there is a zero in the location to where the vector does not contain the zero just reverse the cases.

#/1008The Fit of the Regression Variation: In the context of the model we speak of variation of a variable as movement of the variable, usually associated with (not necessarily caused by) movement of another variable.

#/1009Decomposing the Variation of yTotal sum of squares = Regression Sum of Squares (SSR) +Residual Sum of Squares (SSE)

#/10010Decomposing the Variation

#/10011A Fit MeasureR2 =

(Very Important Result.) R2 is bounded by zero and one if and only if:(a) There is a constant term in X and (b) The line is computed by linear least squares.

#/10012Understanding R2R2 = squared correlation between y and the prediction of y given by the regression

#/100Regression Results14-----------------------------------------------------------------------------Ordinary least squares regression ............LHS=BOX Mean = 20.72065 Standard deviation = 17.49244---------- No. of observations = 62 DegFreedom Mean squareRegression Sum of Squares = 9203.46 2 4601.72954Residual Sum of Squares = 9461.66 59 160.36711Total Sum of Squares = 18665.1 61 305.98555---------- Standard error of e = 12.66361 Root MSE 12.35344Fit R-squared = .49308 R-bar squared .47590Model test F[ 2, 59] = 28.69497 Prob F > F* .00000--------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval--------+--------------------------------------------------------------------Constant| -12.0721** 5.30813 -2.27 .0266 -22.4758 -1.6684CNTWAIT3| 53.9033*** 12.29513 4.38 .0000 29.8053 78.0013 BUDGET| .12740*** .04492 2.84 .0062 .03936 .21544--------+--------------------------------------------------------------------

#/100Adding VariablesR2 never falls when a z is added to the regression. A useful general result

#/10015----------------------------------------------------------------------Ordinary least squares regression ............LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36Model size Parameters = 3 Degrees of freedom = 33Residuals Sum of squares = 1472.79834Fit R-squared = .98356 Adjusted R-squared = .98256Model test F[ 2, 33] (prob) = 987.1(.0000)Effects of additional variables on the regression below: -------------Variable Coefficient New R-sqrd Chg.R-sqrd Partial-Rsq Partial FPD -26.0499 .9867 .0031 .1880 7.411PN -15.1726 .9878 .0043 .2594 11.209PS -8.2171 .9890 .0055 .3320 15.904YEAR -2.1958 .9861 .0025 .1549 5.864--------+-------------------------------------------------------------Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X--------+-------------------------------------------------------------Constant| -79.7535*** 8.67255 -9.196 .0000 PG| -15.1224*** 1.88034 -8.042 .0000 2.31661 Y| .03692*** .00132 28.022 .0000 9232.86--------+-------------------------------------------------------------Adding Variables to a ModelWhat is the effect of adding PN, PD, PS, YEAR to the model (one at a time)?

#/100Adjusted R Squared Adjusted R2 (for degrees of freedom?)

Includes a penalty for variables that dont add much fit. Can fall when a variable is added to the equation.

#/10017Regression Results18-----------------------------------------------------------------------------Ordinary least squares regression ............LHS=BOX Mean = 20.72065 Standard deviation = 17.49244---------- No. of observations = 62 DegFreedom Mean squareRegression Sum of Squares = 9203.46 2 4601.72954Residual Sum of Squares = 9461.66 59 160.36711Total Sum of Squares = 18665.1 61 305.98555---------- Standard error of e = 12.66361 Root MSE 12.35344Fit R-squared = .49308 R-bar squared .47590Model test F[ 2, 59] = 28.69497 Prob F > F* .00000--------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval--------+--------------------------------------------------------------------Constant| -12.0721** 5.30813 -2.27 .0266 -22.4758 -1.6684CNTWAIT3| 53.9033*** 12.29513 4.38 .0000 29.8053 78.0013 BUDGET| .12740*** .04492 2.84 .0062 .03936 .21544--------+--------------------------------------------------------------------

#/100Adjusted R-SquaredWe will discover when we study regression with more than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.To have a fit measure that accounts for this, Adjusted R2 is a number that increases with the correlation, but decreases with the number of variables.

#/10019Notes About Adjusted R2

#/10020Inference and RegressionTransformed Data

#/10021Linear Transformations of DataChange units of measurement by dividing every observation e.g., $ to Millions of $ (see internet buzz regression) by dividing Box by 1000000.Change meaning of variables:x=(x1=nominal interest=i, x2=inflation=dp, x3=GDP)z=(x1-x2 = real interest i-dp, x2=inflation=dp, x3=GDP)Change theory of art appreciation:x=(x1=logHeight, x2=logWidth, x3=signature)z=(x1-x2=logAspectRatio, x2=logHeight, x3=signature)22

#/100(Linearly) Transformed DataHow does linear transformation affect the results of least squares? Z = XP for KxK nonsingular P(Each variable in Z is a combination of the variables in X.)Based on X, b = (XX)-1Xy.You can show (just multiply it out), the coefficients when y is regressed on Z are c = P -1 bFitted value is Zc = XPP-1b = Xb. The same!!Residuals from using Z are y - Zc = y - Xb (we just proved this.). The same!!Sum of squared residuals must be identical, as y-Xb = e = y-Zc.R2 must also be identical, as R2 = 1 - ee/same total SS.

#/10023Principal ComponentsZ = XC Fewer columns than XIncludes as much variation of X as possibleColumns of Z are orthogonal

Why do we do this?CollinearityCombine variables of ambiguous identity such as test scores as measures of ability

#/100+----------------------------------------------------+| Ordinary least squares regression || LHS=LOGBOX Mean = 16.47993 || Standard deviation = .9429722 || Number of observs. = 62 || Residuals Sum of squares = 20.54972 || Standard error of e = .6475971 || Fit R-squared = .6211405 || Adjusted R-squared = .5283586 |+----------------------------------------------------++--------+--------------+----------------+--------+--------+----------+|Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X|+--------+--------------+----------------+--------+--------+----------+|Constant| 12.5388*** .98766 12.695 .0000 ||LOGBUDGT| .23193 .18346 1.264 .2122 3.71468||STARPOWR| .00175 .01303 .135 .8935 18.0316||SEQUEL | .43480 .29668 1.466 .1492 .14516||MPRATING| -.26265* .14179 -1.852 .0700 2.96774||ACTION | -.83091*** .29297 -2.836 .0066 .22581||COMEDY | -.03344 .23626 -.142 .8880 .32258||ANIMATED| -.82655** .38407 -2.152 .0363 .09677||HORROR | .33094 .36318 .911 .3666 .09677|4 INTERNET BUZZ VARIABLES|LOGADCT | .29451** .13146 2.240 .0296 8.16947||LOGCMSON| .05950 .12633 .471 .6397 3.60648||LOGFNDGO| .02322 .11460 .203 .8403 5.95764||CNTWAIT3| 2.59489*** .90981 2.852 .0063 .48242|+--------+------------------------------------------------------------+

#/100+----------------------------------------------------+| Ordinary least squares regression || LHS=LOGBOX Mean = 16.47993 || Standard deviation = .9429722 || Number of observs. = 62 || Residuals Sum of squares = 25.36721 || Standard error of e = .6984489 || Fit R-squared = .5323241 || Adjusted R-squared = .4513802 |+----------------------------------------------------++--------+--------------+----------------+--------+--------+----------+|Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X|+--------+--------------+----------------+--------+--------+----------+|Constant| 11.9602*** .91818 13.026 .0000 ||LOGBUDGT| .38159** .18711 2.039 .0465 3.71468||STARPOWR| .01303 .01315 .991 .3263 18.0316||SEQUEL | .33147 .28492 1.163 .2500 .14516||MPRATING| -.21185 .13975 -1.516 .1356 2.96774||ACTION | -.81404** .30760 -2.646 .0107 .22581||COMEDY | .04048 .25367 .160 .8738 .32258||ANIMATED| -.80183* .40776 -1.966 .0546 .09677||HORROR | .47454 .38629 1.228 .2248 .09677||PCBUZZ | .39704*** .08575 4.630 .0000 9.19362|+--------+------------------------------------------------------------+

#/100Inference and RegressionModel Building and Functional Form

#/10027Using Logs28

#/100Time Trends in Regressiony = + 1x + 2t + 2 is the period to period increase not explained by anything else.log y = + 1log x + 2t + (not log t, just t) 1002 is the period to period % increase not explained by anything else.

#/1002930

U.S. Gasoline Market:Price and Income ElasticitiesDownward Trend in Gasoline Usage

#/100Application: Health Care DataGerman Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994.

DOCTOR = 1(number of doctor visits > 0) HOSPITAL= 1(number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar year PUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / 10000. HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling FEMALE = 1(female headed household) AGE = age in years MARRIED = marital status EDUC = years of education

#/10031Dummy VariableD = 0 in one case and 1 in the otherY = a + bX + cD + eWhen D = 0, E[Y|X] = a + bXWhen D = 1, E[Y|X] = a + c + bX

#/100

#/100

#/100

#/100A Conspiracy Theory for Art Sales at Auction

Sothebys and Christies, 1995 to about 2000 conspired on commission rates.

#/10036If the Theory is Correct

Sold from 1995 to 2000Sold before 1995 or after 2000

#/10037Evidence: Two Dummy VariablesSignature and Conspiracy Effects

The statistical evidence seems to be consistent with the theory.

#/10038Set of Dummy VariablesUsually, Z = Type = 1,2,,KY = a + bX + d1 if Type=1 + d2 if Type=2 + dK if Type=K

#/100A Set of Dummy VariablesComplete set of dummy variables divides the sample into groups.Fit the regression with group effects.Need to drop one (any one) of the variables to compute the regression. (Avoid the dummy variable trap.)

#/10040Group Effects in Teacher Ratings

#/100Rankings of 132 U.S.Liberal Arts Colleges

Reputation=+1Religious + 2GenderEcon + 3EconFac + 4North + 5South + 6Midwest + 7West + Nancy Burnett: Journal of Economic Education, 1998

#/10042Minitab does not like this model.

#/10043Too many dummy variables cause perfect multicollinearityIf we us all four region dummiesReputation = a + bn + if northReputation = a + bm + if midwestReputation = a + bs + if southReputation = a + bw + if westOnly three are needed so Minitab dropped westReputation = a + bn + if northReputation = a + bm + if midwestReputation = a + bs + if southReputation = a + if west

#/100Unordered Categorical Variables

House price data (fictitious)Type 1 = Split levelType 2 = RanchType 3 = ColonialType 4 = TudorUse 3 dummy variables for this kind of data. (Not all 4)Using variable STYLE in the model makes no sense. You could change the numbering scale any way you like. 1,2,3,4 are just labels.

#/10045Transform Style to Types

#/10046

#/10047Hedonic House Price Regression

Each of these is relative to a Split Level, since that is the omitted category. E.g., the price of a Ranch house is $74,369 less than a Split Level of the same size with the same number of bedrooms.

#/10048

We used McDonalds Per Capita

#/100More Movie MadnessMcDonalds and Movies (Craig, Douglas, Greene: International Journal of Marketing)

Log Foreign Box Office(movie,country,year) = + 1*LogBox(movie,US,year) + 2*LogPCIncome + 4LogMacsPC + GenreEffect + CountryEffect + .

#/100Movie Madness Data (n=2198)

#/100Macs and MoviesCountries and Some of the DataCode Pop(mm) per cap # of Language Income McDonalds1 Argentina 37 12090 173 Spanish2 Chile, 15 9110 70 Spanish3 Spain 39 19180 300 Spanish4 Mexico 98 8810 270 Spanish5 Germany 82 25010 1152 German6 Austria 8 26310 159 German7 Australia 19 25370 680 English8 UK 60 23550 1152 UKGenres (MPAA)1=Drama2=Romance3=Comedy4=Action5=Fantasy6=Adventure7=Family8=Animated9=Thriller10=Mystery11=Science Fiction12=Horror13=Crime

#/100

#/100

CRIME is the left out GENRE.AUSTRIA is the left out country. Australia and UK were left out for other reasons (algebraic problem with only 8 countries).

#/100Functional Form: QuadraticY = a + b1X + b2X2 + e

dE[Y|X]/dX = b1 + 2b2X

#/100

#/100

#/100

#/100Interaction EffectY = a + b1X + b2Z + b3X*Z + eE.g., the benefit of a year of education depends on how old one is.Log(income)=a + b1*Ed + b2*Ed2 + b3*Ed*Age + edlogIncome/dEd=b1+2b2*Ed+b3*Age

#/100Effect of an additional year of education increases from about 6.8% at age 20 to 7.2% at age 40

#/100Statistics and Data AnalysisProperties of Least Squares

#/10061Terms of ArtEstimates and estimatorsProperties of an estimator - the sampling distributionFinite sample properties as opposed to asymptotic or large sample properties

#/10062Least Squares

#/10063Deriving the Properties of bSo, b = the parameter vector + a linear combination of the disturbances, each times a vector.

Therefore, b is a vector of random variables. We analyze it as such. We do the analysis conditional on an X, then show that results do not depend on the particular X in hand, so the result must be general i.e., independent of X.

#/10064Unbiasedness of b

#/10065Left Out Variable BiasA Crucial Result About Specification: Two sets of variables in the regression, X1 and X2.

y = X1 1 + X2 2 +

What if the regression is computed without the second set of variables?What is the expectation of the "short" regression estimator?

b1 = (X1X1)-1X1y

#/10066The Left Out Variable Formula E[b1] = 1 + (X1X1)-1X1X22

The (truly) short regression estimator is biased.Application:

Quantity = 1Price + 2Income +

If you regress Quantity on Price and leave out Income. What do you get?

#/10067Application: Left out VariableLeave out Income. What do you get?

In time series data, 1 < 0, 2 > 0 (usually)Cov[Price,Income] > 0 in time series data.So, the short regression will overestimate the price coefficient.Simple Regression of G on a constant and PGPrice Coefficient should be negative.

#/10068Estimated Demand EquationShouldnt the Price Coefficient be Negative?

#/10069Multiple Regression of G on Y and PG. The Theory Works!----------------------------------------------------------------------Ordinary least squares regression ............LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36Model size Parameters = 3 Degrees of freedom = 33Residuals Sum of squares = 1472.79834 Standard error of e = 6.68059Fit R-squared = .98356 Adjusted R-squared = .98256Model test F[ 2, 33] (prob) = 987.1(.0000)--------+-------------------------------------------------------------Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X--------+-------------------------------------------------------------Constant| -79.7535*** 8.67255 -9.196 .0000 Y| .03692*** .00132 28.022 .0000 9232.86 PG| -15.1224*** 1.88034 -8.042 .0000 2.31661--------+-------------------------------------------------------------

#/10070Specification Errors-1Omitting relevant variables: Suppose the correct model is y = X11 + X22 + . I.e., two sets of variables. Compute least squares omitting X2. Some easily proved results:

Var[b1] is smaller than Var[b1.2]. You get a smaller variance when you omit X2. (One interpretation: Omitting X2 amounts to using extra information (2 = 0). Even if the information is wrong (see the next result), it reduces the variance. (This is an important result.)

#/10071Specification Errors-2 Including superfluous variables: Just reverse the results.

Including superfluous variables increases variance. (The cost of not using information.)

Does not cause a bias, because if the variables in X2 are truly superfluous, then 2 = 0, so E[b1.2] = 1.

#/10072Inference and RegressionEstimating Var[b|X]

#/10073Variance of the Least Squares Estimator

#/10074Gauss-Markov Theorem A theorem of Gauss and Markov: Least Squares is the Minimum Variance Linear Unbiased Estimator 1. Linear estimator

2. Unbiased: E[b|X] =

Comparing positive definite matrices: Var[c|X] Var[b|X] is nonnegative definite for any other linear and unbiased estimator.

#/10075True Variance of b|X76

#/100Estimating 2Using the residuals instead of the disturbances: The natural estimator: ee/N as a sample surrogate for /NImperfect observation of i = ei + ( - b)xi Downward bias of ee/N. We obtain the result E[ee|X] = (N-K)2

#/10077Expectation of ee

#/10078Expected Value of ee:

#/10079Estimating 2The unbiased estimator is s2 = ee/(N-K).

N-K = Degrees of freedom correction

#/10080Var[b|X]Estimating the Covariance Matrix for b|XThe true covariance matrix is 2 (XX)-1 The natural estimator is s2(XX)-1 Standard errors of the individual coefficients are the square roots of the diagonal elements.

#/10081XX(XX)-1s2(XX)-1

#/10082Regression Results----------------------------------------------------------------------Ordinary least squares regression ............LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36Model size Parameters = 7 Degrees of freedom = 29Residuals Sum of squares = 778.70227 Standard error of e = 5.18187 t] Mean of X--------+-------------------------------------------------------------Constant| -7.73975 49.95915 -.155 .8780 PG| -15.3008*** 2.42171 -6.318 .0000 2.31661 Y| .02365*** .00779 3.037 .0050 9232.86 TREND| 4.14359** 1.91513 2.164 .0389 17.5000 PNC| 15.4387 15.21899 1.014 .3188 1.67078 PUC| -5.63438 5.02666 -1.121 .2715 2.34364 PPT| -12.4378** 5.20697 -2.389 .0236 2.74486--------+-------------------------------------------------------------Create ; trend=year-1960$Namelist; x=one,pg,y,trend,pnc,puc,ppt$Regress ; lhs=g ; rhs=x$

#/10083Inference and RegressionNot Perfect Collinearity

#/10084Variance Inflation and MulticollinearityWhen variables are highly but not perfectly correlated, least squares is difficult to compute accuratelyVariances of least squares slopes become very large.Variance inflation factors: For each xk, VIF(k) = 1/[1 R2(k)] where R2(k) is the R2 in the regression of xk on all the other x variables in the data matrix85

#/100

NIST Statistical Reference Data Sets Accuracy Tests

#/100

The Filipelli Problem

#/100

VIF for X10: R2 = .99999999999999630 VIF = .27294543196184830D+15

#/100

#/100

Other software: Minitab reports the correct answer Stata drops X10

#/100

Accurate and Inaccurate Computation of Filipelli ResultsAccurate computation requires not actually computing (XX)-1. We (and others) use the QR method. See text for details.

#/100Inference and RegressionTesting Hypotheses

#/10092Testing Hypotheses

#/10093Hypothesis Testing: Criteria

#/100The F Statistic has an F Distribution

#/100Nonnormality or Large NDenominator of F converges to 1.Numerator converges to chi squared[J]/J.Rely on law of large numbers for the denominator and CLT for the numerator: JF Chi squared[J]Use critical values from chi squared.

#/100Significance of the Regression - R*2 = 0

#/100

Table of 95% Critical Values for F

#/100

#/100+----------------------------------------------------+| Ordinary least squares regression || LHS=LOGBOX Mean = 16.47993 || Standard deviation = .9429722 || Number of observs. = 62 || Residuals Sum of squares = 25.36721 || Standard error of e = .6984489 || Fit R-squared = .5323241 || Adjusted R-squared = .4513802 |+----------------------------------------------------++--------+--------------+----------------+--------+--------+----------+|Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X|+--------+--------------+----------------+--------+--------+----------+|Constant| 11.9602*** .91818 13.026 .0000 ||LOGBUDGT| .38159** .18711 2.039 .0465 3.71468||STARPOWR| .01303 .01315 .991 .3263 18.0316||SEQUEL | .33147 .28492 1.163 .2500 .14516||MPRATING| -.21185 .13975 -1.516 .1356 2.96774||ACTION | -.81404** .30760 -2.646 .0107 .22581||COMEDY | .04048 .25367 .160 .8738 .32258||ANIMATED| -.80183* .40776 -1.966 .0546 .09677||HORROR | .47454 .38629 1.228 .2248 .09677||PCBUZZ | .39704*** .08575 4.630 .0000 9.19362|+--------+------------------------------------------------------------+F = [(.6211405 - .5323241)/3] / [(1 - .6211405)/(62 13)] = 3.829; F* = 2.84

#/100Inference and RegressionA Case Study

#/100101Mega Deals for StarsA Capital Budgeting ComputationCosts and BenefitsCertainty: CostsUncertainty: BenefitsLong Term: Need for discounting

#/100102Baseball Story A Huge Sports ContractAlex Rodriguez hired by the Texas Rangers for something like $25 million per year in 2000.Costs the salary plus and minus some fine tuning of the numbersBenefits more fans in the stands.How to determine if the benefits exceed the costs? Use a regression model.

#/100The Texas Deal for Alex Rodriguez2001Signing Bonus = 10M200121200221200321200421200525200625200727200827200927201027Total:$252M ???

#/100104The Real DealYearSalaryBonusDeferral2001 2125 to 201120022124 to 201220032123 to 201320042124 to 201420052524 to 2015200625 4 to 2016200727 3 to 20172008273 to 20182009273 to 20192010275 to 2020Deferrals accrue interest of 3% per year.

#/100105CostsInsurance: About 10% of the contract per year(Taxes: About 40% of the contract)Some additional costs in revenue sharing revenues from the league (anticipated, about 17.5% of marginal benefits uncertain)Interest on deferred salary - $150,000 in first year, well over $1,000,000 in 2010.(Reduction) $3M it would cost to have a different shortstop. (Nomar Garciaparra)

#/100106PDV of the CostsUsing 8% discount factor (They used)Accounting for all costsRoughly $21M to $28M in each year from 2001 to 2010, then the deferred payments from 2010 to 2020Total costs: About $165 Million/Year in 2001 (Present discounted value)

#/100BenefitsMore fans in the seatsGateParkingMerchandiseIncreased chance at playoffs and world seriesSponsorships(Loss to revenue sharing)Franchise value

#/100How Many New Fans?Projected 8 more wins per year.What is the relationship between wins and attendance?Not known preciselyMany empirical studies (The Journal of Sports Economics)Use a regression model to find out.

#/100Baseball Data31 teams, 17 years (fewer years for 6 teams)Winning percentage: Wins = 162 * percentageRankAverage attendance. Attendance = 81*AverageAverage team salaryNumber of all starsManager years of experiencePercent of team that is rookiesLineup changesMean player experienceDummy variable for change in manager

#/100Baseball Data (Panel Data)

#/100A Dynamic Equation

#/100

#/100

#/100

#/100

About 220,000 fans

#/100Marginal Value of One More Win

#/100The Regression Model

#/100

#/100Marginal Value of One Win

#/100Marginal Value of an A Rod8 games * 63,734 fans = 509,878 fans509,878 fans * $18 per ticket $2.50 parking etc.$1.80 stuff (hats, bobble head dolls,)$11.3 Million per year !!!!! Its not close. (Marginal cost is at least $16.5M / year)

#/100