DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos

slide 1

DSCI 5180: Introduction to the Business Decision Process

Spring 2013 – Dr. Nick Evangelopoulos

Lectures 5-6: Simple Regression Analysis (Ch. 3)

Simple Regression IISimple Regression II 22Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 3Chapter 3Simple Regression AnalysisSimple Regression Analysis

(Part 2)(Part 2)

Terry DielmanTerry DielmanApplied Regression Analysis:Applied Regression Analysis:

A Second Course in Business and A Second Course in Business and Economic Statistics, fourth editionEconomic Statistics, fourth edition


3.4 Assessing the Fit of the 3.4 Assessing the Fit of the Regression LineRegression Line

It some problems, it may not be possible It some problems, it may not be possible to find a good predictor of the to find a good predictor of the yy values. values.

We know the least squares procedure We know the least squares procedure finds the best possible fit, but that des not finds the best possible fit, but that des not guarantee good predictive power.guarantee good predictive power.

In this section we discuss some methods In this section we discuss some methods for summarizing the fit quality.for summarizing the fit quality.


3.4.1 The ANOVA Table3.4.1 The ANOVA Table

Let us start by looking at the amount of Let us start by looking at the amount of variation in the variation in the yy values. The variation values. The variation about the mean is:about the mean is:

which we will call SST, the which we will call SST, the total sum of total sum of squaressquares..

Text equations (3.14) and (3.15) show how Text equations (3.14) and (3.15) show how this can be split up into two parts.this can be split up into two parts.

2

1

)( yyn

ii


Partitioning SSTPartitioning SST

SST can be split into two pieces which SST can be split into two pieces which are the previously introduced SSE are the previously introduced SSE and a new quantity, SSR, the and a new quantity, SSR, the regression sum of squaresregression sum of squares..

SST = SSR + SSESST = SSR + SSE2

1

2

1

2

1

)ˆ()ˆ()( i

n

ii

n

ii

n

ii yyyyyy


Explained and Unexplained VariationExplained and Unexplained Variation

We know that SSE is the sum of all the We know that SSE is the sum of all the squared residuals, which represent lack of squared residuals, which represent lack of fit in the observations.fit in the observations.

We call this the We call this the unexplainedunexplained variation in variation in the sample.the sample.

Because SSR contains the remainder of Because SSR contains the remainder of the variation in the sample, it is thus the the variation in the sample, it is thus the variation variation explainedexplained by the regression by the regression equation.equation.


The ANOVA TableThe ANOVA Table

Most statistics packages organize Most statistics packages organize these quantities in an these quantities in an ANANalysis alysis OOf f VAVAriance table.riance table.

SourceSource DFDF SSSS MSMS FF

RegressionRegression 11 SSRSSR MSRMSR MSR/MSEMSR/MSE

ResidualResidual n-2n-2 SSESSE MSEMSE

TotalTotal n-1n-1 SSTSST


3.4.2 The Coefficient of Determination3.4.2 The Coefficient of Determination

If we had an exact relationship If we had an exact relationship between between yy and and xx, then SSE would be , then SSE would be zero and SSR = SST.zero and SSR = SST.

Since that does not happen often it is Since that does not happen often it is convenient to use the ratio of SSR to convenient to use the ratio of SSR to SST as measure of how close we get SST as measure of how close we get to the exact relationship.to the exact relationship.

This ratio is called the This ratio is called the Coefficient of Coefficient of DeterminationDetermination or or RR22..


RR22

SSRSSRRR22 = —— is a fraction between 0 and 1 = —— is a fraction between 0 and 1 SSTSST

In an exact model, RIn an exact model, R22 would be 1. Most of would be 1. Most of the time we multiply by 100 and report it the time we multiply by 100 and report it as a percentage.as a percentage.

Thus, RThus, R22 is the percentage of the variation in is the percentage of the variation in the sample of the sample of y y values that is explained by values that is explained by the regression equation. the regression equation.


Correlation CoefficientCorrelation Coefficient

Some programs also report the Some programs also report the square root of Rsquare root of R22 as the correlation as the correlation between the between the yy and and y-haty-hat values. values.

When there is only a single predictor When there is only a single predictor variable, as here, the Rvariable, as here, the R22 is just the is just the square of the correlation between square of the correlation between yy and and xx..


3.4.3 The F Test3.4.3 The F Test

An additional measure of fit is provided by An additional measure of fit is provided by the F statistic, which is the ratio of MSR to the F statistic, which is the ratio of MSR to MSE.MSE.

This can be used as another way to test This can be used as another way to test the hypothesis that the hypothesis that 11 = 0. = 0.

This test is not real important in simple This test is not real important in simple regression because it is redundant with regression because it is redundant with the the tt test on the slope. test on the slope.

In multiple regression (next chapter) it is In multiple regression (next chapter) it is much more important.much more important.


F Test SetupF Test Setup

The hypotheses are:The hypotheses are:HH00:: 1 1 = 0 = 0

HHaa:: 11 ≠ 0 ≠ 0

The F ratio has 1 numerator degree of The F ratio has 1 numerator degree of freedom and freedom and n-2n-2 denominator degrees of denominator degrees of freedom. freedom.

A critical value for the test is selected from A critical value for the test is selected from that distribution and that distribution and HH00 is rejected if the is rejected if the computed F ratio exceeds the critical computed F ratio exceeds the critical value.value.


Example 3.8 Pricing Communications Example 3.8 Pricing Communications Nodes (continued)Nodes (continued)

Below we see the portion of the Below we see the portion of the Minitab output that lists the statistics Minitab output that lists the statistics we have just discussed.we have just discussed.

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%

Analysis of Variance

Source DF SS MS F PRegression 1 1751268376 1751268376 94.41 0.000Residual Error 12 222594146 18549512Total 13 1973862521


RR22 and F and F

RR22 = = SSR/SST = 1751268376/ 1973862521SSR/SST = 1751268376/ 1973862521

= .8872 or 88.7%= .8872 or 88.7%

F = MSR/MSE = 1751268376/222594146 F = MSR/MSE = 1751268376/222594146

= 94.41= 94.41

From the FFrom the F1,121,12 distribution, the critical value distribution, the critical value at a 5% significance level is 4.75at a 5% significance level is 4.75


3.5 Prediction or Forecasting With a Simple 3.5 Prediction or Forecasting With a Simple Linear Regression EquationLinear Regression Equation

Suppose we are interested in predicting Suppose we are interested in predicting the cost of a new communications node the cost of a new communications node that had 40 ports.that had 40 ports.

If this size project is something we would If this size project is something we would see often, we might be interested in see often, we might be interested in estimating the average cost of all projects estimating the average cost of all projects with 40 nodes.with 40 nodes.

If it was something we expect to see only If it was something we expect to see only once, we would be interested in predicting once, we would be interested in predicting the cost of the individual project. the cost of the individual project.


3.5.1 Estimating the Conditional Mean 3.5.1 Estimating the Conditional Mean of of yy Given Given xx..

At At xxmm = 40 ports, the quantity we are = 40 ports, the quantity we are estimating is:estimating is:

Our best guess of this is just the point Our best guess of this is just the point on the regression line:on the regression line:

1040| 40 xy

10 40ˆ bbym


Standard Error of the MeanStandard Error of the Mean

We will want to make an interval estimate, We will want to make an interval estimate, so we need some kind of standard error.so we need some kind of standard error.

Because our point estimate is a function of Because our point estimate is a function of the random variables the random variables bb00 and and bb1 1 their their standard errors figure into our standard errors figure into our computation.computation.

The result is:The result is:

2

2

)1(

)(1

x

mem Sn

xx

nSS


Where Are We Most Accurate?Where Are We Most Accurate? For estimating the mean at the point For estimating the mean at the point xxm m

the standard error is the standard error is SSmm.. If you examine the formula:If you examine the formula:

you can see that the second term will be you can see that the second term will be zero if we predict at the mean value of zero if we predict at the mean value of xx..

That makes sense—it says you do your That makes sense—it says you do your best prediction right in the center of your best prediction right in the center of your data.data.

2

2

)1(

)(1

x

mem Sn

xx

nSS


Interval EstimateInterval Estimate

For estimating the conditional mean For estimating the conditional mean of of yy that occurs at that occurs at xxmm we use: we use:

^̂ yymm ± t ± tn-2n-2 S Sm m

We call this a confidence interval for We call this a confidence interval for the mean value of the mean value of yy at at xxmm..


Hypothesis TestHypothesis Test

We could also perform a hypothesis We could also perform a hypothesis test about the conditional mean.test about the conditional mean.

The hypothesis would be:The hypothesis would be:

HH00: µ: µy|x=40y|x=40 = (some value) = (some value)

and we would construct a and we would construct a tt ratio from ratio from the point estimate and standard the point estimate and standard error.error.


3.5.2 Predicting an Individual 3.5.2 Predicting an Individual Value of Value of yy Given Given xx

If we are trying to say something If we are trying to say something about an individual value of about an individual value of yy it is a it is a little bit harder.little bit harder.

We not only have to first estimate We not only have to first estimate the conditional mean, but we also the conditional mean, but we also have to tack on an allowance for have to tack on an allowance for yy being above or below its mean.being above or below its mean.

We use the same point estimate but We use the same point estimate but our standard error is larger.our standard error is larger.


Prediction Standard ErrorPrediction Standard Error

It can be show that the prediction It can be show that the prediction standard error is:standard error is:

This looks a lot like the previous one This looks a lot like the previous one but has an additional term under he but has an additional term under he square root sign.square root sign.

The relationship is: The relationship is:

2

2

)1(

)(11

x

mep Sn

xx

nSS

222emp SSS


Predictive InferencePredictive Inference

Although we could be interested in a Although we could be interested in a hypothesis test, the most common hypothesis test, the most common type of predictive inference is a type of predictive inference is a prediction interval.prediction interval.

The interval is just like the one for The interval is just like the one for the conditional mean, except that the conditional mean, except that SSpp is used in the computation.is used in the computation.


Example 3.10 Pricing Communications Example 3.10 Pricing Communications Nodes (one last time)Nodes (one last time)

What do we get when there are 40 What do we get when there are 40 ports?ports?

Many statistics packages have a way Many statistics packages have a way for you to do the prediction. Here is for you to do the prediction. Here is Minitab's output:Minitab's output:

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 42600 1178 ( 40035, 45166) ( 32872, 52329)

Values of Predictors for New Observations

New Obs NUMPORTS1 40.0


From the OutputFrom the Output

^̂ yym m = = 42600 42600 SSmm = 1178 = 1178

Confidence interval: Confidence interval: 40035 to 4516640035 to 45166computed: 42600 ± 2.179(1178)computed: 42600 ± 2.179(1178)

Prediction interval: 32872 to 52329Prediction interval: 32872 to 52329computed: 42600 ± 2.179(????)computed: 42600 ± 2.179(????)

it does not list Sit does not list Spp


InterpretationsInterpretations

For all projects with 40 nodes, we are For all projects with 40 nodes, we are 95% sure that the 95% sure that the average costaverage cost is is between $40,035 and $45,166.between $40,035 and $45,166.

We are 95% sure that any We are 95% sure that any individual individual projectproject will have a cost between will have a cost between $32,872 and $52,329.$32,872 and $52,329.


3.5.3 Assessing Quality of Prediction3.5.3 Assessing Quality of Prediction

We use the model's RWe use the model's R22 as a measure of fit as a measure of fit ability, but this may overestimate the ability, but this may overestimate the model's ability to predict.model's ability to predict.

The reason for that is that RThe reason for that is that R22 is optimized is optimized by the least squares procedure, for the by the least squares procedure, for the data in our sample. data in our sample.

It is not necessarily optimal for data It is not necessarily optimal for data outside our sample, which is what we are outside our sample, which is what we are predicting.predicting.


Data SplittingData Splitting

We can split the data into two pieces. Use the We can split the data into two pieces. Use the first part to obtain the equation and use it to first part to obtain the equation and use it to predict the data in the second part.predict the data in the second part.

By comparing the actual By comparing the actual yy values in the second values in the second part to their corresponding predicted values, you part to their corresponding predicted values, you get an idea of how well you predict data that is get an idea of how well you predict data that is not in the "fit" sample.not in the "fit" sample.

The biggest drawback to this is that it won't work The biggest drawback to this is that it won't work too well unless we have a lot of data. To be really too well unless we have a lot of data. To be really reliable we should have at least 25 to 30 reliable we should have at least 25 to 30 observations in both samples.observations in both samples.


The PRESS StatisticThe PRESS Statistic

Suppose you temporarily deleted Suppose you temporarily deleted observation observation ii from the data set, fit a new from the data set, fit a new equation, then used it to predict the equation, then used it to predict the yyii value.value.

Because the new equation did not use any Because the new equation did not use any information from this data point, we get a information from this data point, we get a clearer picture of the model's ability to clearer picture of the model's ability to predict it.predict it.

The sum of these squared prediction The sum of these squared prediction errors is the PRESS statistic. errors is the PRESS statistic.


Prediction RPrediction R22

It sounds like a lot of work to do by It sounds like a lot of work to do by hand, but most statistics packages hand, but most statistics packages will do it for you.will do it for you.

You can then compute an RYou can then compute an R22-like -like measure called the measure called the prediction Rprediction R22::

SST

PRESSRPRED 12


In Our ExampleIn Our Example

For the communications node data we have been using, For the communications node data we have been using, SSE = 222594146, SST =1973862521 and RSSE = 222594146, SST =1973862521 and R22 = 88.7% = 88.7%

Minitab reports that PRESS = 345066019Minitab reports that PRESS = 345066019

Our prediction ROur prediction R2:2:

1 – (345066019/1973862521) = 1 - .175 = .825 or 82.5%1 – (345066019/1973862521) = 1 - .175 = .825 or 82.5%

Although there is a little loss, it implies we still have good Although there is a little loss, it implies we still have good prediction ability.prediction ability.


3.6 Fitting a Linear Trend Model to 3.6 Fitting a Linear Trend Model to Time-Series DataTime-Series Data

Data gathered on different units at the Data gathered on different units at the same point in time are called same point in time are called cross cross sectional datasectional data..

Data gathered on a single unit (person, Data gathered on a single unit (person, firm, etc.) over a sequence of time periods firm, etc.) over a sequence of time periods are called are called time-series datatime-series data..

With this type of data, the primary goal is With this type of data, the primary goal is often building a model that can forecast often building a model that can forecast the futurethe future


Time Series ModelsTime Series Models

There are many types of models that There are many types of models that attempt to identify patterns of attempt to identify patterns of behavior in a time series in order to behavior in a time series in order to extrapolate it into the future.extrapolate it into the future.

Some of these will be examined in Some of these will be examined in Chapter 11, but here we will just Chapter 11, but here we will just employ a simple employ a simple linear trend modellinear trend model..


The Linear Trend ModelThe Linear Trend ModelWe assume the series displays a steady upward or We assume the series displays a steady upward or

downward behavior over time that can be downward behavior over time that can be described by:described by:

where where tt is the time index ( is the time index (t =1t =1 for the first for the first observation, observation, t=2t=2 for the second, and so forth). for the second, and so forth).

The forecast for this model is quite simple:The forecast for this model is quite simple:

You just insert the appropriate value for You just insert the appropriate value for TT into the into the regression equation.regression equation.

tt ety 10

TbbyT 10ˆ


Example 3.11 ABX Company SalesExample 3.11 ABX Company Sales

The ABX Company sells winter sports The ABX Company sells winter sports merchandise including skates and skis. merchandise including skates and skis. The quarterly sales (in $1000s) from first The quarterly sales (in $1000s) from first quarter 1994 through fourth quarter 2003 quarter 1994 through fourth quarter 2003 are graphed on the next slide.are graphed on the next slide.

The time-series plot shows a strong The time-series plot shows a strong upward trend. There are also some upward trend. There are also some seasonal fluctuations which will be seasonal fluctuations which will be addressed in Chapter 7.addressed in Chapter 7.


300

250

200

40302010

SA

LES

Index

ABX Company SalesABX Company Sales


Obtaining the Trend EquationObtaining the Trend Equation

We first need to create the time We first need to create the time index variable which is equal to 1 for index variable which is equal to 1 for first quarter 1994 and 40 for fourth first quarter 1994 and 40 for fourth quarter 2003.quarter 2003.

Once this is created we can obtain Once this is created we can obtain the trend equation by linear the trend equation by linear regression.regression.


Trend Line EstimationTrend Line EstimationThe regression equation isSALES = 199 + 2.56 TIME

Predictor Coef SE Coef T PConstant 199.017 5.128 38.81 0.000TIME 2.5559 0.2180 11.73 0.000

S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%

Analysis of Variance

Source DF SS MS F PRegression 1 34818 34818 137.50 0.000Residual Error 38 9622 253Total 39 44440


The Slope CoefficientThe Slope Coefficient

The slope in the equation is 2.5559. The slope in the equation is 2.5559. This implies that over this 10-year This implies that over this 10-year period, we saw an average growth in period, we saw an average growth in sales of $2,556 per quarter.sales of $2,556 per quarter.

The hypothesis test on the slope has a The hypothesis test on the slope has a tt value of 11.73, so this is indeed value of 11.73, so this is indeed significantly greater than zero.significantly greater than zero.


Forecasts For 2004Forecasts For 2004

Forecasts for 2004 can be obtained Forecasts for 2004 can be obtained by evaluating the equation at by evaluating the equation at t = 41, t = 41, 42, 43 42, 43 andand 44. 44.

For example,For example, the salesthe sales in fourth in fourth quarter are forecast:quarter are forecast:SALES = 199 + 2.56 (44) = 311.48

A graph of the data, the estimated trend and the forecasts is next.


454035302520151050

300

250

200

TIME

SA

LE

SData, Trend (Data, Trend (——) and Forecast () and Forecast (------))


3.7 Some Cautions in Interpreting 3.7 Some Cautions in Interpreting Regression ResultsRegression Results

Two common mistakes that are made Two common mistakes that are made when using regression analysis are:when using regression analysis are:

1.1. That x causes y to happen, andThat x causes y to happen, and

2.2. That you can use the equation to That you can use the equation to predict y for any value of x.predict y for any value of x.


3.7.1 Association Versus Causality3.7.1 Association Versus Causality

If you have a model with a high RIf you have a model with a high R22, it does , it does not automatically mean that a change in not automatically mean that a change in xx causes causes yy to change in a very predictable to change in a very predictable way.way.

It could be just the opposite, that It could be just the opposite, that y y causes causes xx to change. A high correlation goes both to change. A high correlation goes both ways.ways.

It could also be that both It could also be that both yy and and xx are are changing in response to a third variable changing in response to a third variable that we don't know about. that we don't know about.


The Third FactorThe Third Factor One example of this third factor is the price and One example of this third factor is the price and

gasoline mileage of automobiles. As price gasoline mileage of automobiles. As price increases, there is a sharp drop in mpg. This is increases, there is a sharp drop in mpg. This is caused by size. Larger cars cost more and get caused by size. Larger cars cost more and get less mileage.less mileage.

Another is mortality rate in a country versus Another is mortality rate in a country versus percentage of homes with television. As TV percentage of homes with television. As TV ownership increases, mortality rate drops. This is ownership increases, mortality rate drops. This is probably due to better economic conditions probably due to better economic conditions improving quality of life and simultaneously improving quality of life and simultaneously allowing for greater ownership. allowing for greater ownership.


3.7.2 Forecasting Outside the Range of the 3.7.2 Forecasting Outside the Range of the Explanatory VariableExplanatory Variable

When we have a model with a high When we have a model with a high RR22, it means we know a good deal , it means we know a good deal about the relationship of about the relationship of y y and and xx for for the range of the range of xx values in our study. values in our study.

Think of our communication nodes Think of our communication nodes example where number of ports example where number of ports ranged from 12 to 68. Does our ranged from 12 to 68. Does our model even hold if we wanted to model even hold if we wanted to price a massive project of 200 ports?price a massive project of 200 ports?


An Extrapolation PenaltyAn Extrapolation Penalty

Recall that our prediction intervals Recall that our prediction intervals were always narrowest when we were always narrowest when we predicted right in the middle of our predicted right in the middle of our data set.data set.

As we go farther and farther outside As we go farther and farther outside the range of our data, the interval the range of our data, the interval gets wider and wider, implying we gets wider and wider, implying we know less and less about what is know less and less about what is going on.going on.

slide 47

DSCI 5180Decision Making

HW 3 – Hypothesis Testing in Regression

Documents

DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos