Chapter Topics Types of Regression Models Determining the Simple Linear Regression Equation Measures of Variation in Regression and Correlation Assumptions

Chapter Topics• Types of Regression Models

• Determining the Simple Linear Regression Equation

• Measures of Variation in Regression and Correlation

• Assumptions of Regression and Correlation

• Residual Analysis and the Durbin-Watson Statistic

• Estimation of Predicted Values

• Correlation - Measuring the Strength of the Association

Purpose of Regression and Correlation Analysis

• Regression Analysis is Used Primarily for Prediction

A statistical model used to predict the values of a dependent or response variable based on values of at least one independent or explanatory variable

Correlation Analysis is Used to Measure Strength of the Association Between Numerical Variables

Types of Regression Models

Positive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

Simple Linear Regression Model

iii XY 10

Y intercept

Slope

• The Straight Line that Best Fit the Data

• Relationship Between Variables Is a Linear Function

Random Error

Dependent (Response) Variable

Independent (Explanatory) Variable

Sample Linear Regression Model

ii XbbY 10

Yi

= Predicted Value of Y for observation i

Xi = Value of X for observation i

b0 = Sample Y - intercept used as estimate ofthe population 0

b1 = Sample Slope used as estimate of the population 1

Simple Linear Regression Equation: Example

You wish to examine the relationship between the square footage of produce stores and its annual sales. Sample data for 7 stores were obtained. Find the equation of the straight line that fits the data best

Annual Store Square Sales

Feet ($000)

1 1,726 3,681

2 1,542 3,395

3 2,816 6,653

4 5,555 9,543

5 1,292 3,318

6 2,208 5,563

7 1,313 3,760

Equation for the Best Straight Line

i

ii

X..

XbbY

4871415163610

From Excel Printout:

CoefficientsIntercept 1636.414726X Variable 1 1.486633657

Graph of the Best Straight Line

0

2000

4000

6000

8000

10000

12000

0 1000 2000 3000 4000 5000 6000

Square Feet

An

nu

al

Sa

les

($00

0)

Y i = 1636.415 +1.487X i

Interpreting the Results

Yi = 1636.415 +1.487Xi

The slope of 1.487 means for each increase of one unit in X, the Y is estimated to increase 1.487units.

For each increase of 1 square foot in the size of the store, the model predicts that the expected annual sales are estimated to increase by $1487.

Measures of Variation:The Sum of Squares

SST = Total Sum of Squares

•measures the variation of the Yi values around their mean Y

SSR = Regression Sum of Squares

•explained variation attributable to the relationship between X and Y

SSE = Error Sum of Squares

•variation attributable to factors other than the relationship between X and Y

_

Measures of Variation: The Sum of Squares

Xi

Y i = b0

+ b1X i

Y

X

Y

SST = (Yi - Y)2

SSE =(Yi - Yi )2

SSR = (Yi - Y)2

_

_

_

df SSRegression 1 30380456.12Residual 5 1871199.595Total 6 32251655.71

The Sum of Squares: Example

Excel Output for Produce Stores

SSR SSE SST

The Coefficient of Determination

SSR regression sum of squares

SST total sum of squaresr2 = =

Measures the proportion of variation that is explained by the independent variable X in the regression model

Coefficients of Determination (r2) and Correlation (r)

r2 = 1, r2 = 1,

r2 = .8, r2 = 0,Y

Yi = b0 + b1Xi

X

^

YYi = b0 + b1Xi

X

^Y

Yi = b0 + b1Xi

X

^

Y

Yi = b0 + b1Xi

X

^

r = +1 r = -1

r = +0.9 r = 0

Standard Error of Estimate

2

n

SSESyx

21

2

n

)YY(n

iii

=

The standard deviation of the variation of observations around the regression line

Regression StatisticsMultiple R 0.9705572R Square 0.94198129Adjusted R Square 0.93037754Standard Error 611.751517Observations 7

Measures of Variation: Example

Excel Output for Produce Stores

r2 = .94 Syx94% of the variation in annual sales can be explained by the variability in the size of the store as measured by square footage

Linear Regression Assumptions

• 1. Normality– Y Values Are Normally Distributed For Each X– Probability Distribution of Error is Normal

• 2. Homoscedasticity (Constant Variance)

• 3. Independence of Errors

For Linear Models

Variation of Errors Around the Regression Line

X1

X2

X

Y

f(e)y values are normally distributed

around the regression line.

For each x value, the “spread” or variance around the regression

line is the same.

Regression Line

Residual Analysis• Purposes

– Examine Linearity – Evaluate violations of assumptions

• Graphical Analysis of Residuals– Plot residuals Vs. Xi values

• Difference between actual Yi & predicted Yi– Studentized residuals:

• Allows consideration for the magnitude of the residuals

Residual Analysis for Linearity

Not Linear Linear

X

e e

X

Residual Analysis for Homoscedasticity

Heteroscedasticity Homoscedasticity

Using Standardized Residuals

SR

X

SR

X

The Durbin-Watson Statistic•Used when data is collected over time to detect

autocorrelation (Residuals in one time period are related to residuals in another period)

•Measures Violation of independence assumption

n

ii

n

iii

e

)ee(D

1

2

2

21 Should be close to 2.

If not, examine the model for autocorrelation.

Residual Analysis for Independence

Not Independent Independent

X

SR

X

SR

Inferences about the Slope: t Test• t Test for a Population Slope

Is a Linear Relationship Between X & Y ?

1

11

bS

bt

•Test Statistic:

n

ii

YXb

)XX(

SS

1

21

and df = n - 2

•Null and Alternative Hypotheses

H0: 1 = 0 (No Linear Relationship) H1: 1 0 (Linear Relationship)

Where

Example: Produce Stores

Data for 7 Stores: Regression Model Obtained:

The slope of this model is 1.487.

Is there a linear relationship between the square footage of a store and its annual sales?

Annual

Store Square Sales Feet ($000)

1 1,726 3,681

2 1,542 3,395

3 2,816 6,653

4 5,555 9,543

5 1,292 3,318

6 2,208 5,563

7 1,313 3,760

Yi = 1636.415 +1.487Xi

t Stat P-valueIntercept 3.6244333 0.0151488X Variable 1 9.009944 0.0002812

• H0: 1 = 0

• H1: 1 0

.05•df 7 - 2 = 7•Critical Value(s):

Test Statistic:

Decision:

Conclusion:

There is evidence of a relationship.t

0 2.5706-2.5706

.025

Reject Reject

.025

From Excel Printout

Reject H0

Inferences about the Slope: t Test

Inferences about the Slope: Confidence Interval Example

Confidence Interval Estimate of the Slopeb1 tn-2 1bS

Excel Printout for Produce Stores

At 95% level of Confidence The confidence Interval for the slope is (1.062, 1.911). Does not include 0.

Conclusion: There is a significant linear relationship between annual sales and the size of the store.

Lower 95% Upper 95%Intercept 475.810926 2797.01853X Variable 11.06249037 1.91077694

Estimation of Predicted Values

Confidence Interval Estimate for XY

The Mean of Y given a particular Xi

n

ii

iyxni

)XX(

)XX(

nStY

1

2

2

21

t value from table with df=n-2

Standard error of the estimate

Size of interval vary according to distance away from mean, X.

Estimation of Predicted Values

Confidence Interval Estimate for Individual Response Yi at a Particular Xi

n

ii

iyxni

)XX(

)XX(

nStY

1

2

2

21

1

Addition of this 1 increased width of interval from that for the mean Y

Interval Estimates for Different Values of X

X

Y

X

Confidence Interval for a individual Yi

A Given X

Confidence Interval for the mean of Y

Y i = b0 + b1X i

_

Example: Produce Stores

Yi = 1636.415 +1.487Xi

Data for 7 Stores:

Regression Model Obtained:

Predict the annual sales for a store with

2000 square feet.

Annual Store Square Sales

Feet ($000)

1 1,726 3,681

2 1,542 3,395

3 2,816 6,653

4 5,555 9,543

5 1,292 3,318

6 2,208 5,563

7 1,313 3,760

Estimation of Predicted Values: Example

Confidence Interval Estimate for Individual Y

Find the 95% confidence interval for the average annual sales for stores of 2,000 square feet

n

ii

iyxni

)XX(

)XX(

nStY

1

2

2

21

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)

X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

= 4610.45 980.97

Confidence interval for mean Y

Estimation of Predicted Values: Example

Confidence Interval Estimate for XY

Find the 95% confidence interval for annual sales of one particular store of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)

X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

= 4610.45 1853.45

Confidence interval for individual Y

n

ii

iyxni

)XX(

)XX(

nStY

1

2

2

21

1

Correlation: Measuring the Strength of Association

• Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’

• Coefficient of Correlation Used– Population correlation coefficient denoted

(‘Rho’)– Values range from -1 to +1– Measures degree of association

• Is the Square Root of the Coefficient of Determination

Test of Coefficient of Correlation

• Tests If There Is a Linear Relationship Between 2 Numerical Variables

• Same Conclusion as Testing Population Slope 1

• Hypotheses – H0: = 0 (No Correlation)

– H1: 0 (Correlation)

Documents

Chapter Topics Types of Regression Models Determining the Simple Linear Regression Equation Measures of Variation in Regression and Correlation Assumptions