16
3/22/2010 1 Chapter 11 Simple Linear Regression and Correlation IE 609 Correlation 1 The Relation between Two Sets of Measures 2 The Relation between Two Sets of Measures Construct a scatter diagram for the following data: 3 The Relation between Two Sets of Measures Plot Results 4 The Relation between Two Sets of Measures You might have reversed the axes so that the vertical dimension represented the midterm grade and the horizontal dimension, the final d grade. When one measure may be used to predict another, it is customary to represent the predictor on the horizontal dimension (the x- axis). 5 The Relation between Two Sets of Measures Linear or Straight Line Relationship 6

Chapt 11 & 12 linear & multiple regression minitab

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

1

Chapter 11Simple Linear Regression and

Correlation

IE 609

Correlation

1

The Relation between

Two Sets of Measures

2

The Relation betweenTwo Sets of Measures

• Construct a scatter diagram for the following data:

3

The Relation betweenTwo Sets of Measures

• Plot Results

4

The Relation betweenTwo Sets of Measures

• You might have reversed the axes so that the vertical dimension represented the midterm grade and the horizontal dimension, the final

dgrade.

• When one measure may be used to predict another, it is customary to represent the predictor on the horizontal dimension (the x-axis).

5

The Relation betweenTwo Sets of Measures

• Linear or Straight Line Relationship

6

Page 2: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

2

The Relation betweenTwo Sets of Measures

• Other relationships

7

The Relation betweenTwo Sets of Measures

• Which of the diagrams represents the stronger relationship?

8

The Relation betweenTwo Sets of Measures

• Which of the diagrams represents the stronger relationship?

9

Simple Linear Regression

y = α + βx

y = a + bxi + εi

10

Simple Linear RegressionMinitab Data Entry

Table 11.1Pg 393

11

Simple Linear RegressionCalc ‐> Column Statistics

12

Page 3: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

3

Simple Linear RegressionCalc ‐> Calculator (Create Formula, Store Variable: Residual

13

Simple Linear Regression

14

Simple Linear RegressionGraph ‐> Probability Plot

15

Simple Linear RegressionResiduals appear Normally Distributed

16

Linear Regression Simple Structure

yi = ŷ + εi

ŷ → Sample mean = 34.0606 (Minitab*)

εi = yi - ŷ (Minitab “Residual”)

Sample Variance of ŷ = (10.7)2

17

* Mean of Demand , y (%) = 34.0606

Linear Regression and CorrelationSimple Structure

Question…….. Is the sample mean of Demand the correct value to use for ŷ?

yi = ŷ + εi

– Although it might seem to be a trivial question, i h k h h l ( b )you might ask why the sample mean (y-bar) was

the correct value to use for ŷ ?– Since the purpose of the is to accurately describe

the yi then we would expect the model is to deliver small errors (that is, εi) but how should we go about making small errors?

18

Page 4: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

4

Linear Regression Simple Structure

Question…….. Is the sample mean of Demand the correct value to use for ŷ?

yi = ŷ + εi

– A logical choice is to pick ŷ, which might beA logical choice is to pick ŷ, which might be different from the sample mean, so that the error variance s2 calculated with εi = yi - ŷ is minimized.

19

11

2

2

ns

n

ii

Linear Regression Simple Structure

The calculus operation that delivers this solution is

Σεi2  = Σ(ŷ – y)2

20

This is called the method of least squares because the method minimizes the error sum of squares.

Linear RegressionSimple Structure

Now consider the scatter diagram below.  y appears to increase linearly with respect to x

21

There might be an underlying causal relationship between x and y of the form:

y = α + βx

Linear Regression Simple Structure

• The parameters α and β are the y axis intercept and slope, respectively.

• Since we typically have sample data and not the

y = α + βx

Since we typically have sample data and not the complete population of (x, y) observations, we cannot expect to determine α and β, exactly- they will have to be estimated from the sample data. Our model is of the form

y = a + bxi + εi

22

Linear Regression Simple Structure

• Then for any choice of a and b the εi may be determined from

εi = (yi – ŷ) = yi – (a + bxi )

• These errors or discrepancies εi , are also called the model residuals.

23

Linear Regression Simple Structure

• Although this equation allows us to calculate the ε, for a

given (x, yi) data set once a and b, are specified, there

b

εi = (yi – ŷ) = yi – (a + bxi)

are still an infinite number of a and b, values that could

be used in the model. Clearly the choice of a and b, that provides the best fit to the data should make the εi or some function of them small. Although many conditions can be stated to define best fit lines by minimizing the εi , by far the most frequently used condition to define the best fit line is the one that minimizes Σεi

2. 24

Page 5: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

5

Linear Regression and CorrelationSimple Structure

• That is, the best fit line for the (x, y) data, is called the linear least squares regression line, which corresponds to

the choice of a and b, that minimizes Σεi2.

• The calc l s sol tion to this problem is gi en b the• The calculus solution to this problem is given by the simultaneous solution to the two equations:

• The method of fitting a line to (xi, yi) data using the solution is called linear regression.

25

n

iia 1

2 0

n

iib 1

2 0

Linear RegressionSimple Structure

• The error variance for linear least squares regression is given by

where n is the number of (xi, yi) observations and sε is called the standard error of the model.

• The Equation has n- 2 in the denominator because two degrees of freedom are consumed by the calculation of the regression coefficients a and b from the experimental data.

26

Linear Regression Simple Structure

• Think of the error variance sε2 in the regression problem in the same way as you think of the sample variance s2 used to quantify the amount of variation in simple measurement data.

• Whereas the sample variance characterizes the scatter of observations about a single value the error variance in the regression problem characterizes the distribution of values about the line ŷi = a + bxi

• Sε2 and s2 are close cousins, they are both measures of

the errors associated with different models for different kinds of data.

27

yy

REGRESSION COEFFICIENTS

• With the condition to determine the a and b, values that provide the best fit line for the (xi, yi) data, namely the minimization of Σεi

2,

we proceed to determine a and b in a more rigorous manner.

28

REGRESSION COEFFICIENTS

Determining the unique values of a and b

29

REGRESSION COEFFICIENTS

• The calculus method that determines the

unique values of a and b, that minimize Σεi2

requires that we solve the simultaneous iequations:

30

n

iia 1

2 0

n

iib 1

2 0

Page 6: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

6

REGRESSION COEFFICIENTS

• From these equations the resulting values of aand b, are best expressed in terms of sums of squares:q

31

n

iix xxSS

1

2)(

x

xy

SS

SSb

n

iiixy yyxxSS

1

)()(

xbya

n

iiy yySS

1

2)(

x

xy

SS

SSb

REGRESSION COEFFICIENTS

• SSX, and SSY are just the sums of squares required to determine the variances of the x and y values

32

n

iix xxSS

1

2)(

n

iiy yySS

1

2)(

REGRESSION COEFFICIENTS

• Similarly, using the sum of squares notation, we can write the error sum of squares for the regression as

• and the standard error as:

33

REGRESSION COEFFICIENTS

• Another important implication of Equations

xy

SS

SSb xbya

• that the point ( ) fall on the best-fit line. This is just a consequence of the way the sums of squares are calculated

34

xSS

yx,

REGRESSION COEFFICIENTS

35

xbya

ŷ=

s2=SSE/(n‐2)

LINEAR REGRESSION ASSUMPTIONS

Stats > Regression > Fitted Line Plot

36

Page 7: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

7

LINEAR REGRESSION ASSUMPTIONS

37

LINEAR REGRESSION ASSUMPTIONS

• A valid linear regression model requires that five conditions are satisfied:l. The values of x are determined without error.

2. The εi, are normally distributed with mean με= 0 for all2. The εi, are normally distributed with mean με 0 for all values of x.

3 . The distribution of the εi, has constant variance σε2 for all values of x within the range of experimentation (that is, homoscedasticity)

4. The εi are independent of each other.

5. The linear model provides a good fit to the data

38

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

Hypothetical distributions for α and β39

α βα0

β0

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• The values of the intercept and slope a and bfound with Equations

xySSb xbya

are actually estimates for the true parameters

α and β

40

xSSb xbya

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

Hypothetical distributions for α and βBoth of these distributions follow Student's t distribution with degrees of freedom equal to the error degrees of freedom.

41

α βα0 β0

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• Although linear regression analysis will always return a and b values. it's possible that one or both of these values could be statistically insignificant. We require a formal method of testing α and β to see if they are different from zero. H th f th t tHypotheses for these tests are:

H0: α0 = 0H1: α0 ≠ 0

H0: β 0 = 0H1: β 0 ≠ 0

To perform these tests we need some idea of the amount of variability present in the estimates of α and β

42

Page 8: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

8

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• Estimates of the variances σα0 and σβ0 are given

by:

sα2 =

sβ2 =

43

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• The hypothesis tests can be performed using one-sample t tests with dfε = n -2 degrees of freedom with the t statistics.

44Microsoft Equation

3.0

st

and

st

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• The (1 -α) 100% confidence intervals for α and β are determined from

P(a - t /2s < α < a + t /2s ) = 1- αP(a - tα/2sa < α < a + tα/2sa ) 1- α

P(b - tα/2sb < β < b+ tα/2sb ) = 1- α

with n -2 degrees of freedom.

45Microsoft Equation

3.0

HYPOTHESIS TESTS FOR REGRESSION COEFFICIENTS

• It is very important to realize that the variances of a and b as given are proportional to the standard error of the fit Sε. This means that if there are any uncontrolled variables in the experiment that cause the standard error to increase. there will be a corresponding increase in the standard deviations of the regression coefficients. This could

k th i ffi i t di i t th imake the regression coefficients disappear into the noise.

• Always keep in mind that the model's ability to predict the regression coefficients is dependent on the size of the standard error. Take care to remove or control or account for extraneous variation so that you get the best predictions from your models with the least effort.

46

CONFIDENCE LIMITS FOR THE REGRESSION LINE

• The true slope and intercept of a regression line are

not exactly known.

• The (l – α) 100% confidence interval for the

i li i i bregression line is given by:

47

CONFIDENCE LIMITS FOR THE REGRESSION LINE

Stat > Regression > Fitted Line Plot menu. You will have to select Display Confidence Bands in the Options menu to add the confidence limits to the fitted line plot. 

48

Page 9: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

9

PREDICTION LIMITS FOR THE OBSERVED VALUES

• The prediction interval provides prediction bounds for individual observations. The width of the prediction interval combines the uncertainty of the position of the true line as described by the confidence interval with the scatter of points aboutconfidence interval with the scatter of points about the line as measured by the standard error.

where tα/2 has dfε= n - 2 degrees of freedom

49

PREDICTION LIMITS FOR THE OBSERVED VALUES

Stat>Regression> Fitted Line Plot menu. You will have to select Display Prediction Bands in the Options menu

50

CORRELATION

COEFFICIENT OF DETERMINATION (r2)

CORRELATION COEFFICIENT (r)

51

CORRELATIONCoefficient of Determination r2

• A comprehensive statistic is required to measure the fraction of the total variation in the response y that is explained by the regression model.

• The total variation in y taken relative to y-bar is given by SSy

= Σ(yi – ) but SS is partitioned into two terms: one thaty Σ(yi ) but SSy is partitioned into two terms: one that accounts for the amount of variation explained by the straight line model given by SSregression and another that accounts for the unexplained error variation given by

.

.

52

2

1

2)( i

n

ii yySS

y

CORRELATIONCoefficient of Determination r2

• The three quantities are related by:

• Consequently the fraction of SS explained by the• Consequently, the fraction of SSy explained by the model is:

where r2 is called the coefficient of determination.

53

CORRELATION COEFFICIENT (r)• The correlation coefficient r is given by the square root

of the coefficient of determination r2 with an appropriate plus or minus sign.

• If two measures have a linear relationship, it is possibleIf two measures have a linear relationship, it is possible to describe how strong the relationship is by means of a statistic called a correlation coefficient r.

• The symbol for the correlation coefficient is r.

• The symbol for the corresponding population parameter is ρ (the Greek letter "rho").

54

Page 10: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

10

CORRELATION COEFFICIENT (r)

• Pearson product-moment correlation

55

CORRELATION COEFFICIENT (r)

• The basic formulas for the correlation coefficient are

56

PEARSONS PRODUCT-MOMENTCORRELATION COEFFICIENT (r)

• Given a set of data. (Example 11.10, pg 435 ) Find r x y x y

0.414 29186 0.548 67095

0.383 29266 0.581 85156

0.399 26215 0.557 69571

57

0.402 30162 0.55 84160

0.442 38867 0.531 73466

0.422 37831 0.55 78610

0.466 44576 0.556 67657

0.5 46097 0.523 74017

0.514 59698 0.602 87291

0.53 67705 0.569 86836

0.569 66088 0.544 82540

0.558 78486 0.557 81699

0.577 89869 0.53 82096

0.572 77369 0.547 75657

0.548 67095 0.585 80490

CORRELATIONThe Coefficient of Determination r2

• The coefficient of determination finds numerous applications in regression and multiple regression problems.

• Since SSregression is bounded by 0≤ SSregression ≤SSy there are corresponding bounds on the coefficient of determination given by 0 ≤ r2 ≤ 1given by 0 ≤ r2 ≤ 1.

• When r2 = 0 the regression model has little value because very little of the variation in y is attributable to its dependence on r. When r2 = 1 the regression model almost completely explains all of the variation in the response, that is, r almost perfectly predicts y.

• We're usually hoping for r2 = l, but this rarely happens.

58

Confidence Interval for the Coefficient of Determination r2

• The coefficient of determination r2 is a statistic that represents the proportion of the total variation in the values of the variable Y that can be accounted for or

l i d b li l i hi i h h dexplained by a linear relationship with the random variable X.

• A different data set of (x, y) values will give a different value of r2. The quantity that such r2 values estimate is the true population coefficient of determination p2, which is a parameter.

59

Confidence Interval for the Correlation Coefficient (r)

• When the distribution of the regression model residuals is normal with constant variance, the distribution of r is complicated, but the distribution of:

is appro imatel normal ith mean:is approximately normal with mean:

and standard deviation:

• The transformation of r into Z is called Fisher's Z transformation.

60

Page 11: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

11

Confidence Interval for the Correlation Coefficient (r)

• This information can be used to construct a confidence interval for the unknown parameter µz from the statistic r and the sample size n. The confidence interval is:The confidence interval is:

61

LINEAR REGRESSIONWITH MINITAB

• MINITAB provides two basic functions for performing linear regression

1. Stat Regression> Fitted Line Plot menu is the best place to start to evaluate the quality p q yof the fitted function.Includes a scatter plot of the (x, y,) data with the superimposed fitted line, a full ANOVA table and an abbreviated table of regression coefficients.

62

LINEAR REGRESSIONWITH MINITAB

2. Stat>Regression> Regression menu

The first part is a table of the regression coefficients and the corresponding standard deviations, t values, and p values. The second part is the ANOVA table, which summarizes the statistics required to determine the regression coefficients and the summary t ti ti lik 2 dstatistics like r, r2, radj. and sε.

There is a p-value reported for the slope of the regression line in the table of regression coefficients and another p value reported in the ANOVA table for the ANOVA F test. These two p values are numerically identical and not just by coincidence. There is a special relationship that exists between the t and F distributions when the F distribution has one numerator degree of freedom.

63

LINEAR REGRESSIONWITH MINITAB

Stat>Regression> Regression menu  

64

POLYNOMIAL MODELS

ŷ = a + b1 x + b2x2 + …+bpx

p

65

POLYNOMIAL MODELS

• The general form of a polynomial model is:

ŷ = a + b1 x + b2x2 + …+bpxp

where the polynomial is said to be of order p. The p y pregression coefficients a, b1, . . . ,bp are determined using the same algorithm that was used for the simple linear model; the error sum of squares is simultaneously minimized with respect to the regression coefficients. The family of equations that must be solved to determine the regression coefficients is nightmarish, but most of the good statistical software packages have this capability.

66

Page 12: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

12

POLYNOMIAL MODELS

• Although high-order polynomial models can fit the (x, y) data very well, they should be of the lowest order possible that accurately represents the relationship between y and x. There are no l id li h t d i ht bclear guidelines on what order might be

necessary, but watch the significance (that is, the p values) of the various regression coefficients to confirm that all of the terms are contributing to the model. Polynomial models must also be hierarchical, that is, a model of order p must contain all possible lower-order terms.

67

POLYNOMIAL MODELS

• Because of their complexity, it's important to summarize the performance of polynomial models using r2

adjusted instead of r2. In some cases when there are relatively few errorcases when there are relatively few error degrees of freedom after fitting a large polynomial model, the r2 value could be misleadingly large whereas r2

adjusted will be much lower but more representative of the true performance of the model.

68

POLYNOMIAL MODELS

• Fit the following data with an appropriate model and use scatter plots and residuals diagnostic plots to check for lack of fit.

69

POLYNOMIAL MODELS

• Solution:scatter plots and residuals diagnostic plots

70

POLYNOMIAL MODELS

• Solution: SCATTER PLOT

71

POLYNOMIAL MODELS

• Solution: SCATTER PLOT

72

Page 13: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

13

POLYNOMIAL MODELS• Solution: Residuals diagnostic plots

73

POLYNOMIAL MODELS• Solution: Residuals diagnostic plots

74

POLYNOMIAL MODELS• Solution: Residuals diagnostic plots

75

POLYNOMIAL MODELS• Solution: Residuals diagnostic plots

76

POLYNOMIAL MODELS

• Solution: Quadratic Create x^2 Column

77

POLYNOMIAL MODELS

• Solution:

Quadratic

Model

78

Page 14: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

14

POLYNOMIAL MODELS• Solution: Quadratic Model

Stat > Regression > Fitted Line Plot (x,y) – Quadratic

79

Multiple Regression

80

Multiple Regression

• When a response has n quantitative predictors such as y (x1 x2, .. . , xn), the model for y must be created by multiple regression. In multiple regression each predictive term in the modelregression each predictive term in the model has its own regression coefficient. The simplest multiple regression model contains a linear term for each predictor:

81

Multiple Regression

• This equation has the same basic structure as the polynomial model and, in fact, the two models are fitted and analyzed in much the same way. Where the work-sheet to fit the polynomial model requires n columns, one for each power of x, the worksheet to fit the multiple regression model requires n columns to account for each of the n predictors. The same regression methods are used to analyze both problems.

82

Multiple Regression

• Frequently, the simple linear model in does not fit the data and a more complex model is required. The terms that must be added to the model to achieve a good fit might involvemodel to achieve a good fit might involve interactions, quadratic terms, or terms of even higher order. Such models have the basic form:

83

Multiple Regression

• PROBLEM

A real‐estate executive would like to be able to predict the cost of a house in a housing development

• Selling Price Table(in thousands of dollars)

in a housing development on the basis of the number of bedrooms and bath‐rooms in the house. 

84

Page 15: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

15

Multiple Regression

• The following first-order model is assumed to connect the selling price of the home with the number of bedrooms and the number of baths. The dependent variable is represented by yand the independent variables are x1,the number of bedrooms, and x2, the number of baths.2,

85

Multiple Regression

• MINITAB SOLUTION

• Stat > Regression > Regression.

86

Multiple Regression

• MINITAB Output– Stat > Regression > Regression.

87

Multiple Regression

88

Multiple Regression• Problem

The following table contains data from a blood pressure study. The data were collected on a group of middle aged men. Systolic is the systolic blood pressure, Age is the age of the individual, Weight is the weight in pounds, Parents indicates whether the individual's parents had high blood pressure: 0 means neither parent has high blood pressure, 1 means one parent has high blood pressure, and 2means both mother and father have high blood pressure, Med is the number of hours per month that the individual meditates, and TypeAis a measure of the degree to which the individual exhibits type A personality behavior, as determined from a form that the person fills out. Systolic is the dependent variable and the other five variables are the independent variables

89

Multiple RegressionBlood pressure study on fifty 

middle‐aged men.

90

Page 16: Chapt 11 & 12 linear &  multiple regression minitab

3/22/2010

16

Multiple Regression

• Model

Y = systolic,

xl = age,

x2 = weight,

x3 = parents,

x4 = med, and

x5 = Type A

91

Multiple Regression• MINITAB SOLUTION : Stat > Regression > Regression

92

Multiple Regression• MINITAB SOLUTION : Stat > Regression > Regression

93

The five hypothesis tests suggest Weight and Type A should be kept and the other three variables thrown out.

Multiple RegressionChecking the Overall Utitity of a Model

• Purpose: Check whether the model is useful and to control your α value

Rather than conduct a large group of t-tests on the betas and increase the probability of making a type 1 error make one test and know that α= 0.05. The F-test is such a test. It is contained in the analysis of variance associated with the analysis. The F-test tests the following hypothesis associated with the blood pressure model

94

Multiple Regression• MINITAB SOLUTION : Stat > Regression > Regression

Interpretation

95

As is seen F= 18.50 with a p‐value of 0.000 and the null hypothesis should be rejected; the conclusion is that at least one βi ≠ 0. This F‐test says that the model is useful in predicting systolic blood pressure.

ENDEND

96