27
Part 12: Linear Regression 2-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Embed Size (px)

Citation preview

Page 1: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-1/27

Statistics and Data Analysis

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics

Page 2: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-2/27

Statistics and Data Analysis

Part 12 – Linear Regression

Page 3: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-3/27

Linear Regression Covariation (and vs. causality) Examining covariation

Descriptive: Relationship between variables Predictive: Use values of one variable to predict

another. Control: Should a firm increase R&D? Understanding: What is the elasticity of demand

for our product? (Should we raise our price?) The regression relationship

Page 4: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-4/27

Covariation of Home Prices with Other Factors

What explains the pattern? Is the distribution of average listing prices random?

Page 5: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-5/27

Page 6: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-6/27

IncomePC

List

ing

3250030000275002500022500200001750015000

900000

800000

700000

600000

500000

400000

300000

200000

100000

Scatterplot of Listing vs IncomePC

Page 7: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-7/27

Regression

Modeling and understanding covariation “Change in y” is associated with “change

in x” How do we know this? What can we infer from the observation? Causality and covariation

http://en.wikipedia.org/wiki/Causality and see, esp. “Probabilistic Causation” about halfway down the article.

Page 8: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-8/27

Covariation – Education and Life Expectancy

EDUC

DA

LE

121086420

80

70

60

50

40

30

20

01

OECD

Scatterplot of DALE vs EDUC

Causality? Covariation? Does more education make people live longer? A hidden driver of both? (GDPC)

Graph Scatterplots With Groups/ Categorical variable is OECD.

Page 9: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-9/27

Useful Description(?)

Scatter plot of box office revenues vs. number of “Can’t Wait To See It” votes on Fandango for 62 movies. What do we learn from the figure? Is the “relationship” convincing? Valid? (Real?)

Page 10: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-10/27

More Movie Madness

Domestic

Overs

eas

6005004003002001000

1400

1200

1000

800

600

400

200

0

Scatterplot of Overseas vs Domestic

Domestic

Overs

eas

5004003002001000

700

600

500

400

300

200

100

0

Scatterplot of Overseas vs Domestic

Did domestic box office success help to predict foreign box office success?

499 biggest movies up to 2003500 biggest movies up to 2003

Note the influence of an outlier.

Movies.mtp

Page 11: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-11/27

Average Box Office by Internet Buzz Index

= Average Box Office for Buzz in Interval

Page 12: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-12/27

Covariation

Is there a conditional expectation?

The data suggest that the average of Box Office increases as Buzz increases.

Average Box Office = f(Buzz) is the “Regression of Box Office on Buzz”

Page 13: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-13/27

Is There Really a Relationship?

BoxOffice is obviously not equal to f(Buzz) for some function. But, they do appear to be “related,” perhaps statistically – that is, stochastically. There is a covariance. The linear regression summarizes it.

A predictor would be Box Office = a + b Buzz. Is b really > 0? What would be implied by b > 0?

Page 14: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-14/27

Using Regression to Predict

Domestic

Overs

eas

6005004003002001000

1250

1000

750

500

250

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression95% PI

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

Predictor: Overseas = a + b Domestic. The prediction will not be perfect. We construct a range of “uncertainty.”

Stat Regression Fitted Line Plot

Options: Display Prediction Interval

The equation would not predict Titanic.

Page 15: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-15/27

Effect of an Outlier is to Twist the Regression Line

DomesticBox

Fore

ignBox

5004003002001000

700

600

500

400

300

200

100

0

S 66.9303R-Sq 47.4%R-Sq(adj) 47.3%

Regression of Foreign Box Office on DomesticForeignBox = 20.78 + 0.9202 DomesticBox

Domestic

Overs

eas

6005004003002001000

1400

1200

1000

800

600

400

200

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

Without Titanic, slope = 0.9202

With Titanic, slope = 1.051

Page 16: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-16/27

Least Squares Regression

Page 17: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-17/27

a

b

How to compute the y intercept, a, and the slope, b, in y = a + bx.

Page 18: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-18/27

Fitting a Line to a Set of Points

Income

PerC

apitaG

27000260002500024000230002200021000

6.4

6.3

6.2

6.1

6.0

5.9

5.8

5.7

5.6

Scatterplot of PerCapitaG vs Income

Choose a and b tominimize the sum of squared residuals

Gauss’s methodof least squares.

N N N2 2 2

i i i i ii 1 i 1 i 1SS [y - a - bx ] [y - (a + bx )] e

Residuals i i i

i i

e y (a bx )

ˆ y y

Yi

Xi

Predictionsa + bxi

Page 19: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-19/27

Computing the Least Squares Parameters a and b

N N

i ii 1 i 1

N2 2x ii 1

N

xy i ii 1

1 1y = y = 20.721 x = x = 0.48242

N N1

Var(x) = s = (x x) = 0.02453N-1

1Cov(x,y) = s = (x x)(y y) = 1.784

N-1

4 numbers are needed :

xy

2x

s 1.784b 72.7181

s 0.02453

a y - bx = 20.721- (72.7181)(0.48242) = -14.36

Page 20: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-20/27

Least Squares Uses Calculus

N 21i iN-1 i=1

2N i i1

N-1 i=1

N1i iN-1 i=1

2N i i1

N-1 i=1

N1i i iN-1 i=1

SS = (y - a -bx )

(y - a -bx )SS=

a a

= 2(y - a -bx )(-1) = 0

(y - a -bx )SS =

b b

= 2(y - a -bx )(-x ) = 0

N1i=1 i iN-1

N 21i=1 iN-1

The solution is

a = y - bx where

Σ (x - x)(y - y)b =

Σ (x - x)

Page 21: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-21/27

b Measures Covariation

Predictor Box Office = a + b Buzz.

xyxy

x y

y

x

Cov(x,y)b =

Var(x)

Note the numerator of b is

the covariance of x and y.

If Cov(x,y) = 0, then b = 0.

Also, since the correlation

sCov(x,y)is r ,

s sVar(x)Var(y)

sb Correlation of x and y.

s

Page 22: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-22/27

Is There Really a Statistically Valid Relationship?

We reframe the question.

If b = 0, then there is no (linear) relationship. How can we find out if the regression relationship is just a fluke due to a particular observed set of points? To be studied later in the course.

BoxOffice = a + b Cntwait3. Is b really > 0?

Page 23: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-23/27

Interpreting the Function

EDUC

DA

LE

121086420

80

70

60

50

40

30

20

S 7.87034R-Sq 59.2%R-Sq(adj) 59.0%

Fitted Line PlotDALE = 35.16 + 3.611 EDUC

a

b

a = the life expectancy associated with 0 years of education. No country has 0 average years of education. The regression only applies in the range of experience.

b = the increase in life expectancy associated with each additional year of average education.

The range of experience (education)

Page 24: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-24/27

Covariation and Causality

EDUC

DA

LE

121086420

80

70

60

50

40

30

20

S 7.87034R-Sq 59.2%R-Sq(adj) 59.0%

Fitted Line PlotDALE = 35.16 + 3.611 EDUC

Does more education make you live longer (on average)?

Page 25: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-25/27

Causality?

Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050

Estimated Income = -451 + 50.2 Height

Correlation = 0.84 (!)

Page 26: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-26/27

Using Regression to Predict

Domestic

Overs

eas

6005004003002001000

1250

1000

750

500

250

0

S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%

Regression95% PI

Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic

Page 27: Part 12: Linear Regression 12-1/27 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Part 12: Linear Regression12-27/27

Summary Using scatter plots to examine data The linear regression

Description Predict Control Understand

Linear regression computation Computation of slope and constant term Prediction Covariation vs. Causality

Interpretation of the regression line as a conditional expectation