53
BUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear Regression Data visualization, conditional distributions, correlation, and least squares regression Max H. Farrell The University of Chicago Booth School of Business

BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

  • Upload
    tranbao

  • View
    226

  • Download
    2

Embed Size (px)

Citation preview

Page 1: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

BUS 41100 Applied Regression Analysis

Week 1: Introduction, Simple Linear

Regression

Data visualization, conditional distributions, correlation,and least squares regression

Max H. FarrellThe University of Chicago Booth School of Business

Page 2: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

The basic problem

Available data on two or morevariables

Formulate amodel topredict orestimate avalue of interest

Use estimateto make a(business)decision

1

Page 3: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Regression: What is it?

I Simply: The most widely used statistical tool for

understanding relationships among variables

I A conceptually simple method for investigating

relationships between one or more factors and an

outcome of interest

I The relationship is expressed in the form of an equation

or a model connecting the outcome to the factors

2

Page 4: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Regression in business

I Optimal portfolio choice:

- Predict the future joint distribution of asset returns

- Construct an optimal portfolio (choose weights)

I Determining price and marketing strategy:

- Estimate the effect of price and advertisement on sales

- Decide what is optimal price and ad campaign

I Credit scoring model:

- Predict the future probability of default using known

characteristics of borrower

- Decide whether or not to lend (and if so, how much)

3

Page 5: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Regression in everything

Straight prediction questions:

I What price should I charge for my car?

I What will the interest rates be next month?

I Will this person like that movie?

Explanation and understanding:

I Does your income increase if you get an MBA?

I Will tax incentives change purchasing behavior?

I Is my advertising campaign working?

4

Page 6: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Data Visualization

Example: pickup truck prices on Craigslist

We have 4 dimensions to consider.

> data <- read.csv("pickup.csv")

> names(data)

[1] "year" "miles" "price" "make"

A simple summary is

> summary(data)

year miles price make

Min. :1978 Min. : 1500 Min. : 1200 Dodge:10

1st Qu.:1996 1st Qu.: 70958 1st Qu.: 4099 Ford :12

Median :2000 Median : 96800 Median : 5625 GMC :24

Mean :1999 Mean :101233 Mean : 7910

3rd Qu.:2003 3rd Qu.:130375 3rd Qu.: 9725

Max. :2008 Max. :215000 Max. :23950 5

Page 7: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

First, the simple histogram (for each continuous variable).

> par(mfrow=c(1,3))

> hist(data$year)

> hist(data$miles)

> hist(data$price)

Histogram of data$year

data$year

Fre

quen

cy

1975 1980 1985 1990 1995 2000 2005 2010

05

1015

Histogram of data$miles

data$miles

Fre

quen

cy

0 50000 100000 150000 200000 250000

05

1015

Histogram of data$price

data$price

Fre

quen

cy

0 5000 10000 15000 20000 25000

05

1015

20Data is “binned” and plotted bar height is the count in each bin.

6

Page 8: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

We can use scatterplots to compare two dimensions.

> par(mfrow=c(1,2))

> plot(data$year, data$price, pch=20)

> plot(data$miles, data$price, pch=20)

●●

●●

●●

●●

●●

●●

●●

1980 1990 2000

5000

1500

0

data$year

data

$pric

e

●●

●●

●●

●●

●●

●●

●●

0 50000 150000

5000

1500

0

data$miles

data

$pric

e

7

Page 9: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Add color to see another dimension.

> par(mfrow=c(1,2))

> plot(data$year, data$price, pch=20, col=data$make)

> legend("topleft", levels(data$make), fill=1:3)

> plot(data$miles, data$price, pch=20, col=data$make)

●●

●●

●●

●●

●●

●●

●●

1980 1990 2000

5000

1500

0

data$year

data

$pric

e

DodgeFordGMC

●●

●●

●●

●●

●●

●●

●●

0 50000 150000

5000

1500

0

data$miles

data

$pric

e

8

Page 10: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Boxplots are also super useful.

> year_boxplot <- factor(1*(year<1995) + 2*(1995<=year & year<2000)

+ 3*(2000<=year & year<2005) + 4*(2005<=year & year<2009),

labels=c("<1995", "’95-’99", "2000-’04", "’05-’09"))

> boxplot(price ~ make, ylab="Price ($)", main="Make")

> boxplot(price ~ year_boxplot, ylab="Price ($)", main="Year")

Dodge Ford GMC

5000

1500

0

Make

Pric

e ($

)

●●

<1995 '95−'99 2000−'04 '05−'0950

0015

000

Year

Pric

e ($

)

The box is the Interquartile Range (IQR; i.e., 25th to 75th %), with the

median in bold. The whiskers extend to the most extreme point which is

no more than 1.5 times the IQR width from the box. 9

Page 11: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Regression is what we’re really here for.

> plot(data$year, data$price, pch=20, col=data$make)

> abline(lm(price ~ year),lwd=1.5)

●●

●●

●●

●●

●●

●●

●●

1980 1990 2000

5000

1500

0

data$year

data

$pric

e

DodgeFordGMC

●●

●●

●●

●●

●●

●●

●●

0 50000 150000

5000

1500

0data$miles

data

$pric

e

I Fit a line through the points, but how?

I lm stands for linear model

I Rest of the course: formalize and explore this idea 10

Page 12: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Conditional distributions

Regression models are really all about modeling the

conditional distribution of Y given X.

Why are conditional distributions important?

We want to develop models for forecasting. What we are

doing is exploiting the information in the conditional

distribution of Y given X.

The conditional distribution is obtained by “slicing” the point

cloud in the scatterplot to obtain the distribution of Y

conditional on various ranges of X values.

11

Page 13: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Conditional v. marginal distribution

Consider a regression of house price on size:

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

● ●●

●●

● ●

●● ●

● ●

●●

0.5 1.0 1.5 2.0 2.5 3.0 3.5

100

200

300

400

size

price

marg 1−1.5 1.5−2 2−2.5 2.5−3 3−3.5

100

200

300

400

price

{“slice” of data

conditionaldistribution

of price given3 < size < 3.5

regressionline

marginaldistribution

of price

12

Page 14: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Key observations from these plots:

I Conditional distributions answer the forecasting problem:

if I know that a house is between 1 and 1.5 1000 sq.ft.,

then the conditional distribution (second boxplot) gives

me a point forecast (the mean) and prediction interval.

I The conditional means (medians) seem to line up along

the regression line.

I The conditional distributions have much smaller

dispersion than the marginal distribution.

13

Page 15: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

This suggests two general points:

I If X has no forecasting power, then the marginal and

conditionals will be the same.

I If X has some forecasting information, then conditional

means will be different than the marginal or overall mean

and the conditional standard deviation of Y given X will

be less than the marginal standard deviation of Y .

14

Page 16: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Intuition from an example where X has no predictive power.

marg 0 1 2 3 4

100

200

300

400

price

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●● ●

●●

●●

0 1 2 3 4

100

200

300

400

# stops

pric

e

House price v.number of stopsigns (Y) within atwo-block radiusof a house (X)

See that in this case the marginal andconditionals are not all that different

15

Page 17: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Predicting house prices

Problem:

I Predict market price based on observed characteristics

Solution:

I Look at property sales data where we know the price and

some observed characteristics.

I Build a decision rule that predicts price as a function of

the observed characteristics.

=⇒ We have to define the variables of interest and develop a

specific quantitative measure of these variables

16

Page 18: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

What characteristics do we use?

I Many factors or variables affect the price of a house

I size of houseI number of bathsI garage, air conditioning, etc.I size of landI location

I Easy to quantify price and size but what about other

variables such as location, aesthetics, workmanship, etc?

17

Page 19: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

To keep things super simple, let’s focus only on size of the

house.

The value that we seek to predict is called the

dependent (or output) variable, and we denote this as

I Y = price of house (e.g. thousands of dollars)

The variable that we use to guide prediction is the

explanatory (or input) variable, and this is labelled

I X = size of house (e.g. thousands of square feet)

18

Page 20: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

What do the data look like?

> size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6,

+ 1.8,2,2.4,2.5,2.7,3.2,3.5)

> price <- c(70,83,74,93,89,58,85,114,

+ 95,100,138,111,124,161,172)

> plot(size, price, pch=20)

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

19

Page 21: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Appears to be a linear relationship between price and size:

I as size goes up, price goes up.

Fitting a line by the “eyeball” method:

> abline(35, 40, col="red")

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

20

Page 22: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Recall that the equation of a line is:

Y = b0 + b1X

where b0 is the intercept and b1 is the slope.

In the house price example

I our “eyeball” line has b0 = 35, b1 = 40.

I predict the price of a house when we know only size

I just read the value off the line that we’ve drawn.

I The intercept value is in units of Y ($1,000).

I The slope is in units of Y per units of X

($1,000/1,000 sq ft).21

Page 23: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Recall how the slope (b1) and intercept (b0) work together

graphically.

Y

X

b0

2 1

b1

Y = b0 + b1X

22

Page 24: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

What is a good line?

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and intercept

parameters in the model Y = b0 + b1X.

That involves

I choosing a criteria, i.e., quantifying how good a line is

I and matching that with a solution, i.e., finding the best

line subject to that criteria.

23

Page 25: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Although there are lots of ways to choose a criteria

I only a small handful lead to solutions that are “easy” to

compute,

I and which have nice statistical properties (more later).

Most reasonable criteria involve measuring the amount by

which the fitted value obtained from the line differs from the

observed value of the response value(s) in the data.

This amount is called the residual.

I Good lines produce small residuals.

I Good lines produce accurate predictions.

24

Page 26: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Yi

Xi

Ŷi ei = Yi – Ŷi = Residual i

The line is our predictions or fitted values: Yi = b0 + b1Xi.

The residual ei is the discrepancy between the fitted Yi and

observed Yi values.

I Note that we can write Yi = Yi + (Yi − Yi) = Yi + ei.

25

Page 27: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Least squares

A reasonable goal is to minimize the size of all residuals:

I If they were all zero we would have a perfect line.

I Trade-off between moving closer to some points and at

the same time moving away from other points.

Since some residuals are positive and some are negative, we

need one more ingredient.

I |ei| treats positives and negatives equally.

I So does e2i , which is easier to work with mathematically.

Least squares chooses b0 and b1 to minimize∑n

i=1 e2i .

26

Page 28: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Yi

Xi

positive residuals negative residuals

Choose the line to minimize the sum of the squares of the residuals,

n∑i=1

e2i =n∑

i=1

(Yi − Yi)2 =n∑

i=1

(Yi − [b0 + b1Xi])2.

27

Page 29: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

R’s lm command provides a least squares fit.

> reg <- lm(price ~ size)

> reg

Call:

lm(formula = price ~ size)

Coefficients:

(Intercept) size

38.88 35.39

I lm stands for “linear model”; it’ll be our workhorse

28

Page 30: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

> abline(reg, col="green")

> legend("bottomright", c("eyeball", "LS"),

+ col=c("red", "green"), lty=1)

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

eyeballLS

I The least squares line is different than our eyeballed line

I . . . but why do we like it better?

29

Page 31: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Properties of the least squares fit

Developing techniques for model validation and criticism

requires a deeper understanding of the least squares line.

The fitted values (Yi) and “residuals” (ei) obtained from the

least squares line have some special properties.

I From now on “obtained from the least squares line” will

be implied (and therefore not repeated) whenever we talk

about Yi and ei.

Let’s look at the housing data analysis to figure out what

some of these properties are . . .

. . . but first, review covariance and correlation.

30

Page 32: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Correlation and covariance

Cov(X,Y ) = E [(X − E[X])(Y − E[Y ])]

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

34

x

y

E[Y]

E[X]

X and Y vary with each other around their means.

31

Page 33: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

corr(X,Y ) = Cov(X,Y )/σXσY

-3 -2 -1 0 1 2 3

-3-2

-10

12

3corr = 1

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .5

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = -.8

32

Page 34: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Correlation only measures linear relationships:

I corr(X, Y ) = 0 does not mean the variables are unrelated!

-3 -2 -1 0 1 2

-8-6

-4-2

0

corr = 0.01

0 5 10 15 20

05

1015

20

corr = 0.72

Also be careful with influential observations.

33

Page 35: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Least squares properties

1. The fitted values are perfectly correlated with the inputs.

> plot(size, reg$fitted, pch=20, xlab="X",

+ ylab="Fitted Values")

> text(x=3, y=80, col=2, cex=1.5,

+ paste("corr(y.hat, x) =", cor(size, reg$fitted)))

●●

1.0 1.5 2.0 2.5 3.0 3.5

8010

012

014

016

0

X

Fitt

ed V

alue

s

corr(y.hat, x) = 1

34

Page 36: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

2. The residuals have zero correlation with inputs, i.e.

“stripped of all linearity”.

> plot(size, reg$fitted-price, pch=20, xlab="X", ylab="Residuals")

> text(x=3.1, y=26, col=2, cex=1.5,

+ paste("corr(e, x) =", round(cor(size, reg$fitted-price),2)))

> text(x=3.1, y=19, col=4, cex=1.5,

+ paste("mean(e) =", round(mean(reg$fitted-price),0)))

> abline(h=0, col=8, lty=2)

● ●

1.0 1.5 2.0 2.5 3.0 3.5

−20

−10

010

2030

X

Res

idua

ls

corr(e, x) = 0mean(e) = 0

35

Page 37: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Intuition for the relationship between Y , e, and X?

I Lets consider some “crazy” alternative line:

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

X

Y

LS line: 38.9 + 35.4 X

Crazy line: 10 + 50 X

36

Page 38: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

This is a bad fit! We are underestimating the value of small

houses and overestimating the value of big houses.

1.0 1.5 2.0 2.5 3.0 3.5

−20

−10

010

2030

X

Cra

zy R

esid

uals

corr(e, x) = −0.7mean(e) = 1.8

I Clearly, we have left some predictive ability on the table!

37

Page 39: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

As long as the correlation between e and X is non-zero, we

could always adjust our prediction rule to do better:

minn∑

i=1

e2i equivalent to corr(e,X)=0 &1

n

n∑i=1

ei =0

We need to exploit all of the (linear!) predictive power in the

X values and put this into Y ,

I leaving no “Xness” in the residuals.

In Summary: Y = Y + e where:

I Y is “made from X”; corr(X, Y ) = 1;

I e is unrelated to X; corr(X, e) = 0.

38

Page 40: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

What is a good line?Statistics version!

In a happy coincidence, the least squares line makes good

statistical sense too. To see why, we need a model.

Simple Linear Regression (SLR) Model:

Y = β0 + β1X + ε

What’s important right now? (more next week)

I It is a model, so we are assuming this relationship holds

for some fixed, but unknown values of β0, β1.

I The error ε is independent of X, mean zero.

39

Page 41: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

The model Y = β0 + β1X + ε is something we assume but

don’t know. Can we estimate it?

cov(X, Y ) = cov(X, β0 + β1X + ε)

= cov(X, β1X)

= β1var(X)

Re-write this to get: β1 =cov(X, Y )

var(X)= corr(X, Y )

σyσx

.

I Interpretation: β1 is correlation in units Y per units X.

But still unknown!

I Replace with analogues from the data:

b1 =sample cov(X, Y )

sample var(X)= rxy

sysx 40

Page 42: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

To summarize:

R’s lm(Y ∼ X) function

I finds the coefficients b0 and b1 characterizing the “least

squares” line Y = b0 + b1X.

I That is it minimizes∑n

i=1(Yi − Yi)2 =∑n

i=1 e2i .

I Equivalent to: corr(e,X)=0 & 1n

∑ni=1 ei =0

The least squares formulas are (cf slide 40)

b1 = rxysysx

and b0 = Y − b1X.

41

Page 43: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Example: wage data

Data:

I Greenberg and Kosters, 1970, Rand Corporation.

I 39 demographic groupings of 6000 households with themale head earning less than $15,000 annually in 1966.

Problem:

I Estimate the relationship between pay and labor supply.

I Use this information to influence social policy decisionsand the debate on a guaranteed national wage.

Possible solution:

I Fit a linear model for the effect of pay on labor supply forthe working poor.

42

Page 44: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Read in the data, and change to X and Y for convenience.

> D <- read.csv("wages.csv")

> Y <- D$HRS

> X <- D$RATE

Use correlation to slope the line

> b1 <- cor(X,Y)*sd(Y)/sd(X)

> b1

[1] 80.93679

Make sure that X and Y are on the line.

> b0 <- mean(Y) - mean(X)*b1

> b0

[1] 1913.009

43

Page 45: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

> plot(X,Y, xlab="rate", ylab="hours")

> abline(b0, b1, col="red")

●● ●

●●

● ●

●●

1.5 2.0 2.5 3.0 3.5

2000

2100

2200

rate

hour

s

44

Page 46: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

The residual errors are defined as ei = Yi − (b0 + b1Xi).

> e <- Y - (b0 + b1*X)

> plot(X, e, xlab="rate", ylab="residuals", pch=20)

> abline(h=0, col=8, lwd=2)

● ●

●●

●●

●●

●●●

1.5 2.0 2.5 3.0 3.5

−50

050

100

rate

resi

dual

s

I What does this plot of ei versus Xi say about the fit?

45

Page 47: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

How does labor supply change with pay?

We’ve estimated the linear model:

hours = 1913 + 81× rate + e

It increases: Every dollar extra per hour leads to an increase of

81 expected annual hours worked.

What would this mean for the debate (at the time of the

study) on national wage/negative income tax?

46

Page 48: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Questions remain ...

1. Does this apply today?

2. Does the same hold for those who make more money?

3. How good is this estimate?

4. Is pay rate the only variable of importance?

5. Do higher wages cause more work?

47

Page 49: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Steps in a regression analysis

1. State the problem

2. Data collection

3. Model fitting & estimation (this class)

3.1 Model specification (linear? logistic?)

3.2 Select potentially relevant variables

3.3 Model fitting (least squares)

3.4 Model validation and criticism

3.5 Back to 3.1? Back to 2?

4. Answering the posed questions

But that oversimplifies a bit;

I it is more iterative, and can be more art than science.48

Page 50: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Course Overview

W1: Introduction, The simple linear regression (SLR) model

W2: Inference for SLR

W3: Multiple linear regression (MLR)

W4: Regression diagnostics and data transformations

W5: MLR: Causal inference, model building I

W6: Midterm

W7: MLR: model building II, data mining

W8: Generalized Linear Models, classification

W9: Time series models and autoregression

W10: Advanced topics in dependence (if time)

W11: Final exam and Project due

49

Page 51: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Administrative details

Syllabus

I Lots of info! Please check before you ask. Pretty please.

Slides

I Not a stand-alone resource for learning

Course Website

I All class material is already postedI HW solutions posted after due date

Piazza Q&A

I Post all questions here!I Student & Instructor Answers

More help

I Office hoursI See syllabus for: books, R help, etc 50

Page 52: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Your work

Turned-in work: clear, concise, and on message

I Fewer plots usually betterI Results and analysis, not output/code

Homework

I Not exam practice! Not similar at allI Reinforce & extend ideas, challenge youI Open-ended analysis

Exams

I Narrower scopeI Test core concepts/abilitiesI Look at sample exams to get a sense of style

Project: Your glimpse at real life! 51

Page 53: BUS 41100 Applied Regression Analysis Week 1: …faculty.chicagobooth.edu/max.farrell/bus41100/week1-slides.pdfBUS 41100 Applied Regression Analysis Week 1: Introduction, Simple Linear

Glossary of symbols

I X = input, explanatory variable, or covariate.

I Y = output, dependent variable, or response.

I sxy is covariance and rxy is the correlation, sx and sy are

standard deviation of X and Y respectively

I rxy = sxysxsy

.

I b0 = least squares estimate of the intercept

I b1 = least squares estimate of the slope

I Y is the fitted value b0 + b1X

I ei is the residual Yi − Yi.

52