25
Linear Regression

Linear Regression. Simple Linear Regression Using one variable to … 1) explain the variability of another variable 2) predict the value of another variable

Embed Size (px)

Citation preview

Linear Regression

Simple Linear Regression

Using one variable to …

1) explain the variability of another variable

2) predict the value of another variable

Both accomplished with the line that best fits a scatterplot.

Linear Regression Slide #2

Linear Regression Slide #3

Recall -- Definitions• Response (dependent) variable

– variability is being explained or values are predicted– y-axis

• Explanatory (independent, predictor) variable– used to explain variability or make predictions– x-axis

Review -- Line Characteristics

1. What is the most common equation of a line?

2. What does the slope tell us?

3. What does the intercept tell us?

Linear Regression Slide #4

Linear Regression Slide #5

Finding the Best-Fit LineCandidate Lines

80 90 100 110 120

80

90

100

110

120

130

X

Y

We need an objective criterion

Linear Regression Slide #6

Finding the Best-Fit LineDefinition -- Predicted Y ( )y

• The y-coordinate of the point on the line that corresponds to the observed x value

110 120

110

120

130

X

y,x

y,xy

y

• Plug value of x into line equation to get y

Linear Regression Slide #7

Finding the Best-Fit LineDefinition -- Residual

80 90 100 110 120

80

90

100

110

120

130

X

Y

Residual = Observed Y - Predicted Y

Residual = Observed Y - Predicted Y

Linear Regression Slide #8

Finding the Best-Fit Lineminimize sum of residuals?

80 90 100 110 120

80

90

100

110

120

130

X

Y

Linear Regression Slide #9

• RSS = sum of squared residuals• the line out of all possible lines that minimizes

the RSS

• Should the RSS be computed for all lines?

Finding the Best-Fit Lineminimize sum of squared residuals?

x

y

s

sr"slope"

x*slopeyintercept"-y"

Linear Regression Slide #10

So ….

• It is important to understand – where the equation of the line comes from– how to interpret the line

• It is not important to compute the best-fit line “by hand”

Linear Regression Slide #11

Example -- Rabbit Metabolic RateKatzner et al. (1997; J. Wildl. Man. 78:1053-1062)

examined the metabolic rate of pygmy rabbits (Brachylagus idahoensis) in the laboratory. In particular, they wanted to determine if the variability in resting metabolic rate (ml O2 g-1 h-1) at 20oC could be adequately explained by body mass (g).

What is the response variable?– Resting metabolic rate

What is the explanatory variable?– Body mass

1

2

Linear Regression Slide #12

Example -- Rabbit Metabolic Rate

Y = 1.41 - 0.00124X

R-Sq = 55.4 %

400 450 500

0.8

0.9

1.0

Mass

Met

abol

ic R

ate

In terms of the variables of the problem, what is the equation of the best-fit line?MetRate = 1.41-0.00124Mass

3

Linear Regression Slide #13

Example -- Rabbit Metabolic Rate

Y = 1.41 - 0.00124X

R-Sq = 55.4 %

400 450 500

0.8

0.9

1.0

Mass

Met

abol

ic R

ate In terms of the variables

of the problem, interpret the value of the slope?

For each additional gram of mass, the metabolic rate decreases

0.00124 ml O2 g-1 h-1 on average

4

Linear Regression Slide #14

Example -- Rabbit Metabolic Rate

Y = 1.41 - 0.00124X

R-Sq = 55.4 %

400 450 500

0.8

0.9

1.0

Mass

Met

abol

ic R

ate In terms of the variables of

the problem, interpret the value of the y-intercept?

Rabbits with no mass have a metabolic rate of 1.41 ml O2 g-1 h-1 on average

5

Linear Regression Slide #15

Example -- Rabbit Metabolic Rate

Y = 1.41 - 0.00124X

R-Sq = 55.4 %

400 450 500

0.8

0.9

1.0

Mass

Met

abol

ic R

ate

What is the predicted metabolic rate for a mass of 450 g?

6

(450,0.85) What is the predicted metabolic rate for a mass of 600 g?

7

What is the residual for a mass of 425 g and a metabolic rate of 0.82 ml O2 g-1 h-1?

8

(425,0.82)

(425,0.88)

Linear Regression Slide #16

One More Regression Statistic

• r2 = coefficient of determination• = proportion of the total variability in the

response variable explained away by knowing the value of the explanatory variable

Linear Regression Slide #17

Visualizing r2

Height

We

igh

t

Tot

al V

aria

bil

ity

in Y

Variability Explained

r2 = Variability Explained

Total Variability in y =

Vrb

ilit

yR

emai

n

Linear Regression Slide #18

Characteristics of r2

• What range of values can r2 be?

• Which relationship is stronger -- r2 = 0.5 or 0.9?• Which relationship gives “better” predictions --

r2 = 0.5 or 0.9?

0 < r2 < 1

Linear Regression Slide #19

Example -- Rabbit Metabolic Rate

Y = 1.41 - 0.00124X

R-Sq = 55.4 %

400 450 500

0.8

0.9

1.0

Mass

Met

abol

ic R

ate What proportion of the variability in metabolic rate is explained by knowing mass?

r2 = 0.554

9

What is the correlation between metabolic rate and mass?

r = 0.5540.5 = -0.744

10

Simple Linear Regression in R

• Examine handout– lm()– rSquared()– fitPlot()– predict()

Linear Regression Slide #20

Linear Regression Slide #21

Regression is the Most Used and Most Abused Statistical Technique

• Assumptions:– A line adequately models the data– Homoscedasticity – same scatter of points along

entire line

– Residuals at any given value of the explanatory variable are normally distributed

– Residuals at any given value of the explanatory variable are independent

Intr

oA

dva

nce

d

Linear Regression Slide #22

A Line Models the Data

80 100 120

80

100

120

80 100 120

80

100

120

80 100 120

80

100

120

80 100 120

80

100

120

Linear Regression Slide #23

Homoscedasticity

80 100 120

80

100

120

80 100 120

80

100

120

80 100 120

80

100

120

Linear Regression Slide #24

r2 doesn’t depend on x because of homoscedasticity

Tot

al V

aria

bil

ity

in YV

rbil

ity

Rem

ain

Variability Explained

Height

We

igh

t

Linear Regression Slide #25

Other Problems

• Outliers– a problem because the model does not fit that point– may or may not remove

• Influential Points– a point that would markedly change the line if it

were removed– typically an outlier in the x direction