146 36 line_fitting

MATH& 146

Lesson 36

Section 5.1

Line Fitting

1

Scatterplots

A scatterplot is a useful summary between two

numerical variables. It gives a good visual picture

of the relationship between the two variables, and

aids the interpretation of the correlation coefficient

and regression model.

2

Roles for the Variables

It is important to determine which of the two numerical variables goes on the x-axis and which on the y-axis.

When the roles are clear, the explanatoryvariable goes on the x-axis, and the response variable goes on the y-axis.

3

Example 1

Pick out which variable you think should be the

explanatory (x) and which variable should be the

response (y).

You have data on the circumference of oak trees

(measured 12 inches from the ground) and their

age (in years). An oak tree in the park has a

circumference of 36 inches, and you wish to know

approximately how old it is.

4

Looking at Scatterplots

When describing a scatterplot, there are four

things to consider.

a) Direction

b) Form

c) Strength

d) Outliers

5

Direction

Positive correlation: Both variables are generally

increasing/decreasing at the same time.

Negative correlation: As one variable increases,

the other generally decreases.

No correlation: The direction is either not clear or

not linear.

6

Form

Linear: The graph appears to stretch out in a

generally consistent, straight form.

Nonlinear: The graph appears to show a curve

(up and down, for example).

7Linear Nonlinear

Form

In this course, we will only consider fitting linear

models to our data. If our data show a nonlinear

form, then more advanced techniques should be

used.

8

Strength

Strong: The points appear to follow a single

stream, whether straight, curved, or bending all

over the place.

Weak: The points appear as a vague cloud with

no discernible pattern or trend.

9Strong NoneModerate Weak

Outliers

Look for the unexpected. Outliers can offer valuable insight, but can also completely distort your analysis. Fortunately, potential outliers can be easily identified with the graph.

10

Example 2

Describe the direction, form, strength, and any unusual

features in the scatterplot.

11

Example 3

Below is a scatterplot of a person's performance IQ

plotted against their brain size. Describe the direction,

form, strength, and any unusual features in the scatter

plot.

12

Example 4

Below is a scatterplot of the maximum wind speed (in

mph) plotted against the central pressure (in mb) for

163 hurricanes that have hit the United States since

1851. Describe the direction, form, strength, and any

unusual features in the scatter plot.

13

Example 5

Each of the 2025 elementary schools in Ohio are

plotted below, comparing the passing rate on the

state's fourth-grade reading proficiency test with the

school's poverty level. Describe the direction, form,

strength, and any unusual features in the scatterplot.

14

Linear Regression

Once we have examined the scatterplot, the next step

would be to quantify the relationship between the

explanatory and response variables.

When we try to use a line to model that relationship,

we call it linear regression.

15

Linear Regression

Linear regression assumes that the relationship

between two variables, x and y, can be modeled

by a straight line:

where β0 and β1 represent two model parameters

for the intercept and slope.

These parameters are estimated using data, and

we write their point estimates as b0 and b1.

16

0 1y x

Linear Regression

It is rare for all of the data to fall on a straight line, as

seen in the three scatterplots below. In each case, we

will have some uncertainty regarding our estimates of

the model parameters, β0 and β1. For instance, we

might wonder, should we move the line up or down a

little, or should we tilt it more or less?

17

Brushtail Possums

The graph below shows a scatterplot for the head

length and total length of 104 brushtail possums from

Australia. Each point represents a single possum from

the data. (The highlighted point represents a possum

with head length 94.1 mm and total length 89 cm.)

18

Brushtail Possums

The head and total length variables are positively

associated. While the relationship is not perfectly

linear, it could be helpful to partially explain the

connection between these variables with a straight

line.

19

Brushtail Possums

We want to describe the relationship between the

head length and total length variables in the

possum data set using a line. In this example, we

will use the total length as the predictor variable, x,

to predict a possum's head length, y.

20

Parts of a Linear Equation

There are many linear models we could use, some

better than others. The "best fitting" model has the

equation

21

ˆ 41 0.6 .y x


So where are these numbers coming from? The

first number, 41, is the y-intercept. You can

estimate it by extending the graph to include x = 0

and look to see where the line crosses the y-axis.

22

41ˆ 0.6y x

41


The second number, 0.6, is the slope. You can

estimate it by approximating the rise and run from

one point on the line to the next. The slope will be

the steepness of the line, rise over run.

23

Rise ≈ 6

Run ≈ 10 ˆ 41 0.6y x


We can use this line to discuss properties of possums.

For instance, the equation predicts a possum with a

total length of 80 cm will have a head length of

That is, our prediction is that a possum with a total

length of 80 cm will have a head length of about 89

mm.

24

ˆ 41 0.6

41 0.6 80

89 mm

y x

Read as "y hat"


The "hat" on y is used to signify that this is a

prediction.

This prediction may be viewed as an average: the

equation predicts that possums with a total length of

80 cm will have an average head length of 89 mm.

Absent further information about an 80 cm possum,

the prediction for head length that uses the average is

a reasonable estimate.

25

Residuals

We can use residuals to check the strength of our

model. Residuals are the leftover variation in the

data after accounting for the model fit:

Residual = Data – Fit

26

ˆe y y

Residuals

Every observation will have a residual.

Observations above the prediction line will have

positive residuals (under predictions).

Observations below the line have negative residuals

(over predictions).

27

Example 6

The linear fit shown below is given as

Based on this line, formally compute the residual of the

observation (77.0, 85.3). This observation is denoted

by "×" on the plot.

28

ˆ 41 0.6 .y x

Example 7

If a model underestimates an observation, will the

residual be positive or negative? What about if it

overestimates the observation?

29

Residual Plots

Residuals are helpful in evaluating how well a

linear model fits a data set. We often display them

in a residual plot such as the one below.

30Scatterplot Residual Plot

Residual Plots

The residuals are plotted at their original horizontal

locations but with the residual on the vertical. For

instance, the point (85.0, 98.6)+ has a residual of 7.45,

so in the residual plot, it is placed at (85.0, 7.45).

31Scatterplot Residual Plot

Residual Plots

Creating a residual plot is like tipping the scatterplot

over so the regression line is horizontal.

32

Residual Plot

Scatterplot

Example 8

One purpose of residual plots is to identify

characteristics or patterns still apparent in data after

fitting a model. The figure below shows three

scatterplots with linear models in the first row and

residual plots in the second row. Can you identify any

patterns remaining in the residuals?

33

Education

146 36 line_fitting