Upload
greg-kent
View
179
Download
0
Embed Size (px)
Citation preview
MATH& 146
Lesson 36
Section 5.1
Line Fitting
1
Scatterplots
A scatterplot is a useful summary between two
numerical variables. It gives a good visual picture
of the relationship between the two variables, and
aids the interpretation of the correlation coefficient
and regression model.
2
Roles for the Variables
It is important to determine which of the two numerical variables goes on the x-axis and which on the y-axis.
When the roles are clear, the explanatoryvariable goes on the x-axis, and the response variable goes on the y-axis.
3
Example 1
Pick out which variable you think should be the
explanatory (x) and which variable should be the
response (y).
You have data on the circumference of oak trees
(measured 12 inches from the ground) and their
age (in years). An oak tree in the park has a
circumference of 36 inches, and you wish to know
approximately how old it is.
4
Looking at Scatterplots
When describing a scatterplot, there are four
things to consider.
a) Direction
b) Form
c) Strength
d) Outliers
5
Direction
Positive correlation: Both variables are generally
increasing/decreasing at the same time.
Negative correlation: As one variable increases,
the other generally decreases.
No correlation: The direction is either not clear or
not linear.
6
Form
Linear: The graph appears to stretch out in a
generally consistent, straight form.
Nonlinear: The graph appears to show a curve
(up and down, for example).
7Linear Nonlinear
Form
In this course, we will only consider fitting linear
models to our data. If our data show a nonlinear
form, then more advanced techniques should be
used.
8
Strength
Strong: The points appear to follow a single
stream, whether straight, curved, or bending all
over the place.
Weak: The points appear as a vague cloud with
no discernible pattern or trend.
9Strong NoneModerate Weak
Outliers
Look for the unexpected. Outliers can offer valuable insight, but can also completely distort your analysis. Fortunately, potential outliers can be easily identified with the graph.
10
Example 2
Describe the direction, form, strength, and any unusual
features in the scatterplot.
11
Example 3
Below is a scatterplot of a person's performance IQ
plotted against their brain size. Describe the direction,
form, strength, and any unusual features in the scatter
plot.
12
Example 4
Below is a scatterplot of the maximum wind speed (in
mph) plotted against the central pressure (in mb) for
163 hurricanes that have hit the United States since
1851. Describe the direction, form, strength, and any
unusual features in the scatter plot.
13
Example 5
Each of the 2025 elementary schools in Ohio are
plotted below, comparing the passing rate on the
state's fourth-grade reading proficiency test with the
school's poverty level. Describe the direction, form,
strength, and any unusual features in the scatterplot.
14
Linear Regression
Once we have examined the scatterplot, the next step
would be to quantify the relationship between the
explanatory and response variables.
When we try to use a line to model that relationship,
we call it linear regression.
15
Linear Regression
Linear regression assumes that the relationship
between two variables, x and y, can be modeled
by a straight line:
where β0 and β1 represent two model parameters
for the intercept and slope.
These parameters are estimated using data, and
we write their point estimates as b0 and b1.
16
0 1y x
Linear Regression
It is rare for all of the data to fall on a straight line, as
seen in the three scatterplots below. In each case, we
will have some uncertainty regarding our estimates of
the model parameters, β0 and β1. For instance, we
might wonder, should we move the line up or down a
little, or should we tilt it more or less?
17
Brushtail Possums
The graph below shows a scatterplot for the head
length and total length of 104 brushtail possums from
Australia. Each point represents a single possum from
the data. (The highlighted point represents a possum
with head length 94.1 mm and total length 89 cm.)
18
Brushtail Possums
The head and total length variables are positively
associated. While the relationship is not perfectly
linear, it could be helpful to partially explain the
connection between these variables with a straight
line.
19
Brushtail Possums
We want to describe the relationship between the
head length and total length variables in the
possum data set using a line. In this example, we
will use the total length as the predictor variable, x,
to predict a possum's head length, y.
20
Parts of a Linear Equation
There are many linear models we could use, some
better than others. The "best fitting" model has the
equation
21
ˆ 41 0.6 .y x
Parts of a Linear Equation
So where are these numbers coming from? The
first number, 41, is the y-intercept. You can
estimate it by extending the graph to include x = 0
and look to see where the line crosses the y-axis.
22
41ˆ 0.6y x
41
Parts of a Linear Equation
The second number, 0.6, is the slope. You can
estimate it by approximating the rise and run from
one point on the line to the next. The slope will be
the steepness of the line, rise over run.
23
Rise ≈ 6
Run ≈ 10 ˆ 41 0.6y x
Parts of a Linear Equation
We can use this line to discuss properties of possums.
For instance, the equation predicts a possum with a
total length of 80 cm will have a head length of
That is, our prediction is that a possum with a total
length of 80 cm will have a head length of about 89
mm.
24
ˆ 41 0.6
41 0.6 80
89 mm
y x
Read as "y hat"
Parts of a Linear Equation
The "hat" on y is used to signify that this is a
prediction.
This prediction may be viewed as an average: the
equation predicts that possums with a total length of
80 cm will have an average head length of 89 mm.
Absent further information about an 80 cm possum,
the prediction for head length that uses the average is
a reasonable estimate.
25
Residuals
We can use residuals to check the strength of our
model. Residuals are the leftover variation in the
data after accounting for the model fit:
Residual = Data – Fit
26
ˆe y y
Residuals
Every observation will have a residual.
Observations above the prediction line will have
positive residuals (under predictions).
Observations below the line have negative residuals
(over predictions).
27
Example 6
The linear fit shown below is given as
Based on this line, formally compute the residual of the
observation (77.0, 85.3). This observation is denoted
by "×" on the plot.
28
ˆ 41 0.6 .y x
Example 7
If a model underestimates an observation, will the
residual be positive or negative? What about if it
overestimates the observation?
29
Residual Plots
Residuals are helpful in evaluating how well a
linear model fits a data set. We often display them
in a residual plot such as the one below.
30Scatterplot Residual Plot
Residual Plots
The residuals are plotted at their original horizontal
locations but with the residual on the vertical. For
instance, the point (85.0, 98.6)+ has a residual of 7.45,
so in the residual plot, it is placed at (85.0, 7.45).
31Scatterplot Residual Plot
Residual Plots
Creating a residual plot is like tipping the scatterplot
over so the regression line is horizontal.
32
Residual Plot
Scatterplot
Example 8
One purpose of residual plots is to identify
characteristics or patterns still apparent in data after
fitting a model. The figure below shows three
scatterplots with linear models in the first row and
residual plots in the second row. Can you identify any
patterns remaining in the residuals?
33