Upload
percival-ray
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
The Practice of StatisticsThird Edition
Chapter 3:Examining Relationships
Daniel S. Yates
Copyright © 2008 by W. H. Freeman & Company
3.1 – Scatterplots and Correlation
When are some situations when we might want to examine a relationship between two variables?
•Height & Heart Attacks•Weight & Blood Pressure•Hours studying & test scores•What else?
In this chapter we will deal with relationships and quantitative variables; the next chapter will deal with more categorical variables.
Explanatory vs. Response
The response variable is our dependent variable (traditionally y)
The explanatory variable is our independent variable (traditionally x)
Explanatory or Response?Which is the explanatory and which is the response
variable?
Jim wants to know how the mean 2005 SAT Math and Verbal scores in the 50 states are related to each
other. He doesn't think that either score explains or causes the other.
Julie looks at some data. She asks, “Can I predict a state's mean 2005 SAT Math score if I know its mean
2005 SAT Verbal score?”
Explanatory and Response Variables
When we deal with cause and effect, there is always a definite response variable and explanatory variable.
But calling one variable response and one variable explanatory doesn't necessarily mean that one causes change in the other.
When analyzing several-variable data, the same principles apply… Data
Analysis Toolbox
To answer a statistical question of interest involving one or more data sets, proceed as follows.
• DATAOrganize and examine the data. Answer the key
questions.
• GRAPHSConstruct appropriate graphical displays.
• NUMERICAL SUMMARIESCalculate relevant summary statistics
• INTERPRETATIONLook for overall patterns and deviations
When the overall pattern is regular, use a mathematical model to describe it.
W5HW
Scatterplots
Let's say we wanted to examine the relationship between the percent of a state's high school seniors who took the SAT exam in 2005 and the mean SAT Math score in state that year. A scatterplot is an effective way to graphically represent our data.
But first, what is the explanatory variable and what is the response variable in this situation?
Scatterplots
Once we decide on the response and
explanatory variables, we can
create a scatterplot.
explanatory variable
response variable
Scatterplot
Scatterplot Tips• Plot the explanatory variable on the horizontal axis. If
there is no explanatory-response distinctions, either variable can go on the horizontal axis.
• Label both axes!
• Scale the horizontal and vertical axes. The intervals must be uniform. (but do not have to have same scales)
• If you are given a grid, try to adopt a scale so that your plot uses the whole grid. Make your plot large enough so that the details can be easily seen.
Note: there is no outlier rule for bivariate data (like 1.5xIQR)
Must use definition.
Overall Pattern
Direction:
Form:
Strength: how closely they follow form
negative
positive(or none)
linear
nonlinear
r-value
Positive vs. Negative
Interpreting Scatterplots
•Direction?
•Form?
•Strength?
•Outliers?
Adding Categorical Data
The Mean SAT Math scores and percent of high
school seniors who take the test, by state, with the southern states
highlighted.
Is the South different?
Measuring Linear Association:Correlation
Linear relations are important because, when we discuss the relationship between two quantitative variables, a straight line is a simple pattern that is quite common.
A strong linear relationship has points that lie close to a straight line.
A weak linear relationship has points that are widely scattered about a line.
Correlation
strong association weak association
• Our eyes are not good measures of how strong a linear relationship is...
A numerical measure along with a graph gives the linear association an exact value.
Facts about Correlation• Correlation makes no distinction between explanatory
and response variables.
• r doesn't change when we change the units of measurement of x, y, or both.
• r is positive when the association is positive and is negative when the association is negative.
• The correlation r is always a number between -1 and 1. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either -1 or 1.
• Patterns closer to a straight line have correlations closer to 1 or -1
Cautionary Notes about Correlation
• Correlation requires that both variables be quantitative.
Correlation does not describe curved relationships, no matter how strong they are.
Like the mean and standard deviation, the correlation is not resistant; r is strongly affected by a few outlying observations.
Correlation is not a complete summary of two-variable data. You should give the means and standard deviations of both x and y along with the correlation
Cautionary Notes about Correlation
• Many data sets can have the same r value but have completely different linear relationships
• ALWAYS PLOT YOUR DATA!!!
Correlation applet
3.2 – Least Squared Regression
When a scatterplot shows a linear relationship, we would like to summarize the overall pattern by drawing a line on the scatterplot.
Regression line – describes how a response variable y changes as an explanatory x changes.
Regression requires explanatory and response variables
y-intercept does not always make sense
represents predicted or average change
must be very specific when interpreting
Regression Lines
• Once we have our regression line, we can use it to predict responses.
• Extrapolation – using the line for predictions outside the range of values of the explanatory variable– Such predictions are often not accurate
That’s one big rat!!!Some data were collected on the weight of a male
white laboratory rat following its birth. A scatterplot of the weight (in grams) and time since birth (in weeks)
shows a fairly strong positive linear relationship. The linear regression equation models the data fairly well.
weight = 100 + 40(time)
a) Interpret the slope in the (context of this setting)b) Interpret the y-intercept (in this setting)c) Would you be willing to use this line to predict
the rat’s weight at age 2 years?(there are 454 grams in a pound)
a) slope: For every one week increase in age, the rat will increase its weight by an average of 40 grams
b) y-intercept: An estimate for the birth weight (100 grams) of this male rat
c) No, this would be extrapolation. The rat would weigh approximately 4,260 gram or 9.4 lbs. This is what a medium-sized cat weighs!
That’s one big rat!!!
The Least-Squares Regression Line (LSRL)
• In most cases, no line will pass exactly through all of the points in a scatterplot
• Our eyes are not a good judge of the best line
• Because we use the line to predict y from x, the prediction errors are errors in y, the vertical direction
• A good regression line makes the vertical distances of the points from the line as small as possible
The Least-Squares Regression Line (LSRL)
Does fidgeting keep you slim?
Refer to today’s handout for scatterplot
Use your calculator and program CORR to find the equation of the LSRL for the NEA and Fat gain data.
Using Program CORR
1.13890.7786 .00344
257.66y
x
sb r
s
2.3875 ( .00344)(324.8) 3.505a y bx
NEA and Fat Gain
Fat gain = 3.505 – 0.00344(NEA change)
ˆ 3.505 0.00344y x
NEA and Fat Gain
Make the errors in predicting y as small as possible by minimizing the sum of the squares of the vertical distances of the data points from the
line
Understanding the LSRL
How well does the line fit the data?
• Two ways:– Residual plot– Coefficient of determination, r2
• Residual – difference between observed value of response and the predicted
How much error there is in the LSRL
ˆy y
Residualsx: 1 2 3 4 5y: 2 4 6 8 15
Find LSRL:
1 4 7 10 13
Residuals: 1 0 -1 -2 2
ˆ 2 3y x
ˆ :y
Residual Plot• Plot (x, residual)
• The residual plot should show no obvious pattern.– Curved: linear not a good fit– Fanning: predictions will be less accurate for
larger/smaller x
* Residuals should be small
resi
dual
Residuals
• Need to be small…but what’s small enough?
• Standard deviation of the residuals– Used to measure the typical prediction error
2ˆ 101.83
2 3
y ys
n
Consistently off by 1.83
Residuals for NEA & Fat Gain
x -94 -57 -29 135 143 151 245 355
.37-.70
1.095 -.34 .187 .61 -.26 -.98
x 392 473 486 535 571 580 620 690
1.64 -.18 -.23 .54 -.54-
1.11.93 -.03
ˆy y
ˆy y
In calculator: L3 = Y1(L1) gives all predicted values
L4 = L2 – L3 (actual – predicted)
Residuals for NEA & Fat Gain
• Make scatterplot of residuals: L1, L4
1 var stats: L4
Sres = 0.71
Residual Plots
Scattered…no real pattern. A line is a good model.
Curved patter. A line may not be the best model.
Residual Plots
Fanning…more spread for larger values of x. Prediction will be less accurate when x is large.
HW: pg. 220 #39, 40
Using r2 to determine how well the data fits the line
• r2: coefficient of determination proportion of variation in y
• How well LSRL does at predicting values of response
• How much better is the LSRL at predicting responses than if we just used as our prediction.
y
We know that the LRSL minimizes the sum of the squared residuals….
• Compare sum of squared residuals of LRSL to the sum of squared residuals of
• Use NEA and Fat Gain data. = 2.3875
Create a new list and use 1 var stats to find
y
2
2
ˆy y
y y
y
y y 2x
• This gives us the proportion of how much error there is in the LSRL model with respect to the error in the mean model.
• How can we use this to determine how much better the LSRL is (r2)?
• r2 = 1 – .3934 = .6066
2
2
ˆ 7.663.3934
19.4575
y y
y y
So what does this .6066 mean?
• 60.6% of the variation in fat gain is explained by the LSRL relating fat gain and non-exercise activity.
• The other 39.4% is individual variation that is not explained by this linear relationship
• If all the point lie on the LSRL then and r2 = 1
– All of the variation in y is explained by the linear relationship with x
• Worst case scenario:r2 = 0
– 0% is explained by the line
When reporting regression always give r2 to determine how successful the line was in
explaining the response.
2ˆ 0y y
2 2ˆy y y y
Facts about LSRL1) Distinction between explanatory and response is
essential. (Will get a different line if they are reversed)
2) Close connection between correlation and slope
3) LSRL always passes through
4) r describes the strength of the straight-line relationship
r2 is the proportion of variation in y that is explained by the least-squared regression of y on x
y
x
Sb r
S
,x y