Upload
rebekah-conrad
View
33
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Building a Model. Least-Squares Regression Section 3.3. Why Create a Model?. There are two reasons to create a mathematical model for a set of bivariate data. To predict the response value for a new individual. To find the “average” response value for any explanatory value. - PowerPoint PPT Presentation
Citation preview
Building a ModelBuilding a Model
Least-Squares RegressionSection 3.3
Why Create a Model?Why Create a Model?
There are two reasons to create a mathematical model for a set of bivariate data.
To predict the response value for a new individual.To find the “average” response value for any explanatory value.
Which Model is “Best”Which Model is “Best”
Since we want to use our model to predict response values for given explanatory values, we will define “best” as the model in which we have the smallest error. (We will define “error/residual” as the vertical distance from an observed value to the prediction line)
• Residual = Observed – Predicted
When the variables show a linear relationship, we find that the line of “best” fit is the
Least-Squares Regression Line
Least-Squares Regression Line
Least-Squares Regression Line
Why is it called the “Least-Squares Regression Line?Consider our data set from the “Dream Team” Notice that our line is an “average” line and that it does not intersect each piece of data.
5
15
25
35
45
Poi
nts
10 20 30 40 50 60 70 80Minutes
Points = 0.615Minutes - 0.81; r^2 = 0.68
1992 Dream Team Scatter Plot
•This means that our predictions will have some error associated with it
So Why is it “Best”?So Why is it “Best”?
If we find the vertical distance from the actual data point to our prediction line, we can find the amount of error. But if we try to add these errors together, we will find they add to zero since our line is an “average” line.We can avoid that sum of zero by squaring each of those errors and then finding the sum.
Smallest “Sum of Squared Error”Smallest “Sum of Squared Error”
We find that the line called the Least-Squares Regression Line has the smallest sum of squared error.This seems to indicate that this model will be the line that does the best job of predicting.
5
15
25
35
45
Po
ints
10 20 30 40 50 60 70 80Minutes
Points = 0.615Minutes - 0.81; r^2 = 0.68; Sum of squares = 508.1
1992 Dream Team Scatter Plot
Equation of the LSRLEquation of the LSRL
The LSRL can be found using the means, standard deviations, and the correlation between our explanatory and response variable.
xbby 10ˆ Where:yhat = predicted response variablebo = y-interceptb1 = slopex = explanatory variable value
Calculating LSRL using summary statistics
Calculating LSRL using summary statistics
When all you have is the summary statistics, we can use the following equations to calculateWhere b1 and b0 can be found using:
xbby o 1ˆ
xbybo 1
x
y
s
srb 1
Finding the LSRLFinding the LSRL
So with the summary statistics for both minutes and points, we can find the line of “best” fit for predicting the number of points we can expect, on average, for a given number of minutes played. 8240.
9959.11
9167.29
0850.16
50
r
S
y
S
x
y
x
minutes)(ntsipo 1bbo
8107.)50(6145.9167.29
6145.0850.16
9959.118240.1
ob
b
minutes)(6145.8107.ntsipo
Describing bo in contextDescribing bo in contextb0= the y-intercept: the y-intercept is the value of the response variable when our explanatory variable is zero. Sometimes this has meaning in context and sometimes has only a mathematical meaning.
minutes)(6145.8107.ntsipo
bo = -.8107, this would mean that if a player spent no minutes on the court, he would score a negative .8107 points on average. Since it is impossible to score negative points, we can conclude that the y-intercept has no meaning in this situation.
Describing b1 in contextDescribing b1 in context
b1= the slope: the slope of the regression line, tells us, what change in the response variable we expect, on average, for an increase of 1 in the explanatory variable.
Since b1= .6145, we can conclude that for each addition minute spent on the court, a player would, on average, score approximately .6145 more points.Or equivalently, for each additional 10 minutes, he would score on average an additional 6.145 points.
minutes)(6145.8107.ntsipo
Finding the LSRL with raw dataFinding the LSRL with raw data
We can find the LSRL using technology---either our TI-calculators or a statistical software.The program called “StatCrunch” is a web based statistical program that provides statistical calculations and plots. The output is very similar to most statistical programs.
Least-Squares Regression OutputLeast-Squares Regression OutputSimple linear regression results:
Dependent Variable: Points Independent Variable: Minutes Points = -0.81066996 + 0.6145467 Minutes Sample size: 12 R (correlation coefficient) = 0.824 R-sq = 0.6790264 Estimate of error standard deviation: 7.127934 Parameter estimates:
Parameter Estimate Std. Err. DF T-Stat P-Value
Intercept -0.81066996 6.9903164 10 -0.11597042 0.91
Slope 0.6145467 0.13361223 10 4.5994797 0.001
Analysis of variance table for regression model: Source DF SS MS F-stat P-value
Model 1 1074.8423 1074.8423 21.155212 0.001
Error 10 508.07443 50.80744
Total 11 1582.9166
Regression Equation
Y-Intercept
Slope
TI-Tips for LSRLTI-Tips for LSRL
To find the LSRL on a TI-83, 84 calculator, first enter the data into the list editor of the calculator. This can be either named lists or the built in lists.
From the home screen:STATCALC8:LinReg(a+bx)
• The arguments for this command are simply to tell the calculator where the explanatory and response values are located.
ENTERNotice that in addition to the values for the y-intercept and slope, the correlation coefficient, r, is also given.
Is a linear model appropriate?Is a linear model appropriate?
We now know how to create a linear model, but how do we know that this type of model is the appropriate one?To answer this question, we look at 3 things:
• Does a scatterplot of the data appear linear?• How strong is the linear relationship, as measured by the
correlation coefficient, “r” ?• What does a graph of the residuals (errors in prediction)
look like.
Checking for LinearityChecking for Linearity
As we can see from the scatterplot the relationship appears fairly linearThe correlation coefficient for the linear relationship is .824Even though both of these things indicate a linear model, we must check a graph of the residuals to make sure the errors associated with a linear model aren’t systematic in some way.
5
15
25
35
45
Po
ints
10 20 30 40 50 60 70 80Minutes
1992 Dream Team Scatter Plot
ResidualsResiduals
We can look at a graph of the number of minutes (x-values) vs the errors produced by the LSRL. If there is no pattern present, we can use a linear model to represent the relationship.However, if a pattern is present, (like any of the graphs at the right) we should investigate other possible models.
resi
du
al
-1.0
-0.6
-0.2
0.2
0.6
1.0
1.4
x-3 -2 -1 0 1 2 3 4 5 6
Collection 1 Scatter Plot
A parabolic shape indicates the data is not linear
resi
du
al3
-1.0
-0.6
-0.2
0.2
0.6
1.0
x3-2 0 2 4 6
Collection 1 Scatter Plot
A “trig” looking pattern indicates “auto-correlation”.
resi
du
al2
-2.0
-1.0
0.0
1.0
2.0
x2-3 -2 -1 0 1 2 3 4
Collection 1 Scatter Plot
An increase or decrease in variation is called a mega-phone effect
Dream Team residualsDream Team residuals
Notice that there does not appear to be any pattern to the residuals of the least-squares regression line between the number of minutes spent on the court and the number of points scored. This would indicate that a linear model is appropriate.
How Good is our ModelHow Good is our Model
Although a linear model may be appropriate, we can also evaluate how much of the differences in our response variable can be explained by the differences in the explanatory variable.The statistics that gives this information is r2. This statistic helps us to measure the contribution of our explanatory variable in predicting our response variable.
How Good is our Dream Team Model? How Good is our Dream Team Model?
Remember from both our stat crunch output and our calculator output, we found that r2=.68Approximately 68% of the differences in the number of points scored by the players can be explained by the differences in the number of minutes the player spent on the court.An alternative way to say this same thing:
Approximately 68% of the differences in the number of points scored by the players can be explained by the least-squares regression of points on minutes.
So, how good is it????So, how good is it????
Well it may help to know how r2 is calculated. Yes, r2 is the square of the correlation coefficient r, however it is useful to see it in a different light.Remember that our goal is to find a model that helps us to predict the response variable, in this case points scored.
Understanding r2Understanding r2
One way to describe the number of points scored for players is to simply give the average number of points scored.Notice, that this line, as with our regression line, has some error associated with it.
5
15
25
35
45
Po
ints
10 20 30 40 50 60 70 80Minutes
Points = Points mean
1992 Dream Team Scatter Plot
The error associated about the mean is found by finding the vertical distance from each data point to the line represented by the average response value.To avoid a sum of zero, we again square each of these distances, then find the sum, called the sum of squares total: SSTFor our example: we find that SST = 1582.9167Remember that our LSRL also measured the vertical distances from each data point to the prediction line and found the line that minimizes this sum, this is the sum of squared error: SSEFor our example: we find that SSE = 508.0744
Now if the explanatory variable we have chosen really does NOT help us in predicting our response variable, then the sum of squares total (SST) will be very close to the sum of squares error (SSE).The difference between these two is the amount of the variation in the response variable that can be explained by the regression line of y on x. (Sometimes this is referred to the sum of squares regression (SSR) or sum of squares model (SSM).
r2 represents the ratio of Sum of squares Regression (Model) to Sum of Squares TotalIt is the proportion of variability that is explained by the least-squares regression of y on x.Notice that our regression output gives us the ability to calculate r2 directly from the error measurements
68.9166.1582
8423.1074
2
SST
SSESSTr
Source DF SS MS F-stat P-value
Model 1 1074.8423 1074.8423 21.155212 0.001
Error 10 508.07443 50.80744
Total 11 1582.9166
Interpreting r2Interpreting r2
When r2 is close to zero, this indicates that the variable we have chosen to use as a predictor does not contribute much, in other words, it would be just as valuable to use the mean of our response variable.As r2 gets closer to 1, this indicates that the explanatory variable is contributing much more to our predictions and our regression model will be more useful for predictions than just reporting a mean.Some models include more than one explanatory variable, this type of model is called multiple-linear regression and we’ll leave the study of these models for another course
Additional ResourcesAdditional Resources
Against All Oddshttp://www.learner.org/resources/series65.html
• Video #7 Models for Growth
The Practice of Statistics-YMM Pg 137-151The Practice of Statistics-YMS Pg 149-165
What you learned:What you learned:
Why we create a modelWhich model is “best” and whyFinding the LSRL using summary statsUsing technology to find the LSRLDescribing the y-intercept and slope in contextDetermining if a LSRL is appropriateHow “good” is our model?