Regression
Learning Objectives
By the end of this lecture, you should be able to:
– Describe what is meant by regression. Be able to describe correlation and the relationship between the two.
– Generate a regression model both by using the calculator to calculate b0 and b1, and by using statistical software such as SPSS.
– Describe why extrapolation can lead to misleading conclusions.– Working with categorical variables – selecting the best tool for
the job when comparing categories.
Linear Regression - Overview• Once we have convinced ourselves that a linear relationship does in
fact exist between two variables, and that the relationship is causal (more on this later), we have a terrific tool for making predictions.– For example, we can predict the blood alcohol level based on the number of
beers consumed,
• We can predict by using the graph line itself, but it can be difficult to estimate exactly where on the axis a line falls. An even more useful way would be to take that line and turn it into a formula. This technique is called ‘regression’.
• For example: Once we have a formula, we can simply plug in the value for number of beers and the formula would tells us the predicted BAC level.
Developing a more precise model
• If you were asked for the blood alcohol level (BAC) for 2.5 beers, you would have to estimate both the location of 2.5 on the x-axis and the quantity of BAC on the y-axis. Your prediction would be imprecise.
– Answer = 0.028 ????
• So the next step is to generate a formula from our model which will give us more precise predictions . Ultimately, we will come up with the model:
BAC = -0.013 + 0.018 * num_beers
Is regression the appropriate tool for the job?
• Reminder: A key point of this course is to recognize when you can and can NOT use a statistical tool.
• This is one of those times: It is VERY important that you recognize when a regression model is NOT an appropriate tool.
• Before doing a regression analysis, the following must ALL be true:1. The relationship is linear2. ‘r’ is not very weak (I.e. is not too close to 0)3. The relationship is ‘causal’ (important – but later…)
Summary on using correlation to build a model• The purpose of all of this (taking data, graphing it, and looking for
correlation, generating a regression ine) is to generate a model (a formula) that allows us to infer information about the population and/or to make predictions. – Eg: If we give someone 6.5 beers, what do we think their BAC is likely to
be?
Steps1. Obtain data
– e.g. Do a study where you take a group of people, determine how many beers they drank, and then measure their BAC.
2. Graph that data on a scatterplot. 3. If you believe there is a correlation, draw a regression line (we’ll use
software for this step).4. From that regression line, generate a formula (a model)
The Regression ModelWhen dealing with “single linear regression” (the only regression model we will deal with in this course), the formula generated from the model will be in the form ‘y = mx + b’ that many of you probably encountered in high school. The only difference is that we will use more “statistically appropriate letters and symbols. You will need to know these (sorry)!
Different people use different symbols. In this course, we will use:
•b0 to refer to the intercept (what you probably called ‘b’).
•b1 to refer to the slope (what you probably called ‘m’).
xbby 10ˆ
The Regression Model• b0 refers to the intercept (what you
probably called ‘b’). The intercept is where the regression line crosses the y-axis.
• On this graph it is about -0.013• b1 to refer to the slope (what you
probably called ‘m’). The slope refers to the ‘angle’ of the line.
xbby 10ˆ
# Beers vs BAC – The regression modelLet’s take the generic model and apply it to our # beers v.s. BAC study:
xbby 10ˆ BAC = b0 + b1 * # of beers
The trick is to find out what the values for ‘b0’ and ‘b1’ are.
x
y
s
srb 1
First we calculate the slope of the line, b1:
r is the correlation.sy is the standard deviation of the response variable y.sx is the the standard deviation of the explanatory variable x.
Once we know the slope (b1), we can easily calculate the y-intercept (b0):
xbyb 10 where x and y are the sample means of the x and y variables
How to calculate b0 and b1
(Good news: It’s actually pretty easy!)
Important: You WILL be asked to do these calculations. And I hope you agree the calculations themselves are quite easy. In addition, I will give you the formulas on a cheat-sheet during your exams. HOWEVER: The key is for you to recognize when they can (and can NOT!) be used.
Variable namesFYI, not all calculators and software use the same variables! For example, some use:
And some use:
bxay ˆ
ˆ y ax b
Make sure you know the variables used by YOUR
software/calculator before you answer homework or
exam questions!
What’s up with the hat??
xbby 10ˆ Gas Consumption^ = b0 + b1 * Heating
The hat (^) is a symbol that tells us that this result is a predicted value as calculated using the regression line model, as opposed to a value that comes from the original data (observed data).
For example, look at the (tiny) purple dot for x=24. This dot was one of our original datapoints that says that on a 24 degree day, the average gas consumption was about 6.4. So 6.4 is the observed result from our data. However, the regression model is somewhere around 5.6.
Similarly, for x=26, y=5.3 but y^=6.0.
Again, these are symbols I want you to be comfortable with.
What’s up with the hat??xbby 10ˆ
Gas Consumption^ = b0 + b1 * Heating
So if for a heating value of 24, I say: ‘Gas Consumption’ = 6.4, then I am saying that this particular result came from the observed data . (That is, data that was collected somewhere along the way).
If, however, for a heating value of 24, I say:
‘Gas Consumption^’ = 5.6, then I am saying that this particular result is predicted from a regression model. (I.e. As opposed to an observed value that was collected somewhere).
Example using SPSS• Let’s use the software to generate a regression model for the beer
blood alcohol level discussed earlier.• In SPSS, open beer_bac.sav (you can find this file from the datasets
on the class webpage). • To generate the graph: Graph >> Legacy Dialogs >> Scatter Dot >>
Simple Scatter. Click ‘Define’– Remember to always place your explanatory variable (in this
case the number of beers variable) on the x-axis and your response variable (in this case, the bac variable) on the y-axis. You can click on the variable and click the arrow to move it into the appropriate field. Click ‘OK’.
– Also remember that it is very important that you do not confuse the explanatory vs response variables!
Example using SPSS contd• A new window will open showing your scatterplot and some additional
information. • Generate Regression line: use chart editor (double click on plot)
choose the icon for ‘Add Fit Line’ , – You will see a ‘Properties’ window open up. Choose ‘Linear’. Then
close the Properties window.• To calculate Parameters: SPSS will also calculate bo and b1 for you.
• Close the ‘Chart Editor’ window and return to the output window. • Click: Analyze >> Regression >> Linear. “Dependent” refers to the Response
variable. Independent refers to the explanatory variable. (Recall that these terms are a bit flawed, but as I mentioned earlier, they are still sometimes used). Click ok.
• Under ‘Coefficients’: the first value under the ‘B’ column is b0 (the intercept). The value below b0 is b1 (slope).
– We will talk about the ‘model summary’ table later.
Note the R2 value of 0.8 that SPSS provides with the graph. As you might expect, if you take the square root of this value, you will have your value for ‘r’. This gives us an r of about 0.89. From this we can say that we have a pretty strong, positive correlation between # of beers and BAC level.
The graph generated by SPSS:
SPSS’ Coefficients table tells us b0 and b1
Regression Model:
BAC^ = -0.013 + 0.018 * num_beers
b0 (y-intercept) b1 (slope)
Example of Minitab output
interceptslope
R2
rR2
interceptslope
Example of Excel output
Regression describes the
variation in the response
variable (y) given change in
the explanatory variable (x).
Correlation quantifies the
strength and direction of a
relationship between two
(quantitative) variables.
Correlation and Regression
Be sure that you are clear on the definition of each. I will probably have a question on your exam(s) that asks you to define correlation and regression. Or I may ask you to explain the difference between the two. The definitions are below:
Correlation v.s. Regression restated
• Correlation is a single number that tells you about the strength of the relationship. – It in no way helps you predict a specific value for ‘y’ give an ‘x’.
• Regression is the process of generating a model to allow you to make predictions.
Making predictionsThe equation that we have derived using our regression formula allows us to predict y for a given value of x.
y
Nobody in the study drank 6.5 beers,
but by finding the value of BAC
from the regression line for x = 6.5
we would predict a blood alcohol
content of 0.104 mg/ml.
mg/ml 104.0ˆ
5.6*018.0013.0ˆ
y
y
Year Powerboats Dead Manatees
1977 447 13
1978 460 21
1979 481 24
1980 498 16
1981 513 24
1982 512 20
1983 526 15
1984 559 34
1985 585 33
1986 614 33
1987 645 39
1988 675 43
1989 711 50
1990 719 47
There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths.
(in 1000’s)
From the regression line, we generate the equation:
1.214.415.62ˆ 4.41)500(125.0ˆ yy
Roughly 21 manatees.
Thus if we were to limit the number of powerboat registrations to 500,000, what
could we expect for the number of manatee deaths?
ˆ y 0.125x 41.4
ˆ y 0.125x41.4
Extrapolation is the use of a
regression line for predictions
outside the range of x values
used to obtain the line.
This can be an extremely
misleading thing to do, as seen
here.H
eig h
t in
Inch
e sH
eig h
t in
Inch
e s
!!!
!!!Extrapolation
Another example of extrapolation• In this example, there is a strong linear relationship between the time and the temperature. As
the time progresses, the temperature keeps dropping. The extrapolation, of course, results from the fact that while the time observations IN THIS RANGE are linear, the graph does level off at a later point and then begins sloping upwards. So, you could do a regression analysis on this particular period, but you could not extrpolate the results to a date before 11/21 or beyond 1/23.
Taken by itself, the y-intercept is often meaningless. In fact, it is
sometimes not even a possible value. For example, the y-intercept in our
beer / BAC model tells us that at 0 beers, we have a negative blood
alcohol content, which makes no sense…
y-intercept shows
negative blood alcoholBut the intercept is
necessary for determining
the regression model.
The y intercept
-0.013
Categorical variables in scatterplotsSometimes, even data that is purely quantitative is best divided up into multiple categories. If we neglect to do, we risk drawing entirely false conclusions.
What may look like a positive linear
relationship is in fact a series of
unrelated negative linear associations.
Plotting different habitats in different
colors allows us to make that important
distinction.
Had we neglected to do so, we would
have likely drawn the straight line
(shown) and incorrectly concluded that
there is a positive linear relationship.
Key Point
• If one of your variables can be divided into categories, plot each datapoint using a different symbol or color depending on its category.
• Another option is simply to use a separate graph for each category. Still, it is often more helpful to keep the two plots on the same chart if doing so allows you to observe differences between the categories.
Comparison of men and women
racing records over time.
Each group shows a very strong
negative linear relationship that
would not be apparent without the
gender categorization.
Relationship between lean body mass
and metabolic rate in men and women.
Both men and women follow the same
positive linear trend, but women show a
stronger association. As a group, males
typically have larger values for both
variables.
Categorical explanatory variablesSo far, we’ve drawn our scatterplots using quantitative variables (even when we broke them up into different categories). When the explanatory variable is categorical, a scatterplot might not be your best choice. However, there are ways of comparing compare different categories side by side.
Level of Education (categorical) vs
Income (quantitative response).
Comparing 5 different categories on
a single graph.
Boxplots are a great choice for this
kind of comparison.