Upload
ryan-sain
View
1.293
Download
3
Embed Size (px)
DESCRIPTION
Citation preview
LINEAR REGRESSIONRyan Sain, Ph.D.
Regression Introduced
Regression is about prediction Predicting an unknown point based on
observations (or measurements) Widgets sold based on advertising
We can explore known relationships We can explore unknown
relationships
The Variables
Outcome Variable The thing we are predicting Number of widgets sold
Predictor Variable (simple regression) The variables that you know about Advertising dollars
Predictor Variables (multiple regression) We predict values of a dependent
variable (outcome) using one or more independent variables (predictors)
The model
Any prediction follows the basic formula: Outcomei = (model) + errori
In regression our model contains several things: Slope of the line (that best fits the data
measured) = b1
Intercept of the line (at the Y axis) b0
So our model = Yi = (b0 + b1Xi) + Errori Do you recognize this equation? The model is simply a line
So how do we calculate this line? The Method of Least Squares
The line that is the closest to all the data points
Residuals = Deviations (distance of actual data points to the line)
Square these residuals to get rid of negatives
Then sum them.
How well does this line fit? No line is perfect (there are always
residuals) If our line is a good one it should be
better than a basic line (significantly so) We compare our line to a basic line:
Deviation = SUM (observed – model)2
This is basically a ‘mean’ (model) The mean is an awful predictor No matter how much you spend on adverts
– the sales of your widgets are the same
Fitness continued
SSt = total sum of squared differences (using the mean)
SSr = total residual sum of squares (using our best fit model) Represents a degree of inaccuracy
SSm (model sum of squares) = SSt – SSr
Large = our model is different than the simple model Proportion of improvement:
R2 = SSm / SSt
Percentage of variation in the outcome that can be explained by our model
More fitness
You can assess this using an F test as well F is simply systematic variance/unsystematic
variance In regression that means:
Improvement of the model (SSm - systematic) and the difference between the model and the observed data (SSr – unsystematic)
But we need to look at mean squares Because we need to use the average sums of squares in an F
test. So we divide by degrees of freedom
For SSm = the number of variables in the model
For SSr = number of observations – the number of parameters being estimated (number of beta coefficients or predictors)
F = MSM / MSR
Individual Predictors
The coefficient b is essentially the gradient of the line
If the predictor is not valuable then it will predict no change in the outcome as it changes. This would be b = 0 This is what the mean does
If the predictor is valuable – then it will be significantly different than 0.
Individual Predictors cont. To test if b is different from 0 we will use a t-test. We are comparing how big the b value is in
comparison to the amount of error in that estimate.
We will then use the standard error of the b value.
t = bobserved – bexpected / SEb
Since the expected value is 0 (no change) then we have to simply divide the observed b value by the standard error of b to get the t score.
Degress of freedom is calculated using the following: N – p – 1 (p = number of predictors)