Linear regression

LINEAR REGRESSIONRyan Sain, Ph.D.

Regression Introduced

Regression is about prediction Predicting an unknown point based on

observations (or measurements) Widgets sold based on advertising

We can explore known relationships We can explore unknown

relationships

The Variables

Outcome Variable The thing we are predicting Number of widgets sold

Predictor Variable (simple regression) The variables that you know about Advertising dollars

Predictor Variables (multiple regression) We predict values of a dependent

variable (outcome) using one or more independent variables (predictors)

The model

Any prediction follows the basic formula: Outcomei = (model) + errori

In regression our model contains several things: Slope of the line (that best fits the data

measured) = b1

Intercept of the line (at the Y axis) b0

So our model = Yi = (b0 + b1Xi) + Errori Do you recognize this equation? The model is simply a line

So how do we calculate this line? The Method of Least Squares

The line that is the closest to all the data points

Residuals = Deviations (distance of actual data points to the line)

Square these residuals to get rid of negatives

Then sum them.

How well does this line fit? No line is perfect (there are always

residuals) If our line is a good one it should be

better than a basic line (significantly so) We compare our line to a basic line:

Deviation = SUM (observed – model)2

This is basically a ‘mean’ (model) The mean is an awful predictor No matter how much you spend on adverts

– the sales of your widgets are the same

Fitness continued

SSt = total sum of squared differences (using the mean)

SSr = total residual sum of squares (using our best fit model) Represents a degree of inaccuracy

SSm (model sum of squares) = SSt – SSr

Large = our model is different than the simple model Proportion of improvement:

R2 = SSm / SSt

Percentage of variation in the outcome that can be explained by our model

More fitness

You can assess this using an F test as well F is simply systematic variance/unsystematic

variance In regression that means:

Improvement of the model (SSm - systematic) and the difference between the model and the observed data (SSr – unsystematic)

But we need to look at mean squares Because we need to use the average sums of squares in an F

test. So we divide by degrees of freedom

For SSm = the number of variables in the model

For SSr = number of observations – the number of parameters being estimated (number of beta coefficients or predictors)

F = MSM / MSR

Individual Predictors

The coefficient b is essentially the gradient of the line

If the predictor is not valuable then it will predict no change in the outcome as it changes. This would be b = 0 This is what the mean does

If the predictor is valuable – then it will be significantly different than 0.

Individual Predictors cont. To test if b is different from 0 we will use a t-test. We are comparing how big the b value is in

comparison to the amount of error in that estimate.

We will then use the standard error of the b value.

t = bobserved – bexpected / SEb

Since the expected value is 0 (no change) then we have to simply divide the observed b value by the standard error of b to get the t score.

Degress of freedom is calculated using the following: N – p – 1 (p = number of predictors)

Education

Linear regression