Warsaw Summer School 2015, OSU Study Abroad Program Regression

Preview:

Citation preview

Warsaw Summer School 2015, OSU Study Abroad Program

Regression

Linear Relationship

The line = a mathematical function that can be expressed through the formula Y = a + bX, where Y & X are our variables.

Y, the dependent variable, is expressed as a linear function of

the independent (explanatory) variable X.

Linear Relationship

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Cartesian Coordinate System

Variables X, Y and their linear function:

The formula Y = a + bX expresses the dependent (response) variable Y as a linear function of the independent (explanatory) variable X. The formula maps out a strait-line graph with slope b and Y-intercept a.

Basics

Linear Relationship: Y = a + bX

The constant a is the value of Y when X = 0. For X = 0 we have: Y = a + b*0 = a

The constant a is the value of Y where the line Y = a + bX intersects the Y-axis.

The slope b equals the change in Y for a one-unit increase in X. This means that one-unit increase in X corresponds to a change of b units in Y. Thus, the slope describes the rate of change in the Y-values as X increases. Generally,

b = (Y - a) / X

Model vs Reality

The function Y = a + bX is a model

In reality we do not have one line

The Scatter gram and Least Squares Method

The graphical plot of observed values (X,Y) is called a

- scatter-gram

- scatter-diagram

- scatter-plot.

A regression function is a function that describes how the expected value of the dependent (response) variable Y changes according to the values of an independent

(explanatory) variable X.

Regression

This expected value is estimated by a linear function:

• Ý = a + bX

Ý = predicted value for the dependent variable, Ya = the intercept (the value of Y when X = 0)b = the regression coefficient (the slope), indicating the amount of change in Y given a unit change in XX = the independent variable

Regression

Ý = a + bX

b = [Σ(X - X̃ )(Y - Ÿ)] / Σ(X - X̃ )2

a = Ý - b*X

Method of Least Squares

The prediction errors, called residuals, are defined as the differences between observed and predicted values of Y

E = Ý - (a + bX) = Y - Ý

Regression line minimizes the sum of error terms: SSE = Σ(Y - Ý)2

Method of Least Squares

The method of least squares provides the prediction equation Ý = a + bX having the minimal value of SSE.

The least square estimates a and b are the values determining the prediction equation for which the sum of squared errors SSE is a minimum.

Covariance

In the regression analysis we ask: to what extend could we predict Y knowing our variable X? Prediction means that values X and Y go together or co-vary.

Covariance is sum of products, or SP, • SP = Σ (X - X̃ ) (Y - Ÿ)

Sums of squares for X:• SSx = Σ (X - X̃ )2

Note that in the regression equation of Y on X• Ý = bX + a• b = SP / SSx

Interpretation of b

The slope of the line, b, has the verbal interpretation “rise over run”-- that is, the rise divided by the run. This means that the change in the vertical distance is divided by the change in the horizontal distance.

The more steep the hill, the higher the slope. You go “up” more rapidly than you go over. The line can have a negative slope.

When there is negative slope, you are going “downhill” rather than “uphill.”

• b > 0, positive relationship• b < 0, negative relationship• b = 0, no relationship

Linear Relationship

The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept).

The slope b equals the change in Y for a one-unit increase inX (one-unit increase in X corresponds to a change of b unitsin Y). The slope describes the rate of change in Y-values, asX increases.

Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change inthe vertical distance is divided by the change in the horizontaldistance).

Unststandardized and standardized coefficients

If both variables, IV and DV, are expressed in z-scores, a (constant) is equal zero.

We obtain Beta coefficients that tell us the following: How much change in the standard deviation units in DV is attributable to the change in IV by one standard deviation.

Two and more IVs

Ý = a + b1X1 + b2X2

Ý = β1X1 + β2X2

Ý = a + b1X1 + b2X2 ……….. bk-1Xk-1 + bkXk

Ý = β 1X1 + β 2X2 ……….. β k-1Xk-1 + β kXk

Coefficients and variables

The estimated parameters b1, b2, ..., bk are partial regression coefficients. They are different from regression coefficients for bi-variate relationships between Y and each exploratory variable.

Three criteria for a number of independent (exploratory) variables:

• (1) Theory

• (2) Parsimony

• (3) Sample size

R2

Coefficient of determination (explained variance) for two variables

SS(total) - SS(error)• r2 = ----------------------------- SS(total)

• Stata provides a value of the coefficient of determination for

• SS(total) - SS(error)• R2 = ----------------------------- SS(total)

Sum of squares

R2 is a proportion of explained variance by X1, X2, ...., Xk.

Therefore, 1 - R2 is a proportion of unexplained variance.

Adjusted R-square

• Adjusted R-square is a modification of R-square that adjusts for the number of terms in a model. R-square always increases when a new term is added to a model, but adjusted R-square increases only if the new term improves the model more than would be expected by chance.

Sum of Squares

The Regression SUM of SQUARES is defined:

SS(regression) = SS(total) – SS(error)

Mean square

The Regression MEAN SQUARE

MSS(regression) = SS(regression) / df-v

df-v = k where k is a number of variables

The MEAN SQUARE ERROR

MSS(error) = SS(error) / df

df-t = n - (k + 1) where n is a number of cases and k is a number of variables.

F

The null hypothesis

Ho: b1 = b2 = … = bk = 0

MSS(model)• F = -------------- MSS(error)

The sampling distribution of this statistic is the F-distribution

t

The test of H0: bk = 0 evaluates whether Y and X are statistically dependent, ignoring other variables.

We use the t statistic b• t = -------------- σB where σB is a standard error of B

SS(error)• σB = -------- n - 2

ANOVA

ANALYSIS OF VARIANCE

• How much of the variance is explained by values of the nominal variable?

• Total sum of squared variation from the mean:

• SS(total) = Σ [X – XM (total)]2

ANOVA

The between group variation represents the squared deviations of every group mean from the total mean:

• SS(between) = Σ [XM (group) – XM (total)]2

The within-group sum of squares is the sum of every raw

score from its group mean:

• SS(within) = Σ [X – XM (group)]2

ANOVA

Mean Squares:

• MSS(between) = SS(between) / df(between)

where df(between) = k – 1

• MSS(within) = SS(within) / df(within) where df(within) = N - k

F

F-statistic

MSS(between)

• F = --------------

MSS(within)

• The larger the F-value, the greater the impact of a group on the dependent variable.

F

Compare:

MSS(between)

• F = --------------

MSS(within)

MSS(regression)

• F = -------------- Regression ANOVA

MSS(error)

Stata

ANOVA

• Source - Model, Residual, and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error). 

• SS - Sum of Squares associated with the three sources of variance, Total, Model and Residual.

• df - Degrees of freedom associated with the sources of variance.  The total variance has N-1 degrees of freedom.  The model degrees of freedom = the number of coefficients + intercept minus 1.  The Residual degrees of freedom is the DF total minus the DF model.

• MS - Mean Squares, the Sum of Squares divided by their respective DF. 

Regression

• Number of observations used in the regression analysis. • The F-statistic is the Mean Square Model divided by the

Mean Square Residual.  The numbers in parentheses are the Model and Residual degrees of freedom.

• Prob > F - This is the p-value associated with the above F-statistic.  It is used in testing the null hypothesis that all of the model coefficients are 0.

• R-squared - R-Squared is the proportion of variance in the dependent variable which can be explained by the independent variables. 

• Adj R-squared - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model.  Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors. 

• Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Squared Error). 

Recommended