34
Multiple regression analysis And Stepwise regression

# Regression

Embed Size (px)

Citation preview

Multiple regression analysis And

Stepwise regression

History:The earliest form of regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809.

The term "regression" was used by British biometrician sir Francis Galton in the (1822-1911), to describe a biological phenomenon.

Sir Galton's work on inherited characteristics of sweet peas led to the initial conception of linear regression.

Introduction: Regression is a statistical technique for investigating and modeling the relationship between variables.

Applications of regression are numerous and occur in almost every field, including engineering, the physical and the social sciences, and the biological sciences.

Usually, the investigator seeks to ascertain the causal effect of one variable upon another—the effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate.

Definition: Regression is the measure of the average relationship between two or more variables in terms of the original units of the data. It is unquestionably the most widely used statistical technique in social sciences. It is also widely used in biological and physical science. Regression equation is (y) =a + b xSlope (b) = (NΣXY-(ΣX)( ΣY)) / (NΣX2 – (ΣX)2)Intercept (a) = (ΣY-b(ΣX)) / N

Review of Simple linear regression. A simple linear regression is carried out to estimate the relationship between a dependent variable, Y and a single explanatory variable, x given a set of data that includes observations for both of these variables for a particular population.

•For ex: A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)•A random sample of 10 houses is selected

Dependent variable (Y) = house price Independent variable (X) = square feet

Simple Linear Regression Model

ii10i εXββY Linear component

Population

Y intercept

Population SlopeCoefficient

Random Error term

Dependent Variable

Independent Variable

Random Error component

i10i XbbY

The simple linear regression equation provides an estimate of the population regression line

Estimate of the regression intercept

Estimate of the regression slope

Estimated (or predicted) Y value for observation i Value of X

for observation i

The individual random error terms ei have a mean of zero

Prediction equation is given by:

21

xx

yyxx

SS

SS

i

ii

xx

xy

xy 10ˆˆ

Estimation of coefficients:

Where

Measures of Variation

Total variation is made up of two parts:

SSE SSR SST Total Sum of Squares

Regression Sum of Squares

Error Sum of Squares

2i )YY(SST 2

ii )YY(SSE 2i )YY(SSR

where:

= Average value of the dependent

variable

Yi = Observed values of the dependent

variable

i = Predicted value of Y for the given Xi

value

Y

Y

Measures of Variation

Xi

Y

X

Yi

SST = (Yi - Y)2

SSE = (Yi - Yi )2

SSR = (Yi - Y)2

__

_

Y

Y

Y_Y

Coefficient of Determination, r2

• The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable

• The coefficient of determination is also called r-squared and is denoted as r2

1r0 2 note:

squares of sum

squares of regression2

total

sum

SST

SSRr

Multiple linear regression

Introduction: The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variables.

Definition: A regression model that involves the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data (more than one regressor variable) is called a multiple regression model. Every value of the independent variable x is associated with a value of the dependent variable y. Suppose that the yield in the pounds of conversation in a chemical process depends on temperature and the catalyst concentration. A multiple regression model that might describe the relationship is

y=β0+β1x1+β2x2+ε

where y denotes the yield,x1denotes the temperature,x2

denotes the catalyst concentration. This is multiple linear regression model with two regressor variables. The term linear is used because equation is a linear function of the known parameters β0,β1& β2 and ε is error term.

The parameter β1 indicates that the expected

change in response (y) per unit change in x1 when x2 is held

constant. Similarly β2 measures the expected change in (y)

per unit change in x2 when x1 held constant.

In general, the response y may be related to k regressor (or) predictor variables. The model y= β0+β1x1+β2x2+……………+ βkxk+ε

is a multiple linear regression with k regressors. The parameters βj, j=0,1,…….k. are called regression coefficients.

The parameter βj represents the expected change in the response

(y) per unit change in xj when all of the remaining regressor variables xi

(i≠j) are held constant. For this reason the parameters βj, j=1,…….k are

often called partial regression coefficients.

Assumptions of Regression

• For any given set of values of x1, x2, … , xk, the random error has a probability distribution with the following properties:

• 1. Mean equal to 0

• 2. Variance equal to 2

• 3. Normal distribution

• 4. Random errors are independent

Regression Analysis: Model Building

• General Linear Model

• Determining When to Add or Delete Variables

• Analysis of a Larger Problem

• Multiple Regression Approach

to Analysis of Variance

General Linear Model

Models in which the parameters (β0, β1, . . . , βp) all have exponents of one are called linear models.

• First-Order Model with One Predictor Variable

y x 0 1 1y x 0 1 1

Variable Selection Procedures

• Stepwise Regression

• Forward Selection

• Backward Elimination

Iterative; one independent

variable at a time is added or

deleted Based on

the F statistic

Variable Selection Procedures• F Test

• To test whether the addition of x2 to a model involving x1 (or the deletion of x2 from a model involving x1and x2) is statistically significant

F0=MSR/MSRes

(MSR=SSR/K)

The p-value corresponding to the F statistic is the criterion used to determine if a variable should be added or deleted

(SSE(reduced)-SSE(full))/ number of extra termsMSE(full)

F

Forward Selection

• This procedure is similar to stepwise-regression, but does not permit a variable to be deleted.

• This forward-selection procedure starts with no independent variables.

• It adds variables one at a time as long as a significant reduction in the error sum of squares (SSE) can be achieved.

Backward Elimination

• This procedure begins with a model that includes all the independent variables the modeler wants considered.

• It then attempts to delete one variable at a time by determining whether the least significant variable currently in the model can be removed because its p-value is less than the user-specified or default value.

• Once a variable has been removed from the model it cannot re enter at a subsequent step.

Stepwise regression:

Procedure of simultaneous forward and backward selection also available

In a stepwise regression, predictor variables are entered into the regression equation one at a time

based upon statistical criteria. At each step in the analysis the predictor variable that

contributes the most to the prediction equation in terms of increasing the multiple correlation, R, is

entered first. This process is continued only if additional variables add anything statistically to the

regression equation.

The choosing is done according to following manner

i.e.) delete x.i if ^i2 E(2 )(Z1

lZl)-1 ii<Fout=F1,n-r-1 pout

enter x.j if (n-r-2)cjq

2 (cjjcqq-cjq2) > Fin=F1,n-r-2 pin

here either pin or pout are specify the stepwise procedure is terminated when either of the two

following points happens

We can’t enter or delete the variables according to the above criteria i.e.) this includes the case where enter all regressor & can’t delete any.

The processor dictates that the same regressor be enter and deleted in successive operations the stepwise selection procedure is an attempt to

achieve to insert variables in terms until the regression equation is satisfactory.

When additional predictor variables add anything statistically meaningful to the regression equation, the analysis stops. Thus, not all

predictor variables may enter the equation in stepwise regression.

There are a number of multiple regression variants. Stepwise is usually a good choice though one can enter all variables simultaneously as an alternative. Similarly, one can enter all of the variables simultaneously and gradually eliminate predictors one by one if elimination does little

to change the overall prediction.Stepwise regression procedure is the best procedure

when compared to the all procedures we have see earlier.

Uses of Regression Analysis:1.Regression analysis helps in establishing a functional

Relationship between two or more variables.

2. Since most of the problems of economic analysis are based on cause and effect relationships, the regression analysis is a highly valuable tool in economic and business research.

3. Regression analysis predicts the values of dependent variables from the values of independent variables.

4. We can calculate coefficient of correlation (r) and coefficient of determination (R2) with the help of regression coefficients.

ANOVA TABLE:Source Degrees of freedom Sum of squares Mean Square F Regression 2 5550.8166 2775.4083 4.7*10-16 Residual 22 233.7260 10.6239

Total 24 5784.5426

R2 = 0.9596 Adjusted R2 = 0.9559

Scatter plot for cases and distance