Regn & Marketing Research

8/9/2019 Regn & Marketing Research

1/23

Correlation and

Regression:

Explaining Associationand Causation

byTuhin Chattopadhyay


2/23

Application Areas: Correlation

1. Correlation and Regression are generally

performed together. The application of correlationanalysis is to measure the degree of association

between two sets of quantitative data. The correlation

coefficient measures this association. It has a value

ranging from 0 (no correlation) to 1 (perfect positive

correlation), or -1 (perfect negative correlation).

2. For example, how are sales of product A correlated

with sales of product B? Or, how is the advertising

expenditure correlated with other promotional

expenditure? Or, are daily ice cream sales correlated

with daily maximum temperature?

3. Correlation does not necessarily mean there is a

causal effect. Given any two strings of numbers,

there will be some correlation among them. It does

not imply that one variable is causing a change inanother, or is dependent upon another.

4. Correlation is usually followed by regression

analysis in many applications.

Slide 1


3/23

Slide 2 Application Areas: Regression

1. The main objective of regression analysis is to explain thevariation in one variable (called the dependent variable),

based on the variation in one or more other variables (calledthe independent variables).

2. The applications areas are in explaining variations insales of a product based on advertising expenses, or numberof sales people, or number of sales offices, or on all theabove variables.

3. If there is only one dependent variable and oneindependent variable is used to explain the variation in it,then the model is known as a simple regression.

4. If multiple independent variables are used to explain thevariation in a dependent variable, it is called a multipleregression model.

5. Even though the form of the regression equation could beeither linear or non-linear, we will limit our discussion tolinear (straight line) models.

6. As seen from the preceding discussion, the major application of regression analysis in marketing is in the areaof sales forecasting, based on some independent (or explanatory) variables. This does not mean that regressionanalysis is the only technique used in sales forecasting.

There are a variety of quantitative and qualitative methodsused in sales forecasting, and regression is only one of thebetter known (and often used) quantitative techniques.


4/23

Slide 3 Methods

There are basically two approaches to regression y A hit and trial approach .

y A pre- conceived approach.

Hit and trial Approach

In the hit and trial approach we collect data on a largenumber of independent variables and then try to fit a

regression model with a stepwise regression model, enteringone variable into the regression equation at a time.The general regression model (linear) is of the type

Y = a + b1x1 + b2x2 +.+ bnxn

where y is the dependent variable and x1, x2 , x3.xn are theindependent variables expected to be related to y andexpected to explain or predict y. b1, b2, b3bn are thecoefficients of the respective independent variables, whichwill be determined from the input data.

Pre-conceived Approach

The pre-conceived approach assumes the researcher knowsreasonably well which variables explain y and the modelis pre-conceived, say, with 3 independent variables x1, x2,x3. Therefore, not too much experimentation is done. Themain objective is to find out if the pre-conceived model isgood or not. The equation is of the same form as earlier.


5/23

Slide 4

Data

1. Input data on y and each of the x variables isrequired to do a regression analysis. This data is input

into a computer package to perform the regression

analysis.

2. The output consists of the b coefficients for all the

independent variables in the model. The output also

gives you the results of a t test for the significance of

each variable in the model, and the results of the F

test for the model on the whole.

3. Assuming the model is statistically significant at thedesired confidence level (usually 90 or 95% for typical

applications in the marketing area), the coefficient of

determination or R2 of the model is an important part

of the output. The R2 value is the percentage (or

proportion) of the total variance in y explained by all

the independent variables in the regression equation.


6/23

Slide 5 Recommended usage

1. It is recommended that for exploratory research, the hit-and-trial approach may be used. But for serious decision-

making, there has to be a-priori knowledge of the variableswhich are likely to affect y, and only such variables shouldbe used in the regression analysis.

2. It is also recommended that unless the model is itselfsignificant at the desired confidence level (as evidenced by

the F test results printed out for the model), the R valueshould not be interpreted.

3. The variables used (both independent and dependent)are assumed to be either interval scaled or ratio scaled.Nominally scaled variables can also be used as

independent variables in a regression model, with dummyvariable coding.

4. If the dependent variable happens to be a nominallyscaled one, discriminant analysis should be the techniqueused instead of regression.


7/23

Slide 6 Worked Example: Problem

1. A manufacturer and marketer of electric motors would like

to build a regression model consisting of five or six

independent variables, to predict sales. Past data has beencollected for 15 sales territories, on Sales and six different

independent variables. Build a regression model and

recommend whether or not it should be used by the

company.

2. We will assume that data are for a particular year, in

different sales territories in which the company operates, and

the variables on which data are collected are as follows:

Dependent Variable

Y = sales in Rs.lakhs in the territory

Independent Variables

X1 = market potential in the territory (in Rs.lakhs).X2 = No. of dealers of the company in the territory.

X3 = No. of salespeople in the territory.

X4 = Index of competitor activity in the territory on

a 5 point scale

(1=low, 5 = high level of activity by competitors).

X5 = No. of service people in the territory.X6 = No. of existing customers in the territory.


8/23

Sli e 7

I t ata:

The data set consisting o 15 observations, is given inig 1.

Fig. 1

ata ile : T 1. T (15 cases ith 7

variables)

1

SALES

2

POTENTL

3

DEALERS

4

PEOPLE

5

COMPET

6

SERVICE

7

CUSTOM

1 5 25 1 6 5 2 202 60 150 12 30 4 5 503 20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167 15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910

3 40 1 6 5 2 511 16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315 81 150 15 35 4 7 70


9/23

Slide 8

Correlation

First, let us look at the correlations of all the variableswith each other. The correlation table (output from

the computer for the Pearson Correlation procedure)

is shown in Fig. 2. The values in the correlation tableare standardised, and range from 0 to 1 (+ ve and - ve).

Fig.2 : Correlations a le

T T.

MULTIPL

.

Correlations (regdata1.sta)

ariable

P T

TL

L

P P

L

C M

P T

IC CU T

M

L

P T TL 1.00 .84 .88 .14 .61 .83 .94

L .84 1.00 .85 -.08 .68 .86 .91

P PL .88 .85 1.00 -.04 .79 .85 .95

C MP T .14 -.08 -.04 1.00 -.18 -.01 -.05

IC .61 .68 .79 -.18 1.00 .82 .73

CU T M .83 .86 .85 -.01 .82 1.00 .88

L .94 .91 .95 -.05 .73 .88 1.00


10/23

Slide 9

1. ooking at the last column of the table, we find that

except for CO PET (index of competitor activity), all othervariables are highly correlated (ranging from .73 to .95) with

Sales.

2. This means we may have chosen a fairly good set of

independent variables (No. of Dealers, Sales Potential, No.of Customers, No. of Service People, No. of Sales People) to

try and correlate with Sales.

3. Only the Index of Competitor Activity does not appear to

be strongly correlated (correlation coefficient is -.05) with

Sales. But we must remember that these correlations in Fig.

2 are one-to-one correlations of each variable with the other.

So we may still want to do a multiple regression with an

independent variable showing low correlation with a

dependent variable, because in the presence of other

variables, this independent variable may become a good

predictor of the dependent variable.


11/23

4. The other point to be noted in the correlation table is

whether independent variables are highly correlated with

each other. If they are, like in Fig. 2, this may indicate

that they are not independent of each other, and we may

be able to use only 1 or 2 of them to predict the

dependent variables.

5. As we will see later, our regression ends up

eliminating some of the independent variables, because

all six of them are not required. Some of them, being

correlated with other variables, do not add any value to

the regression model.

6. We now move on to the regression analysis of the

same data.

Slide 9 contd...


12/23

Slide 10

Regression

We will first run the regression model of the followingform, by entering all the 6 'x' variables in the model -

Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6..Equation 1

and determine the values of a, b1, b2, b3, b4, b5, & b6.

Regression Output:

The results (output) of this regression model are in Fig.4

in table form.

Column 4 of the table, titled B lists all the coefficients

for the model. According to this,

a (intercept) = -3.17298b1 = .22685

b2 = .81938

b3 = 1.09104

b4 = -1.89270

b5 = -0.54925

b6 = 0.06594


13/23


14/23

Slide 12

The R

2

value is 0.977, from the top of Fig. 4. FromFig. 4, we also note that t tests for significance of

individual independent variables indicate that at the

significance level of 0.10 (equivalent to a confidence

level of 90%), only POTENT and PEOP E are

statistically significant in the model. The other 4

independent variables are individually not significant.

ig. 4 MULTIPLE REGRESSION RESULTS:

All independent variables were entered in one block

Dependent Variable: SA ES

ultiple R: .988531605

ultiple R-Square: .977194734Adjusted R-Square: .960090784

Number of cases: 15

F(6, 8) = 57.13269 p< .000004Standard Error of Estimate: 4.391024067

Intercept: -3.172982117

Std.Error: 5.813394 t(8) = -.5458 p< .600084


15/23

Slide 12 contd...

STAT.

TIP EREGRESS.

Regression Summary for Dependent Variable: SA ES

R= .98853160 R2= .97719473 Adjusted R2= .96009078F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

N=15

BETA

St.Err.

of

BETA

B

St. Err.

of B t(8) p-level

Intercept -3.1729 5.813394 -.54581 .600084

POTENT .439073 .144411 .22685 .074611 3.04044 .016052

DEA ERS .164315 .126591 .81938 .631266 1.29800 .230457

PEOP E .413967 .158646 1.09104 .418122 2.60937 .031161

CO PET .084871 .060074 -1.89270 1.339712 -1.41276 .195427

SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204

C STO .050490 .149302 .06594 .095002 .33817 .743935


16/23

Slide 13

However, ignoring the significance of individual

variables for now, we shall use the model as it is, and tryto apply it for decision making.

The real use of the regression model would be to try and

predict sales in Rs. lakhs, given all the independent

variable values.

The equation we have obtained means, in effect, thatsales will increase in a territory if the potential increases,

or if the number of dealers increases, or if level of

competitors activity decreases, if number of service

people decreases, and if the number of existing

customers increases.

The estimated increase in sales for every unit increase or

decrease in these variables is given by the coefficients of

the respective variables. For instance, if the number of

sales people is increased by 1, sales in Rs . lakhs, are

estimated to increase by 1.09, if all other variables areunchanged. Similarly, if 1 more dealer is added, sales are

expected to increase by 0.82 lakh, if other variables are

held constant.


17/23

There is one co-efficient, that of the SERVICE variable,which does not make too much intuitive sense. If we increase

the number of service people, sales are estimated to decrease

according to the 0.55 coefficient of the variable "No. of

Service People" (SERVICE).

But if we look at the individual variable t tests, we find that

the coefficients of the variable SERVICE is statistically not

significant (p-level 0.735204 from fig. 4). Therefore, the

coefficient for SERVICE is not to be used in interpreting the

regression, as it may lead to wrong conclusions.

Strictly speaking, only two variables, potential (POTENT )

and No. of sales people (PEOP E) are significant

statistically at 90 percent confidence level since their p- level

is less than 0.10. One should therefore only look at the

relationship of sales with one of these variables, or boththese variables.

Slide 13 contd...

Slide 14 Making Predictions/Sales orecasts


18/23

Slide 14 Making Predictions/Sales orecasts

Given the levels of X1, X2, X3, X4, X5, and X6 for a

particular territory, we can use the regression model for

prediction of sales.Before we do that, we have the option of redoing the

regression model so that the variables not statistically

significant are minimized or eliminated.

We can follow either the Forward Stepwise Regression

method, or the Backward Stepwise Regression method,

to try and eliminate the 'insignificant' variables fromthe full regression model containing all six

independent variables.

orward Stepwise Regression

For example, we could ask the computer for a Forward

stepwise Regression model, in which case the

algorithm adds one independent variable, at a time ,

starting with the one which explains most of the

variation in sales (y), and adding one more X variable

to it , rechecking the model to see that both variablesform a good model, then adding a third variable if it

still adds to the explanation of Y , and so on. Fig, 5

shows the result of running a forward stepwise

Regression, which ends up with only 4 out of 6

independent variables remaining in the regression

model.


19/23


20/23


21/23

The R for the model has dropped only slightly, to 0.9599,the F-test for the model is highly significant, and both the

independent variables POTENT and PEOP E are

significant at 90 % confidence level (p-levels of .002037

and .000728 from last column, Fig, 6).

If we were to decide to use this model for prediction , weonly require data to be collected on the number of sales

people (PEOP E) and the sales potential (POTENT ), in

a given territory . We could form the equation using the

Intercept and coefficients from column B in Fig. 6. as

follows-

Sales = -10.6164 + .2433 (POTENTL)

+ 1.4244 (PEOPLE)...Equation 3

Thus, if potential in a territory were to be Rs. 50 lakhs,

and the territory had 6 salespeople, then expected sales,using the above equation would be

= -10.6164 +.2433(50) +1.4244(6)

= 10.095 lakhs.

Similarly, we could use this model to make predictions

regarding sales in any territory for which Potential andNo. of Sales People were known.

Slide 17


22/23

Slide 18

Additional comments

1. As we can see from the example discussed, regression

analysis is a very simple (particularly on a computer),

and useful techniques to predict one metric dependent

variable based on a set of metric independent variables.

Its use, however, gets more complex, for instance, if the

independent variables are nominally scaled into two(dichotomous) or more (polytomous) categories.

2. It is also a good idea to define the range of all

independent variables used for constructing the

regression model. For prediction of Y values, only those

X values which fall within or close to this range (used

earlier in the model construction stage) must be used, for

the predictions to be effective.

3. Finally, we have assumed that a linear model is the

only option available to us. That is not the only choice. Aregression model could be of any non linear variety, and

some of these could be more suitable for particular cases.


23/23

4. Generally, a look at the plot of Y and X tells us in case of a

simple regression model, whether the linear (straight line)approach is best or not. But in a multiple regression, this

visual plot may not indicate the best kind of model, as there

are many independent variables, and the plot in 2 dimensions

is not possible.

5. In this particular example, we have not used any

macroeconomic variables, but in industrial marketing, we

may use those types of industry or macroeconomic variables

in a regression model. For example, to forecast sales of steel,

we may use as independent variables, the growth rate of a

countrys GDP, the new construction starts, and the growthrate of the automobile industry.

Slide 18 contd.

Documents

Regn & Marketing Research