Upload
sandeepbanna
View
220
Download
0
Embed Size (px)
Citation preview
8/9/2019 Regn & Marketing Research
1/23
Correlation and
Regression:
Explaining Associationand Causation
byTuhin Chattopadhyay
8/9/2019 Regn & Marketing Research
2/23
Application Areas: Correlation
1. Correlation and Regression are generally
performed together. The application of correlationanalysis is to measure the degree of association
between two sets of quantitative data. The correlation
coefficient measures this association. It has a value
ranging from 0 (no correlation) to 1 (perfect positive
correlation), or -1 (perfect negative correlation).
2. For example, how are sales of product A correlated
with sales of product B? Or, how is the advertising
expenditure correlated with other promotional
expenditure? Or, are daily ice cream sales correlated
with daily maximum temperature?
3. Correlation does not necessarily mean there is a
causal effect. Given any two strings of numbers,
there will be some correlation among them. It does
not imply that one variable is causing a change inanother, or is dependent upon another.
4. Correlation is usually followed by regression
analysis in many applications.
Slide 1
8/9/2019 Regn & Marketing Research
3/23
Slide 2 Application Areas: Regression
1. The main objective of regression analysis is to explain thevariation in one variable (called the dependent variable),
based on the variation in one or more other variables (calledthe independent variables).
2. The applications areas are in explaining variations insales of a product based on advertising expenses, or numberof sales people, or number of sales offices, or on all theabove variables.
3. If there is only one dependent variable and oneindependent variable is used to explain the variation in it,then the model is known as a simple regression.
4. If multiple independent variables are used to explain thevariation in a dependent variable, it is called a multipleregression model.
5. Even though the form of the regression equation could beeither linear or non-linear, we will limit our discussion tolinear (straight line) models.
6. As seen from the preceding discussion, the major application of regression analysis in marketing is in the areaof sales forecasting, based on some independent (or explanatory) variables. This does not mean that regressionanalysis is the only technique used in sales forecasting.
There are a variety of quantitative and qualitative methodsused in sales forecasting, and regression is only one of thebetter known (and often used) quantitative techniques.
8/9/2019 Regn & Marketing Research
4/23
Slide 3 Methods
There are basically two approaches to regression y A hit and trial approach .
y A pre- conceived approach.
Hit and trial Approach
In the hit and trial approach we collect data on a largenumber of independent variables and then try to fit a
regression model with a stepwise regression model, enteringone variable into the regression equation at a time.The general regression model (linear) is of the type
Y = a + b1x1 + b2x2 +.+ bnxn
where y is the dependent variable and x1, x2 , x3.xn are theindependent variables expected to be related to y andexpected to explain or predict y. b1, b2, b3bn are thecoefficients of the respective independent variables, whichwill be determined from the input data.
Pre-conceived Approach
The pre-conceived approach assumes the researcher knowsreasonably well which variables explain y and the modelis pre-conceived, say, with 3 independent variables x1, x2,x3. Therefore, not too much experimentation is done. Themain objective is to find out if the pre-conceived model isgood or not. The equation is of the same form as earlier.
8/9/2019 Regn & Marketing Research
5/23
Slide 4
Data
1. Input data on y and each of the x variables isrequired to do a regression analysis. This data is input
into a computer package to perform the regression
analysis.
2. The output consists of the b coefficients for all the
independent variables in the model. The output also
gives you the results of a t test for the significance of
each variable in the model, and the results of the F
test for the model on the whole.
3. Assuming the model is statistically significant at thedesired confidence level (usually 90 or 95% for typical
applications in the marketing area), the coefficient of
determination or R2 of the model is an important part
of the output. The R2 value is the percentage (or
proportion) of the total variance in y explained by all
the independent variables in the regression equation.
8/9/2019 Regn & Marketing Research
6/23
Slide 5 Recommended usage
1. It is recommended that for exploratory research, the hit-and-trial approach may be used. But for serious decision-
making, there has to be a-priori knowledge of the variableswhich are likely to affect y, and only such variables shouldbe used in the regression analysis.
2. It is also recommended that unless the model is itselfsignificant at the desired confidence level (as evidenced by
the F test results printed out for the model), the R valueshould not be interpreted.
3. The variables used (both independent and dependent)are assumed to be either interval scaled or ratio scaled.Nominally scaled variables can also be used as
independent variables in a regression model, with dummyvariable coding.
4. If the dependent variable happens to be a nominallyscaled one, discriminant analysis should be the techniqueused instead of regression.
8/9/2019 Regn & Marketing Research
7/23
Slide 6 Worked Example: Problem
1. A manufacturer and marketer of electric motors would like
to build a regression model consisting of five or six
independent variables, to predict sales. Past data has beencollected for 15 sales territories, on Sales and six different
independent variables. Build a regression model and
recommend whether or not it should be used by the
company.
2. We will assume that data are for a particular year, in
different sales territories in which the company operates, and
the variables on which data are collected are as follows:
Dependent Variable
Y = sales in Rs.lakhs in the territory
Independent Variables
X1 = market potential in the territory (in Rs.lakhs).X2 = No. of dealers of the company in the territory.
X3 = No. of salespeople in the territory.
X4 = Index of competitor activity in the territory on
a 5 point scale
(1=low, 5 = high level of activity by competitors).
X5 = No. of service people in the territory.X6 = No. of existing customers in the territory.
8/9/2019 Regn & Marketing Research
8/23
Sli e 7
I t ata:
The data set consisting o 15 observations, is given inig 1.
Fig. 1
ata ile : T 1. T (15 cases ith 7
variables)
1
SALES
2
POTENTL
3
DEALERS
4
PEOPLE
5
COMPET
6
SERVICE
7
CUSTOM
1 5 25 1 6 5 2 202 60 150 12 30 4 5 503 20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167 15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910
3 40 1 6 5 2 511 16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315 81 150 15 35 4 7 70
8/9/2019 Regn & Marketing Research
9/23
Slide 8
Correlation
First, let us look at the correlations of all the variableswith each other. The correlation table (output from
the computer for the Pearson Correlation procedure)
is shown in Fig. 2. The values in the correlation tableare standardised, and range from 0 to 1 (+ ve and - ve).
Fig.2 : Correlations a le
T T.
MULTIPL
.
Correlations (regdata1.sta)
ariable
P T
TL
L
P P
L
C M
P T
IC CU T
M
L
P T TL 1.00 .84 .88 .14 .61 .83 .94
L .84 1.00 .85 -.08 .68 .86 .91
P PL .88 .85 1.00 -.04 .79 .85 .95
C MP T .14 -.08 -.04 1.00 -.18 -.01 -.05
IC .61 .68 .79 -.18 1.00 .82 .73
CU T M .83 .86 .85 -.01 .82 1.00 .88
L .94 .91 .95 -.05 .73 .88 1.00
8/9/2019 Regn & Marketing Research
10/23
Slide 9
1. ooking at the last column of the table, we find that
except for CO PET (index of competitor activity), all othervariables are highly correlated (ranging from .73 to .95) with
Sales.
2. This means we may have chosen a fairly good set of
independent variables (No. of Dealers, Sales Potential, No.of Customers, No. of Service People, No. of Sales People) to
try and correlate with Sales.
3. Only the Index of Competitor Activity does not appear to
be strongly correlated (correlation coefficient is -.05) with
Sales. But we must remember that these correlations in Fig.
2 are one-to-one correlations of each variable with the other.
So we may still want to do a multiple regression with an
independent variable showing low correlation with a
dependent variable, because in the presence of other
variables, this independent variable may become a good
predictor of the dependent variable.
8/9/2019 Regn & Marketing Research
11/23
4. The other point to be noted in the correlation table is
whether independent variables are highly correlated with
each other. If they are, like in Fig. 2, this may indicate
that they are not independent of each other, and we may
be able to use only 1 or 2 of them to predict the
dependent variables.
5. As we will see later, our regression ends up
eliminating some of the independent variables, because
all six of them are not required. Some of them, being
correlated with other variables, do not add any value to
the regression model.
6. We now move on to the regression analysis of the
same data.
Slide 9 contd...
8/9/2019 Regn & Marketing Research
12/23
Slide 10
Regression
We will first run the regression model of the followingform, by entering all the 6 'x' variables in the model -
Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6..Equation 1
and determine the values of a, b1, b2, b3, b4, b5, & b6.
Regression Output:
The results (output) of this regression model are in Fig.4
in table form.
Column 4 of the table, titled B lists all the coefficients
for the model. According to this,
a (intercept) = -3.17298b1 = .22685
b2 = .81938
b3 = 1.09104
b4 = -1.89270
b5 = -0.54925
b6 = 0.06594
8/9/2019 Regn & Marketing Research
13/23
8/9/2019 Regn & Marketing Research
14/23
Slide 12
The R
2
value is 0.977, from the top of Fig. 4. FromFig. 4, we also note that t tests for significance of
individual independent variables indicate that at the
significance level of 0.10 (equivalent to a confidence
level of 90%), only POTENT and PEOP E are
statistically significant in the model. The other 4
independent variables are individually not significant.
ig. 4 MULTIPLE REGRESSION RESULTS:
All independent variables were entered in one block
Dependent Variable: SA ES
ultiple R: .988531605
ultiple R-Square: .977194734Adjusted R-Square: .960090784
Number of cases: 15
F(6, 8) = 57.13269 p< .000004Standard Error of Estimate: 4.391024067
Intercept: -3.172982117
Std.Error: 5.813394 t(8) = -.5458 p< .600084
8/9/2019 Regn & Marketing Research
15/23
Slide 12 contd...
STAT.
TIP EREGRESS.
Regression Summary for Dependent Variable: SA ES
R= .98853160 R2= .97719473 Adjusted R2= .96009078F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910
N=15
BETA
St.Err.
of
BETA
B
St. Err.
of B t(8) p-level
Intercept -3.1729 5.813394 -.54581 .600084
POTENT .439073 .144411 .22685 .074611 3.04044 .016052
DEA ERS .164315 .126591 .81938 .631266 1.29800 .230457
PEOP E .413967 .158646 1.09104 .418122 2.60937 .031161
CO PET .084871 .060074 -1.89270 1.339712 -1.41276 .195427
SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204
C STO .050490 .149302 .06594 .095002 .33817 .743935
8/9/2019 Regn & Marketing Research
16/23
Slide 13
However, ignoring the significance of individual
variables for now, we shall use the model as it is, and tryto apply it for decision making.
The real use of the regression model would be to try and
predict sales in Rs. lakhs, given all the independent
variable values.
The equation we have obtained means, in effect, thatsales will increase in a territory if the potential increases,
or if the number of dealers increases, or if level of
competitors activity decreases, if number of service
people decreases, and if the number of existing
customers increases.
The estimated increase in sales for every unit increase or
decrease in these variables is given by the coefficients of
the respective variables. For instance, if the number of
sales people is increased by 1, sales in Rs . lakhs, are
estimated to increase by 1.09, if all other variables areunchanged. Similarly, if 1 more dealer is added, sales are
expected to increase by 0.82 lakh, if other variables are
held constant.
8/9/2019 Regn & Marketing Research
17/23
There is one co-efficient, that of the SERVICE variable,which does not make too much intuitive sense. If we increase
the number of service people, sales are estimated to decrease
according to the 0.55 coefficient of the variable "No. of
Service People" (SERVICE).
But if we look at the individual variable t tests, we find that
the coefficients of the variable SERVICE is statistically not
significant (p-level 0.735204 from fig. 4). Therefore, the
coefficient for SERVICE is not to be used in interpreting the
regression, as it may lead to wrong conclusions.
Strictly speaking, only two variables, potential (POTENT )
and No. of sales people (PEOP E) are significant
statistically at 90 percent confidence level since their p- level
is less than 0.10. One should therefore only look at the
relationship of sales with one of these variables, or boththese variables.
Slide 13 contd...
Slide 14 Making Predictions/Sales orecasts
8/9/2019 Regn & Marketing Research
18/23
Slide 14 Making Predictions/Sales orecasts
Given the levels of X1, X2, X3, X4, X5, and X6 for a
particular territory, we can use the regression model for
prediction of sales.Before we do that, we have the option of redoing the
regression model so that the variables not statistically
significant are minimized or eliminated.
We can follow either the Forward Stepwise Regression
method, or the Backward Stepwise Regression method,
to try and eliminate the 'insignificant' variables fromthe full regression model containing all six
independent variables.
orward Stepwise Regression
For example, we could ask the computer for a Forward
stepwise Regression model, in which case the
algorithm adds one independent variable, at a time ,
starting with the one which explains most of the
variation in sales (y), and adding one more X variable
to it , rechecking the model to see that both variablesform a good model, then adding a third variable if it
still adds to the explanation of Y , and so on. Fig, 5
shows the result of running a forward stepwise
Regression, which ends up with only 4 out of 6
independent variables remaining in the regression
model.
8/9/2019 Regn & Marketing Research
19/23
8/9/2019 Regn & Marketing Research
20/23
8/9/2019 Regn & Marketing Research
21/23
The R for the model has dropped only slightly, to 0.9599,the F-test for the model is highly significant, and both the
independent variables POTENT and PEOP E are
significant at 90 % confidence level (p-levels of .002037
and .000728 from last column, Fig, 6).
If we were to decide to use this model for prediction , weonly require data to be collected on the number of sales
people (PEOP E) and the sales potential (POTENT ), in
a given territory . We could form the equation using the
Intercept and coefficients from column B in Fig. 6. as
follows-
Sales = -10.6164 + .2433 (POTENTL)
+ 1.4244 (PEOPLE)...Equation 3
Thus, if potential in a territory were to be Rs. 50 lakhs,
and the territory had 6 salespeople, then expected sales,using the above equation would be
= -10.6164 +.2433(50) +1.4244(6)
= 10.095 lakhs.
Similarly, we could use this model to make predictions
regarding sales in any territory for which Potential andNo. of Sales People were known.
Slide 17
8/9/2019 Regn & Marketing Research
22/23
Slide 18
Additional comments
1. As we can see from the example discussed, regression
analysis is a very simple (particularly on a computer),
and useful techniques to predict one metric dependent
variable based on a set of metric independent variables.
Its use, however, gets more complex, for instance, if the
independent variables are nominally scaled into two(dichotomous) or more (polytomous) categories.
2. It is also a good idea to define the range of all
independent variables used for constructing the
regression model. For prediction of Y values, only those
X values which fall within or close to this range (used
earlier in the model construction stage) must be used, for
the predictions to be effective.
3. Finally, we have assumed that a linear model is the
only option available to us. That is not the only choice. Aregression model could be of any non linear variety, and
some of these could be more suitable for particular cases.
8/9/2019 Regn & Marketing Research
23/23
4. Generally, a look at the plot of Y and X tells us in case of a
simple regression model, whether the linear (straight line)approach is best or not. But in a multiple regression, this
visual plot may not indicate the best kind of model, as there
are many independent variables, and the plot in 2 dimensions
is not possible.
5. In this particular example, we have not used any
macroeconomic variables, but in industrial marketing, we
may use those types of industry or macroeconomic variables
in a regression model. For example, to forecast sales of steel,
we may use as independent variables, the growth rate of a
countrys GDP, the new construction starts, and the growthrate of the automobile industry.
Slide 18 contd.