Regn & Marketing Research

Embed Size (px)

Citation preview

  • 8/9/2019 Regn & Marketing Research

    1/23

    Correlation and

    Regression:

    Explaining Associationand Causation

    byTuhin Chattopadhyay

  • 8/9/2019 Regn & Marketing Research

    2/23

    Application Areas: Correlation

    1. Correlation and Regression are generally

    performed together. The application of correlationanalysis is to measure the degree of association

    between two sets of quantitative data. The correlation

    coefficient measures this association. It has a value

    ranging from 0 (no correlation) to 1 (perfect positive

    correlation), or -1 (perfect negative correlation).

    2. For example, how are sales of product A correlated

    with sales of product B? Or, how is the advertising

    expenditure correlated with other promotional

    expenditure? Or, are daily ice cream sales correlated

    with daily maximum temperature?

    3. Correlation does not necessarily mean there is a

    causal effect. Given any two strings of numbers,

    there will be some correlation among them. It does

    not imply that one variable is causing a change inanother, or is dependent upon another.

    4. Correlation is usually followed by regression

    analysis in many applications.

    Slide 1

  • 8/9/2019 Regn & Marketing Research

    3/23

    Slide 2 Application Areas: Regression

    1. The main objective of regression analysis is to explain thevariation in one variable (called the dependent variable),

    based on the variation in one or more other variables (calledthe independent variables).

    2. The applications areas are in explaining variations insales of a product based on advertising expenses, or numberof sales people, or number of sales offices, or on all theabove variables.

    3. If there is only one dependent variable and oneindependent variable is used to explain the variation in it,then the model is known as a simple regression.

    4. If multiple independent variables are used to explain thevariation in a dependent variable, it is called a multipleregression model.

    5. Even though the form of the regression equation could beeither linear or non-linear, we will limit our discussion tolinear (straight line) models.

    6. As seen from the preceding discussion, the major application of regression analysis in marketing is in the areaof sales forecasting, based on some independent (or explanatory) variables. This does not mean that regressionanalysis is the only technique used in sales forecasting.

    There are a variety of quantitative and qualitative methodsused in sales forecasting, and regression is only one of thebetter known (and often used) quantitative techniques.

  • 8/9/2019 Regn & Marketing Research

    4/23

    Slide 3 Methods

    There are basically two approaches to regression y A hit and trial approach .

    y A pre- conceived approach.

    Hit and trial Approach

    In the hit and trial approach we collect data on a largenumber of independent variables and then try to fit a

    regression model with a stepwise regression model, enteringone variable into the regression equation at a time.The general regression model (linear) is of the type

    Y = a + b1x1 + b2x2 +.+ bnxn

    where y is the dependent variable and x1, x2 , x3.xn are theindependent variables expected to be related to y andexpected to explain or predict y. b1, b2, b3bn are thecoefficients of the respective independent variables, whichwill be determined from the input data.

    Pre-conceived Approach

    The pre-conceived approach assumes the researcher knowsreasonably well which variables explain y and the modelis pre-conceived, say, with 3 independent variables x1, x2,x3. Therefore, not too much experimentation is done. Themain objective is to find out if the pre-conceived model isgood or not. The equation is of the same form as earlier.

  • 8/9/2019 Regn & Marketing Research

    5/23

    Slide 4

    Data

    1. Input data on y and each of the x variables isrequired to do a regression analysis. This data is input

    into a computer package to perform the regression

    analysis.

    2. The output consists of the b coefficients for all the

    independent variables in the model. The output also

    gives you the results of a t test for the significance of

    each variable in the model, and the results of the F

    test for the model on the whole.

    3. Assuming the model is statistically significant at thedesired confidence level (usually 90 or 95% for typical

    applications in the marketing area), the coefficient of

    determination or R2 of the model is an important part

    of the output. The R2 value is the percentage (or

    proportion) of the total variance in y explained by all

    the independent variables in the regression equation.

  • 8/9/2019 Regn & Marketing Research

    6/23

    Slide 5 Recommended usage

    1. It is recommended that for exploratory research, the hit-and-trial approach may be used. But for serious decision-

    making, there has to be a-priori knowledge of the variableswhich are likely to affect y, and only such variables shouldbe used in the regression analysis.

    2. It is also recommended that unless the model is itselfsignificant at the desired confidence level (as evidenced by

    the F test results printed out for the model), the R valueshould not be interpreted.

    3. The variables used (both independent and dependent)are assumed to be either interval scaled or ratio scaled.Nominally scaled variables can also be used as

    independent variables in a regression model, with dummyvariable coding.

    4. If the dependent variable happens to be a nominallyscaled one, discriminant analysis should be the techniqueused instead of regression.

  • 8/9/2019 Regn & Marketing Research

    7/23

    Slide 6 Worked Example: Problem

    1. A manufacturer and marketer of electric motors would like

    to build a regression model consisting of five or six

    independent variables, to predict sales. Past data has beencollected for 15 sales territories, on Sales and six different

    independent variables. Build a regression model and

    recommend whether or not it should be used by the

    company.

    2. We will assume that data are for a particular year, in

    different sales territories in which the company operates, and

    the variables on which data are collected are as follows:

    Dependent Variable

    Y = sales in Rs.lakhs in the territory

    Independent Variables

    X1 = market potential in the territory (in Rs.lakhs).X2 = No. of dealers of the company in the territory.

    X3 = No. of salespeople in the territory.

    X4 = Index of competitor activity in the territory on

    a 5 point scale

    (1=low, 5 = high level of activity by competitors).

    X5 = No. of service people in the territory.X6 = No. of existing customers in the territory.

  • 8/9/2019 Regn & Marketing Research

    8/23

    Sli e 7

    I t ata:

    The data set consisting o 15 observations, is given inig 1.

    Fig. 1

    ata ile : T 1. T (15 cases ith 7

    variables)

    1

    SALES

    2

    POTENTL

    3

    DEALERS

    4

    PEOPLE

    5

    COMPET

    6

    SERVICE

    7

    CUSTOM

    1 5 25 1 6 5 2 202 60 150 12 30 4 5 503 20 45 5 15 3 2 254 11 30 2 10 3 2 205 45 75 12 20 2 4 306 6 10 3 8 2 3 167 15 29 5 18 4 5 308 22 43 7 16 3 6 409 29 70 4 15 2 5 3910

    3 40 1 6 5 2 511 16 40 4 11 4 2 1712 8 25 2 9 3 3 1013 18 32 7 14 3 4 3114 23 73 10 10 4 3 4315 81 150 15 35 4 7 70

  • 8/9/2019 Regn & Marketing Research

    9/23

    Slide 8

    Correlation

    First, let us look at the correlations of all the variableswith each other. The correlation table (output from

    the computer for the Pearson Correlation procedure)

    is shown in Fig. 2. The values in the correlation tableare standardised, and range from 0 to 1 (+ ve and - ve).

    Fig.2 : Correlations a le

    T T.

    MULTIPL

    .

    Correlations (regdata1.sta)

    ariable

    P T

    TL

    L

    P P

    L

    C M

    P T

    IC CU T

    M

    L

    P T TL 1.00 .84 .88 .14 .61 .83 .94

    L .84 1.00 .85 -.08 .68 .86 .91

    P PL .88 .85 1.00 -.04 .79 .85 .95

    C MP T .14 -.08 -.04 1.00 -.18 -.01 -.05

    IC .61 .68 .79 -.18 1.00 .82 .73

    CU T M .83 .86 .85 -.01 .82 1.00 .88

    L .94 .91 .95 -.05 .73 .88 1.00

  • 8/9/2019 Regn & Marketing Research

    10/23

    Slide 9

    1. ooking at the last column of the table, we find that

    except for CO PET (index of competitor activity), all othervariables are highly correlated (ranging from .73 to .95) with

    Sales.

    2. This means we may have chosen a fairly good set of

    independent variables (No. of Dealers, Sales Potential, No.of Customers, No. of Service People, No. of Sales People) to

    try and correlate with Sales.

    3. Only the Index of Competitor Activity does not appear to

    be strongly correlated (correlation coefficient is -.05) with

    Sales. But we must remember that these correlations in Fig.

    2 are one-to-one correlations of each variable with the other.

    So we may still want to do a multiple regression with an

    independent variable showing low correlation with a

    dependent variable, because in the presence of other

    variables, this independent variable may become a good

    predictor of the dependent variable.

  • 8/9/2019 Regn & Marketing Research

    11/23

    4. The other point to be noted in the correlation table is

    whether independent variables are highly correlated with

    each other. If they are, like in Fig. 2, this may indicate

    that they are not independent of each other, and we may

    be able to use only 1 or 2 of them to predict the

    dependent variables.

    5. As we will see later, our regression ends up

    eliminating some of the independent variables, because

    all six of them are not required. Some of them, being

    correlated with other variables, do not add any value to

    the regression model.

    6. We now move on to the regression analysis of the

    same data.

    Slide 9 contd...

  • 8/9/2019 Regn & Marketing Research

    12/23

    Slide 10

    Regression

    We will first run the regression model of the followingform, by entering all the 6 'x' variables in the model -

    Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6..Equation 1

    and determine the values of a, b1, b2, b3, b4, b5, & b6.

    Regression Output:

    The results (output) of this regression model are in Fig.4

    in table form.

    Column 4 of the table, titled B lists all the coefficients

    for the model. According to this,

    a (intercept) = -3.17298b1 = .22685

    b2 = .81938

    b3 = 1.09104

    b4 = -1.89270

    b5 = -0.54925

    b6 = 0.06594

  • 8/9/2019 Regn & Marketing Research

    13/23

  • 8/9/2019 Regn & Marketing Research

    14/23

    Slide 12

    The R

    2

    value is 0.977, from the top of Fig. 4. FromFig. 4, we also note that t tests for significance of

    individual independent variables indicate that at the

    significance level of 0.10 (equivalent to a confidence

    level of 90%), only POTENT and PEOP E are

    statistically significant in the model. The other 4

    independent variables are individually not significant.

    ig. 4 MULTIPLE REGRESSION RESULTS:

    All independent variables were entered in one block

    Dependent Variable: SA ES

    ultiple R: .988531605

    ultiple R-Square: .977194734Adjusted R-Square: .960090784

    Number of cases: 15

    F(6, 8) = 57.13269 p< .000004Standard Error of Estimate: 4.391024067

    Intercept: -3.172982117

    Std.Error: 5.813394 t(8) = -.5458 p< .600084

  • 8/9/2019 Regn & Marketing Research

    15/23

    Slide 12 contd...

    STAT.

    TIP EREGRESS.

    Regression Summary for Dependent Variable: SA ES

    R= .98853160 R2= .97719473 Adjusted R2= .96009078F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

    N=15

    BETA

    St.Err.

    of

    BETA

    B

    St. Err.

    of B t(8) p-level

    Intercept -3.1729 5.813394 -.54581 .600084

    POTENT .439073 .144411 .22685 .074611 3.04044 .016052

    DEA ERS .164315 .126591 .81938 .631266 1.29800 .230457

    PEOP E .413967 .158646 1.09104 .418122 2.60937 .031161

    CO PET .084871 .060074 -1.89270 1.339712 -1.41276 .195427

    SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204

    C STO .050490 .149302 .06594 .095002 .33817 .743935

  • 8/9/2019 Regn & Marketing Research

    16/23

    Slide 13

    However, ignoring the significance of individual

    variables for now, we shall use the model as it is, and tryto apply it for decision making.

    The real use of the regression model would be to try and

    predict sales in Rs. lakhs, given all the independent

    variable values.

    The equation we have obtained means, in effect, thatsales will increase in a territory if the potential increases,

    or if the number of dealers increases, or if level of

    competitors activity decreases, if number of service

    people decreases, and if the number of existing

    customers increases.

    The estimated increase in sales for every unit increase or

    decrease in these variables is given by the coefficients of

    the respective variables. For instance, if the number of

    sales people is increased by 1, sales in Rs . lakhs, are

    estimated to increase by 1.09, if all other variables areunchanged. Similarly, if 1 more dealer is added, sales are

    expected to increase by 0.82 lakh, if other variables are

    held constant.

  • 8/9/2019 Regn & Marketing Research

    17/23

    There is one co-efficient, that of the SERVICE variable,which does not make too much intuitive sense. If we increase

    the number of service people, sales are estimated to decrease

    according to the 0.55 coefficient of the variable "No. of

    Service People" (SERVICE).

    But if we look at the individual variable t tests, we find that

    the coefficients of the variable SERVICE is statistically not

    significant (p-level 0.735204 from fig. 4). Therefore, the

    coefficient for SERVICE is not to be used in interpreting the

    regression, as it may lead to wrong conclusions.

    Strictly speaking, only two variables, potential (POTENT )

    and No. of sales people (PEOP E) are significant

    statistically at 90 percent confidence level since their p- level

    is less than 0.10. One should therefore only look at the

    relationship of sales with one of these variables, or boththese variables.

    Slide 13 contd...

    Slide 14 Making Predictions/Sales orecasts

  • 8/9/2019 Regn & Marketing Research

    18/23

    Slide 14 Making Predictions/Sales orecasts

    Given the levels of X1, X2, X3, X4, X5, and X6 for a

    particular territory, we can use the regression model for

    prediction of sales.Before we do that, we have the option of redoing the

    regression model so that the variables not statistically

    significant are minimized or eliminated.

    We can follow either the Forward Stepwise Regression

    method, or the Backward Stepwise Regression method,

    to try and eliminate the 'insignificant' variables fromthe full regression model containing all six

    independent variables.

    orward Stepwise Regression

    For example, we could ask the computer for a Forward

    stepwise Regression model, in which case the

    algorithm adds one independent variable, at a time ,

    starting with the one which explains most of the

    variation in sales (y), and adding one more X variable

    to it , rechecking the model to see that both variablesform a good model, then adding a third variable if it

    still adds to the explanation of Y , and so on. Fig, 5

    shows the result of running a forward stepwise

    Regression, which ends up with only 4 out of 6

    independent variables remaining in the regression

    model.

  • 8/9/2019 Regn & Marketing Research

    19/23

  • 8/9/2019 Regn & Marketing Research

    20/23

  • 8/9/2019 Regn & Marketing Research

    21/23

    The R for the model has dropped only slightly, to 0.9599,the F-test for the model is highly significant, and both the

    independent variables POTENT and PEOP E are

    significant at 90 % confidence level (p-levels of .002037

    and .000728 from last column, Fig, 6).

    If we were to decide to use this model for prediction , weonly require data to be collected on the number of sales

    people (PEOP E) and the sales potential (POTENT ), in

    a given territory . We could form the equation using the

    Intercept and coefficients from column B in Fig. 6. as

    follows-

    Sales = -10.6164 + .2433 (POTENTL)

    + 1.4244 (PEOPLE)...Equation 3

    Thus, if potential in a territory were to be Rs. 50 lakhs,

    and the territory had 6 salespeople, then expected sales,using the above equation would be

    = -10.6164 +.2433(50) +1.4244(6)

    = 10.095 lakhs.

    Similarly, we could use this model to make predictions

    regarding sales in any territory for which Potential andNo. of Sales People were known.

    Slide 17

  • 8/9/2019 Regn & Marketing Research

    22/23

    Slide 18

    Additional comments

    1. As we can see from the example discussed, regression

    analysis is a very simple (particularly on a computer),

    and useful techniques to predict one metric dependent

    variable based on a set of metric independent variables.

    Its use, however, gets more complex, for instance, if the

    independent variables are nominally scaled into two(dichotomous) or more (polytomous) categories.

    2. It is also a good idea to define the range of all

    independent variables used for constructing the

    regression model. For prediction of Y values, only those

    X values which fall within or close to this range (used

    earlier in the model construction stage) must be used, for

    the predictions to be effective.

    3. Finally, we have assumed that a linear model is the

    only option available to us. That is not the only choice. Aregression model could be of any non linear variety, and

    some of these could be more suitable for particular cases.

  • 8/9/2019 Regn & Marketing Research

    23/23

    4. Generally, a look at the plot of Y and X tells us in case of a

    simple regression model, whether the linear (straight line)approach is best or not. But in a multiple regression, this

    visual plot may not indicate the best kind of model, as there

    are many independent variables, and the plot in 2 dimensions

    is not possible.

    5. In this particular example, we have not used any

    macroeconomic variables, but in industrial marketing, we

    may use those types of industry or macroeconomic variables

    in a regression model. For example, to forecast sales of steel,

    we may use as independent variables, the growth rate of a

    countrys GDP, the new construction starts, and the growthrate of the automobile industry.

    Slide 18 contd.