Multivariate data analysis regression, cluster and factor analysis on spss

Preview:

Citation preview

“Aditya Banerjee 86Amlan Anurag 90Apoorva Jain 94

Boris Babu Joseph 98

Regression Equation

Y = .243xX6 - .286xX7 + .248xX9 + .127x11 + .546xX12 + .227xX20 + .2xX21 – 2.010

Product Line has the least effect on Csat. This should be looked at last when increasing efforts.

Salesforce Image has the most effect on Csat. This should be looked at first when increasing efforts.

Existence of Homoscedasticity: All errors have constant variance

This is tested by looking at scatter plots of each independent variable to the

dependent variable.

We see that x6, x12,

and x20 have mild

heteroscedasticity, but

this magnitude can be

ignored.

Functional Form of Regression is Linear: The highest power of the equation is

1, i.e. when plotted, the regression equation is a straight line.

Sphericity of Errors: All errors are normally distributed.

As can be seen, there is only one outlier when looking

at errors.

�No Multicollinearity: No dependence between independent variables. This is checked by

looking at the data for Tolerance And VIF. Tolerance is how resistant the variable is to the other

independent variables, and VIF is how much the variable will change if resistance threshold is

crossed.

No Autocorrelation: This is accounted for by loking at the Durbin Watson statistic. It is

acceptable to have it at 2.3

The R2 is .835, and the Adjusted R2 is .822. This shows that this

model is robust as it can be generalised for 82% of the population.

The SEE is also at .5027 which is advisable.

When efforts are being made to increase C Sat, the bulk of our efforts should be directed towards x12.

E Commerce activities show coefficient of -.268 which show that while there is an increase in e

commerce activities, it might not be contributing to increasing consumer satisfaction. Hence, work

needs to be done there in the form discounts, or other offers that can be put online

The highest correlation seen is between the variables cost control and cash and financial

management which is 0.496, which is not very strong.

To determine the number of clusters we put the condition of Eigen value>1. This gave us four factors. But as

we can see four factors are explaining only 58% of the variance which is below our agreeable limit. We can

also see that after 4 factors, each additional factor is explaining a very small amount of variation. Hence we

put 5 factors a priori and run the analysis again, the result of which can be seen below.

We can see in the factor

matrix box that factor 1 has

high correlation with

variable 4,7,10,11. Factor

2 has high correlation with

variable 3,5. Factor3 with

variable 6, factor 4 with

variables 8,9 and factor 5

as we can see does not

have high correlation with

any of the factors. We can

also see that variable 1

and 2 do not have a strong

correlation with any of the

factors. Hence on rotation

of the matrix a more

equitable distribution of

variation can be seen,

though the total variance

remains the same. Factor

1 shows high correlation

with variables 7,10,11.

Factor 2 shows high

correlation with variables 1

and 3. Factor 3 shows with

variables 2,4 and Factor 4

shows with variable 8.

Variable 6 does not have

correlation with any of the

factors. Therefore, we can

take it as a separate factor.

Taking the correlation of the variables with their

factors we have given the following labels to the

five factors extracted. :

1. Cost management 2. Product service3. Pricing of machinery4. Marketing5. Employee productivity.

DATA CLEANING

We have converted the missing values in

the Likert scale (1-7) .

Values which were shown to be higher than 7 were

replaced with the mean of the given variable.

This produced a whole new set of variables for the

operation.

This was done using data transform.

TRANFORM > REPLACE MISSING VALUES

Select Data mean

CHANGE CAPTURED

Change from 9 to mean values for that particular variable.

FACTOR ANALYSIS

Multicollinearity occurs when 2 or more predictor

variables are highly correlated. Small changes in the

data might lead to large jumps due to this.

To address the issue of multicollinearity, we have

run factor analysis.

With a KMO > .6, the issue of Multicollinearity is

surpassed.

ANALYZE > DIMENSION REDUCTION > FACTOR

Multicollinearity

check

completed

FACTOR ANALYSIS

Awareness, Attitude & Preference combined for the

first factor which can be classified as Consumer

Attitude as it showed factors that may influence the

consumers and how their perception is built

Purchase & Loyalty combined for the second factor

which can be considered as Consumer Loyalty as

these factors reflected how the consumer feels about

the brand, and holds it above others in comparison.

CLUSTERING

The highest change in coefficient was noticed at

Stage 40 to Stage 41 which means that

agglomeration had to stop at this point.

N = 45

No. of Clusters = 45 – 40 = 4

PROFILING AND INTERPRETATION

Gender & Usage

Anova test was run to check if the classification was

significantly different when based on Gender or

Usage patterns.

It was found that no significant associations were

present for the same.

K MEANS VS HEIRARCHIAL CLUSTERING

It was found that there were major differences in the

number of cases/respondents that each cluster took

from the different methods used.

Although the number of clusters are same the mean

values for various variables will also differ

accordingly across the two methods due to the

change in respondents

Cluster 1 15

2 12

3 5

4 5

5 8

Valid 45

Missing 0

Hierarchical Method

K Means Method

Recommended