DIY Driver Analysis Webinar slides

T I M B O C KP R E S E N T S

Examples in Q5 (beta)If you have any questions, enter them into the Questions field. Questions will be answered at the end. If we do not have time to get to your question, we will email you.

We will email you a link to the slides and data.

Get a free one-month trial of Q from www.q-researchsoftware.com.

DIY Driver Analysis

2

Overview• Objectives of (key) driver analysis• Overview of techniques• 13 assumptions that need to be checked when doing QA for driver analysis

(there are workarounds for most of these)1: There are 15 or fewer predictors (if using Shapley)2: The outcome variable is monotonically increasing3: The outcome variable is numeric (if using Shapley)4: The predictor variables are numeric or binary5: People do not differ in their needs/wants (segmentation)6: The causal model is plausible7: There is no multicollinearity/correlations between predictors (if using GLMs)8: There are no unexpected correlations between the predictors and the outcome variable9: The signs of the importance scores are correct10: The predictor variables have no missing values11: There are no outliers/influential data points12: There is no serial correlation (aka autocorrelation)13: The residuals have constant variance (i.e., no heteroscedasticity in a model with a linear outcome variable)

3

The basic objective of (key) driver analysis

The basic objective: work out the relative importance of a series of predictor variables in predicting an outcome variable. For example:• NPS: comfort vs customer service vs price.• Customer satisfaction: wait time vs staff friendliness vs comfort.• Brand preference: modernity vs friendliness vs youthfulness.

What driver analysis is not: predictive analysis (e.g., predicting sales, customer churn). Although, you can use driver analysis to make strategic predictions (e.g., if I improve, say, fun, then preference will increase.)

Likelihood to recommend

This brand is fun

This brand is exciting

This brand is youthful

6 1 1 19 0 1 07 0 0 06 1 1 19 0 1 07 0 0 17 0 0 0

What the data looks like

This data shows 7 observations

1 outcome variable

4

Predictor variables(Typically there will be more than 3.)

5

Case study 1: TechnologyOutcome variable(s) Predictor variable(s)Likelihood to recommend:• Apple• Microsoft• IBM• Google• Intel• Hewlett-Packard• Sony• Dell• Yahoo• Nokia• Samsung• LG• Panasonic

Brand associations:• Fun• Worth what you pay for• Innovative• Good customer service• Stylish• Easy-to-use• High quality• High performance• Low prices

ID Appl

e

Mic

roso

ft

IBM

Appl

e

Mic

roso

ft

IBM

Appl

e

Mic

roso

ft

IBM

1 6 9 7 1 0 0 1 1 02 8 7 7 1 0 0 1 0 03 0 9 8 0 1 0 0 0 04 0 0 0 0 0 0 0 0 0

This brand is fun


Likelihood to recommend

ID BrandLikelihood to recommend

This brand is fun


1 Apple 6 1 11 Microsoft 9 0 11 IBM 7 0 02 Apple 6 1 12 Microsoft 9 0 12 IBM 7 0 03 Apple 6 1 13 Microsoft 9 0 13 IBM 7 0 04 Apple 6 1 14 Microsoft 9 0 14 IBM 7 0 0

The data (stacked)From: one row per respondentTo: one row per brand per respondent

6

7

Tips for stacking• Get an SPSS .SAV data file. If you do not have an SPSS file:• Import your data the usual way• Tools > Save Data as SPSS/CSV and Save as type: SPSS• Re-import

• Tools > Stack SPSS .sav Data File• Set the labels for the stacking variable (in Q: observation) in Value Attributes• Delete any None of these data (e.g., brand associations where respondents were

able to select None of these

8

Case study 2: Cola brand attitudeOutcome variable(s)

34 Predictor variable(s)

If the brand was a person, what would its personality be?

Hate/Dislike/Neither/Like/Love/Don’t know:• Coke Zero• Coke• Diet Coke• Diet Pepsi• Pepsi Max• Pepsi

Brand associations:• Beautiful• Carefree• Charming• Confident• Down-to-earth• Feminine• Fun• Health-conscious• Hip• Honest• Humorous

• Imaginative• Individualistic• Innocent• Intelligent• Masculine• Older• Open to new

experiences• Outdoorsy• Rebellious• Reckless• Reliable

• Sexy• Sleepy• Tough• Traditional• Trying to be cool• Unconventional• Up-to-date• Upper-class• Urban• Weight-conscious• Wholesome• Youthful

9

Example output: Importance scores

Key drivers of cola

preference

10

Example output: Performance-Importance Chart (aka Quad Chart)

11

Example output: Correspondence Analysis with Importance

12



Standard “best practice” recommendation for driver analysis:

LMG Lindeman, Merenda, Gold (1980)

=KruskalKruskal (1987)

=Dominance AnalysisBudescu (1993)

= Shapley / Shapley ValueLipovetsky and Conklin(2001)

The average improvement in R² that a predictor makes across all possible models (aka “Shapley”)

14

Best practice: Bespoke models (e.g., Bayesian multilevel model)

Bivariate metrics E.g., Correlations, Jaccard Coefficients

Shapley, Relative Importance Analysis

Much too hard Too hard Too Soft Just RightGLMs(e.g., linearregression)

15

What makes bespoke models and GLMs too hard?To estimate an OK bespoke model, you need to have a few week, and know lots of things, including:• Joint interpretation of parameter

estimates, the predictor covariance matrix, and the parameter covariance matrix

• Conditional effects• Multicollinearity• Confounding (e.g., suppressor

effects)• Estimation (ML, Bayesian)• Specification of informative priors• Specification of random effects

To understand importance in a GLM (e.g., linear regression), you need to know quite a lot about:• Joint interpretation of parameter estimates, the

predictor covariance matrix, and the parameter covariance matrix• Conditional effects• Multicollinearity• Confounding (e.g., suppressor effects)

Shapley and similar methods allow us to be less careful when

interpreting results

16

Bespoke models& GLMs

Proportional Marginal Variance

Decomposition

ShapleyWith coefficient adjustment

Lipovetsky and Conklin(2001)

Random Forest(for importance analysis)

Kruskal’s Squared partial correlationCalled Kruskal in Q

Relative Importance Analysis

AKA Relative Weight: Johnson (2000)

Shapley

17

Shapley case study• Open Initial.Q. This already contains the cola data.• File > Data Sets > Add to Project > From File > Stacked Technology• Create > Regression > Driver (Importance) Analysis > Shapley• Dependent variable: Q3. Likelihood to recommend [Stacked Technology]• Dependent variable: Q4 variables from Stacked Technology• No when asked about confidence intervals (clicking Yes is OK as well)• Note that High Quality is the most important, with a score of 18.2• Right-click: Reference name: shapley

Everything I demonstrate in this webinar is described on a slide like this. The rest of them are hidden in this deck, but you can get them if you download the slides. So, there is no need to take detailed notes.

Instructions for the case studies

18



19

1: There are 15 or fewer predictors (if using Shapley)

Options (ranked from best to worst) Comments

Relative Importance AnalysisTends to give almost identical results to Shapley regression, and much, much, faster to compute.

Shapley using only a subset of models (e.g., all predictors, leave 1 out predictor, 2 predictors, 1 predictor)

There is no evidence that this method produces accurate estimates of Shapley importance.

This is likely to be less similar to a proper Shapley than using Relative Importance Analysis.

Dimension reduction (e.g., PCA) Hard to interpret.

IssueShapley Regression cannot be computed with more 15 predictors in Q (it reverts to Relative Importance Analysis)

If you run Shapley Regression in the R package relimpo with this many variables you get a crash.

20

1: There are 15 or fewer predictors (if using Shapley)• With the cola study, we have 34 variables, and that will take an infinite amount of time

to compute, so using Shapley is not an option and we have to use Relative Importance Analysis.

• We can use the technology data set, which only has 9 predictors, to explore how similar the techniques are.

• Create > Regression > Linear Regression• Reference name: relative.importance• Select variables• Output: Relative importance analysis• Check Automatic Note that High Quality is again most important

• Right-click: Add R Output: comparison = cbind(shapley = shapley[-10], "Relative Importance" = relative.importance$relative.importance$importance)

• Calculate• Change shapley to shapley[-10]• Calculate• Right-click: Add R Output: correlation = cor(comparison)• Increase number of decimal places. Note the correlation is 0.999• Rename output: Correlation• Insert > Charts > Visualization > Labeled Scatterplot,

• Table: comparison• Automatic


21

2: The outcome variable is monotonically increasing

Options (not mutually exclusive) Comments

Set Don’t Knows to missing

Merge categories

• Do this when there are categories that have ambiguous orderings (e.g., OK and Good).

• The more categories you merge, the less significant the results will be.

Recode the data in some meaningful way (e.g., reverse the scale, Likelihood to recommend, recoded as NPS)

The specific values tend to make little difference, so using a recoding that is easy to explain to stakeholders, such as NPS, is often desirable.

IssueAll the standard driver analysis algorithms assume that the outcome variable contains categories ordered from lowest to highest, and which are believed to be associated with greater levels of preference.

TestThis is usually best checked by creating a summary table.

22

2: The outcome variable is monotonically increasing• Right-click: Add Table• Blue drop-down: Likelihood to recommend [Stacked Technology.sav]

Other than the NET, which is ignored by analyses, the order is correct.• Click on table of outcome variables in Technology. Note that there is no problem.• Click on Brand preference Note that here we have a Don’t Know. • Right-click on Don’t Know and press Remove. This sets it as missing values. • Click on two grey outputs (just setting up a later analysis)


23

3: The outcome variable is numeric (if using Shapley)


Use limited dependent variable versions of Relative Importance Analysis (e.g., Ordered Logit)

• The less numeric the variable, the better this option is.

• This approach is also preferable because it can take non-linear relationships into account automatically.

Ignore the problem and use Shapley.Where the variable is close to being numeric, there is probably little lost by this approach.

IssueShapley assumes that the outcome variable is numeric (theoretically, it can deal with non-numeric outcome variables, but for more than about 10 or so variables, it is impractical).

24

3: Numeric outcome• Select relative.importance• Type: Ordered logit

• Select chart We can see that the correlation between the models is smaller than before. This is because correctly modeling the data type has a bigger impact on our results than the difference between Shapley and Relative importance analysis.• Note that the changes are pretty trivial.


25

4: The predictor variables are numeric or binary


Set Don’t Knows to missingThis can be problematic as the variables as the missing values may not be missing at random. This is discussed later.

Merge categories• Do this when there are categories that have

ambiguous orderings (e.g., OK and Good). • The more categories you merge, the less

significant the results will be.

Recode the data in some meaningful way (midpoint recoding)

Use a bespoke or Generalized Linear Model (GLM), with dummy variables and/or splines, computing importance as the difference between the lowest and largest effect sizes for each variable.

In theory this is the best approach to dealing with non-numeric data, but it requires quite a lot to get right and, when interpreting the data, the sampling error of the categorical and spline effects will make them hard to compare.

IssueBoth Shapley and Relative Importance Analysis assume that the predictor variables are numeric or binary.

26

5: People do not differ in their needs/wants (segmentation)


Estimate an appropriate bespoke model (e.g., latent class analysis) and then estimate the driver analysis models within each segment

In Q: In a non-stacked data file, set up the data as an Experiment, and use Create > Segment > Latent Class Analysis

Form segments by judgment, and estimate separate relative importance analyses for each segment.

Ignore the problem, interpreting results as “average” effects

Rightly-or-wrongly, this is how 99.9%* of all modelling is done.* Made-up number

IssueTraditional driver analysis techniques assume that people have the same needs/wants, and apply these consistently from situation to situation.

How to test• Compare by brand• Compare by other data• Latent class analysis

27

5: People do not differ in their needs/wants (segmentation)• Select the Shapley Importance for Q3… table and press Duplicate• Brown drop-down: Brand [Stacked Technology.sav]• Note that this table is showing difference in important scores by segment. It is

not showing difference in perceptions.• The colors and arrows are telling us that there are differences by brand. Look at

Fun. This is significantly lower for Intel, HP, and Dell. But, really important in the evaluation of Google.• What do we do now? Let’s return to the previous slide and look at the options.


28

6: The causal model is plausibleOptions (not mutually exclusive) Comments

Build a bespoke model This is usually too hard

Include all the relevant (non-outcome) variables and cross your fingers (if you have not collected the data, you cannot magic it into existence)

Rightly-or-wrongly, this is how 99.9%* of all modelling is done* Made-up number

IssueAll driver analysis techniques assume that the analysis is a plausible explanation of the causal relationship between the predictor variables and the outcome variable. This assumption is never true. How to testCommon sense. Four common examples are shown on the next slides.

Example causality problem: Omitted variable bias

If we fail to include a relevant predictor variable, and that variable is correlated with the predictor variables that we do include, the estimates of importance will be wrong. If your R-square is less than 0.9, you may have this problem (a typical R-square is closer to 0.2 than 0.9).

Predictor 1

Predictor 2

Predictor 3

Predictor 4E.g., price

Outcome 1

Assumed predictor variables

Arrows denote the true causal relationship 29

Example causality problem: Outcome variable included as a predictor

30

Predictor 1

Predictor 2

Predictor 3

Outcome 2E.g., Satisfaction

Outcome 1E.g., NPS


If we include a predictor variable that is really an outcome variable, the estimates of importance will be wrong.

Arrows denote the true causal relationship

31

6: Outcome variable a predictor• Click on relative.importance• Worth what paid for is clearly an example of this problem, so should be deleted.• Remove Worth what paid for from the model• Click on Shapley Importance for Q3. Likelihood to recommend• Click on Comparisons. Note • The warning• The R-square is back in the model.

• Change in the R code -10 to -9


Example causality problem: Backdoor path

32

Predictor 1E.g., price perception

Predictor 2E.g., quality

Predictor 3E,g., packaging

VariableE.g., Attitude

Outcome 1E.g., NPS


If backdoor path exists from the predictors to the outcome variable, the estimates of importance will be wrong (spurious).


Example causality problem: Functional form

33

Assumed functional form

If we have the wrong functional form (i.e., assumed equation), the estimates of importance will be wrong.


Outcome = Predictor 1 + Predictor 2 + Predictor 3

Outcome = Predictor 1 × Predictor 2 + Predictor 3

True functional form

34

7: There is no multicollinearity/correlations between predictors (if using GLMs, e.g., linear regression)


Take all the relevant theory into account when interpreting the results.

This requires a strong technical and intuitive understanding of the underlying maths. Even if you possess that understanding, it is really difficult to explain to clients (particularly if it is a tracking study and they are seeing results fluctuate from period-to-period)

Use Shapley or Relative Importance Analysis.

These techniques are designed to address this problem. They are not perfect, but they are easier to interpret than linear regression and other GLMs when predictor variables are correlated.

IssueThe bigger the correlations between predictors, the more difficult it is to accurately interpret estimates from traditional GLMs (e.g., linear regression)Test1. Inspect the Variance Inflation Factors (VIF)

or Generalized Variance Inflation Factors (GVIF). Q automatically computes these and warns you if they are high.

2. Inspect the coefficients. Do they make sense?

3. Look at the correlations.

35

7: There is no multicollinearity• Click on relative.importance• Uncheck Automatic• Change it back to Summary. Now it is showing an order logit regression.• Type to Linear.• Note that we have a negative effect for Stylish. That is, the model says that if we hold everything else constant,

an improvement in Stylish will make a reduction in likelihood to recommend. This doesn’t seem to make sense.• Create > Correlation > Correlation Matrix. Select• Q3. Likelihood to recommend• Q4.

• Note that there is a moderate correlation between Stylish and Likelihood to recommend. And, Stylish is correlated with everything out.

• My intuition is that this is all some weird and uninteresting quirk in the data. But, in truth I don’t know.• Conclusion: we should be using Shapley or Relative Importance Analysis. It is designed to remove such

headaches. • Output: Relative Importance Analysis


36

8: There are no unexpected correlations between the predictors and the outcome variable

Options (ranked from best to worst)

Investigate the data to make sense of the unexpected relationships.

Remove problematic variables from the analysis.

IssueWhen people interpret importance scores, they assume that higher means better. This is assumption is not always right.

TestCorrelate each predictor variable with the outcome variable

37

8: There are no unexpected correlations• Click back on the correlation matrix. As we discussed earlier, these all look fine.• Now, let’s look at the cola correlations. I computed them before, but they will

need to re-run because we removed the Don’t Know category from the Outcome Variable. • Click on cola.correlations• You can see we have some negative correlations in the first column, which shows

the correlations with the outcome variable.


38

9: The signs of the importance scores are correctRecommendation

If all the effects should be positive, select the Absolute importance scores option. Otherwise, manually change the results when reporting.

IssueThe underlying Shapley and Relative Importance Analysis algorithms always compute a positive importance scores. However, the true effect of a predictor can be negative, resulting in people misinterpreting the results. TestCompute a GLM (e.g., linear regression). Any negative coefficients warrant investigation. For this reason, Q automatically does this and puts the signs of the multiple regression coefficients onto the driver analysis outputs (both Shapley and Relative Importance Analysis).If the correlation is also negative, it means that the effect is negative. If positive, it suggests that the multiple regression is picking up a non-interesting artefact.

39

9: The signs of the importance scores are correct• Click on relative.importance• Stylish is negative. As discussed, it is because of the multiple regression. This is

really just a reminder that we need to be careful. We have already looked at this, and we know that the correlation is positive, so we can ignore this.• Check Absolute importance scores• Go to comparison, and replace shapley[-9] with abs(shapley[-9])


40

10: The predictor variables have no missing values


Create a bespoke model that appropriately models the process(es) that cause the values to be missing.

This is really hard!

Multiple imputation of missing values If using Relative Importance Analysis, set Missing Data to Multiple Imputation

Leave out observations with missing values from the analysis (i.e., complete case analysis)

This implicitly assumes that the data is Missing Completely At Random (MCAR; i.e., other than that some variables have more missing values than others, there is no pattern of any kind in the missing data).

Test this assumption using Automate > Browse Online Library > Missing Data > Little’s MCAR Test

IssueThere are missing values of predictor variables (e.g., some attributes were not collected for some respondents, or there were “don’t know” response)

41

11: There are no outliers/unusual data pointsOptions (ranked from best to worst) Comments

Inspect each unusual observation, and understand if it is an error or not

Difficult/time consuming

Filter out all the unusual observations, and check to see if the model has changed. If it has changed, and the number of unusual observations is small, use the new model.

Ignore the problemThis is, by far, the most common approach.

IssueA few outliers/unusual observations can skew the results of importance analysis.

Test• Hat/influence scores• Standardized residuals• Cook’s distance

42

11: There are no outliers/unusual data points• Click on 3 more (warnings)• Note that we have a warning about Unusual Observations. Let’s

explore this further.• Create > Regression > Diagnostic > Plot > Influence Index• Uncheck Automatic (so that it does not update when we later

change the model)• You can see the various unusual observations have been marked:

379, 382, 295, and so on• Go to Notes tab and copy code• Variables and Questions – Stacked Technology.sav• Right-click on a row: Insert > Variable(s) > R Variable:

• !(1:4056 %in% c(295, 1744, 1749, 2764, 3420, 246, 2364, 2829, 2830, 4029, 295, 2380, 2764, 3049, 3050)

• Name: Unusual observations removed

• Check the F in the Tags column. Make the variable available as a filter.

• Go back to Output: relative.importance• Apply the filter to the table


43

12: There is no serial correlation (aka autocorrelation)


Create a bespoke model that addresses the serial correlation (e.g., a random effects model if the serial correlation is due to repeated measures, or a time series model if it is measures over time)

This is a lot of work.

Don’t report statistical test results (i.e., p-values).

The importance scores will be OK. The significance tests will be misleading to an unknown extent.

IssueThe standard tests for the significance of a predictor assume that there is no serial correlation/autocorrelation (a particular type of pattern in the residuals).Whenever you stack data you are highly likely to have this problem.TestRegression > Diagnostic > Serial Correlation (Durbin-Watson)

44

12: There is no serial correlation (aka auto…)• Regression > Diagnostic > Serial Correlation (Durbin-Watson)• These are significant. As mentioned, this is always the case.


45

13: The residuals have constant variance (i.e., no heteroscedasticity in a model with a linear outcome variable)


Use a more appropriate model (e.g., ordered logit)

This is not possible with Shapley.

This models make other, hopefully less problematic, assumptions (beyond the scope of this webinar)

Use robust standard errorsThis is not possible with Shapley.

In Q: check Robust standard error

IssueThe standard tests for the significance of a predictor in a linear model assume that the variance of the residuals is constant. This is rarely the case in driver analysis, as usually the data is from a bounded scale (e.g., if it is a rating out of 10, it is impossible for a value to be observed that is greater than 10).TestDisplayr automatically performs the Breusch-Pagen Test Type = Linear

46

13: The residuals have constant variance• Click on relative.importance. Note that there is a warning.• Change the Type to Ordered Logit


47

Creating the donut chart (earlier in the presentation)• The first step is to create a new table containing the cola importance scores. Add a new R

Output, with code:ColaImportance = cola.importance$relative.importance$importance# Removing "Performance: " from the labelsnames(ColaImportance) = flipFormat::TidyLabels(names(ColaImportance))ColaImportance

• Then, sort the cola importance scores, in a descending order:• Add a new R Output, with code:SortedColaImportance = sort(ColaImportance, decreasing = TRUE)

• Create > Charts > Visualization > Donut Chart• Select SortedColaImportance as the Table• Data value suffix: %• Data label decimal places: 1• Change the name in the report tree to Donut• Check Automatic (at the top)


48

Creating the performance-importance chart (earlier in the presentation)• Right-click on donut in the Report tree and select Add Table• In the blue drop-down menu, select Performance• Click on Brand preference, at the top of the Report tree• Add another table and select Brand [Stacked Cola Brand Associations.sav] in the blue drop-down menu• Right-click on the 17% next to Diet Coke and select Create Filter• Make sure that Apply filter to the current table is not selected, and press OK• Click the Performance table, and apply the Diet Coke filter (using the Filter drop-down, at the bottom-left of the

screen)• Right-click on Performance and select Reference Name and change this to performance• Right-click on the Performance table in the Report tree and select Add R Output, entering the code PerformanceImportance = cbind("Importance (%)" = ColaImportance, "Diet Coke Brand Associations (%)" = table.Performance[-nrow(table.Performance)])• Create > Charts > Visualization > Labeled Scatterplot• Set Table to PerformanceImportance • Rename to Perfomance Importance Chart• Press Calculate


49

Creating the correspondence analysis chart (earlier in the presentation)• Right-click on Performance Importance Chartin the Report tree and select Add

Table• In the blue drop-down menu, select Performance• In the brown drop-down select Brand [Stacked Cola Brand Associations.sav]• Right-click on None of these and select Remove

• Create > Dimension Reduction > Correspondence Analysis of a Table• Table: Performance by Brand• Output: Bubble Chart• Bubble sizes: ColaImportance• Legend title: Importance


T I M B O C KP R E S E N T S

Q&A Session

Type questions into the Questions fields in GoToWebinar.

If we do not get to your question during the webinar, we will write back via email.

We will email you a link to the slides and data.

Get a free one-month trial of Q from www.q-researchsoftware.com.

Data & Analytics

DIY Driver Analysis Webinar slides