Upload
aubrey-townsend
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
MKT 700Business Intelligence and
Decision Models
Algorithms andCustomer Profiling (1)
Classification and Prediction
ClassificationUnsupervised Learning
PredictingSupervised Learning
SPSS Direct Marketing
Classification Predictive
Unsupervised Learning
RFM Cluster analysis
Postal Code Responses
NA
SupervisedLearning Customer Profiling Propensity to buy
SPSS Analysis
Classification Predictive
Unsupervised Learning
Hierarchical ClusterTwo-Step ClusterK-Means Cluster
NA
SupervisedLearning Classification Trees
-CHAID-CART
Linear RegressionLogistic Regression
Artificial Neural Nets
Major Algorithms
Classification Predictive
Unsupervised Learning
Euclidean DistanceLog Likelihood
NA
SupervisedLearning Chi-square Statistics
Log LikelihoodGINI Impurity IndexF-Statistics (ANOVA)
Log LikelihoodF-Statistics (ANOVA)
Nominal: Chi-square, Log LikelihoodContinuous: F-Statistics, Log Likelihood
Euclidean Distance
Euclidean Distance for Continuous Variables
Pythagorean distance √d2 = √(a2+b2)
Euclidean space √d2 = √(a2+b2+c2)
Euclidean distance d = [(di)2]1/2
(Cluster Analysis with continuous var.)
Pearson’s Chi-Square
Contingency Table
North South East West Tot.
Yes 68 75 57 79 279
No 32 45 33 31 141
Tot. 100 120 90 110 420
Observed and theoretical Frequencies
North South East West Tot.
Yes 6866
7580
5760
7973
27966%
No 3234
4540
3330
3137
14134%
Tot. 100 120 90 110 420
Chi-Square: e
eo
fff
X2
2 )(
Obs. fo fe fo-fe (fo-fe)2 (fo-fe)2
fe
1,1 681,2 751,3 571,4 792,1 322,2 452,2 332,4 31
6680607334403037
2-5-36
-2536
425
936
425
936
.0606
.3125
.1500
.4932
.1176
.6250
.3000
.9730X2= 3.032
Statistical Inference DF: (4 col –1) (2 rows –1) = 3
3.032 7.8156.251
.10 .05
Log Likelihood Chi-Square
Log Likelihood Based on probability distributions
rather than contingency (frequency) tables.
Applicable to both categorical and continuous variables, contrary to chi-square which must be discreticized.
Contingency Table (Observed Frequencies)
Cluster 1 Cluster 2 Total
Male 10 30 40
Contingency Table (Expected Frequencies)
Cluster 1 Cluster 2 Total
Male 1020
3020
4040
Chi-Square: e
eo
fff
X2
2 )(
Obs. fo Fe fo-fe (fo-fe)2 (fo-fe)2
fe
1,1 101,2 30
2020
-1010
100100
5.005.00
X2= 10.00
p < 0.05; DF = 1; Critical value = 3.84
Log Likelihood Distance & Probability
Cluster 1 Cluster 2
Male O E
1020
3020
O/ELn (O/E)O * Ln (O/E)
2∑O*Ln(O/E)
10/20 = .50-.693
10*-.693-6.93
30/20=1.50.405
30*.40512.164
2*(-6.93+12.164)= 10.46
p < 0.05; critical value = 3.84
Variance, ANOVA, andF Statistics
F-Statistics For metric or continuous variables
Compares explained (in the model) and unexplained variances (errors)
VarianceSQUARED
VALUE MEAN DIFFERENCE
20 43.6 55734 43.6 92.1634 43.6 92.1638 43.6 31.3638 43.6 31.3640 43.6 12.9641 43.6 6.7641 43.6 6.7641 43.6 6.7642 43.6 2.5643 43.6 0.3647 43.6 11.5647 43.6 11.5648 43.6 19.3649 43.6 29.1649 43.6 29.1655 43.6 13055 43.6 13055 43.6 13055 43.6 130
COUNT 20 SS = 1461
DF= 19
VAR = 76.88
MEAN 43.6 SD= 8.768
SS is Sum of SquaresDF = N-1VAR=SS/DFSD = √VAR
ANOVA Two Groups: T-test
Three + Group Comparisons: Are errors (discrepancies between observations and the overall mean) explained by group membership or by some other (random) effect?
OnewayANOVA Grand mean
Group 1 Group 2 Group 3 5.0426 8 35 9 2 (X-Mean)2
4 7 1 0.9185 8 3 0.0024 9 2 1.0856 7 1 0.0025 8 3 1.0854 9 2 0.918
0.002Group means 1.085
4.875 8.125 2.125 8.752 15.668
3.835 8.752
(X-Mean)2 (X-Mean)2 (X-Mean)2 15.6681.266 0.016 0.766 3.8350.016 0.766 0.016 8.7520.766 1.266 1.266 15.6680.016 0.016 0.766 4.1680.766 0.766 0.016 9.2521.266 1.266 1.266 16.3350.016 0.016 0.766 4.1680.766 0.766 0.016 9.252
16.3354.875 4.875 4.875 4.168
9.252
SS Within 14.625Total SS 158.958
MSS(Between)/MSS(Within)
Winthin groupsBetween Groups Total Errors
SS 14.625 + 144.333= 158.958DF 24-3=21 3-1=2 24-1=23Mean SS 0.696 72.167 6.911
Between Groups Mean SS 72.167 103.624 p-value < .05Within Groups Mean SS 0.696
ONEWAY (Excel or SPSS)
Anova: Single Factor
SUMMARYGroups Count Sum Average Variance
Group 1 8 39 4.875 0.696Group 2 8 65 8.125 0.696Group 3 8 17 2.125 0.696
ANOVASource of Variation SS df MS F P-value F crit
Between Groups 144.333 2 72.167 103.624 1.318E-11 3.467Within Groups 14.625 21 0.696
Total 158.958 23
Profiling
Customer Profiling: Documenting or Describing Who is likely to buy or not respond? Who is likely to buy what product or
service? Who is in danger of lapsing?
CHAID or CART Chi-Square Automatic Interaction Detector
Based on Chi-Square All variables discretecized Dependent variable: nominal
Classification and Regression Tree Variables can be discrete or continuous Based on GINI or F-Test Dependent variable: nominal or continuous
Use of Decision Trees Classify observations from a target binary
or nominal variable Segmentation
Predictive response analysis from a target numerical variable Behaviour
Decision support rules Processing
Decision Tree
Example:dmdata.sav
Underlying Theory X2
CHAID AlgorithmSelecting Variables Example
Regions (4), Gender (3, including Missing)Age (6, including Missing)
For each variable, collapse categories to maximize chi-square test of independence: Ex: Region (N, S, E, W,*) (WSE, N*)
Select most significant variable Go to next branch … and next level Stop growing if …estimated X2 < theoretical X2
CART (Nominal Target) Nominal Targets:
GINI (Impurity Reduction or Entropy)Squared probability of node membershipGini=0 when targets are perfectly classified.Gini Index =1-∑pi
2
Example Prob: Bus = 0.4, Car = 0.3, Train = 0.3 Gini = 1 –(0.4^2 + 0.3^2 + 0.3^2) = 0.660
CART (Metric Target) Continuous Variables:
Variance Reduction (F-test)
Comparative Advantages(From Wikipedia) Simple to understand and interpret Requires little data preparation Able to handle both numerical
and categorical data Uses a white box model easily
explained by Boolean logic. Possible to validate a model
using statistical tests Robust