Upload
python-predictions
View
3.045
Download
0
Embed Size (px)
Citation preview
#pawcon@pythongeert
Opening Keynote:
Le Mariage ParfaitCombining Logit and Ensemble Modeling for Increased Customer Churn Detection
PAW London Oct 29 – 30, 2014
Geert Verstraeten, PhDProgram ChairPAW London
#pawcon@pythongeert
prediction
#pawcon@pythongeert
our domain is about predicting the future, but we’re not the community that uses a crystal ball
#pawcon@pythongeert
#pawcon@pythongeert
prediction
instead, by knowing what lies behind us, we make an estimated guess about what’s ahead of us
#pawcon@pythongeert
#pawcon@pythongeert
prediction
example 1: predicting the stock marketkeynote john elderPAW london 2013
#pawcon@pythongeert
#pawcon@pythongeert
prediction
example 2: predicting player injuriesleicester tigers rugby teamPAW london 2012
#pawcon@pythongeert
#pawcon@pythongeert
prediction
example 3: predicting breakdown of sewerage network in walesPAW london 2012
#pawcon@pythongeert
#pawcon@pythongeert
prediction
in many cases, such as targeted marketing and credit scoring, decision makers are playing a game of guess who
predictive analytics can help decision makers win the game
#pawcon@pythongeert
big data
#pawcon@pythongeert
we add business value based on analysis of all kinds of data, including big data
#pawcon@pythongeert
technologytrigger
inflatedexpectations
disillusionment
plateau of productivity
#pawcon@pythongeert
hype cycle
while big data is at the peak of the hype cycle, predictive analytics is in the plateau of productivity
#pawcon@pythongeert
#pawcon@pythongeert
tools
there’s a lot of tools out there, i.e. algorithms that we use to predict -we’ll only explain the basic tools used in our study
#pawcon@pythongeert
#pawcon@pythongeert
people who listen to radiohead are claimed to be more intelligent than those who listen to beyoncé
#pawcon@pythongeert
#pawcon@pythongeert
SAT score
rankradiohead
musicthatmakesyoudumb.virgil.gr
MassachusettsInstitute of Technology
Harvard College
Yale UniversityUniversity of
Chicago
WellesleyCollege
University ofSouthernCalifornia
PrincetonUniversity
1
2
3
4
5
6
7
8
9
10
1300 1350 1400 1450 1500 1550 1600
linear regression explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
linear regression
Y = 𝛼 + 𝛽 ∗ X
rank radiohead
Y = 𝛼 + 𝛽1 ∗ X1 + 𝛽2 ∗ X2 +...
SAT score
linear regression explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
SAT score
rankradiohead
MassachusettsInstitute of Technology
Harvard College
Yale UniversityUniversity of
Chicago
WellesleyCollege
University ofSouthernCalifornia
PrincetonUniversity
1
2
3
4
5
6
7
8
9
10
1300 1350 1400 1450 1500 1550 1600
logistic regression explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
SAT score
probabilityradioheadin top 3
MassachusettsInstitute of Technology
Harvard College
Yale University
University ofChicago
WellesleyCollege
University ofSouthernCalifornia
PrincetonUniversity
1
0
1300 1350 1400 1450 1500 1550 1600
logistic regression explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
logistic regression
p = 1
1+𝑒−( )
Probability radiohead in top 3
SAT score
p = 1
1+𝑒−(𝛼+ 𝛽1∗X
1+ 𝛽
2∗X
2+...)
𝛼 + 𝛽 ∗ X
#pawcon@pythongeert
#pawcon@pythongeert
SAT score
# under-
graduates
MassachusettsInstitute of Technology
Harvard College
Yale University
University ofChicago
WellesleyCollege
University ofSouthernCalifornia
PrincetonUniversity
1300 1350 1400 1450 1500 1550 1600
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
University of Pennsylvania
75%
50%
0%
decision trees explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
decision tree
50%
25% 75%
0% 50%
SAT < 1,450 SAT >= 1,450
#ugrads < 9,000 #ugrads >= 9,000
decision trees explained using somedata from this study
#pawcon@pythongeert
#pawcon@pythongeert
multiplicitythis is multiplicity photography
#pawcon@pythongeert
#pawcon@pythongeert
multiplicity
in predictive analytics, multiplicity of good models means that several models often deliver comparable results
#pawcon@pythongeert
all models are wrong, but some are usefulGeorge Box
#pawcon@pythongeert
simplicity is theultimate sophistication
Leonardo Da Vinci
multiplicity
#pawcon@pythongeert
ensemble techniques
#pawcon@pythongeert
youtube.com/watch?v=gBnvGS4u3F0
explaining ensembles through a video with penguins and a whale
#pawcon@pythongeert
#pawcon@pythongeert
bagging
ID Target
1 1
2 0
3 1
4 0
5 0
ID Target
1
5 0
5 0
3 1
1 1
1 1
ID Target
2
2 0
4 0
1 1
4 0
1 1
explaining ensembles without a video with penguins and a whale
#pawcon@pythongeert
random forests
50%
25% 75%
0% 50%
#pawcon@pythongeert
random forests are an ensemble of trees
#pawcon@pythongeert
#pawcon@pythongeert
today
Véronique Van VlasselaerFraud Detection
Phil Winters
Internet of Things
Ajay Kelkar
Cross-selling
Arthur Von Eschen
Cheating Detection
Duncan Ross
Data for Good
the program of PAW london 2014wednesday oct 29
#pawcon@pythongeert
#pawcon@pythongeert
tomorrow
Dieter Debels
Buying flows
Lukas Vermeer
PersonalisedRecommendations
Gergely Kalmár
Online Behavior
Aaron Ferguson
Malicious Use of Anonymizers
Tom Khabaza
Time-seriesData Mining
the program of PAW london 2014thursday oct 30
#pawcon@pythongeert
#pawcon@pythongeert
keynotes
David Hand
The Improbability Principle
John Elder
The Peril of Vast Search
Dean Abbott
The Revolution in Retail Customer Intelligence
keynotesPAW london2014
#pawcon@pythongeert
Geert
@pythongeert
Verstraeten
pythonpredictions.com
pawcon.com/london
my personal mariage parfait
#pawcon@pythongeert
#pawcon@pythongeert
retention
retention is an important marketing problem for many organizations andindustries
#pawcon@pythongeert
#pawcon@pythongeert
predicting churn
churn prediction is a common application where the number of churners is typically small compared to the number of customers that stay
#pawcon@pythongeert
#pawcon@pythongeert
predicting churn
oversampling typically does not work great
#pawcon@pythongeert
#pawcon@pythongeert
predicting churn
mixed results for undersampling
#pawcon@pythongeert
#pawcon@pythongeert
our study
can we combine
undersamplingbagginglogistic regression
and improve logistic regression?
#pawcon@pythongeert
#pawcon@pythongeert
our studythe idea was to apply bagging, but in a way that we always keep the churners and only a proportion of non-churners
#pawcon@pythongeert
Kristof
@KCoussement
Coussement
ieseg.fr
pythonpredictions.comthe mariage parfait of kristofcoussement - professor of marketing in france - who performed the research
#pawcon@pythongeert
the prize
#pawcon@pythongeert
the prize needed to convince kristof to spend his time on this study – ‘a research bet’
#pawcon@pythongeert
#pawcon@pythongeert
11 real-life european churn datasets in retail, telco and financial services
variables 15 – 236 customers 876 – 74,122incidence 1.8% – 47%
our study
A Bagging-Based Undersampling Strategy for Classification: A Customer Churn Prediction Application – IFORS 2014 Barcelona
#pawcon@pythongeert
#pawcon@pythongeert
we compared• logistic regression• logit + bagging• logit + undersampling• logit + undersampling + bagging• random forests
our study
A Bagging-Based Undersampling Strategy for Classification: A Customer Churn Prediction Application – IFORS 2014 Barcelona
‘u-bagging’
#pawcon@pythongeert
#pawcon@pythongeert
• logistic regression with correlation-based feature selection filter
• bagging uses 100 samples• undersampling balanced sampling
• 5 x 2-fold cross-validation AUC• wilcoxon signed-rank test & friedmann test
with bonferroni-dunn post-hoc test
details
A Bagging-Based Undersampling Strategy for Classification: A Customer Churn Prediction Application – IFORS 2014 Barcelona
#pawcon@pythongeert
#pawcon@pythongeert
AUC
probability that a randomly selected
churner will be ranked higher than a randomly selected
non-churner
#pawcon@pythongeert
#pawcon@pythongeert
results
-0.005
0
0.005
0.01
0.015
0.02
0.025
1 2 3 4 5 6 7 8 9 10 11
bagging increasesperformance significantly when compared to logistic regression
Dataset
AUCincrease
#pawcon@pythongeert
#pawcon@pythongeert
resultsundersampling significantly decreases performancewhen compared to logistic regression
Dataset
AUCincrease
-0.025
-0.02
-0.015
-0.01
-0.005
0
0.005
1 2 3 4 5 6 7 8 9 10 11
#pawcon@pythongeert
#pawcon@pythongeert
results
Dataset
AUCincrease
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
1 2 3 4 5 6 7 8 9 10 11
u-bagging increasesperformance significantly when compared to logistic regression
#pawcon@pythongeert
#pawcon@pythongeert
resultsu-bagging is competitive to bagging
Dataset
AUCincrease
-0.015
-0.01
-0.005
0
0.005
0.01
1 2 3 4 5 6 7 8 9 10 11
#pawcon@pythongeert
#pawcon@pythongeert
resultsu-bagging is competitive to random forests
Dataset
AUCincrease
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
1 2 3 4 5 6 7 8 9 10 11
#pawcon@pythongeert
#pawcon@pythongeert
resultsu-bagging has top average ranking on the 11 real-life churn datasets
Average Ranking
1.6
2.2
3.4
3.5
4.4
1 2 3 4 5
u-bagging
bagging
logistic regression
random forests
undersampling
#pawcon@pythongeert
#pawcon@pythongeert
resultsu-bagging is much more efficient than bagging when incidence is low
Number of observations
100
10,000
10
1,000
logistic regression
bagging
undersampling (5% incidence)
u-bagging (5% incidence)
#pawcon@pythongeert
#pawcon@pythongeert
afterthoughts
#pawcon@pythongeert
#pawcon@pythongeert
a more efficient approach to bagging
undersampling vs oversampling
afterthoughts
multiplicity of good models
simplicity is the ultimate sophistication
parallel processing in big data