Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining:Techniques and Applications in

Economics

Rob Potharst

Econometric Institute

Outline of this lecture

Datamining for ICT & Economics, 2

Part 1: Intelligent Decisions in Direct Mailing

Part 2: Brand Choice using Ensemble Methods

Part 3: Ensemble techniques for Choice Problems, especially Churn

Part 1Intelligent Decisions

in Direct Mailing

Rob Potharst, Uzay Kaymak, Wim Pijls Erasmus University Rotterdam

Faculty of Economics, Dept. of Computer Science

Jedid-Jah Jonker, SCP and Nanda Piersma, HES

Datamining for ICT & Economics, part 1: Direct Mailing

4

Outline

• Decision problems in direct mailing

• The charity organization case

• Target selection– models: logreg, CHAID, neural networks, association

rules, fuzzy modelling

• The frequency problem– models: MDP, reinforcement learning

(italic: CI methods)


5

Classical literature

• Optimal mailing policies:Bitran & Mondschein (1996),Mailing Decisions in the Catalog Sales Industry

• on Target Selection:Bult & Wansbeek (1995),Optimal Selection for Direct Mail


6

This part of the lecture is based on:

• R.Potharst, U.Kaymak & W.Pijls (2001),Neural Networks for Target Selection in Direct Marketing

• W.Pijls, R. Potharst & U.Kaymak (2001),Pattern-based Target Selection Applied to Fund Raising (2001)

• U.Kaymak (2001), Fuzzy Target Selection using RFM variables

• J.J.Jonker, N.Piersma & R.Potharst (2002),Direct Mailing Decisions for a Dutch Fundraiser

http://www.few.eur.nl/few/people/potharst/


7

Thanks to:• Jedid-Jah Jonker

(Soc.Cult.Planb., DenHaag)• Uzay Kaymak

(Erasmus University, R’dam)• Nanda Piersma

(HES, A’dam)• Wim Pijls

(Erasmus University, R’dam)• an anonymous charity organization


8

Decisions in direct mailing

• Target Selection: To which addresses are we going to send the next mailing?

• Frequency:How often are we going to send a mailing to each separate address?

• Inventory Size:How many items of each product should we have on stock?

• etc.


9

Charity case

• A large Dutch charity organization

• Goal: to stimulate social and scientific research on a frequent disease

• More than 700 000 supporters

• Annual budget larger than 15M euro

• Multiple mailing campaigns a year, asking for donations


10

Database • Information about over 700000 supporters• About 675000 considered for mailings• Supporter’s donation history is traced after

first-ever donation (cumulative database)• Recorded data (about 0.5 GB)

– mailing dates– donation amount– donation time– administrative data


11

Target selection

• Problem from (direct) marketing

• Generation of customer profiles (models) who could be interested in a product

• Models built by analyzing data from similar (previous) campaigns

• Classification problem– separate positive cases from negative cases

and determine their characteristics


12

Target selection cycle

customersconceptualization

test campaign data gathering

target selection

purchase

product

model


13

Charity donations

• Charity organizations have supporters who donate money for the good cause

• Invite supporters to donate through several mailings per year

• Charity organizations may have different strategies for mailing supporters

• Select those supporters who are likely to donate in a particular mailing


14

Target selection for supporters

supporters

data gathering,past donation behavior

target selection

more donations

model


15

Target selection models

• Segmentation based, e.g. CHAID– divide customer base into disjoint segments

– select most promising segments

– segments assumed to be homogeneous

• Scoring based, e.g. logistic regression– score each customer in the customer base

– select customers with highest scores

– individual approach


16

Gain chart

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

ideal typicalrandom

0.22040

)20( tG

5.12030

)20( eG


17

Hit probability chart

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

ideal typicalrandom


18

Data sources

• External databases: rental list– maintained by specialized companies– household-specific information– demographic information at ZIP code level

• Internal databases: house list– maintained by the company itself– traces purchase history of customer– most reliable and relevant information about

the customer


19

RFM variables

• RecencyHow recent was the last purchase?E.g. number of days since last purchase

• FrequencyHow frequent are the purchases?E.g fraction of responded mailings

• Monetary valueHow much has the customer spent?E.g. average spending per mailing


20

Feature selection

• RFM variables– often appropriate to capture specifics of

customers– relatively small number of variables– not suitable for identifying new or future

prospects

• feature selection (and sometimes reduction) still needed to select most relevant variables


21

Why neural networks?

• Neural networks can hopefully be used for

building good target selection models that

can predict likely charity supporters

successfully

• Performance might be better than

segmentation models like CHAID, and

scoring methods like logistic regression


22

Feature selection

• R1=Number of weeks since last response

• R2=Number of months since first-ever donation

• F1=Fraction of responded mailings

• F2=Response time for last response

• M1=Average donated amount per mailing

• M2=Last donated amount

• M3=Average donation per year

23

Data preparation

• Data set selection– which previous mailing to use for modeling?– influence of mailing strategy– select most recent full mailings (1998,1999)

• Data set size– about 5000 randomly selected supporters– independent training and test sets– training set 1998 - 4057 samples

test set 1998 - 4080 samplestraining set 1999 - 4111 samplestest set 1999 - 4131 samples


24

Feedforward neural network

input layer hidden layer output layer

• 7 inputs• 1 hidden layer• 4 hidden neurons• 1 output

logistic

linear • normalized inputs and outputs

• initial weights random in (-0.1,0.1)


25

Results on 1999 data set

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

idealnn trained on 1998 datann trained on 1999 datarandom


26

Results on 1999 data set

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Fraction selected

Re

spo

nse

fra

ctio

nnn trained on 1998 datann trained on 1999 data


27

NN vs. logistic regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n r

esp

on

ded

idealneural networklogistic regressionrandom

Training set 1998, test set 1999


28

NN vs. logistic regression


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Fraction selected

Res

pons

e fr

actio

nneural networklogistic regression


29

Neural network vs. CHAID


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

idealneural networkCHAIDrandom


30

Conclusions

• Neural networks can be used to build target selection models successfully

• They outperform segmentation methods like CHAID, but performance is comparable to statistical regression methods

• There is evidence that a neural network model can be used for target selection in multiple mailing campaigns


31

Why patterns/association rules?Scoringmethods

Segmentationmethods

Response rate + +/-

Interpretability - +

Question: Is it possible to have + , + ?

Answer: this study! = pattern-based


32

Patterns and their support

a b c resp1 1 3 11 1 3 11 2 1 01 2 1 01 2 3 11 2 3 02 1 3 12 1 3 13 1 2 04 2 1 04 2 1 14 2 3 04 2 3 0

pattern support #responses

b = 2, c = 3 4 1

c = 2 1 0

a = 1, b = 1, c = 3 2 2


33

Definitions

• a pattern is a set of attribute/value combinations

• a record R is a supporter of a pattern P if all attr/val combinations of P match those of R– Example: (3,1,2) is a supporter of ( b = 1, c = 2 )

• the support of a pattern P is the number of supporters of P


34

Frequent patterns

• Given a minimum support minsup a pattern P is said to be frequent if

support( P ) minsup

• The set of frequent patterns can be represented by a trie

• An algorithm for finding frequent itemsets (like Apriori by Agrawal c.s.) can also be used to find frequent patterns


35

The trie of frequent patterns


36

Support and response counts


37

With response rates


38

Selecting the target groupa b c1 1 23 1 22 2 23 1 31 2 13 2 34 2 23 2 13 1 34 1 2

mrr10080

10010050

37,52525

10080

The first record (1,1,2) matches the following freq.patterns:

( a = 1 ) => resp. rate = 50 %

( b = 1 ) => resp. rate = 80 %

( a = 1, b = 2 ) => resp. rate = 100 % => max (mrr)

1 1 22 2 23 1 33 1 33 1 2

Target group:


39

PatSelect

Input: a set of records

Output: a subset of size n: the target group

1. For all records R in the given set do:

• let P be the set of all frequent patterns that match R

• let mrr( R ) = max { resp.rate ( P ) | P in P }

2. Sort all records according to decreasing mrr

3. Select the topmost n records


40

Fund raising application

• Dutch charity organization

• more than 700 000 supporters

• 26 mailing campaigns (dates, targets, responses)

• spread over six years (‘94 - ‘99)

• database of over 400 MB


41

Research questions

1) How to select a target group with as high a response rate as possible, on the basis of history data

2) How to select a target group with as high a total amount donated as possible, again on the basis of history data

This study: question 1.


42

RFM features

R1: # weeks since last response

R2: # months since first donation

F1: fraction of mailings supporter has responded to

F2: median response time of supporter

M1: etc.


43

Model construction

• Choose only full mailing campaigns 98/99

• random split:– training set 50 %– test set 50 %

• resulting datasets:– tr98, tr99– test98, test99– each somewhat less than 200 000 cases!!


44

Results‘99, trained on‘98 data


45

Results‘99, trained on‘99 data


46


47

Comparison

• Neither a pure scoring, nor a pure segmentation method

• not segments, since patterns can be overlapping!

• many patterns => many different scores => performance comparable with scoring methods

• but also:


48

Interpretability

high, since each supporter’s presence in the

target group can be explained by its inclusion

in a pattern with high response rate!!!


49

Conclusions

• New method based on patterns and association rule algorithms with following characteristics:– response rate high– interpretability high

• interesting method, especially for large databases


50

Why fuzzy?

Advantages of fuzzy target selection

models in marketing

• prediction power larger than conventional

statistical models

• large degree of transparency due to the

linguistic rules that can be derived from

data


51

Fuzzy target selection

• FCM clustering in feature product space

• Average response rate

per cluster

• Score per customer

• Customer segmentation

• Rule derivation

}1,0{,1

1

kN

k ik

Nk kik

i ru

ru

Ci ik

Ci iik

ku

us

1

1

otherwise,0

,1 1* ikCiik

ikuu

u


52

Fuzzy clustering

Partition data into overlapping setsbased on similarity amongst patterns

Given the data

Find the fuzzy partition matrix:

and the cluster centres:

Nkxxx nTnkkkk ,,1,],,,[ 21 x

CNC

N

uu

uu

1

111

U

niC vvvV },,{ 1


53

Fuzzy clusteringMinimize objective function

subject to

),(),,( 2

1 1ik

C

i

N

k

mik duJ vxVUX

NkuCi ik ,,1,11

NkCiuik ,,1,,,1,10

CiNuik

Nk ,,1,0 1

membership degree

total membership

no cluster empty

),1( m is the fuzziness parameter


54

Feature selection

• R1=Number of weeks since last response• R2=Number of months since first-ever

donation• F1=Fraction of responded mailings• F2=Response time for last response

(median)• M1=Average donated amount per mailing• M2=Last donated amount• M3=Average donation per year


55

Feature reduction

• Use logistic regression to build a target selection model

• Use only features whose corresponding weights deviate significantly from zero

• Selected features– Number of weeks since last response(TIMELR)

– Number of months since first-ever donation(TIMECL)

– Fraction of responded mailings(FRQRES)


56

Feature reduction

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Re

spo

nse

fra

ctio

nTraining data

7 variables3 variables


57

Fuzzy scoring model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

Evaluation datafuzzy clusteringlogistic regression

40 clusters


58

Fuzzy segmentation model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

Evaluation dataclassificationlogistic regression

40 segments


59

Linguistic rules

-1 -0.5 0 0.5 1 1.5 20

0.5

1

timelr

mem

bers

hip

Membership functions

-1 -0.5 0 0.5 1 1.5 20

0.5

1

timecl

mem

bers

hip

-1 -0.5 0 0.5 1 1.5 20

0.5

1

frqres

mem

bers

hip

Ve

ry s

ho

rtV

ery

sh

ort

Ra

re

Short

Short

Infrequent

Long

Long

Frequent

Ve

ry lo

ng

Ve

ry lo

ng

Oft

en

domains are normalized

• If TIMELR is very short and TIMECL is not very short and FRQRES is often then response rate is 0.81

• If TIMELR is short and TIMECL is very long and FRQRES is often then response rate is 0.75

• If TIMELR is very short and TIMECL is very short and FRQRES is often then response rate is 0.65

• If TIMELR is short and TIMECL is short or long and FRQRES is often then response rate is 0.60


60

Conclusions

• Fuzzy target selection with RFM features– Transparent models for target selection with

good prediction power– Product space fuzzy clustering– Accuracy surpasses statistical models– Transparency by linguistic rules

• Future research– estimation of uncertainty bounds of the model– modeling of donation amounts


61

Frequency problem

• How many mails should I send to this client this year?

• Model as a Markov Decision Process (MDP)

• Theory is based on Markov Chains

• Start with small introduction to MC’s


62

Markov Chain

1

2

3

….

t0 t1 t2 t3

• System with m states (here m = 3)

• from one stage to the next, the state jumps from state j to state k with probability p(j,k)

• p(j,k) is called transition probability


63

Transition matrix

P =

p(1,1) … p(1,m). … .. … .. … .p(m,1) … p(m,m)

p(j,k) = Pr{ end up in k after 1 step | start in j }

p(2)(j,k) = Pr{ end up in k after 2 steps | start in j } =

=

m

i

kipijp1

),(),(


64

Stationary distribution

• PP = P2 = transition matrix after two steps

• P n = transition matrix after n steps

• if n , P n Q with

Q =

q1 … qm

q1 … qm

q1 … qm

… … …

• {q1 , … ,qm} is stationary distribution

• property: Q Q = Q


65

Example

0 ½ ½

½ 0 ½

½ ½ 0P =

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3Q =


66

Markov Decision Process

1

2

3

….

t0 t1 t2 t3

system(3 states)

X agent(2 actions)

….

r0 r1 r2 r3rewards

S = { states }

A = { actions }

S and A finite….


67

Transition, reward matrix, policy

• transition matrix: p(j, k, a) = probability of ending in state k after action a has been performed on state j

• matrix has m m |A| elements

• p: S S A [0,1]

• reward matrix: r(s,a) = the payoff the agent gets by taking action a in state s

• r: S A

• policy: (s) = the action that should be taken when the system is in state s

• : S A


68

Optimal policy

*(s) = maxE(r0 + r1 + 2r2 + …)

= discount factor, denotes how heavy future rewards should weigh

0 < < 1

))(,(0 ssrr

m

i

sispiirr1

1 ))(,,()).(,(

etc.

Exactly:


69

The frequency problem

• Stages: the planning periods (years)

• Actions: A = {0,1,2,3,4} the number of mailings to be sent

• System: client

• Agent: firm

• States: the RFM-profile of the client


70

States

• g = gift size, in categories: 0,1,…,5

• m = number of mailings: 0,1,…,4

• r = number of responses: 0,1,…,4

• Example: s = (3,4,1) means the client received 4 mailings, responded to 1, with a fair-sized gift

• Number of states is 55 (not 6 x 5 x 5 since not all combinations are possible)


71

Transition matrix

• Contains 55 x 55 x 5 = 15125 elements

• A probability where the second number of mailings (m) is unequal to the action(a) must be zero

• All other probabilities must be estimated from historical data!


72

Reward matrix

• r(s,a) = the expected total amount donated in this period, by a client in state s, given the number of mails received (a), minus the cost of a mailings

• Also to be estimated from historical data


73

Scenario’s + demo

• Optimal policy calculation: linear programming, value iteration, policy iteration

• Scenario’s: give extra weight to some states

• Demonstration of the prototype decision support system we developed with the charity organization


74

No historic data available?

• Transition and reward matrix cannot be estimated!

• Learn the optimal policy by reinforcement learning (without knowing P and R)

• Will be our next project together with Uzay Kaymak and Michiel van Wezel

Part 2Brand Choice usingEnsemble Methods

Rob Potharst , Michiel van Rijthoven and Michiel van Wezel

Erasmus University RotterdamEconometric Institute

76 Datamining for ICT & Economics, part 2: Brand Choice

Outline

• Ensemble methods: bagging, boosting, stacking, etc…– What? Why? How?

• Brand choice– classical statistical models– neural network models– ensemble methods


ReferencesOn ensemble methods:

• Hastie, Tibshirani & Friedman: The elements of Statistical Learning, Springer Verlag, 2001

On the brand choice application:

• R. Potharst, M. van Rijthoven and M. van Wezel, "Modeling Brand Choice using Boosted and Stacked Neural Networks", In: Kevin E. Voges et al. (Ed.), Business Applications and Computational Intelligence, to be published in 2005 by Idea Group Inc.

• M. van Wezel and R. Potharst, "Brand Choice, Bagging, Boosting, Bias and Variance". Technical Report, sept '04, submitted for publication.

• Vroomen, B., Franses, P.H. & van Nierop, E., (2004). Modeling consideration sets and brand choice using artificial neural networks. European Journal of Operational Research, 154, 206-217.


Ensemble? What does it mean?

• an ensemble is a group, instead of an individual

• two know more than one, and more than two maybe even more

• examples: – from real life: voting, committee, weather

men– from computational intelligence: a set of

neural network / decision tree models


How can you combine models?

Depends on your error loss function, but in general:

• for classification problems: voting!

• for regression problems: averaging!


Example


Does it really work?• Hm.. it seems a bit simple…• How could we check that it possibly works?• Try it on a simple problem• Credit scoring dataset from internet

(UCI Machine Learning Repository:http://www.ics.uci.edu/~mlearn/MLRepository.html)

• Want a loan? Are you credit worthy: will you pay back yes/no?


Credit scoring example

size of ensemble

% correct



What is a (base) classifier?

something that takes an input vector and assigns an (estimated) class label to it:


What kind of classifiers?

We will use classifiers that learn to do their job from a training set of examples (or instances or patterns):

How to train such a classifier is the subject of Machine Learning, a field very close to Computational Intelligence.

)},(),...,,(),,{( 2211 nn yyyT xxx


Classifiers: neural networks


Universal Approximation

• Neural networks can implement arbitrarily shaped boundaries between classes.

• This is called the universal approximation theorem for 1 hidden layer feedforward neural networks (a.o. Barron, 1993)

• By adding hidden nodes (and training examples!) one can get as close to the real boundaries as one wishes.



Decision boundaries of neural network


Decision boundaries of traditional model (multinomial logit)


Classifiers: decision trees

from: Mitchell, Machine learning, 1997


Decision tree: example


Problem 1: overfitting


Problem 2: Instability

• That is: the model is very dependent on the specific training set you have; take one out and…


Instability of a model

• Let us study the instability of a model in the setting of a regression problem:

• 5 training sets of each 200 examples• 5 linear models (blue)• 5 neural networks (green)• dotted lines: the average models!

noise 1

sin2

x

xy



Prediction error

• If we use a squared error loss function the prediction error is:

• Define the average model of a classifier as

• And let f* be the “real” underlying model ( = E(y|x))

2, ))(()( yfEfPE TyT xx

)()( xx TTA fEf


Bias-variance decomposition• Then we can derive the following formula:

• Or: Prediction error = irreducible error + bias2 + variance

• So: if we try to approximate the average model, we get variance 0. This is the idea behind ensemble methods!

2,

2*2

))()((

))()(()(

xx

xx

x

x

ATT

AnoiseT

ffE

ffEfPE


Bagging• abbreviation for bootstrap aggregating• Breiman(1994)• create "bootstrapped" datasets by randomly

drawing from the original dataset, with replacement!

• the bootstrapped datasets have the same size as the original dataset

• build a model for each bootstrapped dataset• combine these models by averaging or voting



Boosting• Adaboost (Freund & Shapire, 1996)• Each example in trainingset has a weight

attached to it; initial weights w1=1/N• Generate a sequence of models:• build model M1 on trainingset with w1

• examples, misclassified by M1 get a higher weight

• etc.



Adaboost (= adaptive boosting)for 2 classes: -1,+1

Nwi

1

N

iimiim xFywerr

1

))((

1. Initialize the boosting weights: for i = 1,…, N

2. For m = 1 to M perform each of the following:

a) Train model Fm(x) on T with weights wi

b) Compute


c) Compute

)1

log(m

mm err

err

d) Redefine the weights:

)))((exp( imimii xFyww

e) Normalize the weights:

N

kk

ii

w

ww

1

3. Output the final combined model:

))(sgn()(1

M

mmm xFxO


Stacking

• two levels of learning:– first level: train several models on training

set– second level: again train the combination

of these models

• not a fixed voting scheme for combining the models, but: learn an optimal combination method from the data


Brand Choice• classical topic in marketing• a product has k brands• consumer/household wants to buy product• which brand does he pick?• given:

– household characteristics (income, etc)– product factors (price, etc)– situational factors (product on display, etc)


Modeling brand choice

• classical statistical models (multinomial logit, conditional logit, etc): linear

• neural network models: nonlinear

• a model by Vroomen et al., 2004: neural networks with built-in so-called consideration sets


the Vroomen model• as many hidden nodes as there are

brands• three types of variables:

– X: household characteristics, eg size, income– Z: brand characteristics, eg price-level,

promotion, advertising– W: choice-specific characteristics, eg

observed price at purchase occasion


the Vroomen model


J

m qm

Q

qqmk

J

kkmm

qj

Q

qqjk

J

kkjj

j

WCS

WCS

FC

111

0

110

)exp(

)exp(

)(11

0

P

ppjpj

I

iiijjj ZXFCS

xexF

1

1)( = logistic function

the Vroomen model


Dataset

• scanner data: 3055 purchases of liquid detergent of 6 brands (part of ERIM database)

• 400 households• 4 X variables (volume, non-det, size, time)• 4 Z variables (price, feature, display,

recency)• 1 W variable (again price)


Experiments1. split 400 households randomly into

three groups (tr 200, va 100, te 100)2. use backprop for Vroomen, test on

va+te3. use 25 iterations on boosting alg, test

combined model on va + te4. use stacking on the 25 models, use va

to find coefficients, test on teRepeat this whole cycle 10 times!


Results

3 to 4 % gain in predictive performance


Conclusions

• By using ensemble methods we can increase the predictive performance

• on the other hand: because we get combined models they are harder to interpret

• future work: interpreting the combined model!

Part 3On the Use of Ensemble

Techniques for Modeling Choice Problems in Marketing, especially

Churnby

Aurélie Lemmens , Rob Potharst, Michiel van Wezel (Erasmus University Rotterdam)

and Christophe Croux

(Catholic University Leuven)

Datamining for ICT & Economics, part 3: Churn

116

Characteristics of Ensemble Techniques

• Developed in statistics / datamining / machine learning communities

• Not yet applied to marketing problems (a.f.a.w.k.)• High potential for choice problems such as brand choice

and churn• Successfully applied to other fields like fraud detection,

text categorization, chemometrics• Especially successful wrt predictive power, which can be

directly translated into money• Easy to apply


117

How do Ensemble methods work?

1. Develop a number of so-called base models for a problem

Could be any model: dt, nn, logit, …

2. Combine these base models into a final choice model

Combination can be done with: voting, weighted voting, …


118

Existing Ensemble methods

• Bagging (= Bootstrap Aggregating) Breiman, 1996

• Boosting: – Adaboost, Freund & Shapire, 1996– Stochastic gradient boosting, Friedman, 2002

• Stacking Wolpert, 1992


119

Based on 4 recent papers

[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006

(by Lemmens and Croux)

[2] “Bagging a Stacked Classifier” appeared in 2005

(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked

Neural Networks” appeared in 2006

(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using

Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)


120

Ensemble techniques used

paper bagging boosting stacking

[1] X X

[2] X X

[3] X X

[4] X X


121

Base learners used

paper DT NN LDA LR

[1] X

[2] X X X X

[3] X

[4] X


122

Marketing problems considered

paper churn brand choice

[1] X

[2]

[3] X

[4] X


123

Data sets

paper Company / sector

[1] US wireless telecom company

[2] 12 benchmark datasets from machine learning

[3] Scanner data for six brands of liquid detergent

[4] Scanner data for ketchup / peanut butter brands


124

Based on 4 recent papers

[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006

(by Lemmens and Croux)

[2] “Bagging a Stacked Classifier” appeared in 2005

(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked

Neural Networks” appeared in 2006

(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using

Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)

in depth


125

– The 2002 Churn Tournament organised by Teradata Center for

CRM at Duke University

– Churn means defecting from a company, i.e. take his business

elsewhere

– Customer database from an anonymous U.S. wireless telecom

company

– Challenge: predicting churn for elaborating targeted retention

strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and

Zhang 2002)

– Details can be found in Neslin et al. (2004)

The Context


126

– The US Wireless Telecom market (2004)

• 182.1 million subscribers

• Leader in market share: Cingular Wireless

– 26.9% total market volume

– turnover US$19.4 billion / net income US$201 million

• Other major players: AT&T, Verizon, Sprint and Nextel

• Mergers & Acquisitions : Cingular with AT&T Wireless &

Sprint with Nextel

The Context (cont’d)


127

– Churn

• High churn rates 2.6% a month

• Causes: increased competition, lack of

differentiation, market saturation

• Cost: $300 to $700 cost of replacement of a lost

customer in terms of sales support, marketing,

advertising, etc.

• Targeted retention strategies

The Context (cont’d)


128

Formulation of the Churn Problem

• Churn as a Classification issue:

Classify a customer i characterized by k variables

xi = (xi1 , xi2 , …, xiK ) as

– Churner yi = + 1

– Non-churner yi = - 1

• Churn is the response binary variable to predict: yi = f(xi )Choice of the binary choice model f ( . ) ?


129

Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)

• Models allowing for the heterogeneity in consumers’

response:

– Finite mixture model (e.g. Wedel and Kamakura 2000)

– Hierarchical Bayes model (e.g. Yang and Allenby 2003)

• Non-parametric choice models:

– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et

al. 1997)

– Bagging (Breiman 1996), Boosting (Freund and Schapire

1996), Stochastic gradient boosting (Friedman 2002)


130

Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)

• Models allowing for the heterogeneity in consumers’

response:

– Finite mixture model (e.g. Wedel and Kamakura 2000)

– Hierarchical Bayes model (e.g. Yang and Allenby 2003)

• Non-parametric choice models:

– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et

al. 1997)

– Bagging (Breiman 1996), Boosting (Freund and Schapire

1996), Stochastic gradient boosting (Friedman 2002)Mostly ignored in the marketing literature

S.G.B. won the Tournament (Cardell, from Salford Systems)

131

Decision Trees for Churn

Change in consumption

Customer care calls

< 0.5 ≥ 0.5

≥ 3< 3

Age

Yes

≥ 55

55< & ≥ 26 < 26

No

Handset price

≥ $150 <$150

No Yes

Yes No

Example:



132

Bagging and Boosting

• Machine Learning Algorithms

• Principle: classifier aggregation (Breiman, 1996)

• Tree-based method (e.g. Currim et al. 1988)

• Bagging: Bootstrap AGGregatING


133

Calibration sampleZ = {(xi , yi ) }, i = 1, …, N

Random sample Z1*

Random sample Z2*

xf *1̂

xf *2̂

e.g. tree


134

Aggregating bootstrap samples

. . .

xf *2̂

xf *1̂

xf *3̂

xfB*ˆ

…

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

Churn propensity score:

Churn classification:

)(ˆ)(ˆ xfsignxc bagbag


135

• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}

• B bootstrap samples

• From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

• The final classifier is obtained by averaging the scores

• The classification rule is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

)(ˆ)(ˆ xfsignxc bagbag

Bagging


136

• Winner of the Teradata Churn Modeling Tournament

(Cardell, Golovnya and Steinberg, Salford Systems).

• Data adaptively resampled

Stochastic Gradient Boosting

• Previously misclassified observations weights

• Previously well-classified observations weights


137

Data

Time

Customer

Balanced

Sample

Proportional

Sample

Calibration Sample Validation Hold-Out Sample

yi = + 1

yi = + 1

yi = - 1

yi = - 1

Xi = (x1,…, x46) yi

Xi=(x1,…, x46) yi

Behavioral predictorse.g. the average monthly minutes of use

Company interaction’s variablese.g. mean unrounded minutes of customer care calls

Customer demographicse.g. the number of adults in the household

N = 51,306

N=100,462Real-life proportion of churners = 1.8%

Equal proportion of churners = 50%


138

Research Questions

• Do bagging (and boosting) provide better results

than other benchmarks?

– What are the financial gains to be expected from this improvement?

– What are the more relevant churn drivers or triggers that marketers

could watch for?

• How to correct estimated scores obtained from a

balanced calibration sample, when predicting rare

events like churn?


139

Comparing Error Rates…Model* Validated Error

Rate**

Binary Logit Model 0.400

Bagging (tree-based) 0.374

Stochastic Gradient Boosting 0.460

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample


140

Bias due to Balanced Sampling

• Overestimation of the number of churners

• Several bias correction methods exist (see e.g. Cosslett

1993; Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and

Lancaster 1996; King and Zeng 2001a,b; Scott and Wild 1997).

• However, most are dedicated to traditional models (e.g.

logit). We discuss two corrections for bagging and boosting.


141

The Bias Correction Methods• The weighting correction:

Based on marketers’ prior beliefs about the churn rate, i.e. the

proportion of churners among their customers, we attach

weights to observations of a balanced calibration sample.

• The intercept correction:

Take a non-zero cut-off value τB such that the proportion of

predicted churners in the calibration sample equals the actual

a priori proportion of churners.


142

• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}

• B bootstrap samples

• From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

• The final classifier is obtained by averaging the scores

• The classification rule is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

Bbagbag xfsignxc )(ˆ)(ˆ

Bagging


143

Assessing the Best Bias Correction…

Bias Correction

No correction Intercept Weighting

Model* Validated Error Rates**

Binary logit model 0.400 0.035 0.018

Bagging (tree-based)

0.374 0.034 0.025

S.G. boosting 0.460 0.034 0.018



144

The Top-Decile Lift• Focuses on the most critical group of customers

regarding their churn risk: Ideal segment for targeting

a retention marketing campaign

• The top 10% riskiest customers

– With = the proportion of churners in this risky segment

– And = the proportion of churners in the whole validation set

Risk to churn

10%

ˆ

ˆlift decile-Top %10

%10̂̂


145

Financial Gains: Neslin et al. (2004)

– N : customer base of the company

– α : percentage of targeted customers (here, 10%)

– ΔTop decile : increase in top-decile lift

– γ : success rate of the incentive among the churners

– LVC : lifetime value of a customer (Gupta, Lehmann and Stuart 2004)

– δ : incentive cost per customer

– ψ : success rate of the incentive among the non-churners.

LVCdecileTopNGain ˆ

146

0 20 40 60 80 100

Number of iterations

1.6

1.8

2.0

2.2

2.4

2.6

Top d

eci

le*

BaggingStochastic Gradient BoostingBinary Logit Model

Top-Decile Lift with Intercept Correction

* Model estimated on the balanced sample, and lift computed on the validation sample.

+26%


147

Validated** Top-Decile Lift

Model*No / Intercept

correctionWeighting correction

Binary logit model 1.775 1.764

Bagging (tree-based) 2.246 1.549

Stochastic gradient boosting

2.290 1.632



148

Financial Gains

If we consider

– N : customer base of 5,000,000 customers

– α : 10% of targeted customers

– γ : 30% success rate of the incentive among the churners

– LVC : $2,500 lifetime value of a customer

– δ : $50 incentive cost per customer

– ψ : 50% success rate of the incentive among the non-churners

LVCdecileTopNGain ˆ


149

Financial Gains

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of the logit model:

ΔTop decile : 0. 471 (= 2.246 – 1.775)

Gain = + $ 3,214,800

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of a random selection:

ΔTop decile : 1.246 (= 2.246 – 1.000)

Gain = + $ 8,550,000


150

Most Important Churn Triggers

Bagging

151

Partial Dependence Plots

-1000 0 1000 2000

Change in monthly min. of use

48

50

52

54

56

58

60

62

Pro

bability t

o c

hurn

0 500 1000 1500

Equipment days

44

46

48

50

52

54

56

Pro

bability t

o c

hurn

Bagging


152

Partial Dependence Plot

Pro

bab

ilit

y to

ch

urn

49

50

51


153

Conclusions: Main Findings

1. Bagging and S.G. boosting are substantially better

classifiers than the binary logit choice model

– Improvement of 26% for the top-decile lift,

– Good diagnostic measures offering face validity,

– Interesting insights about potential churn drivers,

– Bagging is conceptually simple and easy-to-implement.

2. Intercept correction constitutes an appropriate bias

correction for bagging when using balanced sampling

scheme.

154

Appendix: From Profit to Financial Gain

LVCdecileTopN

LVCN

ˆ

ˆ-ˆ

ProfitProfitGain

2 1

2 1 2-1

cLVCN 1111 ˆ1ˆ ˆ Profit

LVC of a churner

who does not

churn

Incentive cost for the

churners retained

+ non-churners

targeted

Contact

cost

ˆ/ ˆdecile Top 1 1

Documents

Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute