154
Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Embed Size (px)

Citation preview

Page 1: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining:Techniques and Applications in

Economics

Rob Potharst

Econometric Institute

Page 2: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Outline of this lecture

Datamining for ICT & Economics, 2

Part 1: Intelligent Decisions in Direct Mailing

Part 2: Brand Choice using Ensemble Methods

Part 3: Ensemble techniques for Choice Problems, especially Churn

Page 3: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Part 1Intelligent Decisions

in Direct Mailing

Rob Potharst, Uzay Kaymak, Wim Pijls Erasmus University Rotterdam

Faculty of Economics, Dept. of Computer Science

Jedid-Jah Jonker, SCP and Nanda Piersma, HES

Page 4: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

4

Outline

• Decision problems in direct mailing

• The charity organization case

• Target selection– models: logreg, CHAID, neural networks, association

rules, fuzzy modelling

• The frequency problem– models: MDP, reinforcement learning

(italic: CI methods)

Page 5: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

5

Classical literature

• Optimal mailing policies:Bitran & Mondschein (1996),Mailing Decisions in the Catalog Sales Industry

• on Target Selection:Bult & Wansbeek (1995),Optimal Selection for Direct Mail

Page 6: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

6

This part of the lecture is based on:

• R.Potharst, U.Kaymak & W.Pijls (2001),Neural Networks for Target Selection in Direct Marketing

• W.Pijls, R. Potharst & U.Kaymak (2001),Pattern-based Target Selection Applied to Fund Raising (2001)

• U.Kaymak (2001), Fuzzy Target Selection using RFM variables

• J.J.Jonker, N.Piersma & R.Potharst (2002),Direct Mailing Decisions for a Dutch Fundraiser

http://www.few.eur.nl/few/people/potharst/

Page 7: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

7

Thanks to:• Jedid-Jah Jonker

(Soc.Cult.Planb., DenHaag)• Uzay Kaymak

(Erasmus University, R’dam)• Nanda Piersma

(HES, A’dam)• Wim Pijls

(Erasmus University, R’dam)• an anonymous charity organization

Page 8: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

8

Decisions in direct mailing

• Target Selection: To which addresses are we going to send the next mailing?

• Frequency:How often are we going to send a mailing to each separate address?

• Inventory Size:How many items of each product should we have on stock?

• etc.

Page 9: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

9

Charity case

• A large Dutch charity organization

• Goal: to stimulate social and scientific research on a frequent disease

• More than 700 000 supporters

• Annual budget larger than 15M euro

• Multiple mailing campaigns a year, asking for donations

Page 10: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

10

Database • Information about over 700000 supporters• About 675000 considered for mailings• Supporter’s donation history is traced after

first-ever donation (cumulative database)• Recorded data (about 0.5 GB)

– mailing dates– donation amount– donation time– administrative data

Page 11: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

11

Target selection

• Problem from (direct) marketing

• Generation of customer profiles (models) who could be interested in a product

• Models built by analyzing data from similar (previous) campaigns

• Classification problem– separate positive cases from negative cases

and determine their characteristics

Page 12: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

12

Target selection cycle

customersconceptualization

test campaign data gathering

target selection

purchase

product

model

Page 13: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

13

Charity donations

• Charity organizations have supporters who donate money for the good cause

• Invite supporters to donate through several mailings per year

• Charity organizations may have different strategies for mailing supporters

• Select those supporters who are likely to donate in a particular mailing

Page 14: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

14

Target selection for supporters

supporters

data gathering,past donation behavior

target selection

more donations

model

Page 15: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

15

Target selection models

• Segmentation based, e.g. CHAID– divide customer base into disjoint segments

– select most promising segments

– segments assumed to be homogeneous

• Scoring based, e.g. logistic regression– score each customer in the customer base

– select customers with highest scores

– individual approach

Page 16: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

16

Gain chart

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

ideal typicalrandom

0.22040

)20( tG

5.12030

)20( eG

Page 17: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

17

Hit probability chart

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

ideal typicalrandom

Page 18: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

18

Data sources

• External databases: rental list– maintained by specialized companies– household-specific information– demographic information at ZIP code level

• Internal databases: house list– maintained by the company itself– traces purchase history of customer– most reliable and relevant information about

the customer

Page 19: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

19

RFM variables

• RecencyHow recent was the last purchase?E.g. number of days since last purchase

• FrequencyHow frequent are the purchases?E.g fraction of responded mailings

• Monetary valueHow much has the customer spent?E.g. average spending per mailing

Page 20: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

20

Feature selection

• RFM variables– often appropriate to capture specifics of

customers– relatively small number of variables– not suitable for identifying new or future

prospects

• feature selection (and sometimes reduction) still needed to select most relevant variables

Page 21: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

21

Why neural networks?

• Neural networks can hopefully be used for

building good target selection models that

can predict likely charity supporters

successfully

• Performance might be better than

segmentation models like CHAID, and

scoring methods like logistic regression

Page 22: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

22

Feature selection

• R1=Number of weeks since last response

• R2=Number of months since first-ever donation

• F1=Fraction of responded mailings

• F2=Response time for last response

• M1=Average donated amount per mailing

• M2=Last donated amount

• M3=Average donation per year

Page 23: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

23

Data preparation

• Data set selection– which previous mailing to use for modeling?– influence of mailing strategy– select most recent full mailings (1998,1999)

• Data set size– about 5000 randomly selected supporters– independent training and test sets– training set 1998 - 4057 samples

test set 1998 - 4080 samplestraining set 1999 - 4111 samplestest set 1999 - 4131 samples

Page 24: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

24

Feedforward neural network

input layer hidden layer output layer

• 7 inputs• 1 hidden layer• 4 hidden neurons• 1 output

logistic

linear • normalized inputs and outputs

• initial weights random in (-0.1,0.1)

Page 25: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

25

Results on 1999 data set

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

idealnn trained on 1998 datann trained on 1999 datarandom

Page 26: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

26

Results on 1999 data set

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Fraction selected

Re

spo

nse

fra

ctio

nnn trained on 1998 datann trained on 1999 data

Page 27: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

27

NN vs. logistic regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n r

esp

on

ded

idealneural networklogistic regressionrandom

Training set 1998, test set 1999

Page 28: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

28

NN vs. logistic regression

Training set 1998, test set 1999

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Fraction selected

Res

pons

e fr

actio

nneural networklogistic regression

Page 29: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

29

Neural network vs. CHAID

Training set 1998, test set 1998

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Fra

ctio

n of

res

pond

ers

idealneural networkCHAIDrandom

Page 30: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

30

Conclusions

• Neural networks can be used to build target selection models successfully

• They outperform segmentation methods like CHAID, but performance is comparable to statistical regression methods

• There is evidence that a neural network model can be used for target selection in multiple mailing campaigns

Page 31: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

31

Why patterns/association rules?Scoringmethods

Segmentationmethods

Response rate + +/-

Interpretability - +

Question: Is it possible to have + , + ?

Answer: this study! = pattern-based

Page 32: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

32

Patterns and their support

a b c resp1 1 3 11 1 3 11 2 1 01 2 1 01 2 3 11 2 3 02 1 3 12 1 3 13 1 2 04 2 1 04 2 1 14 2 3 04 2 3 0

pattern support #responses

b = 2, c = 3 4 1

c = 2 1 0

a = 1, b = 1, c = 3 2 2

Page 33: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

33

Definitions

• a pattern is a set of attribute/value combinations

• a record R is a supporter of a pattern P if all attr/val combinations of P match those of R– Example: (3,1,2) is a supporter of ( b = 1, c = 2 )

• the support of a pattern P is the number of supporters of P

Page 34: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

34

Frequent patterns

• Given a minimum support minsup a pattern P is said to be frequent if

support( P ) minsup

• The set of frequent patterns can be represented by a trie

• An algorithm for finding frequent itemsets (like Apriori by Agrawal c.s.) can also be used to find frequent patterns

Page 35: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

35

The trie of frequent patterns

Page 36: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

36

Support and response counts

Page 37: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

37

With response rates

Page 38: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

38

Selecting the target groupa b c1 1 23 1 22 2 23 1 31 2 13 2 34 2 23 2 13 1 34 1 2

mrr10080

10010050

37,52525

10080

The first record (1,1,2) matches the following freq.patterns:

( a = 1 ) => resp. rate = 50 %

( b = 1 ) => resp. rate = 80 %

( a = 1, b = 2 ) => resp. rate = 100 % => max (mrr)

1 1 22 2 23 1 33 1 33 1 2

Target group:

Page 39: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

39

PatSelect

Input: a set of records

Output: a subset of size n: the target group

1. For all records R in the given set do:

• let P be the set of all frequent patterns that match R

• let mrr( R ) = max { resp.rate ( P ) | P in P }

2. Sort all records according to decreasing mrr

3. Select the topmost n records

Page 40: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

40

Fund raising application

• Dutch charity organization

• more than 700 000 supporters

• 26 mailing campaigns (dates, targets, responses)

• spread over six years (‘94 - ‘99)

• database of over 400 MB

Page 41: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

41

Research questions

1) How to select a target group with as high a response rate as possible, on the basis of history data

2) How to select a target group with as high a total amount donated as possible, again on the basis of history data

This study: question 1.

Page 42: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

42

RFM features

R1: # weeks since last response

R2: # months since first donation

F1: fraction of mailings supporter has responded to

F2: median response time of supporter

M1: etc.

Page 43: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

43

Model construction

• Choose only full mailing campaigns 98/99

• random split:– training set 50 %– test set 50 %

• resulting datasets:– tr98, tr99– test98, test99– each somewhat less than 200 000 cases!!

Page 44: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

44

Results‘99, trained on‘98 data

Page 45: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

45

Results‘99, trained on‘99 data

Page 46: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

46

Page 47: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

47

Comparison

• Neither a pure scoring, nor a pure segmentation method

• not segments, since patterns can be overlapping!

• many patterns => many different scores => performance comparable with scoring methods

• but also:

Page 48: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

48

Interpretability

high, since each supporter’s presence in the

target group can be explained by its inclusion

in a pattern with high response rate!!!

Page 49: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

49

Conclusions

• New method based on patterns and association rule algorithms with following characteristics:– response rate high– interpretability high

• interesting method, especially for large databases

Page 50: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

50

Why fuzzy?

Advantages of fuzzy target selection

models in marketing

• prediction power larger than conventional

statistical models

• large degree of transparency due to the

linguistic rules that can be derived from

data

Page 51: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

51

Fuzzy target selection

• FCM clustering in feature product space

• Average response rate

per cluster

• Score per customer

• Customer segmentation

• Rule derivation

}1,0{,1

1

kN

k ik

Nk kik

i ru

ru

Ci ik

Ci iik

ku

us

1

1

otherwise,0

,1 1* ikCiik

ikuu

u

Page 52: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

52

Fuzzy clustering

Partition data into overlapping setsbased on similarity amongst patterns

Given the data

Find the fuzzy partition matrix:

and the cluster centres:

Nkxxx nTnkkkk ,,1,],,,[ 21 x

CNC

N

uu

uu

1

111

U

niC vvvV },,{ 1

Page 53: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

53

Fuzzy clusteringMinimize objective function

subject to

),(),,( 2

1 1ik

C

i

N

k

mik duJ vxVUX

NkuCi ik ,,1,11

NkCiuik ,,1,,,1,10

CiNuik

Nk ,,1,0 1

membership degree

total membership

no cluster empty

),1( m is the fuzziness parameter

Page 54: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

54

Feature selection

• R1=Number of weeks since last response• R2=Number of months since first-ever

donation• F1=Fraction of responded mailings• F2=Response time for last response

(median)• M1=Average donated amount per mailing• M2=Last donated amount• M3=Average donation per year

Page 55: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

55

Feature reduction

• Use logistic regression to build a target selection model

• Use only features whose corresponding weights deviate significantly from zero

• Selected features– Number of weeks since last response(TIMELR)

– Number of months since first-ever donation(TIMECL)

– Fraction of responded mailings(FRQRES)

Page 56: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

56

Feature reduction

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Re

spo

nse

fra

ctio

nTraining data

7 variables3 variables

Page 57: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

57

Fuzzy scoring model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

Evaluation datafuzzy clusteringlogistic regression

40 clusters

Page 58: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

58

Fuzzy segmentation model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction selected

Res

pons

e fr

actio

n

Evaluation dataclassificationlogistic regression

40 segments

Page 59: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

59

Linguistic rules

-1 -0.5 0 0.5 1 1.5 20

0.5

1

timelr

mem

bers

hip

Membership functions

-1 -0.5 0 0.5 1 1.5 20

0.5

1

timecl

mem

bers

hip

-1 -0.5 0 0.5 1 1.5 20

0.5

1

frqres

mem

bers

hip

Ve

ry s

ho

rtV

ery

sh

ort

Ra

re

Short

Short

Infrequent

Long

Long

Frequent

Ve

ry lo

ng

Ve

ry lo

ng

Oft

en

domains are normalized

• If TIMELR is very short and TIMECL is not very short and FRQRES is often then response rate is 0.81

• If TIMELR is short and TIMECL is very long and FRQRES is often then response rate is 0.75

• If TIMELR is very short and TIMECL is very short and FRQRES is often then response rate is 0.65

• If TIMELR is short and TIMECL is short or long and FRQRES is often then response rate is 0.60

Page 60: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

60

Conclusions

• Fuzzy target selection with RFM features– Transparent models for target selection with

good prediction power– Product space fuzzy clustering– Accuracy surpasses statistical models– Transparency by linguistic rules

• Future research– estimation of uncertainty bounds of the model– modeling of donation amounts

Page 61: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

61

Frequency problem

• How many mails should I send to this client this year?

• Model as a Markov Decision Process (MDP)

• Theory is based on Markov Chains

• Start with small introduction to MC’s

Page 62: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

62

Markov Chain

1

2

3

….

t0 t1 t2 t3

• System with m states (here m = 3)

• from one stage to the next, the state jumps from state j to state k with probability p(j,k)

• p(j,k) is called transition probability

Page 63: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

63

Transition matrix

P =

p(1,1) … p(1,m). … .. … .. … .p(m,1) … p(m,m)

p(j,k) = Pr{ end up in k after 1 step | start in j }

p(2)(j,k) = Pr{ end up in k after 2 steps | start in j } =

=

m

i

kipijp1

),(),(

Page 64: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

64

Stationary distribution

• PP = P2 = transition matrix after two steps

• P n = transition matrix after n steps

• if n , P n Q with

Q =

q1 … qm

q1 … qm

q1 … qm

… … …

• {q1 , … ,qm} is stationary distribution

• property: Q Q = Q

Page 65: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

65

Example

0 ½ ½

½ 0 ½

½ ½ 0P =

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3Q =

Page 66: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

66

Markov Decision Process

1

2

3

….

t0 t1 t2 t3

system(3 states)

X agent(2 actions)

….

r0 r1 r2 r3rewards

S = { states }

A = { actions }

S and A finite….

Page 67: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

67

Transition, reward matrix, policy

• transition matrix: p(j, k, a) = probability of ending in state k after action a has been performed on state j

• matrix has m m |A| elements

• p: S S A [0,1]

• reward matrix: r(s,a) = the payoff the agent gets by taking action a in state s

• r: S A

• policy: (s) = the action that should be taken when the system is in state s

• : S A

Page 68: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

68

Optimal policy

*(s) = maxE(r0 + r1 + 2r2 + …)

= discount factor, denotes how heavy future rewards should weigh

0 < < 1

))(,(0 ssrr

m

i

sispiirr1

1 ))(,,()).(,(

etc.

Exactly:

Page 69: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

69

The frequency problem

• Stages: the planning periods (years)

• Actions: A = {0,1,2,3,4} the number of mailings to be sent

• System: client

• Agent: firm

• States: the RFM-profile of the client

Page 70: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

70

States

• g = gift size, in categories: 0,1,…,5

• m = number of mailings: 0,1,…,4

• r = number of responses: 0,1,…,4

• Example: s = (3,4,1) means the client received 4 mailings, responded to 1, with a fair-sized gift

• Number of states is 55 (not 6 x 5 x 5 since not all combinations are possible)

Page 71: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

71

Transition matrix

• Contains 55 x 55 x 5 = 15125 elements

• A probability where the second number of mailings (m) is unequal to the action(a) must be zero

• All other probabilities must be estimated from historical data!

Page 72: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

72

Reward matrix

• r(s,a) = the expected total amount donated in this period, by a client in state s, given the number of mails received (a), minus the cost of a mailings

• Also to be estimated from historical data

Page 73: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

73

Scenario’s + demo

• Optimal policy calculation: linear programming, value iteration, policy iteration

• Scenario’s: give extra weight to some states

• Demonstration of the prototype decision support system we developed with the charity organization

Page 74: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 1: Direct Mailing

74

No historic data available?

• Transition and reward matrix cannot be estimated!

• Learn the optimal policy by reinforcement learning (without knowing P and R)

• Will be our next project together with Uzay Kaymak and Michiel van Wezel

Page 75: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Part 2Brand Choice usingEnsemble Methods

Rob Potharst , Michiel van Rijthoven and Michiel van Wezel

Erasmus University RotterdamEconometric Institute

Page 76: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

76 Datamining for ICT & Economics, part 2: Brand Choice

Outline

• Ensemble methods: bagging, boosting, stacking, etc…– What? Why? How?

• Brand choice– classical statistical models– neural network models– ensemble methods

Page 77: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

77 Datamining for ICT & Economics, part 2: Brand Choice

ReferencesOn ensemble methods:

• Hastie, Tibshirani & Friedman: The elements of Statistical Learning, Springer Verlag, 2001

On the brand choice application:

• R. Potharst, M. van Rijthoven and M. van Wezel, "Modeling Brand Choice using Boosted and Stacked Neural Networks", In: Kevin E. Voges et al. (Ed.), Business Applications and Computational Intelligence, to be published in 2005 by Idea Group Inc.

• M. van Wezel and R. Potharst, "Brand Choice, Bagging, Boosting, Bias and Variance". Technical Report, sept '04, submitted for publication.

• Vroomen, B., Franses, P.H. & van Nierop, E., (2004). Modeling consideration sets and brand choice using artificial neural networks. European Journal of Operational Research, 154, 206-217.

Page 78: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

78 Datamining for ICT & Economics, part 2: Brand Choice

Ensemble? What does it mean?

• an ensemble is a group, instead of an individual

• two know more than one, and more than two maybe even more

• examples: – from real life: voting, committee, weather

men– from computational intelligence: a set of

neural network / decision tree models

Page 79: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

79 Datamining for ICT & Economics, part 2: Brand Choice

How can you combine models?

Depends on your error loss function, but in general:

• for classification problems: voting!

• for regression problems: averaging!

Page 80: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

80 Datamining for ICT & Economics, part 2: Brand Choice

Example

Page 81: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

81 Datamining for ICT & Economics, part 2: Brand Choice

Does it really work?• Hm.. it seems a bit simple…• How could we check that it possibly works?• Try it on a simple problem• Credit scoring dataset from internet

(UCI Machine Learning Repository:http://www.ics.uci.edu/~mlearn/MLRepository.html)

• Want a loan? Are you credit worthy: will you pay back yes/no?

Page 82: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

82 Datamining for ICT & Economics, part 2: Brand Choice

Credit scoring example

size of ensemble

% correct

Page 83: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

83 Datamining for ICT & Economics, part 2: Brand Choice

Page 84: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

84 Datamining for ICT & Economics, part 2: Brand Choice

What is a (base) classifier?

something that takes an input vector and assigns an (estimated) class label to it:

Page 85: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

85 Datamining for ICT & Economics, part 2: Brand Choice

What kind of classifiers?

We will use classifiers that learn to do their job from a training set of examples (or instances or patterns):

How to train such a classifier is the subject of Machine Learning, a field very close to Computational Intelligence.

)},(),...,,(),,{( 2211 nn yyyT xxx

Page 86: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

86 Datamining for ICT & Economics, part 2: Brand Choice

Classifiers: neural networks

Page 87: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

87 Datamining for ICT & Economics, part 2: Brand Choice

Universal Approximation

• Neural networks can implement arbitrarily shaped boundaries between classes.

• This is called the universal approximation theorem for 1 hidden layer feedforward neural networks (a.o. Barron, 1993)

• By adding hidden nodes (and training examples!) one can get as close to the real boundaries as one wishes.

Page 88: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

88 Datamining for ICT & Economics, part 2: Brand Choice

Page 89: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

89 Datamining for ICT & Economics, part 2: Brand Choice

Decision boundaries of neural network

Page 90: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

90 Datamining for ICT & Economics, part 2: Brand Choice

Decision boundaries of traditional model (multinomial logit)

Page 91: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

91 Datamining for ICT & Economics, part 2: Brand Choice

Classifiers: decision trees

from: Mitchell, Machine learning, 1997

Page 92: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

92 Datamining for ICT & Economics, part 2: Brand Choice

Decision tree: example

Page 93: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

93 Datamining for ICT & Economics, part 2: Brand Choice

Problem 1: overfitting

Page 94: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

94 Datamining for ICT & Economics, part 2: Brand Choice

Problem 2: Instability

• That is: the model is very dependent on the specific training set you have; take one out and…

Page 95: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

95 Datamining for ICT & Economics, part 2: Brand Choice

Instability of a model

• Let us study the instability of a model in the setting of a regression problem:

• 5 training sets of each 200 examples• 5 linear models (blue)• 5 neural networks (green)• dotted lines: the average models!

noise 1

sin2

x

xy

Page 96: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

96 Datamining for ICT & Economics, part 2: Brand Choice

Page 97: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

97 Datamining for ICT & Economics, part 2: Brand Choice

Prediction error

• If we use a squared error loss function the prediction error is:

• Define the average model of a classifier as

• And let f* be the “real” underlying model ( = E(y|x))

2, ))(()( yfEfPE TyT xx

)()( xx TTA fEf

Page 98: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

98 Datamining for ICT & Economics, part 2: Brand Choice

Bias-variance decomposition• Then we can derive the following formula:

• Or: Prediction error = irreducible error + bias2 + variance

• So: if we try to approximate the average model, we get variance 0. This is the idea behind ensemble methods!

2,

2*2

))()((

))()(()(

xx

xx

x

x

ATT

AnoiseT

ffE

ffEfPE

Page 99: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

99 Datamining for ICT & Economics, part 2: Brand Choice

Bagging• abbreviation for bootstrap aggregating• Breiman(1994)• create "bootstrapped" datasets by randomly

drawing from the original dataset, with replacement!

• the bootstrapped datasets have the same size as the original dataset

• build a model for each bootstrapped dataset• combine these models by averaging or voting

Page 100: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

100 Datamining for ICT & Economics, part 2: Brand Choice

Page 101: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

101 Datamining for ICT & Economics, part 2: Brand Choice

Boosting• Adaboost (Freund & Shapire, 1996)• Each example in trainingset has a weight

attached to it; initial weights w1=1/N• Generate a sequence of models:• build model M1 on trainingset with w1

• examples, misclassified by M1 get a higher weight

• etc.

Page 102: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

102 Datamining for ICT & Economics, part 2: Brand Choice

Page 103: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

103 Datamining for ICT & Economics, part 2: Brand Choice

Adaboost (= adaptive boosting)for 2 classes: -1,+1

Nwi

1

N

iimiim xFywerr

1

))((

1. Initialize the boosting weights: for i = 1,…, N

2. For m = 1 to M perform each of the following:

a) Train model Fm(x) on T with weights wi

b) Compute

Page 104: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

104 Datamining for ICT & Economics, part 2: Brand Choice

c) Compute

)1

log(m

mm err

err

d) Redefine the weights:

)))((exp( imimii xFyww

e) Normalize the weights:

N

kk

ii

w

ww

1

3. Output the final combined model:

))(sgn()(1

M

mmm xFxO

Page 105: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

105 Datamining for ICT & Economics, part 2: Brand Choice

Stacking

• two levels of learning:– first level: train several models on training

set– second level: again train the combination

of these models

• not a fixed voting scheme for combining the models, but: learn an optimal combination method from the data

Page 106: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

106 Datamining for ICT & Economics, part 2: Brand Choice

Brand Choice• classical topic in marketing• a product has k brands• consumer/household wants to buy product• which brand does he pick?• given:

– household characteristics (income, etc)– product factors (price, etc)– situational factors (product on display, etc)

Page 107: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

107 Datamining for ICT & Economics, part 2: Brand Choice

Modeling brand choice

• classical statistical models (multinomial logit, conditional logit, etc): linear

• neural network models: nonlinear

• a model by Vroomen et al., 2004: neural networks with built-in so-called consideration sets

Page 108: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

108 Datamining for ICT & Economics, part 2: Brand Choice

the Vroomen model• as many hidden nodes as there are

brands• three types of variables:

– X: household characteristics, eg size, income– Z: brand characteristics, eg price-level,

promotion, advertising– W: choice-specific characteristics, eg

observed price at purchase occasion

Page 109: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

109 Datamining for ICT & Economics, part 2: Brand Choice

the Vroomen model

Page 110: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

110 Datamining for ICT & Economics, part 2: Brand Choice

J

m qm

Q

qqmk

J

kkmm

qj

Q

qqjk

J

kkjj

j

WCS

WCS

FC

111

0

110

)exp(

)exp(

)(11

0

P

ppjpj

I

iiijjj ZXFCS

xexF

1

1)( = logistic function

the Vroomen model

Page 111: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

111 Datamining for ICT & Economics, part 2: Brand Choice

Dataset

• scanner data: 3055 purchases of liquid detergent of 6 brands (part of ERIM database)

• 400 households• 4 X variables (volume, non-det, size, time)• 4 Z variables (price, feature, display,

recency)• 1 W variable (again price)

Page 112: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

112 Datamining for ICT & Economics, part 2: Brand Choice

Experiments1. split 400 households randomly into

three groups (tr 200, va 100, te 100)2. use backprop for Vroomen, test on

va+te3. use 25 iterations on boosting alg, test

combined model on va + te4. use stacking on the 25 models, use va

to find coefficients, test on teRepeat this whole cycle 10 times!

Page 113: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

113 Datamining for ICT & Economics, part 2: Brand Choice

Results

3 to 4 % gain in predictive performance

Page 114: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

114 Datamining for ICT & Economics, part 2: Brand Choice

Conclusions

• By using ensemble methods we can increase the predictive performance

• on the other hand: because we get combined models they are harder to interpret

• future work: interpreting the combined model!

Page 115: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Part 3On the Use of Ensemble

Techniques for Modeling Choice Problems in Marketing, especially

Churnby

Aurélie Lemmens , Rob Potharst, Michiel van Wezel (Erasmus University Rotterdam)

and Christophe Croux

(Catholic University Leuven)

Page 116: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

116

Characteristics of Ensemble Techniques

• Developed in statistics / datamining / machine learning communities

• Not yet applied to marketing problems (a.f.a.w.k.)• High potential for choice problems such as brand choice

and churn• Successfully applied to other fields like fraud detection,

text categorization, chemometrics• Especially successful wrt predictive power, which can be

directly translated into money• Easy to apply

Page 117: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

117

How do Ensemble methods work?

1. Develop a number of so-called base models for a problem

Could be any model: dt, nn, logit, …

2. Combine these base models into a final choice model

Combination can be done with: voting, weighted voting, …

Page 118: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

118

Existing Ensemble methods

• Bagging (= Bootstrap Aggregating) Breiman, 1996

• Boosting: – Adaboost, Freund & Shapire, 1996– Stochastic gradient boosting, Friedman, 2002

• Stacking Wolpert, 1992

Page 119: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

119

Based on 4 recent papers

[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006

(by Lemmens and Croux)

[2] “Bagging a Stacked Classifier” appeared in 2005

(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked

Neural Networks” appeared in 2006

(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using

Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)

Page 120: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

120

Ensemble techniques used

paper bagging boosting stacking

[1] X X

[2] X X

[3] X X

[4] X X

Page 121: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

121

Base learners used

paper DT NN LDA LR

[1] X

[2] X X X X

[3] X

[4] X

Page 122: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

122

Marketing problems considered

paper churn brand choice

[1] X

[2]

[3] X

[4] X

Page 123: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

123

Data sets

paper Company / sector

[1] US wireless telecom company

[2] 12 benchmark datasets from machine learning

[3] Scanner data for six brands of liquid detergent

[4] Scanner data for ketchup / peanut butter brands

Page 124: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

124

Based on 4 recent papers

[1] “Bagging and Boosting Classification Trees to Predict Churn” to appear in Journal of Marketing Research, 2006

(by Lemmens and Croux)

[2] “Bagging a Stacked Classifier” appeared in 2005

(by Croux, Joossens and Lemmens)[3] “Modeling Brand Choice using Boosted and Stacked

Neural Networks” appeared in 2006

(by Potharst, van Rijthoven and van Wezel)[4] “Improved Customer Choice Predictions using

Ensemble Methods” submitted to European J Oper Res (by van Wezel and Potharst)

in depth

Page 125: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

125

– The 2002 Churn Tournament organised by Teradata Center for

CRM at Duke University

– Churn means defecting from a company, i.e. take his business

elsewhere

– Customer database from an anonymous U.S. wireless telecom

company

– Challenge: predicting churn for elaborating targeted retention

strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and

Zhang 2002)

– Details can be found in Neslin et al. (2004)

The Context

Page 126: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

126

– The US Wireless Telecom market (2004)

• 182.1 million subscribers

• Leader in market share: Cingular Wireless

– 26.9% total market volume

– turnover US$19.4 billion / net income US$201 million

• Other major players: AT&T, Verizon, Sprint and Nextel

• Mergers & Acquisitions : Cingular with AT&T Wireless &

Sprint with Nextel

The Context (cont’d)

Page 127: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

127

– Churn

• High churn rates 2.6% a month

• Causes: increased competition, lack of

differentiation, market saturation

• Cost: $300 to $700 cost of replacement of a lost

customer in terms of sales support, marketing,

advertising, etc.

• Targeted retention strategies

The Context (cont’d)

Page 128: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

128

Formulation of the Churn Problem

• Churn as a Classification issue:

Classify a customer i characterized by k variables

xi = (xi1 , xi2 , …, xiK ) as

– Churner yi = + 1

– Non-churner yi = - 1

• Churn is the response binary variable to predict: yi = f(xi )Choice of the binary choice model f ( . ) ?

Page 129: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

129

Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)

• Models allowing for the heterogeneity in consumers’

response:

– Finite mixture model (e.g. Wedel and Kamakura 2000)

– Hierarchical Bayes model (e.g. Yang and Allenby 2003)

• Non-parametric choice models:

– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et

al. 1997)

– Bagging (Breiman 1996), Boosting (Freund and Schapire

1996), Stochastic gradient boosting (Friedman 2002)

Page 130: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

130

Classification Models in Marketing• Simple binary logit choice model (e.g. Andrews et al. 2002)

• Models allowing for the heterogeneity in consumers’

response:

– Finite mixture model (e.g. Wedel and Kamakura 2000)

– Hierarchical Bayes model (e.g. Yang and Allenby 2003)

• Non-parametric choice models:

– Decisions trees, neural nets (e.g. Thieme et al. 2000; West et

al. 1997)

– Bagging (Breiman 1996), Boosting (Freund and Schapire

1996), Stochastic gradient boosting (Friedman 2002)Mostly ignored in the marketing literature

S.G.B. won the Tournament (Cardell, from Salford Systems)

Page 131: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

131

Decision Trees for Churn

Change in consumption

Customer care calls

< 0.5 ≥ 0.5

≥ 3< 3

Age

Yes

≥ 55

55< & ≥ 26 < 26

No

Handset price

≥ $150 <$150

No Yes

Yes No

Example:

Datamining for ICT & Economics, part 3: Churn

Page 132: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

132

Bagging and Boosting

• Machine Learning Algorithms

• Principle: classifier aggregation (Breiman, 1996)

• Tree-based method (e.g. Currim et al. 1988)

• Bagging: Bootstrap AGGregatING

Page 133: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

133

Calibration sampleZ = {(xi , yi ) }, i = 1, …, N

Random sample Z1*

Random sample Z2*

xf *1̂

xf *2̂

e.g. tree

Page 134: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

134

Aggregating bootstrap samples

. . .

xf *2̂

xf *1̂

xf *3̂

xfB*ˆ

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

Churn propensity score:

Churn classification:

)(ˆ)(ˆ xfsignxc bagbag

Page 135: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

135

• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}

• B bootstrap samples

• From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

• The final classifier is obtained by averaging the scores

• The classification rule is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

)(ˆ)(ˆ xfsignxc bagbag

Bagging

Page 136: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

136

• Winner of the Teradata Churn Modeling Tournament

(Cardell, Golovnya and Steinberg, Salford Systems).

• Data adaptively resampled

Stochastic Gradient Boosting

• Previously misclassified observations weights

• Previously well-classified observations weights

Page 137: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

137

Data

Time

Customer

Balanced

Sample

Proportional

Sample

Calibration Sample Validation Hold-Out Sample

yi = + 1

yi = + 1

yi = - 1

yi = - 1

Xi = (x1,…, x46) yi

Xi=(x1,…, x46) yi

Behavioral predictorse.g. the average monthly minutes of use

Company interaction’s variablese.g. mean unrounded minutes of customer care calls

Customer demographicse.g. the number of adults in the household

N = 51,306

N=100,462Real-life proportion of churners = 1.8%

Equal proportion of churners = 50%

Page 138: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

138

Research Questions

• Do bagging (and boosting) provide better results

than other benchmarks?

– What are the financial gains to be expected from this improvement?

– What are the more relevant churn drivers or triggers that marketers

could watch for?

• How to correct estimated scores obtained from a

balanced calibration sample, when predicting rare

events like churn?

Page 139: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

139

Comparing Error Rates…Model* Validated Error

Rate**

Binary Logit Model 0.400

Bagging (tree-based) 0.374

Stochastic Gradient Boosting 0.460

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

Page 140: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

140

Bias due to Balanced Sampling

• Overestimation of the number of churners

• Several bias correction methods exist (see e.g. Cosslett

1993; Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and

Lancaster 1996; King and Zeng 2001a,b; Scott and Wild 1997).

• However, most are dedicated to traditional models (e.g.

logit). We discuss two corrections for bagging and boosting.

Page 141: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

141

The Bias Correction Methods• The weighting correction:

Based on marketers’ prior beliefs about the churn rate, i.e. the

proportion of churners among their customers, we attach

weights to observations of a balanced calibration sample.

• The intercept correction:

Take a non-zero cut-off value τB such that the proportion of

predicted churners in the calibration sample equals the actual

a priori proportion of churners.

Page 142: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

142

• Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}

• B bootstrap samples

• From each , a base classifier (e.g. tree) is estimated,

giving B score functions:

• The final classifier is obtained by averaging the scores

• The classification rule is carried out via

BbZb ,,2 ,1 ,*

B

bbbag xf

Bxf

1

*ˆ1)(ˆ

*bZ

xfxfxf Bb***

1ˆ,, ˆ,, ˆ

Bbagbag xfsignxc )(ˆ)(ˆ

Bagging

Page 143: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

143

Assessing the Best Bias Correction…

Bias Correction

No correction Intercept Weighting

Model* Validated Error Rates**

Binary logit model 0.400 0.035 0.018

Bagging (tree-based)

0.374 0.034 0.025

S.G. boosting 0.460 0.034 0.018

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

Page 144: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

144

The Top-Decile Lift• Focuses on the most critical group of customers

regarding their churn risk: Ideal segment for targeting

a retention marketing campaign

• The top 10% riskiest customers

– With = the proportion of churners in this risky segment

– And = the proportion of churners in the whole validation set

Risk to churn

10%

ˆ

ˆlift decile-Top %10

%10̂̂

Page 145: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

145

Financial Gains: Neslin et al. (2004)

– N : customer base of the company

– α : percentage of targeted customers (here, 10%)

– ΔTop decile : increase in top-decile lift

– γ : success rate of the incentive among the churners

– LVC : lifetime value of a customer (Gupta, Lehmann and Stuart 2004)

– δ : incentive cost per customer

– ψ : success rate of the incentive among the non-churners.

LVCdecileTopNGain ˆ

Page 146: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

146

0 20 40 60 80 100

Number of iterations

1.6

1.8

2.0

2.2

2.4

2.6

Top d

eci

le*

BaggingStochastic Gradient BoostingBinary Logit Model

Top-Decile Lift with Intercept Correction

* Model estimated on the balanced sample, and lift computed on the validation sample.

+26%

Page 147: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

147

Validated** Top-Decile Lift

Model*No / Intercept

correctionWeighting correction

Binary logit model 1.775 1.764

Bagging (tree-based) 2.246 1.549

Stochastic gradient boosting

2.290 1.632

* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample

Page 148: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

148

Financial Gains

If we consider

– N : customer base of 5,000,000 customers

– α : 10% of targeted customers

– γ : 30% success rate of the incentive among the churners

– LVC : $2,500 lifetime value of a customer

– δ : $50 incentive cost per customer

– ψ : 50% success rate of the incentive among the non-churners

LVCdecileTopNGain ˆ

Page 149: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

149

Financial Gains

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of the logit model:

ΔTop decile : 0. 471 (= 2.246 – 1.775)

Gain = + $ 3,214,800

Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of a random selection:

ΔTop decile : 1.246 (= 2.246 – 1.000)

Gain = + $ 8,550,000

Page 150: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

150

Most Important Churn Triggers

Bagging

Page 151: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

151

Partial Dependence Plots

-1000 0 1000 2000

Change in monthly min. of use

48

50

52

54

56

58

60

62

Pro

bability t

o c

hurn

0 500 1000 1500

Equipment days

44

46

48

50

52

54

56

Pro

bability t

o c

hurn

Bagging

Page 152: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

152

Partial Dependence Plot

Pro

bab

ilit

y to

ch

urn

49

50

51

Page 153: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

Datamining for ICT & Economics, part 3: Churn

153

Conclusions: Main Findings

1. Bagging and S.G. boosting are substantially better

classifiers than the binary logit choice model

– Improvement of 26% for the top-decile lift,

– Good diagnostic measures offering face validity,

– Interesting insights about potential churn drivers,

– Bagging is conceptually simple and easy-to-implement.

2. Intercept correction constitutes an appropriate bias

correction for bagging when using balanced sampling

scheme.

Page 154: Datamining: Techniques and Applications in Economics Rob Potharst Econometric Institute

154

Appendix: From Profit to Financial Gain

LVCdecileTopN

LVCN

ˆ

ˆ-ˆ

ProfitProfitGain

2 1

2 1 2-1

cLVCN 1111 ˆ1ˆ ˆ Profit

LVC of a churner

who does not

churn

Incentive cost for the

churners retained

+ non-churners

targeted

Contact

cost

ˆ/ ˆdecile Top 1 1