22
An Intelligent System for Customer Targeting:A Data Mining Approach YongSeog Kim a , W. Nick Street b a Business Information Systems, Utah State University, Logan, UT 84322, USA b Management Sciences, University of Iowa, Iowa City, IA 52242, USA Abstract We propose a data mining approach for market managers that uses artificial neural networks (ANNs) guided by genetic algorithms (GAs). Our predictive model allows the selection of an optimal target point where expected profit from direct mailing is maximized. Our approach also produces models that are easier to interpret by using a smaller number of predictive features. Through sensitivity analysis, we also show that our chosen model significantly outperforms the baseline algorithms in terms of hit rate and expected net profit on key target points. Key words: Customer targeting; data mining; feature selection; genetic algorithms; neural networks; ensemble. 1 Introduction The ultimate goal of decision support systems is to provide managers with information that is useful for understanding various managerial aspects of a problem and to choose a best solution among many alternatives. In this paper, we focus on a very specific decision support system on behalf of market managers who want to develop and implement efficient marketing programs by fully utilizing a customer database. This is important because, due to the growing interest in micro-marketing, many firms devote considerable resources to identifying households that may be open to targeted marketing messages. Corresponding author: YongSeog Kim, Business Information Systems, Utah State University, Logan, UT 84322. Tel: 1-435-797-2271, Fax: 1-435-797-2351. Email addresses: [email protected] (YongSeog Kim), [email protected] (W. Nick Street). Preprint submitted to Elsevier Science 8 February 2007

An intelligent system for customer targeting: a data mining approach

  • Upload
    uiowa

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

An Intelligent System for Customer

Targeting:A Data Mining Approach ?

YongSeog Kim a, W. Nick Street b

aBusiness Information Systems, Utah State University, Logan, UT 84322, USAbManagement Sciences, University of Iowa, Iowa City, IA 52242, USA

Abstract

We propose a data mining approach for market managers that uses artificial neuralnetworks (ANNs) guided by genetic algorithms (GAs). Our predictive model allowsthe selection of an optimal target point where expected profit from direct mailing ismaximized. Our approach also produces models that are easier to interpret by usinga smaller number of predictive features. Through sensitivity analysis, we also showthat our chosen model significantly outperforms the baseline algorithms in terms ofhit rate and expected net profit on key target points.

Key words: Customer targeting; data mining; feature selection; geneticalgorithms; neural networks; ensemble.

1 Introduction

The ultimate goal of decision support systems is to provide managers withinformation that is useful for understanding various managerial aspects ofa problem and to choose a best solution among many alternatives. In thispaper, we focus on a very specific decision support system on behalf of marketmanagers who want to develop and implement efficient marketing programsby fully utilizing a customer database. This is important because, due to thegrowing interest in micro-marketing, many firms devote considerable resourcesto identifying households that may be open to targeted marketing messages.

? Corresponding author: YongSeog Kim, Business Information Systems, Utah StateUniversity, Logan, UT 84322. Tel: 1-435-797-2271, Fax: 1-435-797-2351.

Email addresses: [email protected] (YongSeog Kim),[email protected] (W. Nick Street).

Preprint submitted to Elsevier Science 8 February 2007

This becomes more critical through the easy availability of data warehousescombining demographic, psychographic and behavioral information.

Both the marketing [1–3] and data mining communities [4–7] have presentedvarious database-based approaches for direct marketing. A good review ofhow data mining can be integrated into a knlowledge-based marketing can befound in [8]. Traditionally, the optimal selection of mailing targets has beenconsidered one of the most important factors for direct marketing to be suc-cessful. Thus many models aim to identify as many customers as possible whowill respond to a specific solicitation campaign letter, based on the customer’sestimated probability of responding to marketing program.

This problem becomes more complicated when the interpretability of themodel is important. For example, in database marketing applications, it iscritical for managers to understand the key drivers of consumer response. Apredictive model that is essentially a “black box” is not useful for developingcomprehensive marketing strategies. At the same time, a rule-based systemthat consists of too many if-then statements can make it difficult for users toidentify the key drivers. Note that two principal goals, model interpretabilityand predictive accuracy, can be in conflict.

Another important but often neglected aspect of models is the decision sup-port function that helps market managers make strategic marketing plans.For example, market managers want to know how many customers should betargeted to maximize the expected net profit or increase market share whileat least recovering the operational costs of a specific campaign. In order toattain this goal, market managers need a sensitivity analysis that shows howthe value of the objective function (e.g., the expected net profit from thecampaign) changes as campaign parameters vary (e.g., the campaign scopemeasured by the number of customers targeted).

In this paper, we propose a data mining approach to building predictive modelsthat satisfies these requirments efficiently and effectively. First, we show howto build predictive models that combine artificial neural networks (ANNs) [9]with genetic algorithms (GAs) [10] to help market managers identify prospec-tive households. ANNs have been used in other marketing applications suchas customer clustering [11,12] and market segmentation [13,14]. We use ANNsto identify optimal campaign targets based on each individual’s likelihood ofresponding to campaign message positively. This can be done by learning lin-ear or possibly non-linear relationships between given input variables and theresponse indicator. We go one step further from this traditional approach.Because we are also interested in isolating key determinants of customer re-sponse, we select different subsets of variables using GAs and use only thoseselected variables to train different ANNs.

2

GAs have become a very powerful tool in finance, economics, accounting,operations research, and other fields as an alternative to hill-climbing searchalgorithms. This is mainly because those heuristic algorithms might lead to alocal optimum, while GAs are more likely to avoid local optima by evaluatingmultiple solutions simultaneously and adjusting their search bias toward morepromising areas. Further, GAs have been known to have superior performanceto other search algorithms for data sets with high dimensionality [15].

Second, we demonstrate through a sensitivity analysis that our approach canbe used to determine the scope of marketing campaign given marginal rev-enue per customer and marginal cost per campaign mail. This can be a veryuseful tool for market managers who want to assess the impacts of variousfactors such as mailing cost and limited campaign budget on the outcomes ofmarketing campaign.

Finally, we enhance the interpretability of our model by reducing the dimen-sionality of data sets. Traditionally, feature extraction algorithms includingprincipal component analysis (PCA) have been often used for this purpose.However, PCA is not appropriate when the ultimate goal is not only to reducethe dimensionality, but also to obtain highly accurate predictive models. Thisis because PCA does not take into account the relationship between depen-dant and other input variables in the process of data reduction. Further, theresulting principal components from PCA can be difficult to interpret whenthe space of input variables is huge.

Data reduction is performed via feature selection in our approach. Featureselection is defined as the process of choosing a subset of the original predictivevariables by eliminating features that are either redundant or possess littlepredictive information. If we extract as much information as possible from agiven data set while using the smallest number of features, we can not onlysave a great amount of computing time and cost, but also build a modelthat generalizes better to households not in the test mailing. Feature selectioncan also significantly improve the comprehensibility of the resulting classifiermodels. Even a complicated model - such as a neural network - can be moreeasily understood if constructed from only a few variables.

Our methodology exploits the desirable characteristics of GAs and ANNs toachieve two principal goals of household targeting at a specific target point:model interpretability and predictive accuracy. A standard GA is used tosearch through the possible combinations of features. The input features se-lected by GA are used to train ANNs. The trained ANN is tested on anevaluation set, and a proposed model is evaluated in terms of two quality mea-surements – cumulative hit rate (which is maximized) and complexity (whichis minimized). We define the cumulative hit rate as the ratio of the numberof actual customers identified out of the total number of actual customers in

3

a data set. This process is repeated many times as the algorithm searches fora desirable balance between predictive accuracy and model complexity. Theresult is a highly accurate predictive model that uses only a subset of the orig-inal features, thus simplifying the model and reducing the risk of overfitting.It also provides useful information on reducing future data collection costs.

In order to help market managers determine the campaign scope, we runthe GA/ANN model repeatedly over different target points to obtain localsolutions. A local solution is a predictive feature subset with the highest fitnessvalue at a specific target point. At a target point i where 0 ≤ i ≤ 100, ourGA/ANN model searches for a model that is optimal when the best i% ofcustomers in a new data set is targeted based on the estimated probabilityof responding to the marketing campaign. Once we obtain local solutions, wecombine them into an Ensemble, a global solution that is used to choose thebest target point. Note that our Ensemble model is different from popularensemble algorithms such as Bagging [16] and Boosting [17] that combine thepredictions of multiple models by voting. Each local solution in our Ensemblemodel scores and selects prospects at a specific target point independentlyof other local solutions. Finally, in order to present the performance of localsolutions and an Ensemble, we use a lift curve that shows the relationshipbetween target points and corresponding cumulative hit rate.

This paper is organized as follows. In Section 2, we explain GAs for featureselection in detail, and motivate the use of a GA to search for the globaloptimum. In Section 3, we describe the structure of the GA/ANN model,and review the feature subset selection procedure. In Section 4, we presentexperimental results of both the GA/ANN model and a single ANN withthe complete set of features. In particular, a global solution is constructedby incorporating the local solutions obtained over various target points. Weshow that such a model can be used to help market managers determine thebest target point where the expected profit is maximized. In Section 5, wereview related work for direct marketing from both the marketing and datamining communities. Section 6 concludes the paper and provides suggestionsfor future research directions.

2 Genetic Algorithms for Feature Selection

A genetic algorithm (GA) is a parallel search procedure that simulates theevolutionary process by applying genetic operators. We provide a simple in-troduction to a standard GA in this section. More extensive discussions onGAs can be found in [10].

Since [10], various types of GAs have been used for many different applications.

4

However, many variants still share common characteristics. Typically, a GAstarts with and maintains a population of chromosomes that correspond tosolutions to the problem. A chromosome can be represented by a number ofdifferent schema including but not limited to strings of bits or vectors of realnumbers with fixed or variable length. A chromosome is typically representedby a fixed-size string and consists of a number of genes called alleles. In ourapproach for feature selection, the representation of an agent consists of Dbits, where D is the number of input features, with each of the bits indicatingwhether the corresponding feature is selected or not (1 if a feature is selected,0 otherwise).

Since a GA cannot typically search the whole space of chromosomes, it gradu-ally limits its focus to more highly fit regions of the search space. One or morefitness values must be assigned to each string in order to guide a GA towardmore promising regions. By default, standard GAs are designed for maximiza-tion problems with only one objective. Each fitness value can be determinedby a fitness function that we want to optimize. In our work, the fitness valuefor a string is determined by a neural network. Using information from house-holds with an observed response, the ANN is able to learn the typical buyingpatterns of customers in the dataset. The trained ANN is tested on an evalu-ation set, and the proposed model is evaluated both on the hit rate (which ismaximized) and the complexity (number of features, which is minimized) ofthe solution. Finally, these two quality measurements are combined into oneand used to evaluate the quality of each string.

Note that combining multiple objectives into one is not recommended in gen-eral. However, the main goal of this paper is to propose an intelligent systemfor customer targeting that can support strategic marketing plans. In order toshow the feasibility of such a model, we keep our model as simple as possibleby using a standard GA that can consider only one objective. Further, com-bining multiple objectives is one of the most common practices [18] and ourframework can be easily modified to consider multiple objectives separately.For readers who are interested in a customer targeting problem using GAsthat consider multiple objectives, we refer to previous work [6,19].

Once fitness values are assigned to each chromosome in the current popula-tion, the GA proceeds to the next generation through three genetic operators:reproduction, crossover, and mutation. We describe three generic operators asfollows:

Reproduction: This operator determines which strings in the current popu-lation survive into the next generation. A fundamental rule is that a stringwith a higher fitness value has higher chance of surviving into the next gen-eration. Roulette wheel selection is one frequently used method, in whicheach string survives in proportion to the ratio of its fitness to the overall

5

population fitness. In our work, we use another well-known method, ranking-based reproduction. In this scheme, the strings are sorted by their fitnessvalue, and a fixed number of the best elements from the current populationare copied into the next generation.

Crossover: The crossover (or recombination) operator combines a part ofone string with a part of another string and is controlled by a crossoverprobability Pr(crossover). Typically, it takes two strings called the par-ents as inputs, and returns one or two new ones, the offspring. In this waywe hope to combine the good parts of one string with the good parts ofanother string, yielding an even better string after the operation. Manydifferent kinds of crossover operators have been proposed including single-point, two-point, and uniform crossover [20]. Our crossover operator followsthe commonality-based crossover framework assuming that commonly se-lected features in the parents are more likely to lead to the offspring withimproved performance [21]. In our implementation, it takes two agents, aparent a and a random mate, and scans through every bit of the two agents.If it locates a different bit, it flips a coin to determine the offspring’s bit.

Mutation: This operator assigns a new value to a randomly chosen geneand is controlled by a mutation probability Pr(mutation). By introducingnew genetic characteristics into the pool of chromosomes, it prevents thegene depletion that might lead a GA to converge to a local optimum. Themutation operator in this paper always randomly selects one bit of eachstring and flips it.

3 GA/ANN Model for Customer Targeting

Our predictive model of household buying behavior is a hybrid of the GAand ANN procedures. In our approach, the GA identifies relevant consumerdescriptors that are used by the ANN to forecast consumer choice given aspecific target point. Our final solution, Ensemble, consists of multiple localsolutions each of which is an optimized solution at a specific target point. Inthis section, we present the structure of our GA/ANN model and the evalua-tion criteria used to select an appropriate predictive model.

3.1 Evaluation Metrics

We define two heuristic evaluation metrics, Fcomplexity and Faccuracy, to evaluateselected feature subsets. These objectives are combined with equal weight toreturn one fitness value for candidate solutions.

Fcomplexity: This objective is aimed at finding parsimonious solutions by min-

6

for each target point i where i = 10, 20, · · · , 90run GA/ANN, optimizing top i% of training set

select besti based on the fitness value

Ensemble(i) = bestiendfor

Fig. 1. Pseudo code for Ensemble construction

imizing the number of selected features as follows:

Fcomplexity = 1− d− 1

D − 1(1)

where d and D represent the dimensionality of the selected feature set andof the full feature set, respectively. Note that at least one feature must beused. Other things being equal, we expect that lower complexity will leadto easier interpretability of solutions as well as better generalization.

Faccuracy: The purpose of this objective is to favor feature sets with higherdiscriminative power to discriminate buyers from non-buyers. In our appli-cation, Faccuracy is same as cumulative hit rate, the ratio of the number ofactual customers identified, AC, out of the total actual customers, TAC.Note that AC is dependent on how many customers are targeted. Cumula-tive hit rate can be represented in a mathematical form as follows:

Faccuracy =AC

TAC. (2)

Often hit rate, the ratio of the number of actual customers identified outof the number of customers targeted, has been used as an alternative mea-surement. However, we also note that it is important to have the values oftwo evaluation metrics in the same range between 0 and 1. Since we canalways attain this requirement by using cumulative hit rate by targetingthe smallest proportion of actual buyers (in the data used in this paper,about 6% of customers are actual buyers), we prefer cumulative hit rate tohit rate.

3.2 Algorithm outline

We show an abstract view of our algorithm in Figure 1. As a first step, ourmodel searches for a set of local solutions optimized at a specific target point.A local solution besti is an optimal solution when the firm targets the best i%customers based on their estimated probability of responding to a solicitationletter. In this paper, we consider nine target points (10%, 20%, · · · , 90%) butmore refined target points can be analyzed without the need to modify thestructure of our algorithm. The GA/ANN component in our algorithm is usedfor finding local solutions and discussed in detail in Section 3.3. Once all thelocal solutions are found, we combine them into the final model, Ensemble.

7

Feature EvaluationMetrics

Training data Full feature set

subset

Best featuresubsetGenetic Algorithm

Artificial Neural Network (ANN)

Training data

Evaluation dataEstimated accuracy

Artificial NeuralNetwork (ANN)

Prospects prediction

Fig. 2. The structure of GA/ANN model. GA searches for a good subset of featuresand passes them to an ANN. The ANN calculates the “goodness” of each subsetand returns two evaluation metrics to GA.

Each local solution is a neural network model built using the feature subsetspecific to a target point, and our Ensemble is a collection of such models.

However, in order to estimate the performance of Ensemble on the evaluationdata, we cannot simply combine the best estimates of local solutions overdifferent target points. Rather, the estimate of Ensemble at target point i(i.e., when we target the best i% of customers) is the estimated performanceof local solution besti optimized at target point i. Ideally, the performance ofEnsemble should return the highest hit rate over all the target points comparedto all the local solutions.

3.3 Structure of GA/ANN Model

The structure of GA/ANN model for finding local solutions is shown in Fig-ure 2. First, the GA searches the exponential space of feature subsets andpasses one subset of features to an ANN. The ANN extracts predictive infor-mation from each subset and learns the patterns. Once an ANN learns thedata patterns, the trained ANN is evaluated on a data set not used for train-ing, and returns two evaluation metrics, Faccuracy and Fcomplexity, to the GA.The two evaluation metrics are then combined with equal weight into one fit-ness value. It is important to note that in both the learning and evaluationprocedures, the ANN uses only the selected features.

Based on the fitness value, the GA biases its search direction to maximize thecombined objective. This routine continues for a fixed number of generations.Among all the evolved models over the generations, we select the best modelin terms of the fitness value. Once we choose the best model, we train the ANNusing all the training points with the selected features only. The trained modelis then used to rank the potential customers (the records in the evaluation set)in descending order by the probability of buying RV insurance (see Section 4),

8

as predicted by the ANN. We finally select the top i% of the prospects in theevaluation set and evaluate the model’s accuracy.

In order to provide a reliable estimate for all the candidates, we estimatetheir fitness with a rigorous estimation procedure, k-fold cross validation. Inthis procedure, the training data is divided into k non-overlapping groups.We train an ANN using the first k − 1 groups of training data and test thetrained ANN on the kth group. We repeat this procedure until each of thegroups is used as a test set once. We then take the average of the performancemeasurements over the k folds. In our experiment, we set k = 2. This is areasonable compromise considering the computational complexity of systemslike ours. Further, an estimate from 2-fold cross validation is likely to be morereliable than an estimate from a common practice using a single holdout set.

4 Application

The new GA/ANN methodology is applied to the prediction of households in-terested in purchasing an insurance policy for recreational vehicles. To bench-mark the new procedure, we contrast the predictive performance of Ensembleto a single ANN with the complete set of features. We do not compare ourapproach to a standard logit regression model because a logit regression modelis a special case of single ANN with one hidden node.

4.1 Data Description

The data are taken from a solicitation of 9,822 European households to buyinsurance for a recreational vehicle. These data, taken from the CoIL 2000forecasting competition [22], provide an opportunity to assess the propertiesof the GA/ANN procedure in a customer prospecting application. 1 In ouranalysis, we use two separate datasets: a training set with 5822 households andan evaluation set with 4000 households. The training data is used to calibratethe model and to estimate the hit rate expected in the evaluation set. Of the5822 prospects in the training dataset, 348 purchased RV insurance, resultingin a hit rate of 348/5822 = 5.97%. From the manager’s perspective, this isthe hit rate that would be obtained if solicitations were sent out randomly toconsumers in the firm’s database.

1 We use a dataset on consumer responses to a solicitation for “caravan” insurancepolicies. A caravan is similar to a recreational vehicle in the United States. For moreinformation about the CoIL competition and the CoIL datasets, refer to the Website http://www.dcs.napier.ac.uk/coil/challenge/.

9

The evaluation data is used to validate the predictive models. Our Ensemblepredictive model is designed to return the top i% of customers in the evaluationdataset judged to be most likely to buy RV insurance. The model’s predictiveaccuracy is examined by computing the observed hit rate among the selectedhouseholds. It is important to understand that only information in the trainingdataset is used in developing the model. Data in the evaluation dataset is usedexclusively for forecasting.

In addition to the observed RV insurance policy choices, each household’srecord also contains 93 additional variables, containing information on bothsocio-demographic characteristics (variables 1-51) and ownership of varioustypes of insurance policies (variables 52-93). Originally, each data set had 85attributes. We omitted the first feature (customer subtype) mainly becauseit would expand search space dramatically with little information gain if werepresented it a 41-bit variable. Further we can still exploit the information ofcustomer type by recording the fifth feature (customer main type) as a 10-bitvariable, which is coded into ten binary variables (variables 4-13). Details areprovided in Table 1.

1

The socio-demographic data are based upon postal code information. Thatis, all customers living in areas with the same postal code have the samesocio-demographic attributes. The insurance firm in this study scales mostsocio-demographic variables on a 10-point ordinal scale (indicating the rela-tive likelihood that the socio-demographic trait is found in a particular postalcode area). This 10-point ordinal scaling includes variables denoted as “pro-portions” in Table 1. For the purposes of this study, all these variables wereregarded as continuous. The psychographic segment assignments (variables4-13), however, are household-specific and are binary variables.

In our subsequent discussion, the word feature refers to one of the 93 variableslisted in Table 1. For example, the binary variable that determines whetheror not a household falls into the “successful hedonist” segment is a singlefeature. Accordingly, in the feature selection step of the GA/ANN model,the algorithm can choose to use any possible subset of the 93 variables indeveloping the predictive model.

4.2 Experimental results

In our experiment, we first select local solutions over different target pointsand construct lift curves for each of them. A lift curve shows the percentageof all buyers identified in the group selected for a direct mail solicitation atthe given target point out of all buyers in the database. We also construct the

10

Feature ID Feature Description

1 Number of houses owned by residents

2 Average size of households

3 Average age of residents

4-13 Psychographic segment: successful hedonists, driven growers, average family, careerloners, living well, cruising seniors, retired and religious, family with grown ups,conservative families, or farmers

14-17 Proportion of residents with Catholic, Protestant, others and no religion

18-21 Proportion of residents of married, living together, other relation, and singles

22-23 Proportion of households without children and with children

24-26 Proportion of residents with high, medium, and lower education level

27 Proportion of residents in high status

28-32 Proportion of residents who are entrepreneur, farmer, middle management, skilledlaborers, and unskilled laborers

33-37 Proportion of residents in social class A, B1, B2, C, and D

38-39 Proportion of residents who rented home and owned home

40-42 Proportion of residents who have 1, 2, and no car

43-44 Proportion of residents with national and private health service

45-50 Proportion of residents whose income level is < $30,000, $30,000-$45,000, $45,000-$75,000, $75,000-$123,000, >$123,000, and average

51 Proportion of residents in purchasing power class

52-72 Scaled contribution to various types of insurance policies such as private third party,third party firms, third party agriculture, car, van, motorcycle/scooter, truck, trailer,tractor, agricultural M/C, moped, life, private accident, family accidents, disability,fire, surfboard, boat, bicycle, property, social security

73-93 Scaled number of households holding insurance policies for the same categories as inscaled contribution attributes

Table 1Household background characteristics

lift curve of Ensemble and compare it to those of the local models. Finally,we show how our Ensemble solution can be used to select the best targetpoint where the expected profit is maximized under two different campaignscenarios.

We set the values for GA parameters as follows: Pr(mutation) = 1.0, Pr(crossover)= 0.8, Population = 100, and Iteration = 200. We use a three-layer ANN andtrain it with the standard backpropagation algorithm. We set the number ofepochs = 10 and heuristically determine the number of hidden nodes using theformula min(3,

√nodein) where nodein represents the number of input nodes.

4.2.1 Analysis of lift curves

We first show the lift curves of local solutions (shown as dotted lines) andEnsemble (shown as the thick solid line) in Figure 3. The lift curve of a lo-cal solution besti is constructed as follows: We estimate the probability ofbuying new insurance for each prospect in the evaluation data with trainedANNs using the subset of features chosen by besti. After sorting prospects

11

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Lift curve of GA/ANN model

Proportion of chosen records

Cum

ulat

ive

hit r

ate(

%)

10%20%30%40%50%60%70%80%90%GA/ANN

Fig. 3. Lift curves of local and Ensemble solution. The lift curve of each local solutionis represented as a thin dotted line. The lift curve of Ensemble is constructed bycombining the lift curves of local solutions at various target points and shown as athick solid line.

in descending order of the estimated probability, we compute the values ofFaccuracy over various target points j where j = 10, 20, · · · , 90%. We definethe cumulative hit rate of a model as the value of Faccuracy at a given marketpoint. Recall that besti is an optimal solution at target point i. The lift curveshows the relationship between a set of market points and their correspondingcumulative hit rates.

The lift curve of Ensemble is constructed by combining a set of cumulativehit rates of besti at target point i where i = 10, 20, · · · , 90. 2 Under randomsampling, the lift curve is a 45-degree line starting at the origin of the graph. Inan ideal case, the lift curve of Ensemble should be a convex hull of lift curves oflocal solutions because it is a combination of local optimal solutions. However,it is possible for GA to be stuck to local optimum, or for the chose solution togeneralize poorly. Among nine target points considered in our model, Ensemble

2 This is similar to [23,24], in which a convex hull of multiple receiver operatingcharacteristic (ROC) curves is taken as a final model to provide a more robust andaccurate model.

12

0

2

4

6

8

10

12

14

16

18

20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Proportion of chosen records

Hit rates (%): GA vs. ANN

ANN-testGA-test

Fig. 4. Hit rates of GA/ANN and single ANN model on the evaluation data. Thedifferences are statistically significant (α=0.05) at six target points (i = 10, 20, 40,50, 60, 80).

shows the best performance at five target points (i = 10, 20, 30, 50, 80) andsecond best at one target point (i = 70). Compared to random sampling, ourEnsemble returns 3.1 (2.5) times higher cumulative hit rate when we targetthe best 10% (20%) of customers. For comparison purposes, we implementedan ANN with the complete set of input variables.

We show the hit rates of Ensemble and the single ANN model on the eval-uation data in Figure 4. Our Ensemble shows much better performance attarget points i = 10, 20 but loses its advantages over middle target pointsi = 30, · · · , 60. However, our model regains its superior performance at highertarget points i = 70, · · · , 90. We partially attribute to oversearching [25,26] thelower performance of our model over the middle target points. A model likeours can find fluke rules that fit the training data well but have low predictiveaccuracy on new data by overusing the estimate of model accuracy. A recentdicussion on oversearching caused by multiple comparison procedures can befound in [27].

Though it is worthwhile to determine why the performance of Ensemble changesover different target points, we will leave this issue to future research in orderto be consistent with the main purpose of this paper: proposing an intelligentrecommendation system for customer targeting. Further, the single ANN re-quires all the input variables and provides more of a black box solution that

13

Local solution Feature index Local solution Feature index

best10 26, 39, 52, 55, 67, 90 best50 29, 39, 55, 80, 90, others

best20 54, 55, 67, 69, 90 best60 8, 30, 55

best30 40, 50, 55, 69, 88 best70 29, 45, 55

best40, best90 55, 88 best80 8, 45, 55

Table 2Chosen features by local solutions

market managers cannot utilize for making strategic marketing plans.

We also show the index of chosen features by local solutions in Table 2. Eachlocal solution clearly highlights different predictive features at the given tar-get point. We partially attribute this finding to the strong correlation amonginsurance-related features. However, note that one feature (55, scaled contri-bution to car policy) is chosen by all local solutions. This makes considerablesense, given the fact that the firm is soliciting households to buy insurance forrecreational vehicles.

Other insurance-related features (scaled contribution to and ownership of boat(69, 90) and fire policy (67, 88)) are also chosen by multiple solutions. De-mographic variables such as income levels (45, 50), home ownership (39), andjobs (farmer, 29) are also selected by few solutions. In general, the results arein line with marketing science work on customer segmentation, which showsthat information about current purchase behavior is most predictive of futurechoices [28]. We refer to our previous work [22] for readers who are interestedin building a potentially valuable profile of likely customers.

1

In terms of data reduction, all local solutions choose at most six features exceptone that chooses 26 features at target point i = 50. On average, local solutionschoose 6 features, which reduces the data dimensionality by (93 − 6)/93 ≈93.5%. This implies that the firm could reduce data collection and storagecosts considerably. This is possible through reduced storage requirements, andthe reduced labor and data transmission costs.

4.2.2 Sensitivity analysis

In this section, we focus on the decision support functionality of the Ensemblesolution. In particular, we want to show how our Ensemble solution can beused to select the best target point where the expected profit is maximized.We consider two different marketing strategies: one case targeting customerswith a nice-looking brochure of higher cost per mailing ($7) and the other case

14

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Net

profIt

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Proportion of chosen records

Model selection based on the expected net profit($)

0

500

1000

1500

2000

2500

3000

3500

Net

profIt

Random ANN GA/ANN

A comparison among different campaigns ($1000)

-16000

-14000

-12000

-10000

-8000

-6000

-4000

-2000

0

2000

4000

Net

profIt

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Proportion of chosen records

Model selection based on the expected net profit($)

-300

-200

-100

0

100

200

300

400

500

600

700

Net

profIt

Random ANN GA/ANN

A comparison among different campaigns ($1000)

Fig. 5. The selection of the best target point and the estimation of expected profitof market campaign using Ensemble on the evaluation data. The two figures at thetop panel show results when the cost per mailing is $0.71. The other two figures atthe bottom are results when the cost per mailing is $7.

using a plain letter of lower cost per mailing ($0.71). 3 However, we make twocommon assumptions for both cases that the marginal revenue per buyer is$70 and the firm has a list of one million prospects to target. We show ourexperimental results in Figure 5.

In order to choose the best target point, we compute the expected profits overdifferent target points using the estimated hit rates of Ensemble on the train-ing data. When the cost per mailing is not expensive ($0.71), our Ensemblereturns the maximized expected profit when the best 80% of customers inthe evaluation set are targeted. This makes sense because it is wise to targetas many customers as possible up to a certain point. Once we determine thetarget point i, we can compute the expected profits of three different modelswhen we target the top i% of one million prospects. Though all models returnpositive profits, the Ensemble solution returns the maximum profit.

When the cost per mailing is expensive ($7), the best target point is i = 10in terms of maximized profit. This time, random targeting returns a negativeprofit as we target the best 10% of customers because of the increased mailing

3 Note that it is not our intention to relate brochure quality to campaign outcomes.

15

cost. The Ensemble solution returns the highest expected profit followed by thesingle ANN solution. Note that a market manager can select the best targetpoint at i = 40 when her main interest is not in maximizing the expected profitbut in increasing the market share while recovering at least the campaign costs.This is particularly useful when marginal revenue per customer and/or mailingcost information is not known as in our case.

5 Related work

Various multivariate statistical and analytical techniques in marketing commu-nity have been applied to the database marketing problem. Routine mailingsto existing customers are typically based upon the RFM (recency, frequency,monetary) approach that targets households using knowledge of the customer’spurchase history [29]. However, the RFM model has major disadvantages in-cluding its limited applicability to current customers only and redundancybecause of inter-dependency among RFM variables. Mailings to householdswith no prior relationship with the firm are based upon the analysis of therelationship between demographics and the response to a test mailing of arepresentative household sample. A brief summary of other traditional modelssuch as Chi-square Automatic Interaction Detection (CH-AID), multiple re-gression methods, discriminant analysis, and gains chart analysis can be foundin [30,1].

A more elaborate model for identifying optimal targets was presented by Bultand Wansbeek [1]. Their profit maximization (PM) approach is mainly basedon the gains chart model [31] in a single period campaign. In their model,only those prospects whose expected return is higher than the marginal costof the mailing are selected for direct mailing. From a comparative experimenton a real data set, they reported improved performance compared to CH-AIDand the gains chart model. Bitran and Mondschein [32] presented a stochasticmodel to determine the optimal mailing frequency and assumed that multiplemailings would result in only one response.

Gonul and Shi [2] presents a Markov decision model for the optimal mailingpolicy over an infinite period. In their model, two different objectives fromboth customers (maximizing utility) and direct mailers (maximizing profit)are optimized simultaneously. Their model utilized the estimated impact ofrecency and frequency on the customer’s response probability, and showedsignificantly improved performance over a single-period model. However, theirrecency- and frequency-based model cannot be applied to a certain types ofcommodities (e.g. seasonal goods). Further their model can be computation-ally expensive once monetary value is added in the state definition.

16

Piersma and Jonker [3] studied a problem for optimizing the frequency forthe direct mailing over a long-term but finite campaign horizon. Their modelis simpler than [2] because of finite campaign horizon considered, but moregeneral than [32] in the sense that they allow multiple responses to multiplemailings. Through comparative experiments on a data set from Dutch char-itable organization, they showed that their model significantly increased thecustomer profitability and reduced wasteful mailings. Other models for directmarketing include a split-hazard model for estimating a physician’s propen-sity of using new drug in [33] and a latent trait and a latent class model forcross-selling of financial services in [34] and [35].

Researchers from the machine learning and data mining community have de-veloped models without making prior assumptions about data distribution.Problems for effectively profiling users have been studied in [36–38]. Fawcettand Provost [36] presented a model for detecting fraudulent usage of cellu-lar calls. Chan and Stolfo [37] presented a cost-sensitive model for detectingfraudulent usage of credit cards. Noting that the original distribution can bedifferent from the desired distribution for optimal training, they divided agiven data set into subsets with the appropriate class distribution throughpreliminary experiments. They obtained their final model by combining mul-tiple classifiers trained on different subsets and reported some success. In [38],a predictive profiling model was presented for profiling customers of an Inter-net service provider who are most likely to stop using internet service. In [39],difficulties encountered in the process of applying data mining techniques todirect marketing were discussed.

A more relevant response model for direct mail campaigns can be found in [4].In [4], Bhattacharyya proposed a GA-based approach for developing optimalmodels at different target points. In his framework, each candidate solutionwas expressed as a linear combination of the input variables and was evalu-ated in terms of two evaluation criteria – response rate and robustness. Hefurther analyzed tradeoffs of multiple objectives by varying weights assignedto multiple criteria at different target points. He reported significantly im-proved performance over the traditional logit regression model at the first fewtarget points. Recent works [19,6] studied the same problem in a Pareto opti-mization framework, where multiple criteria are not combined but consideredindependently in order to avoid a subjective weighting scheme.

In [5] the profitability condition of a campaign was explicitly formulated asa function of the lift of the model, uniform campaign cost per mailing, andmarginal revenue per identified positive record. A brief review of evaluationmetrics for marketing campaigns can be found in [40]. They also providedheuristics to estimate the expected profit from targeting a subset of recordswithout going through a lengthy data mining process. Chou et al. [41] devisedan effective model for identifying prospective insurance buyers when buyer

17

versus non-buyer information is not available. Gersten et al. [42] presented amodel to select prospects in the automotive industry where the buying decisiontakes a long time.

Domingos and Richardson [7] view a market as a social network where eachcustomer has a different network value, influence on other customers’ proba-bility of buying a product. The network value of a customer in their model iscomputed as the expected profit from additional sales to customers whom sherecursively influences to buy. Considering network effects can change optimalmarketing strategy dramatically. For example, it is worth marketing to onewhose intrinsic value is lower than the cost of marketing when her influenceon others’ purchasing decision is strong. From several experiments on a pub-licly available data (www.research.compaq.com/src/eachmovie/), they notonly reported superior performance to traditional direct marketing strategybut also profiled good customers to target.

6 Conclusion

In this paper, we presented a novel approach for customer targeting in databasemarketing. We used a genetic algorithm to search for possible combinationsof features and an artificial neural network to score customers. One of theclear strengths of the GA/ANN approach is its ability to construct predictivemodels that reflect the direct marketer’s decision process. In particular, withinformation of campaign costs and profit per additional actual customer, weshow that our system not only maximizes the hit rate at fixed target point butalso selects a “best” target point where expected profit from direct mailing ismaximized. Further, our models are made easier to interpret by using smallernumber of features.

In future work we will look into a customer targeting model that assumesthe heterogeneous structure of marginal revenue per each prospect. The dataset we analyzed in this paper does not include critical information such asmonetary values that each customer spent to purchase a caravan insurancepolicy. It is also reasonable to assume that there is relatively small differ-ence in terms of monetary values in insurance options available to customers.However, assume the case that a non-profit organization sends solicitation let-ters to possible donors for charity. For this organization, maximizing the totalamount of donated money is more important than identifying as many donorsas possible. This is because, in an extreme case, single donor can donate moremoney than all the other donors. In this case, value-based customer targetingbecomes critical. By considering raised monetary value by targeted customersas one objective, our GA/ANN model will be able to find an optimal solutionwith maximized monetary value.

18

Related to this direction of research, it is interesting to see if a bi-level modelthat has two separate procedures, one for estimating donation probability andthe other for estimating donation amounts, can do better. It was claimed thatany single classifier model that learns two parameters is likely to make moreerrors in learning decision rules than a bi-level model [43].

Another research direction is to investigate whether or not the chosen featuresubsets are related with target points. Our experimental results in Section 4.2.1show that different feature subsets are chosen at different target points exceptfor a few common features. We expected a good predictive subset of featuresto appear in most of the local solutions. We suspect that a certain subset offeatures can discriminate well buyers from non-buyers at a target point, butnot as well as other features at different points. It could also happen becauseof strong correlation among insurance-related features. However, it warrantsfurther investigation to support our speculation.

Acknowledgements

The authors wish to thank Peter van der Putten and Maarten van Someren formaking the CoIL data available for this paper. This work is partially supportedby NSF grant IIS-99-96044.

References

[1] J. R. Bult, T. Wansbeek, Optimal selection for direct mail, Marketing Science14 (4) (1995) 378–394.

[2] F. Gonul, M. Z. Shi, Optimal mailing of catalogs: A new methodology usingestimable structural dynamic programming models, Management Science 44 (9)(1998) 1249–1262.

[3] N. Piersma, J. Jonker, Determining the direct mailing frequency with dynamicstochastic programming, Tech. Rep. EI2000-34A, Econometric Institute,Erasmus University, Rotterdam, Netherlands (2000).

[4] S. Bhattacharyya, Direct marketing response models using genetic algorithms,in: Proc. of 4th Int’l Conf. on Knowledge Discovery & Data Mining (KDD-98),1998, pp. 144–148.

[5] G. Piatetsky-Shapiro, B. Masand, Estimating campaign benefits and modelinglift, in: Proc. of 5th ACM SIGKDD Int’l Conf. on Knowledge Discovery & DataMining (KDD-99), 1999, pp. 185–193.

19

[6] Y. Kim, W. N. Street, G. J. Russell, F. Menczer, Customer targeting: A neuralnetwork approach guided by genetic algorithms, Management Science, Underrevision.

[7] P. Domingos, M. Richardson, Mining the network value of customers, in: Proc.of 7th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining(KDD-01), 2001, pp. 57–66.

[8] M. J. Shaw, C. Subramaniam, G. W. Tan, M. E. Welge, Knowledge managementand data mining for marketing, Decision Support Systems 31 (1) (2001) 127–137.

[9] M. Riedmiller, Advanced supervised learning in multi-layer perceptrons - frombackpropagation to adaptive learning algorithms, International Journal ofComputer Standards and Interfaces 16 (5) (1994) 265–278.

[10] D. E. Goldberg, Genetic Algorithms in Search, Optimization and MachineLearning, Addison-Wesley, New York, MA, 1989.

[11] I. Gath, A. B. Geva, Unsupervised optimal fuzzy clustering, IEEE Transactionson Pattern Analysis and Machine Intelligence 11 (7) (1988) 773–781.

[12] S. C. Ahalt, A. K. Krishnamurthy, P. Chen, D. E. Melton, Competitive learningalgorithms for vector quantization, Neural Networks 3 (1990) 277–290.

[13] H. Hruschka, M. Natter, Comparing performance of feedforward neural nets andK-means for market segmentation, European Journal of Operational Research114 (1999) 346–353.

[14] P. V. S. Balakrishnan, M. C. Cooper, V. S. Jacob, P. A. Lewis, Comparativeperformance of the FSCL neural net and K-means algorithm for marketsegmentation, European Journal of Operation Research 93 (10) (1996) 346–357.

[15] M. Kudo, J. Sklansky, Comparison of algorithms that select features for patternclassifiers, Pattern Recognition 33 (2000) 25–41.

[16] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.

[17] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, in: Proc.of 13th Int’l Conf. on Machine Learning, Bari, Italy, 1996, pp. 148–156.

[18] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEEIntelligent Systems and their Applications 13 (2) (1998) 44–49.

[19] S. Bhattacharyya, Evolutionary algorithms in data mining: Multi-objectiveperformance modeling for direct marketing, in: Proc. of 6th ACM SIGKDDInt’l Conf. on Knowledge Discovery & Data Mining (KDD-00), 2000, pp. 465–473.

[20] S. M. Mahfoud, Niching methods for genetic algorithms, Ph.D. thesis,Department of General Engineering, University of Illinois at Urbana-Champaign, Champaign, IL (1995).

20

[21] S. Chen, C. Guerra-Salcedo, S. Smith, Non-standard crossover for a standardrepresentation – commonality-based feature subset selection, in: GECCO-99: Proc. of the Genetic and Evolutionary Computation Conference, MorganKaufmann, 1999, pp. 129–134.

[22] Y. Kim, W. N. Street, CoIL challenge 2000: Choosing and explaining likelycaravan insurance customers, Tech. Rep. 2000–09,Sentient Machine Research and Leiden Institute of Advanced Computer Science,http://www.wi.leidenuniv.nl/~putten/library/cc2000/ (June 2000).

[23] F. Provost, T. Fawcett, Robust classification systems for impreciseenvironments, in: Proc. of 15th National Conf. on Artificial Intelligence (AAAI-98), 1998, pp. 706–713.

[24] F. Coetzee, E. Glover, S. Lawrence, C. L. Giles, Feature selection in Webapplications using ROC inflections, in: Symposium on Applications and theInternet, SAINT, San Diego, CA, 2001, pp. 5–14.

[25] J. R. Quinlan, R. M. Cameron-Jones, Oversearching and layered search inempirical learning, in: Proc. of 14th Int’l Joint Conf. on Artificial Intelligence,Morgan Kaufmann, 1995, pp. 1019–1024.

[26] S. Murthy, S. Salzberg, Lookahead and pathology in decision tree induction,in: C. S. Mellish (Ed.), Proc. of 14th Int’l Joint Conf. on Artificial Intelligence,Morgan Kaufmann, 1995, pp. 1025–1031.

[27] D. D. Jensen, P. R. Cohen, Multiple comparisons in induction algorithms,Machine Learning 38 (3) (2000) 309–338.

[28] P. E. Rossi, R. McCulloch, G. Allenby, The value of household information intarget marketing, Marketing Science 15 (3) (1996) 321–340.

[29] J. Schmid, A. Weber, Desktop Database Marketing, NTC Business Books, 1998.

[30] P. E. Green, D. S. Green, Research for Marketing Decisions, 5th Edition,Prentice-Hall, Inc, Englewood Cliffs, New Jersey, 1988.

[31] J. Banslaben, Predictive modeling, in: E. L. Nash (Ed.), The Direct MarketingHandbook, McGraw-Hill, New York, NY, 1992, pp. 626–636.

[32] G. R. Bitran, S. V. Mondschein, Mailing decisions in the catalog sales industry,Management Science 42 (1996) 1364–1381.

[33] W. A. Kamakura, B. S. Kossar, A factor analytic split-hazard modelfor database marketing, Tech. Rep. 98-12-009, Department of Marketing,University of Iowa, Iowa City, IA (1998).

[34] W. A. Kamakura, S. N. Ramaswami, R. K. Srivastava, Applying latent traitanalysis in the evaluation of prospects for cross-selling of financial services,International Journal of Research in Marketing 8 (1991) 329–349.

[35] W. A. Kamakura, F. de Rosa, M. Wedel, J. A. Mazzon, Cross-selling financialservices with database marketing, unpublished working paper (2000).

21

[36] T. Fawcett, F. Provost, Combining data mining and machine learning foreffective user profiling, in: Proc. of 2nd ACM SIGKDD Int’l Conf. on KnowledgeDiscovery & Data Mining (KDD-96), 1996, pp. 8–13.

[37] P. K. Chan, S. J. Stolfo, Toward scalable learning with non-uniform class andcost distributions: A case study in credit card fraud detection, in: Proc. of 4thACM SIGKDD Int’l. Conf. on Knowledge Discovery & Data Mining (KDD-98),1998, pp. 164–168.

[38] N. Raghavan, R. M. Bell, M. Schonlau, Defection detection, in: Proc. of 6thInt’l Conf. on Knowledge Discovery & Data Mining, 2000, pp. 447–456.

[39] C. X. Ling, C. Li, Data mining for direct marketing: Problems and solutions,in: Proc. of 4th ACM SIGKDD Int’l Conf. on Knowledge Discovery & DataMining (KDD-98), 1998, pp. 73–79.

[40] S. Rosset, E. Neumann, U. Eick, N. Vatnik, I. Idan, Evaluation of predictionmodels for marketing campaigns, in: Proc. of 7th ACM SIGKDD Int’l Conf. onKnowledge Discovery & Data Mining (KDD-01), 2001, pp. 456–461.

[41] P. B. Chou, E. Grossman, D. Gunopulos, P. Kamesam, Identifying prospectivecustomers, in: Proc. of 6th Int’l Conf. on Knowledge Discovery & Data Mining,2000, pp. 447–456.

[42] W. Gersten, R. Wirth, D. Arndt, Predictive modeling in automotive directmarketing: Tools, experiences and open issues, in: Proc. of 6th Int’l Conf. onKnowledge Discovery & Data Mining, 2000, pp. 398–406.

[43] B. Zadrozny, C. Elkan, Learning and making decisions when costs andprobabilities are both unknown, in: Proc. of 7th ACM SIGKDD Int’l Conf.on Knowledge Discovery & Data Mining (KDD-01), 2001, pp. 204–213.

22