Mining the past to determine the future: Problems and possibilities

International Journal of Forecasting 25 (2009) 441–451www.elsevier.com/locate/ijforecast

Mining the past to determine the future: Problems and possibilities

David J. Hand∗

Department of Mathematics, Imperial College, London, United KingdomInstitute for Mathematical Sciences, Imperial College, London, United Kingdom

Abstract

Technological advances mean that vast data sets are increasingly common. Such data sets provide us with unparallelledopportunities for modelling and predicting the likely outcome of future events. However, such data sets may also bring withthem new challenges and difficulties. An awareness of these, and of the weaknesses as well as the possibilities of these largedata sets, is necessary if useful forecasts are to be made. This paper looks at some of these difficulties, using illustrations withapplications from various areas.c© 2008 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

Keywords: Empirical models; Iconic models; Data mining; Model search; Large datasets; Selection bias

s

It is utterly implausible that a mathematical formulashould make the future known to us, and those whothink it can would once have believed in witchcraft.

Jacob Bernoulli, in Ars Conjectandi, 1713

1. Introduction

Modern data capture technologies and the capacityfor data storage mean that we are experiencing adata deluge. This brings with it both opportunitiesand challenges. The opportunities arise from the

DOI of original article: 10.1016/j.ijforecast.2008.11.001.∗ Corresponding address: Imperial College, Department of

Mathematics, South Kensington Campus, London SW7 2AZ,United Kingdom.

E-mail address: [email protected].

0169-2070/$ - see front matter c© 2008 International Institute of Forecadoi:10.1016/j.ijforecast.2008.09.004

possibility of discerning structures and patterns whichwould be undetectable with data sets with fewer pointsor which did not include such a range of variables. Thechallenges include those of searching through suchvast data sets, as well as issues of data quality andapparent structure arising by chance. Such issues arediscussed by Hand, Blunt, Kelly and Adams (2000).

Forecasting has always been an important statisti-cal problem — indeed, it certainly predates the de-velopment of formal data analytic tools. But with thedevelopment of formal analytics, highly sophisticatedforecasting methods have been developed, with partic-ular tools created for the unique problems of differentkinds of domain.

When the two areas come together — forecastingbased on large masses of data and using therapid development tools of data mining — new

ters. Published by Elsevier B.V. All rights reserved.

http://www.elsevier.com/locate/ijforecast

http://dx.doi.org/10.1016/j.ijforecast.2008.11.001

mailto:[email protected]

http://dx.doi.org/10.1016/j.ijforecast.2008.09.004

442 D.J. Hand / International Journal of Forecasting 25 (2009) 441–451

opportunities are created. But, as with data mining ingeneral, such opportunities do not come without theircaveats. The careless use of any sophisticated tool canlead to misleading conclusions, and data mining is noexception. It is my view that these dangers have beenlargely overlooked by the data mining community,and, now that the discipline is firmly established, theyneed to be addressed. In this paper I briefly summarisehigh level notions of forecasting and data mining,and then look at some of these dangers. I illustratethese points using examples from various domains,though most come from the personal financial servicessector, partly because I have considerable experiencein that area, and partly because many of the dangersare particularly apparent in that area.

2. Forecasting

Economists joke that steering the economy is likesteering a car by looking through the rear view mirror.Of course, one would never steer a car like that.To steer a car, one looks ahead, noting that one isapproaching a bend in the road, that there is anothervehicle bearing down on one, and that there is a cyclistjust ahead on the near side. That is, in steering a car,one sees that certain things lie ahead, which will haveto be taken into account. The presumption in this jokeis that in steering the economy one cannot see whatlies ahead, but, instead, has to try to predict it based onan analysis of past data.

In such a retrospective analysis, one examinesconfigurations of incidents from the past, seekingarrangements which are similar to those of the present,so that one can extrapolate from these past incidentsthrough the present to the future. Sophisticatedextrapolations also take into account the uncertaintiesinvolved, giving distributions or confidence intervalsfor likely future values.

The fact is, however, that in steering a car oneis making exactly the same kind of retrospectiveanalysis. One observes the car ahead, and, based onone’s previous experience with approaching vehicles,assumes that the vehicle will continue to proceed, ina relatively uniform manner, on the correct side of theroad.

The key to this, in the cases of both the economyand the car, is that one’s predictions, one’s forecasts,are based on assumptions of continuity with one’s past

experience. If, in many similar situations in the past,almost all had been followed by a particular event,then one would have considerable confidence that thesame thing would happen the next time: the sun risingtomorrow is the classic example. The trick in all of thisis quantifying the degree of continuity, and, in a sense,that is what all forecasting is about.

The desire to forecast is universal — one of thosethings we all wish we could do is know the future.Forecasting has several aspects. One is defining thedegree of similarity between the present and the past.Another is determining the range and variability ofevents which followed these similar past events, anda third is deciding whether one understands enoughabout the underlying process to adopt a particularmodel form.

Forecasting also has its limitations. Firstly, thereare chaotic limitations. These are fundamental in thesense that they tell us that no matter how much weknow about the past and the present, and no matterhow accurately we know it, there comes a point inthe future at which the minuscule inaccuracies inour present knowledge will have been amplified torender our forecasts wide of the mark. This is nicelyillustrated by weather forecasting, where, thanks tovast computational resources and huge statisticalmodels, we can now forecast reasonably accuratelyperhaps five days ahead, but where extendingthis significantly further requires dramatically moreextensive data and greater computer power.

Secondly, there are stochastic limitations. These arethe sudden, unpredicted and unpredictable jolts to thesystem which are often caused by external agencies, orperhaps by inadequacies in the model. A nice recentexample of that is the current global financial crisis.I have been asked: could we have seen it coming?The short answer, of course, is that we did: thereare many economic forecasters, and at least as manyforecasts. Some of these were sufficiently confident ofthe danger to act on it (some hedge funds did very wellout of it). If we combine data mining with forecastingwe can always find someone who (on looking back)gave the right forecast. This is the basis for the sure-fire way of making money as a stock market tipsterby making a series of multiple different forecasts, andeventually selecting just those potential customers towhom you gave a series which happened to turn out tobe correct. It also illustrates the difficulties of making

D.J. Hand / International Journal of Forecasting 25 (2009) 441–451 443

inferences in data mining, when huge data sets andnumbers of data configurations are involved.

3. Data mining

The preface of my book Principles of Data Mining(Hand, Mannila, & Smyth, 2001) opened by definingdata mining as ‘the science of extracting usefulinformation from large data sets or databases’. I thinkthat this brief definition is sufficiently broad thatit will be non-controversial. However, the openingchapter of the book then included the more detaileddefinition: ‘the analysis of (often large) observationaldata sets to find unsuspected relationships and tosummarise the data in novel ways that are bothunderstandable and useful to the data owner.’ Acomparison of this definition with past and currentdata mining practice immediately reveals somethingimportant: data mining, as a discipline, is developingand changing. Perhaps this is hardly surprising. Thediscipline is a young one — necessarily so, since it isa child of the computer age — and young things dogrow and develop.

In its earliest usages, the term data mining wastypically used in a derogatory sense (along with data‘snooping’, ‘fishing’, ‘trawling’, and so on). It meantthe examination of data sets from a large number ofangles, fitting a great many models, or looking at agreat many subsets of the data. The data set may nothave been very large by modern standards (we aretalking early days of the computer), but the numberof possible data examinations which could be appliedwas essentially unlimited. The derogatory implicationarises from the truism that, in any given data set, if youlook hard enough, you are bound to find apparentlyunusual data configurations.

I think that the two aspects of this early perspectiveon data mining encapsulated the typical statistician’sperspective around, say, the 1980s — firstly, that itinvolved an extensive model search, and secondly,the derogatory implication. The interpretation of datamining as being primarily about extensive modelsearching continues to be pertinent. For example,Hoover and Perez (1999) entitled their paper on modelsearch: ‘Data mining reconsidered: encompassingand the general-to-specific approach to specificationsearch’. However, the derogatory implication hasdied away as the notion of data mining as being

about extracting useful knowledge from large datasets has come to the fore. This is probably partlyas a consequence of the manifest need to do thisin many situations. In addition, however, the furtherperspective that data mining is concerned with seekinginformation in large data sets has become moreimportant nowadays, concurrently with the growth innumbers, and indeed sizes, of large data sets.

It is my personal view also that the changingpopulation of researchers involved in data miningwas a key cause of the improved regard in whichdata mining is held. In particular, as computersdeveloped, so computer scientists, with backgroundsin database technology and related areas, graduallybecame more concerned with the analysis of data.Indeed, entire computational disciplines concernedwith data analysis grew up, such as machine learningand pattern recognition. My personal (entirelysubjective gross generalisation) observation is that (onaverage!) computer scientists are less conservativethan statisticians, so that, where a statistician mightprefer to err on the side of caution in sifting throughdata, a computer scientist might give it a whirl.Furthermore, with a background more solidly indata storage and the properties of existing data sets(through work in databases), in the earlier days of datamining computer scientists were less concerned withnotions of inference — that is, of generalising fromthe configurations found in the database to patterns indata sets yet to be collected. This would have madethem less aware of the role of chance and randomvariation in producing apparently interesting dataconfigurations. For example, if one’s main concernis with the data in the personnel database, it isthe people actually described there in which one isinterested. Chance, and the personnel characteristicswe would observe if we had had another set ofemployees ‘drawn from the same distribution asthat of the actual employees’, is an irrelevant anduninteresting question. My suspicion that, at least inthe early days of data mining, computer scientistswere less concerned with the inferential issues, andmore concerned with the description of actual existingdata sets, is not mere speculation: I took part ininterdisciplinary debates on such matters in the early1990s.

If one’s primary aim is simply to summarise ordescribe particular features of a given data set, then


many of the subtle difficulties and problems whicharise with inference become irrelevant. However,inference is central to forecasting, so these problemsnow become central. I think that the fact that mininglarge data sets originally sprang from computationalrather than statistical roots explains why data minershave so often not appreciated just how important andtough these issues are.

Data mining is also changing in other ways. Inparticular, the extended definition above qualifiesthe data sets as ‘observational’. That is, data setswere regarded not as the product of designedexperimentation, but were collected (often as a by-product of some other exercise) simply by measuringthe world as it is presented. It is true that most datamining is still carried out on such observational datasets. In some domains (astronomy and archaeology,for example), the nature of the domain of studyrenders experimentation impossible. However, inother domains (business applications and medicalresearch, for example) experimentation is certainlypossible, and we are witnessing the collection of verylarge data sets which have been collected as an integralpart of a data mining process. A classic example of thisis the experimentation work carried out by the bankCapital One, of which I say more below.

There are also differences in the way data miningis used in different applications. This is true of bothof the interpretations noted above: the exploration ofa huge model space and the exploration of a hugedata space. In scientific applications, for example,one often finds quite sophisticated techniques beingapplied to examine large data spaces (e.g. modernastronomical databases, or analysing the results ofparticle physics experiments). ‘Sophisticated’ heremeans that they require substantial effort to learnand understand. In contrast, in business applicationsone might find relatively simple and familiar methods(e.g. regression or cluster analysis) being used a hugenumber of times on different sets of customers orvariables. For example, in building scorecards forpredicting creditworthiness, large numbers of possibleways of segmenting the population are explored,along with large numbers of possible sets of predictorvariables to combine when constructing a logisticregression model (say) in each segment. Of course,these descriptions of what goes on in differentdisciplines are generalisations, and one can readily

find exceptions. Note also that the terms ‘large’ and‘a great many’ in all of this are relative: progress incomputers has meant that a few years ago a ‘large’ dataset might have contained just a thousand data points,whereas nowadays it might contain many millions oreven billions.

It is useful, particularly in the context of datamining applications in forecasting, but also moregenerally, to distinguish between two different kindsof data mining exercises. I call these ‘model building’and ‘anomaly detection’. Model building is an exercisewith which all statisticians are familiar. The aim ofmodelling is to reduce the data set to a simplerdescription, which can then illuminate mechanismsor relationships, or can be used for exercises such asprediction or decision making. Time series models forforecasting are a familiar kind of model, but othersinclude market segmentation for characterising likelyfuture behaviour of customers, linear and generalisedlinear models for predicting outcomes, and so on.In contrast, in anomaly detection, the aim is to lookfor something unusual: the sudden departure fromthe norm, the extreme observation, the change inbehaviour, etc.

When building models, one can often work witha sample. For example, basic laws of probability tellus that we may well be able to construct a model ofsufficient accuracy using a (properly taken) sample ofjust 1000 data points, in place of the billion in theentire data set. Entire disciplines — survey samplingis an illustration — are built on this truth. In fact,however, there are subtleties. The size of the requiredsample will depend on both the accuracy one wants toattain and the complexity of the model one wants tobuild. To characterise a data distribution merely by itsmean and variance requires a relatively small sample,but to describe also its skewness and kurtosis, alongwith other aspects, will require a larger sample.

As we push this notion further, so we reach thestage of trying to model very small features of the‘true distribution’ from which the data arose. That is,we enter the realm of anomaly detection. Typicallyin data mining, however, this is approached from theother direction. Instead of trying to build a globalmodel which describes the entirety of the underlyingdistribution, we focus on the data itself, and seekto detect unusual data points, groups of points,relationships between points, high frequency counts,


etc. Having detected such unusual configurations,we can then ask ourselves the inferential question:whether they could easily have arisen by chance.

At the very extreme, when we pose the anomalyquestion about individual data points — is this datapoint unusual? — we are forced to examine eachand every data point. Sampling is of no potential usehere (though sampling may be helpful in constructinga model with which each individual data point iscompared — in outlier detection, for example). Anexample of such a situation would be detectingfraudulent credit card transactions, where sampling isclearly likely to be of little help. One simply has toexamine each transaction.

Increasing numbers of large and very large datasets, along with the development of very fastcomputers facilitating the rapid exploration of datasets in many different ways, hold out immensepotential for extracting meaning from data, and inparticular for improved forecasts and predictions.However, such potential power does not come withoutits risks. I have already mentioned chaotic andstochastic limitations, but one must always be alert forother, rather more mundane potential problems. Thenext section describes some such.

4. Problems

The combination of large data sets and observa-tional data mean that data mining exercises are oftenat risk of drawing misleading conclusions. In this sec-tion I describe just four of these dangers. These prob-lems are certainly not things I alone have detected. In-deed, within the statistics community, they are prob-lems which are well understood. However, the centralphilosophy of data mining — throw sufficient com-puter power at a large enough data set and interestingthings will be revealed — has meant that they haveoften been overlooked in data mining exercises. Un-fortunately, the solution is to temper the enthusiasm,and to recognise that rather more complex models arenecessary. Statisticians generally do not build compli-cated models simply for fun, but for good reasons.

Problem 1. Selectivity bias.

I noted above that most data mining activitiesare carried out on observational data. By this I

mean that the researcher had no control over whattreatments, exposures, or conditions the objects beingstudied were subjected to. This is in contrast toexperimental studies, where such control is exercised.The risks associated with observational studies arewell known. Primarily, because of the lack of control,and in particular the lack of opportunity for randomassignment to different ‘treatment’ groups, there isa risk that observed differences between groupsof objects may be due to unrecognised factors.For example, in a study aimed at identifying thedistribution of different kinds of astronomical objects,dimmer objects are less likely to be detected. Sinceobjects which are further away are likely to bedimmer, there is a relationship between proximity andprobability of being detected. Then, because of thefinite speed of light, we are observing further objectsat an earlier time of their existence, so we obtain atime-distorted picture of the population of star andgalaxy types across the universe. Things are thenfurther complicated by interstellar and intergalacticgas and dust clouds, which attenuate radiation. Adensely populated region of space may appear sparselypopulated merely because the light is not gettingthrough to us. Of course, all of these phenomenaare well-known to astronomers, and appropriateadjustments are made in astronomical studies, butthings are more difficult in situations where the dataselection mechanisms are not so well understood, or,even worse, the possibility of such mechanisms isignored.

A familiar case arises in the retail finance sector,where credit scores are used as the basis on whichto make decisions about selling financial products(Hand, 2001a; Hand & Henley, 1997; Rosenberg& Gleit, 1994; Thomas, 2000). In particular, theaim is often to forecast whether an applicant fora product (e.g. a loan, a credit card, a mortgage,car finance, etc.) is likely to default on repaymentswithin two years. A variety of different types of creditscore have been developed, but typically they willinclude information on past behaviour with financialproducts (e.g. default on previous loans, slowness inmaking repayments, nature of credit products usedin the past, fraction of credit limit reached, etc.), aswell as other permitted characteristics which havebeen found to be predictive of the probability ofdefaulting (for example, whether or not a homeowner,


time with current employer, etc., but not includinggender, which is prohibited by law). The last thirtyyears has seen considerable development of suchmodels, and some highly sophisticated approacheshave been developed. Although a wide variety ofstatistical and machine learning approaches have beeninvestigated, including, for example, neural networks,random forests, support vector machines, and so on,by far the most popular type of model is a logisticregression tree, or ‘segmented scorecard’, as it iscalled in the industry. Such a model partitions thepopulation of potential customers into segments (e.g.,one might have three segments: those who havepreviously defaulted, those who have not and whohave many existing lines of credit, and those who havenot and have few existing lines of credit), and thenbuilds distinct logistic regression models within eachsegment to predict default, using retrospective data.

Now let us look at this process from theperspectives of the data used to construct the modeland the aims of the exercise. To do so, let us step rightback to the beginning of the process. Our aim is tomake the best predictions we can of the default risk ofanyone who applies for the financial product — a loan,say.

In order to construct our predictive model, weneed data describing previous customers, some ofwhom will have defaulted and some not. To obtainsuch data we started by soliciting applications forthe loan. This might have been by direct mailing,via the internet, or by some other means. Amongstthose who responded to the solicitation, many willhave failed to reply. Amongst those who did reply, wewill have used some earlier scorecard (or maybe evensubjective judgement) to decide to whom to offer theloan. Amongst those who were offered the loan, onlysome would have taken up the offer (the others mayhave found the terms unattractive, or have changedtheir mind, etc.). And amongst those who did takethe loan, some would unfortunately turn out to be badrisks and will have defaulted.

These various selection processes will have finallyproduced a population of customers who took theloan, some of whom defaulted and some of whomdid not. This gives us a population which we can useto build a model to predict the likely outcome of anew customer, with known characteristics. However,this final population has undergone many selection

steps, and might be quite unlike the population ofpeople who apply for a loan. In particular, just totake one aspect by way of illustration, the populationwhose outcome we have actually observed consistssolely of people whom (we originally thought) wouldbe good risks. Because of this, assuming that ourinitial suspicions were reasonably well-founded, thepopulation whose outcome we observe is likely to besignificantly less risky than the overall population ofapplicants: it will contain a lower proportion of peoplelikely to default.

Of course, the consumer credit industry is wellaware of this problem, and considerable effort hasbeen made to overcome it. This effort goes underthe name of ‘reject inference’ (Hand, 2001b; Hand& Henley, 1993), based on the counterfactual notionthat one would like to determine the outcome class ofthose applicants whom one previously rejected for aloan if one had in fact offered them one, so that theseoutcomes could also be used when constructing themodel.

The fact is, however, that the basic problem, aspresented above, is an insuperable one: unbiasedmodels cannot be constructed unless additionalinformation is introduced. This extra informationmight come in various forms, including data aboutthe earlier decision process or assumptions aboutunderlying distributions, or from extra data. In fact,this example is an interesting one because, in contrastto many other data mining situations where suchpopulation selectivity arises, the problem has beenrecognised and has a known solution, at least inprinciple. The ideal solution is to change it from anobservational to an experimental study. This can bedone by accepting a random sample of applicantswhom, one believes, are likely to default, in additionto accepting those that one believes are less likely todo so, so that a scorecard free from the distortionsof sample selectivity can be constructed. This doesrequire an enlightened understanding of the principlesof experimentation, because it necessarily meansselling the financial product to some people whowould be regarded as a high risk. In my experience,most banks are uneasy about this. I regard this as amanifestation of a short term perspective: more overallprofit can be made by sacrificing some short term gainin the interests of learning more about how customersbehave.


Of course, there are exceptions. In particular, Capi-tal One is renowned for its constant experimentation todiscover the best products to provide for its customers.It is reported to carry out some 60,000 experiments ayear. This immediately produces a very large data set(without even considering the individual responses ofthe customers within each arm of each experiment). Toextract useful information from such a mass of studies,data mining tools are needed. However, at least as faras this paper is concerned, the key aspect of this exper-imentation is that it includes notions of a willingnessto assign some predicted bad customers to the ‘good’arm. Recognition of this principle has led Capital Oneto phenomenal success.

This example shows population distortion arisingas a result of some prior data selection process, but itcan also arise in many other ways. One very popularapproach for handling incomplete data is simply todiscard any incomplete records. As all statisticiansknow, this can be a dangerous strategy, since it risksleading to an analysis sample which has a distributiondifferent from that of the complete population.Statisticians have developed a deep understanding ofmissing data, missing data mechanisms, when validinferences can be made, and how to adjust for missingdata, but data miners very often ignore it. The rejectinference problem can, of course, be seen as a missingdata problem, since there the higher risk applicants aredisproportionately more likely to have been excludedfrom the sample with known outcomes which isavailable for analysis.

Problems of selectivity bias are not new. Thepotential for adversely impacting small data sets is justas great as with large data sets, but perhaps the dangersare less obvious with large data sets. I certainly thinkthat data miners have been slow to address the issue.Moreover, if one is seeking small effects amongst largenumbers of data points, the potential for these smalleffects to be caused by unrecognised influences isconsiderable.

Problem 2. Out of date data.

For sound pedagogical reasons, much statistics istaught from the batch mode perspective. That is, theanalyst (a student, say) is given a set of data (complete,with no missing values, and assumed to be withouterrors, of course), and is requested to fit a modelto it. However, the truth is that many analyses are

really conducted within a latent context of a streamof data, of problems, and of questions. In business,for example, I conjecture that almost all analyses areof this kind (and yet basic business statistics textsdo not emphasise it), since businesses generally aimto continue into the future. Perhaps because muchof the economic impetus for data mining has comefrom business needs, there has been considerableinterest in what has come to be called streamingdata in the data mining community. Such streamingapplications are closely tied to forecasting — in mostcases, businesses will want to use the information theyacquire from an analysis to guide their future decisionmaking. The point is that elaborate and sophisticatedmodels do exist for coping with evolving data andproblems (dynamic linear models, for example), butthese are seldom applied in day-to-day data miningapplications.

To illustrate, I again turn to the retail financialservices sector, and credit scorecards.

To build a scorecard, we need both the potentiallypredictive characteristics (described above) and theoutcome (e.g. default or not) of a sample of customers.Clearly, these are customers who were signed up sometime ago, since we have had to wait until they have hadthe opportunity to default. Once again taking a loanas an example, in principle this means waiting untilthe end of the loan period. Suppose, for illustration,that the loan term is two years. Then, to be certain thatsomeone will not default, we must wait until two yearsafter they took out the loan to determine that their trueclass is ‘good’. If they do default before the two yearsare up, then we immediately determine that they are‘bad’. However, we cannot choose a time less than twoyears and look at their status, or at least not withoutmore elaborate analysis, or we would risk selectivityproblems of the kind described above. This means thatthe predictor data on customers from whom we buildthe model relates to a population of customers whichis at least two years out of date. This is not a problemif the system is stationary: if the distributions do notchange over time (that is, if there is no ‘populationdrift’), but it can be a serious problem if they do. Thispoint is particularly relevant at the moment, becausewe have recently had a long period of relatively benignconditions for consumer credit, which was suddenlybrought to a crashing end with the ‘credit crunch’.Models built in the benign period may not be relevant


to the present, as is demonstrated by the dramaticincrease in house repossessions and negative equity inrecent months.

One expects the performance of predictive modelsto degrade over time as populations change. However,the above means that the models should be expected tohave degraded before we start: they are (in the exampleabove) already two years out of date before they areeven used.

Various approaches have been explored for tacklingthis problem. They include:

(i) survival analysis, where one truncates theobservation period to less than two years, assuggested above, but makes explicit allowancefor the fact that some of the customers may gobad after the observation date. Of course, thereis a limit to how short an observation period onecan take. At the least, since ‘default’ is oftendefined in terms of three consecutive months offailure to pay an installment, one may have to waitthree months, and this may not yield sufficientdefaulters to build a reliable model.

(ii) dynamic logistic regression models.(iii) More elaborate models based on the hypothesis

that there are characteristic types of customers,some more likely to default than others, and thatthis trait type is stable for a given customer.This allows the models to be split into two parts,a part relating the customer’s demographic andcircumstantial characteristics to their type, anda part relating the type to the default risk. Thefirst part is a very short term model. This partcan be designed using old data, but when usedit will be based on customer data which is only afew months old, so that it is very up to date. Thesecond part is invariant over time: it can be basedon old data, and will still be a valid link betweenthe customer type identified in the first part, andthe default risk.

However, despite the existence of these moreelaborate models, the sector relies on relatively simplenon-dynamic models, monitoring their performanceuntil the degradation seems sufficient for them to needto be rebuilt.

In fact, this exposition merely scratches thesurface of the difficulties. The aim of buildingpredictive models in this sector, and indeed in business

applications in general, is not to see how clever weare at predicting the future, but is often to take someaction in the present. Indeed, and in particular, itis to make some intervention. If, for example, wepredict that someone is likely to fall into arrears withrepayments, we might well contact them and arrangea revised repayment schedule. This very interventionwill change the nature of the data on which theprediction is based, and so will invalidate the model.We have a reactive situation: what we do affects howthe customer behaves.

In other domains yet further complications arise.Economic data are subject to revision as timeprogresses, because, for example, raw data takes timeto come in, and more comes in as time passes.This means that the current estimates of things suchas inflation, GDP, etc. are likely to be improvedupon as they age. As a consequence, rather than theconventional approach of weighting the most recentdata most heavily in forecasts, it can be better toweight older data more heavily.

Data miners (or at least those in business practice,rather than those who present ideas at academicdata mining conferences) tend to stick to relativelysimple methods — for example, repeatedly rebuildingscorecards, as noted above. The ability to do thisis clearly another consequence of the computerrevolution. One question I would like to raise iswhether this is a good thing. As far as changingcircumstances are concerned, it means that one’smodels can adjust to the current situation, and avoidsrelying on possibly dubious assumptions which mightbe made by more advanced methods: survival analysishas to assume some distributional form to extrapolateinto the future to decide whether someone is likelyto default before the end of the loan term, and thebipartite model solution is making a fundamentalassumption about the nature of customers.

Problem 3. Empirical rather than iconic models.

There are two distinct kinds of statistical models,which go under various names, but which here Iwill term iconic and empirical (Box & Hunter,1965; Cox, 1990; Hand, 1985, 1994). Iconicmodels are mathematical representations (‘images’)of (necessarily simplifying) theories describing thephenomenon in question. Thus we might have aphysical theory which tells us that objects will


accelerate as they fall towards the earth, and we mightfit such a model to a set of data. Conversely, empiricalmodels are based purely on finding convenient oruseful summaries of a data set. Many regressionmodels are of this kind − in a particular context theremay be no theory saying that a mean response shouldbe a linear combination of a set of predictor covariates,but a regression model may be used nevertheless.The relative balance of iconic and empirical modelsvaries across disciplines and changes over time, and amodel can start out as empirical and become iconic asunderstanding grows.

In general, I believe, iconic models should beexpected to yield superior predictions to empiricalmodels. That is provided, of course, that the modelsare ‘right’, in that they do represent important aspectsof (or, perhaps, ‘good approximations to’) the way thesystem being modelled really behaves. The rationalebehind this belief is the fact that models are generallycomposed of various components, so that one canthink of these components as forming a set of basisfunctions by which to represent the system beingmeasured. An iconic model (with the proviso above)is thus based on a good set of basis functions, whichpermits a reasonable approximation to the system,without extensive model searches and without thedanger of adding superfluous basis functions simplybecause, by chance, they happen to fit well to theparticular (finite) data set at hand. In contrast, anempirical model is either the result of a search overa much wider set of basis functions or is the resultof a prior restriction to a particular set (e.g. a linearcombination of predictor variables). In general, fittinga model using a smaller set of well-chosen basisfunctions leads to more accurate estimation — with,again, the proviso above, that this permits a reasonableestimate of the ‘truth’. Without this proviso there is arisk of bias, and hence inaccuracy of a different kind.

Now, the fact is that most predictive data miningmodels (in commercial applications, at least) areempirical. This is reasonable enough — in mostsituations there is little theory on which to base aniconic model. However, I think that there is a dangerhere. I call this the cliff edge effect: It describesthe sudden dramatic deterioration in predictive modelperformance.

Empirical models are well-matched to the dataat hand. They describe the retrospective set of

data available for constructing the model very well,and, if carefully built, allow accurate generalisationand prediction of new cases drawn from the samedistribution. However, as we have already seen, inthe credit arena, and I would conjecture in mostother business applications, population drift (drivenby changing economic circumstances, competitiveenvironment, technological progress, etc.) meansthat the new cases are not drawn from the samedistribution. Indeed, we have already seen that, inthe case of credit scorecards, the forecasting modelis some years out of date before its usage evencommences. In such cases, I suspect that iconicmodels (again with the proviso above) will be morereliable, and less subject to the cliff edge effect.

Of course, constructing good theories on which tobase one’s iconic model may not be easy. However,it is possible to step partly towards this ideal. Onceagain, to illustrate, I turn to the credit scoring domain.

As we have already seen, in this domain, the modelsare generally empirical. They collate a set of possiblepredictor variables, measure the associated outcomes(e.g. default or not), and then build the segmentedscorecards or whatever, purely on the basis of anempirical analysis identifying associations in the data.The resulting model has the familiar regression form

r = f (x1, . . . , x p),

where r is the outcome, xi are the predictor variables,and f is a function permitting the prediction of theoutcome from the predictors. f or its parametersare estimated directly from the retrospective data,which includes observations of both xi and r . Notethat in such a model no restrictive assumptions aremade about the relationships between the xi . Typicallylogistic regression models or logistic regressiontrees are the chosen form for f in the retailcredit industry, and extensive data mining work isused to construct them, including trying differentsegmentations, different transformations of predictorvariables, and different sets of predictor variables.

However, there is an alternative to this. In partic-ular, one can conceptualise ‘creditworthiness’ as a la-tent variable, a characteristic of the customer. This willbe intrinsically unobservable, but will be influenced bythe primary characteristics of the customer, and will inturn influence various behavioural characteristics. Forexample, denoting this latent characteristic by q (for


‘quality’ — see Hand and Crowder (2005)), variablessuch as age, socio-economic group of parents, edu-cation level of parents, and so on, might be regardedas primary characteristics — they influence, but arenot influenced by, the creditworthiness of the individ-ual being rated. In contrast, examples of behaviouralcharacteristics would be arrears history and current ac-count history. We might reasonably regard these aspotentially influenced by creditworthiness. Certainlythey are of a qualitatively distinct kind from the pri-mary characteristics. Moreover, we might reasonablyassume that these behavioural characteristics are con-ditionally independent, given the q value. That is,any relationships between the behavioural character-istics are induced by their mutual relationship with q.This thus yields a more elaborate multiple-indicator-multiple-cause model, with q being the unobservablebut latent variable in the middle, which can be esti-mated (Hand & Crowder, 2005).

The point of this is that it is a step away fromthe purely empirical models traditionally used in datamining, towards an iconic model form. Here the theoryis very weak — merely saying that certain aspectsof an individual influence the latent q variable, thatothers are influenced by it, and that these latter areconditionally independent given q . However, it is afirst step.

Problem 4. Measuring performance.

The final problem I want to mention is that ofmeasuring the performance of predictive models, andthe closely related issue of the criterion used tochoose between models. The starting point is the self-evident truism that different measures lead to differentmodels being chosen. Models for predicting a binaryprognosis of hospital patients based on optimising themisclassification rate are likely to be rather differentfrom models chosen on the basis of likelihood. Thereis nothing deep in this: ranking people by weight islikely to yield an order different from ranking themby height, even though one might expect the two rankorders to be correlated.

Since different performance criteria are likely toyield different orders of merit, it is clearly importantto choose a criterion which closely matches theobjectives of the analysis, and yet this is often notpractised. All too often, too little thought is given to

the ultimate objectives of an analysis, and a criterionis adopted by convention.

Sometimes there are sensible practical reasonsfor avoiding choosing the most appropriate criterion,though the risks are not always appreciated. Forexample, in predictive classification problems (likethe binary prognosis forecasting problem mentionedabove), the misclassification rate is often chosenas a performance measure (indeed, in comparativeevaluations of such methods by the data mining,machine learning, and statistics communities, by farthe most common criterion is the misclassificationrate, see Jamain, 2004). However, the misclassificationrate is rarely chosen as the criterion to be optimisedwhen determining the model. This is because, beingdiscrete, it is difficult to optimise. Instead, moretypically, a measure such as the likelihood is chosen.

One can take this further. Even though themisclassification rate is a common evaluation measurefor binary prognosis problems, it is rarely anappropriate measure. More commonly, different kindsof misclassifications carry different relative degrees ofseverity, and this should be taken into account whenchoosing a criterion.

Taking this further still, practical experience showsthat determining these relative degrees of severityis difficult. This has led to a variety of measuressuch as the Gini coefficient (or, equivalently, the areaunder the ROC curve), or partial areas under theROC curve (e.g. McClish, 1989), which aggregatedifferent relative degrees of severity. These are ratherunsatisfactory measures, either because they makelatent assumptions about the relative severity, orbecause they make these explicit, and hence introducesubjectivity. Likelihood, on the other hand, is universal(all researchers, applying the same model to the samedata, will obtain the same likelihood). Of course, ifone is prepared to believe that the family of modelsbeing contemplated includes the ‘true’ model, thenany measure of discrepancy between the true modeland the fitted model can be defended — and likelihoodhas many attractive properties. On the other hand, ifthe assumption is difficult to justify (as must surelybe the case in all empirical modelling) then it seemsless acceptable. Hand and Vinciotti (2003) exploredthis issue.


5. Conclusion

Forecasting is fundamentally an inferential prob-lem. That is, it is not simply a question of summaris-ing data, but is rather a question of generalising fromthe available data to new data — and in particular tonew situations which are likely to arise in the future.In contrast, the early development of data mining bythe computer science community put emphasis on theanalysis of the data set to hand (e.g. the discovery of‘frequent itemsets’ in large transaction databases). Itis only relatively recently that the inferential nature ofmany of the problems addressed by data miners hasbeen properly recognised. Inference is a much tougherproblem than summarising. It requires careful thoughtabout how the available data arose, so that one can besure that one has a properly representative data set,permitting the powerful tools of probability and statis-tics to be properly applied. I suspect that, all too often,certainly in the past and to a large extent in the present,such issues have been overlooked by data miners. It is,perhaps, a tribute to the power and potential of datamining that, despite these dangers, the discipline hasgained in importance and reputation.

With this as a background, I believe that datamining is changing. The central importance ofinference to many of its concerns is being recognised.Moreover, although many data analyses are based onretrospective observational data originally collectedfor some other purpose, increasingly we are seeingdata mining ideas being applied in experimentalsettings. This holds the promise of very excitingdevelopments in the future.

Data mining, in commercial practice at least, isoften characterised by the extensive fitting of relativelysimple models. Moreover, these are almost universallyempirical. Empirical models have the strength thatthey might include a powerful predictor which an‘expert’ would not have recognised as relevant — theycan include things we would never have thought of —to increase the predictive power. However, this is notwithout its risks. In particular, empirical relationshipsare susceptible to the cliff edge effect, and thepredictive performance may degrade dramatically ifrelationships alter as the circumstances surroundingnew data change. Also, empirical models, while they

might lead to effective prediction and forecasting,do not lead to an enhanced understanding of theunderlying truths.

Acknowledgements

The author’s work on this paper was partiallysupported by a Royal Society Wolfson Research MeritAward.

References

Box, G. E. P., & Hunter, W. (1965). The experimental study ofphysical mechanisms. Technometrics, 7, 57–71.

Cox, D. R. (1990). Role of models in statistical analysis. StatisticalScience, 5, 169–174.

Hand, D. J. (1985). Artificial intelligence and psychiatry.Cambridge: Cambridge University Press.

Hand, D. J. (1994). Deconstructing statistical questions (withdiscussion). Journal of the Royal Statistical Society, Series A,157, 317–356.

Hand, D. J. (2001a). Modelling consumer credit risk. IMA Journalof Management Mathematics, 12, 139–155.

Hand, D. J. (2001b). Reject inference in credit operations.In E. Mays (Ed.), Handbook of credit scoring (pp. 225–240).Chicago: Glenlake Publishing.

Hand, D. J., Blunt, G., Kelly, M. G., & Adams, N. M. (2000). Datamining for fun and profit. Statistical Science, 15, 111–131.

Hand, D. J., & Crowder, M. J. (2005). Measuring customer qualityin retail banking. Statistical Modelling, 5, 145–158.

Hand, D. J., & Henley, W. E. (1993). Can reject inference everwork? IMA Journal of Mathematics Applied in Business andIndustry, 5, 45–55.

Hand, D. J., & Henley, W. E. (1997). Statistical classificationmethods in consumer credit scoring: a review. Journal of theRoyal Statistical Society, Series A, 160, 523–541.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of datamining. Cambridge, Mass: MIT Press.

Hand, D. J., & Vinciotti, V. (2003). Local versus global modelsfor classification problems: fitting models where it matters.American Statistician, 57, 124–131.

Hoover, K. D., & Perez, S. J. (1999). Data mining reconsidered: en-compassing and the general-to-specific approach to specificationsearch. Econometrics Journal, 2, 167–191.

Jamain, A. (2004). A meta-analysis of classification methods. Ph.D.Thesis, Department of Mathematics, Imperial College London.

McClish, D. K. (1989). Analyzing a portion of the ROC curve.Medical Decision Making, 9, 190–195.

Rosenberg, E., & Gleit, A. (1994). Quantitative methods in creditmanagement: a survey. Operations Research, 42, 589–613.

Thomas, L. C. (2000). A survey of credit and behavioural scoring:forecasting financial risk of lending to consumers. InternationalJournal of Forecasting, 16, 149–172.

Documents

Mining the past to determine the future: Problems and possibilities