7
30 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM

Machine learning and data mining

  • Upload
    tom-m

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

30 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM

The image on this page and the next was omitted from the electronic version of this article due to copyright considerations.

COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 31

Over the past

decade , many

organizations have begun to routinely cap-

ture huge volumes of historical data

describing their operations, products, and

customers. At the same time, scientists and

engineers in many fields have been captur-

ing increasingly complex experimental data

sets, such as gigabytes of functional mag-

netic resonance imaging (MRI) data

describing brain activity in humans. The

field of data mining addresses the question

of how best to use this historical data to

discover general regularities and improve

the process of making decisions.

MachineLearning and Data Mining

Tom M. Mitchell

Machine learning algorithmsenable discovery of important“regularities” in large data sets.

PHIL

IP N

OR

THO

VER

/PN

OR

THO

V.FU

TUR

E.EA

SYSP

AC

E.C

OM

/IN

DEX

.HTM

L

� G e n e r a l D o m a i n s

The increasing interest indata mining, or the use ofhistorical data to discoverregularities and improve

future decisions, follows from theconfluence of several recent trends:the falling cost of large data storagedevices and the increasing ease ofcollecting data over networks; thedevelopment of robust and efficientmachine learning algorithms toprocess this data; and the fallingcost of computational power,enabling use of computationallyintensive methods for data analysis.

The field of data mining, some-times called “knowledge discoveryfrom databases,” “advanced dataanalysis,” and machine learning, hasalready produced practical applica-tions in such areas as analyzing med-ical outcomes, detecting credit cardfraud, predicting customer purchasebehavior, predicting the personalinterests of Web users, and optimiz-ing manufacturing processes. It hasalso led to a set of fascinating scien-tific questions about how comput-ers might automatically learn frompast experience.

Prototypical Applications Figure 1 shows a prototypical example of a datamining problem. Given a set of historical data, weare asked to use it to improve our medical decisionmaking. The data consists of a set of medical recordsdescribing 9,714 pregnant women. We want toimprove our ability to identify future high-risk preg-nancies—specifically, those at high risk of requiringan emergency Cesarean-section delivery. In thisdatabase, each pregnant woman is described interms of 215 distinct features, such as her age,whether she is diabetic, and whether this is her firstpregnancy. These features (in the top portion of thefigure) together describe the evolution of each preg-nancy over time.

The bottom portion of the figure illustrates a typ-ical result of data mining, including one of the ruleslearned automatically from this data set. This partic-ular rule predicts a 60% risk of emergency C-sectionfor mothers exhibiting a particular combination ofthree features, out of the 215 possible features.Among the women known to exhibit these three fea-tures, the data indicates that 60% have historically

given birth by emergency C-section. As summarizedat the bottom of the figure, this regularity holds overboth the training data used to formulate the rule anda separate set of test data used to verify the reliabil-ity of the rule over new data. Physicians may wantto consider this rule as a useful factual statementabout past patients when weighing treatment forsimilar new patients.

What algorithms can be used to learn rules like theone in the figure? This rule was learned through asymbolic rule-learning algorithm similar to Clark’sand Nisbett’s CN2 [3]. Decision-tree learning algo-rithms, such as Quinlan’s C4.5 [9], are also fre-quently used to formulate rules of this type. Whenrules have to be learned from extremely large datasets, specialized algorithms stressing computationalefficiency may also be used [1, 4]. Other machinelearning algorithms commonly applied to this kindof data mining problem include neural networks[2], inductive logic programming [8], and Bayesianlearning algorithms [5]. Mitchell’s 1997 textbook[7] describes a broad range of machine learningalgorithms used for data mining, as well as the sta-tistical principles on which they are based.

32 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM

Data

Age: 23FirstPregnancy: NoAnemia: NoDiabetes: NoPreviousPrematureBirth: NoUltrasound: ?Elective C–Section: ?Emergency C–Section: ?

Age: 23FirstPregnancy: NoAnemia: NoDiabetes: YesPreviousPrematureBirth: NoUltrasound: AbnormalElective C–Section: ?Emergency C–Section: ?

Age: 23FirstPregnancy: NoAnemia: NoDiabetes: NoPreviousPrematureBirth: NoUltrasound: ?Elective C–Section: NoEmergency C–Section: Yes

Patient103 time=1

Patient103 time=2

Patient103 time=n

Rules learned

If No previous vaginal delivery, and Abnormal 2nd Trimester Ultrasound, and Malpresentation at admissionThen Probability of Emergency C-Section is 0.6

Training set accuracy: 26/41 = .63Test set accuracy: 12/20 = .60

Figure 1. Data mining application. A historical set of 9,714 medical records describes pregnant women over time. The top portion is a typical patient record (“?” indicates the feature value is unknown). The task for the algorithm is to discover rules that predict which future patients will be at high risk of requiring an emergency C-section delivery. The bottom portion shows one of many rules discovered from this data. Whereas 7% of all pregnant women in the data set received emergency C-sections, the rule identifies a subclass at 60% at risk for needing C-sections.

Although machine learning algorithms are centralto the data mining process, it is important to notethat the process also involves other important steps,including building and maintaining the database,data formatting and cleansing, data visualization andsummarization, the use of human expert knowledgeto formulate the inputs to the learning algorithmand evaluate the empirical regularities it discovers,and determining how to deploy the results. Thus,data mining bridges many technical areas, includingdatabases, human-computer interaction, statisticalanalysis, and machine learning algorithms. My focushere is on the role of machine learning algorithms inthe data mining process.

The patient-medical-records application examplein Figure 1 represents a prototypical data miningproblem in which the data consists of a collection oftime-series descriptions; we use the data to learn topredict later events in the series—emergency C-sec-tions—based on earlier events—symptoms beforedelivery. Although I use a medical example to illus-trate these ideas, I could have given an analogousexample of learning to predict, say, which bank-loanapplicants are at high risk of failing to repay theirloans (see Figure 2). As shown in this figure, data insuch applications typically consists of time-seriesdescriptions of customer bank balances and otherdemographic information, rather than medicalsymptoms.

Other data mining applications include predictingcustomer purchase behavior, customer retention, andthe quality of goods produced by a particular manu-facturing line (see Figure 3). All are applications for

which data mining has beenapplied successfully and in whichfurther research promises evenmore effective techniques.

The State of the Art and Beyond The field of data mining is at aninteresting crossroads; we nowhave a first generation of machinelearning algorithms (such as thosefor learning decision trees, rules,neural networks, Bayesian net-works, and logistic regressions)that have been demonstrated to beof significant value in a variety ofreal-world data mining applica-tions. Dozens of companiesaround the world now providecommercial implementations ofthese algorithms (see www.

kdnuggets.com), along with efficient interfaces tocommercial databases and well-designed user inter-faces. But these first-generation algorithms also havesignificant limitations. They typically assume thedata contains only numeric and symbolic featuresand no text, image features, or raw sensor data. Theyassume the data has been carefully collected into asingle database with a specific data mining task inmind. Furthermore, today’s algorithms tend to befully automatic and therefore fail to allow guidancefrom knowledgeable users at key stages in the searchfor data regularities.

Given these limitations, and the strong commercialinterest despite them, and the accelerating universityresearch in machine learning and data mining, wemight well expect the next decade to produce an orderof magnitude advance in the state of the art. Such anadvance could be motivated by development of newalgorithms that accommodate dramatically morediverse sources and types of data, a broader range ofautomated steps in the data mining process, andmixed-initiative data mining in which humanexperts collaborate more closely with the computerto form hypotheses and test them against the data.

To illustrate one important research issue, consideragain the problem of predicting the risk of an emer-gency C-section for pregnant women. One key lim-itation of current data mining methods is that theycannot utilize the full patient record that is todayroutinely captured in hospital medical records. Thisis because hospital records for pregnant womenoften contain sequences of images (such as the ultra-sound images taken during pregnancy), other raw

COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 33

Figure 2. Typical data and rules for analyzing credit risk.

Data

Years of credit: 9Loan balance: $2,400Income: $52kOwn House: YesOther delinquent accts: 2Max billing cycles late: 3Repay loan?: ?

Customer 103: (time=t1)Customer 103: (time=t0)

Years of credit: 9Loan balance: $3,200Income: ?Own House: YesOther delinquent accts: 2Max billing cycles late: 4Repay loan?: ?

Customer 103: (time=tn)

Years of credit: 9Loan balance: $4,500Income: ?Own House: YesOther delinquent accts: 3Max billing cycles late: 6Repay Loan?: No

Rules learned from synthesized data:

If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1Then Repay-Loan? = No

If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3)Then Repay-Loan? = Yes

instrument data (such as fetal distress monitors),text (such as the notes made by physicians duringperiodic checkups during pregnancy), and evenspeech (such as the recording of phone calls), inaddition to the numeric and symbolic features inFigure 1.

Although our first-generation data mining algo-rithms work well with the numeric and symbolicfeatures, and although some learning algorithms areavailable for learning to classify images or to classifytext, we still lack effective algorithms for learningfrom data that is represented by a combination ofthese various media. As a result, the state of the artin medical outcomes analysis is to ignore the image,text, and raw sensor portions of the medical record,or at best to summarizes them in an oversimplifiedform (such as by labeling complex ultrasoundimages as simply “normal” or “abnormal”).

However, it is natural to expect that if predictionscould be based on the full medical record, we shouldachieve much greater prediction accuracy. There-fore, a topic of considerable research interest isdevelopment of algorithms that can learn regulari-

ties in rich, mixed-media data.This issue is important in manydata mining applications, rang-ing from mining historicalequipment maintenance records,to mining records at customercall centers, to analyzing MRIdata on brain activity during dif-ferent tasks.

The research issue of learningfrom mixed-media data is justone of many current researchissues in data mining. The left-hand side of Figure 4 lists anumber of additional researchtopics in data mining; the right-hand side indicates a variety ofapplications for which theseresearch issues are importantincluding:

Optimizing decisions, ratherthan predictions. The researchgoal here is to develop machinelearning algorithms that gobeyond learning to predict likelyoutcomes, and learn to suggestpreemptive actions that achievethe desired outcome. For exam-ple, consider again the birth dataset mentioned earlier. Although itis clearly helpful to learn to pre-

dict which women suffer a high risk of birth compli-cations, it would be even more useful to learn whichpreemptive actions might help reduce this risk. Simi-larly, in modeling bank customers, it is one thing topredict which of them might close their accounts andmove to new banks; even more useful would be tolearn which actions might help retain them beforethey left.

This problem of learning which actions achieve adesired outcome, given only previously acquireddata, is much more subtle than it may first appear.The difficult issue is that the available data often rep-resents a biased sample that does not correctly repre-sent the underlying causes and effects; for example,whereas the data may show that mothers giving birthat home suffer fewer complications than those givingbirth in a hospital, one cannot necessarily concludethat sending a woman home reduces her risk of complications.

The observed regularity might instead be due tothe fact that a disproportionate number of high-riskwomen choose to give birth in a hospital. Thus, theproblem of learning to choose actions raises impor-

34 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM

Figure 3. More data mining problems.

Customer purchase behavior

Sex: MAge: 53Income: $50,000Own House: YesMS Products: WordComputer: 386 PCPurchase Excel?: ?

Customer 103: (time=t0)

Customer retention

Sex: MAge: 53Income: $50,000Own House: YesChecking: $5,000Savings: $15,000Current customer?: Yes

Customer 103: (time=t0)

Process optimization

Stage: mixMixing–speed: 60rpmViscosity: 1.3Fat content: 15%Density: 2.8Spectral peak: 2800Product underweight?: ?

Product72: (time=t0)

Sex: MAge: 53Income: $50,000Own House: YesMS Products: WordComputer: PentiumPurchase Excel?: ?

Sex: MAge: 53Income: $50,000Own House: YesMS Products: WordComputer: PentiumPurchase Excel?: Yes

Customer 103: (time=t1) Customer 103: (time=tn)

Sex: MAge: 53Income: $50,000Own House: YesChecking: $20,000Savings: $0Current customer?: Yes

Sex: MAge: 53Income: $50,000Own House: YesChecking: $0Savings: $0Current customer?: No

Customer 103: (time=t1) Customer 103: (time=tn)

Stage: cookTemperature: 325°Viscosity: 3.2Fat content: 12%Density: 1.1Spectral peak: 3200Product underweight?: ?

Stage: coolFan–speed: mediumViscosity: 1.3Fat content: 12%Density: 1.2Spectral peak: 3100Product underweight?: Yes

Product72: (time=t1) Product72: (time=tn)

tant and basic questions, such as: How can the systemlearn from biased samples of data? How can the sys-tem incorporate conjectures by human experts aboutthe effectiveness of various intervention actions? Ifsuccessful, this research will allow the application ofhistorical data much more directly to the questions ofgreatest concern to decision makers.

Scaling to extremely large data sets. Whereas mostlearning algorithms perform acceptably on data setswith tens of thousands of training examples, manyimportant data sets are significantly larger. For exam-ple, large retail customer databases and Hubble tele-scope data can easily involve a terabyte or more. Toprovide reasonably efficient data mining methods forsuch large data sets requires additional research.Research during the past few years has already pro-duced more efficient algorithms for such problems as

learning association rules [1] andefficient visualization of large datasets [6]. Further research in thisdirection might well lead to evencloser integration of machine learn-ing algorithms into database man-agement systems.

Active experimentation. Mostcurrent data mining systems pas-sively accept a predetermined dataset. We need new computer meth-ods that actively generate optimalexperiments to obtain additionaluseful information. For example,when modeling a manufacturingprocess, it is relatively easy to cap-ture data while the process runsunder normal conditions. But thisdata may lack information about

how the process performs under important non-stan-dard unpredictable conditions. We need algorithmsthat propose optimal experiments to collect the mostinformative data, taking into account—precisely—the expected benefits, as well as the risks, of theexperiment.

Learning from multiple databases and the Web.The volume and diversity of data available over theInternet and corporate intranets is large and growingrapidly. Therefore, future data mining methodsshould be able to use this huge variety of data sourcesto expand their access to data and to learn useful reg-ularities. For example, one large U.S. equipmentmanufacturer uses data mining to construct models ofthe interests and maintenance needs of its corporatecustomers. In this application, the company mines adatabase consisting primarily of records of past pur-

COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 35

A topic of considerable interest is developmentof algorithms that canlearn regularities inrich, mixed media data.

Figure 4. Research on basic scientific issues (left) will influence data mining applications in many areas (right).

Scientific Issues,Basic Technologies

Applications

Learning from mixed media data, such asnumeric, text, image, voice, sensor

Active experimentation, exploration

Optimizing decisions, rather thanpredictions

Inventing new features to improveaccuracy

Learning from multiple databases andthe Web

Medicine

Manufacturing

Financial

Intelligence analysis

Public policy

Marketing

chases and the servicing needs of its various cus-tomers, with only a few features describing the type ofbusiness each customer performs.

But as it turns out, nearly all of these customershave public Web sites that provide considerable infor-mation about their current and planned activities.Significant improvement could be expected in themanufacturer’s data mining of strategic information ifthe data mining algorithms combined this Web-accessible information with the information availablein the manufacturer’s own internal database. Toachieve this result, however, we need to develop newalgorithms that can successfully extract informationfrom Web hypertext. If successful, this line of researchcould yield several orders of magnitude increase inthe variety and currency of data accessible to manydata mining applications.

Inventing new features to improve prediction accu-racy. In many cases, the accuracy of predictions canbe improved by inventing a more appropriate set offeatures to describe the available data. For example,consider the problem of detecting the imminent fail-ure of a piece of equipment based on the time seriesof sensor data collected from the equipment. Millionsof features describing this time series can be generatedeasily by taking differences, sums, ratios, and averagesof primitive sensor readings, along with previouslydefined features. Given a sufficiently large and long-duration data set, it should be feasible to automati-cally explore this large space of possible definedfeatures to identify the small fraction of them mostuseful for future learning. This work could lead toincreased accuracy in many prediction problems, suchas equipment failure, customer attrition, credit repay-ment, and medical outcomes.

Active research takes many other directions,including how to provide more useful data visualiza-tion tools, how to support mixed-initiative human-machine exploration of large data sets, and how toreduce the effort needed for data warehousing andfor combining information from different legacydatabases. Still, the interesting fact is that even cur-rent first-generation approaches to data mining arebeing put to routine use by many organizations,producing important gains in many applications.

We might speculate that progress in data miningover the next decade will be driven by three mutuallyreinforcing trends:

• Development of new machine learning algo-rithms that learn more accurately, utilize datafrom dramatically more diverse data sources avail-able over the Internet and intranets, and incorpo-rate more human input as they work;

• Integration of these algorithms into standarddatabase management systems; and

• An increasing awareness of data mining technol-ogy within many organizations and an attendantincrease in efforts to capture, warehouse, and uti-lize historical data to support evidence-baseddecision making.

We can also expect more universities to react tothe severe shortage of trained experts in this area bycreating new academic programs for students want-ing to specialize in data mining. Among the univer-sities recently announcing graduate degreeprograms in data mining, machine learning, andcomputational statistics are Carnegie-Mellon Uni-versity (see www.cs.cmu.edu/~cald), the Universityof California, Irvine (www.ics.uci.edu/~gcoun-sel/masterreqs.html), George Mason University(vanish.science.gmu.edu), and the University ofBristol (www.cs.bris.ac.uk/Teaching/Machine-Learning).

References 1. Agrawal, R., Imielinski, T., and A. Swami. Database mining: A perfor-

mance perspective. IEEE Trans. on Knowl. Data Eng. 5, 6 (1993),914–925.

2. Chauvin, Y. and Rumelhart, D. Backpropagation: Theory, Architectures,and Applications. Lawrence Erlbaum Associates, Hillsdale, N.J., 1995.

3. Clark, P. and Niblett, R. The CN2 induction algorithm. Mach. Learn.3, 4 (Mar. 1989), 261–284.

4. Gray, J., Bosworth A., Layman A., and Pirahesh, H. Data Cube: A Rela-tional Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Microsoft Tech. Rep. MSR-TR-95-22, Redmond, Wash., 1995.

5. Heckerman, D., Geiger, D., and Chickering, D. Learning Bayesian net-works: The combination of knowledge and statistical data. Mach. Learn.20, 3 (Sept. 1995), 197–243.

6. Faloutsos C. and Lin, K. FastMap: A fast algorithm for indexing, datamining, and visualization. ACM SIGMOD (1995), 163–174.

7. Mitchell, T. Machine Learning. McGraw-Hill, New York, 1997. 8. Muggleton, S. Foundations of Inductive Logic Programming. Prentice

Hall, Englewood Cliffs, N.J., 1995. 9. Quinlan J. C4.5 Programs for Machine Learning. Morgan Kaufmann,

San Mateo, Calif., 1993.

Tom M. Mitchell ([email protected]) is the Fredkin Professor of AI and Learning and director of the Center for Automated Learning and Discovery in the School of Computer Science at Carnegie-Mellon University in Pittsburgh.

This research is supported in part by DARPA under contract F30602-97-1-0215 andby contributions from the Corporate Members of the Center for Automated Learningand Discovery in the School of Computer Science at Carnegie-Mellon University.

Permission to make digital or hard copies of all or part of this work for personal or class-room use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and the full citation onthe first page. To copy otherwise, to republish, to post on servers or to redistribute tolists, requires prior specific permission and/or a fee.

© 1999 ACM 0002-0782/99/1100 $5.00

c

36 November 1999/Vol. 42, No. 11 COMMUNICATIONS OF THE ACM