Data Mining in Medicine

8/8/2019 Data Mining in Medicine

http://slidepdf.com/reader/full/data-mining-in-medicine 1/42

Datamining in Medicine:Selected Techniques and

Applications

Copyright, 2002 © webAI Group, www.datamining.ro

Author Adrian Giurca

[email protected]



Ov er v iewGenerally, data mining (sometimes called data or knowledge disco v ery) is the process of analyzingdata from different perspecti v es and summarizing itinto useful information - information that can beused to increase re v enue, cuts costs, or both. Datamining software is one of a number of analyticaltools for analyzing data. It allows users to analyze

data from many different dimensions or angles,categorize it, and summarize the relationshipsidentified. Technically, data mining is the process of finding correlation or patterns among dozens of

fields in large relational databases.



The nature of Medical Data

The rapidly emerging globally of datarequires standards in terminology,

v ocabularies and formats to support datasharing, standards for interfaces betweendifferent sources of data and integration of heterogeneous data (including images),and standards in the design of electronic

patient records.




Many en v ironments still lack suchstandards, which hinders the use of data

analysis tools on large global databases,limiting their applications to datasetscollected for specific diagnostic,screening, prognostic, monitoring, therapysupport or other patient management

purposes.




Patient records collected for diagnosis and prognosis typically encompass v alues of

anamnestic, clinical and laboratory parameters, as well as results of particular inv estigations, specific to the gi v en task.




Such datasets are characterized by

incompleteness (missing parameter v alues),

incorrectness (systematic or random noise in thedata),

sparness (few and/or non-representable patientrecords a v ailable),

inexactness (inappropriate selection of parameters for the gi v en task).



The nature of Medical DataDatasets collected in monitoring (either acutemonitoring of a particular patient in an intensi v ecare unit, or discrete monitoring o v er long

periods of time in the case of patients withchronic diseases) ha v e additional characteristics:they in v olv e the measurements of a set of

parameters at different times, requesting thetemporal component to be taken into account indata analysis.



Selected Medical Data Mining

TechniquesCurrent trends in medical decision makingshow awareness of the need to introduce

formal reasoning, as well as intelligentdata analysis techniques in the extractionof knowledge, regularities, trends and

representativ

e cases from patient datastored in medical records.




TechniquesFormal techniques include:

decision theorysymbolic reasoning technologymethods at their intersection, such as

probabilistic belief networks




TechniquesIntelligent data analysis techniques include:

machine learning

clusteringdata v isualizationinterpretation of time-ordered data ( deri v ation

and rev

ision of temporal trends and other formsof temporal data abstraction).




TechniquesMachine learning methods can be classified into threemajor groups:inducti v e learning of symbolic rules (such asinduction of rules, decision trees and logic

programs)statistical or pattern-recognition methods (such as k-nearest neighbors or instance-based learning,discriminate analysis and Bayesian classifiers)artificial neural networks (such as networks with

backpropagation learning, Kohonen's self organizingnetwork and Hofield's associati v e memory)




TechniquesMachine learning methods ha v e been applied toa v ariety of medical domains in order to impro v e

medical decision making.These include diagnostic and prognostic

problems in: oncology, li v er pathology,neuropsychology, gynaecology.Impro v ed medical diagnosis and prognosis may

be achie v ed through automatic analysis of patient data stored in medical records i.e. bylearning from past experiences.




TechniquesGiv en patient records with corresponding diagnoses,machine learning methods are able to diagnose newcases. More specifically, suppose E is a set of examples

with known classifications.An example is described by the v alues of a fixedcollection of features (attributes): A i, i =1,...,N at

Each attribute can either ha v e a finite set of v alues

(discrete) or take real numbers as v alues (continous).An indi v idual example e j, j =1,...,N ex is a n-tuple of v alues v ik

of attributes A i Each example is assigned oneof N cl possible v alues in the class v ariable C(classifications):c i, i =1,«, N cl.

A




TechniquesFor example, in the domain of early diagnosis of rheumatic diseases,the patient record comprise 16 anamnestic attributes. Some of theseare continuous (age, duration of morning stiffness) and some are

discrete (e.g. joint pain, which can be arthrotic, arthritic, or not present at all). There are eight possible diagnoses: ± degenerati v e spin diseases ± inflammatory spine diseases ± other inflamatory diseases ± extraarticular rheumatism ± crystal-induced syno v itis ± non-specific rheumatic manifestations ± non-rheumatic diseases




TechniquesTo classify (diagnose ) new cases, machine learning methodscan take different approaches. ± They can construct explicit symbolic rules that generalize

the training cases( rule induction and decision treeinduction). The induced rules or decision trees can then beused to classify new cases.

± To store (some of) the training cases for reference(instance-based learning). New cases can then be classified

by comparing them to the reference cases. ± To compute , for a gi v en case to be classified , the

conditional probability of classes according to theBayesian formula and assign the most probable class to thecase.



How does data mining work?While large-scale information technology has been e v olv ing separate transactionand analytical systems, data mining pro v ides the link between the two. Datamining software analyzes relationships and patterns in stored transaction data

based on open-ended user queries. Se v eral types of analytical software areav ailable: statistical, machine learning, and neural networks. Generally, any of

four types of relationships are sought:y C lasses : Stored data is used to locate data in predetermined groups. For example, a

restaurant chain could mine customer purchase data to determine when customers v isitand what they typically order. This information could be used to increase traffic byhav ing daily specials.

y C lusters : Data items are grouped according to logical relationships or consumer

preferences. For example, data can be mined to identify market segments or consumer affinities.

y Associations : Data can be mined to identify associations. The beer-diaper example isan example of associati v e mining.

y Sequential patterns : Data is mined to anticipate beha v ior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack

being purchased based on a consumer's purchase of sleeping bags and hiking shoes.



Fiv e major elements:

y Extract, transform, and load transaction data onto thedata warehouse system.

y Store and manage the data in a multidimensionaldatabase system.

y Prov ide data access to business analysts andinformation technology professionals.

y Analyze the data by application software.y Present the data in a useful format, such as a graph

or table.



Different le v els of analysisy Artificial neural networks : Non-linear predicti v e models that learn through training and resemble

biological neural networks in structure.

y G enetic algorithms : O ptimization techniques that use processes such as genetic combination, mutation,and natural selection in a design based on the concepts of natural e v olution.

y Decision trees : Tree-shaped structures that represent sets of decisions. These decisions generate rules for

the classification of a dataset. Specific decision tree methods include Classification and Regression Trees(CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision treetechniques used for classification of a dataset. They pro v ide a set of rules that you can apply to a new(unclassified) dataset to predict which records will ha v e a gi v en outcome. CART segments a dataset bycreating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CARTtypically requires less data preparation than CHAID.

y Nearest neighbor method : A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k -

nearest neighbor technique.y R ule induction : The extraction of useful if-then rules from data based on statistical significance.

y Data visualization : The v isual interpretation of complex relationships in multidimensional data. Graphicstools are used to illustrate data relationships.



What technological

infrastructure is required?Today, data mining applications are a v ailable on all size systems for mainframe,client/ser v er, and PC platforms. System prices range from se v eral thousanddollars for the smallest applications up to $1 million a terabyte for the largest.Enterprise-wide applications generally range in size from 10 gigabytes to o v er 11 terabytes. There are two critical technological dri v ers:

y Size of the database : the more data being processed and maintained, the more powerful thesystem required.

y Q uery complexity : the more complex the queries and the greater the number of queries being processed, the more powerful the system required.

Relational database storage and management technology is adequate for manydata mining applications less than 50 gigabytes. Howe v er, this infrastructure

needs to be significantly enhanced to support larger applications. Some v endorshav e added extensi v e indexing capabilities to impro v e query performance.O thers use new hardware architectures such as Massi v ely Parallel Processors(MPP) to achie v e order-of-magnitude impro v ements in query time. For example,MPP systems from NCR link hundreds of high-speed Pentium processors toachie v e performance le v els exceeding those of the largest supercomputers.



Software Design

Algorithms



Decision Tree (I)The Decision Tree exploration engine, helps sol v e the task of classifying cases into multiplecategories. Decision Tree is the fastest algorithm when dealing with large amounts of attributes. Decision Tree report pro v ides an easily interpreted decision tree diagram and a

predicted v ersus real table.

P roblems to Solve :

± Classification of cases into multiple categories

Target Attributes :

± Categorical or Boolean (Yes/No) attribute

Output Format :

± Classification statistics

± Predicted v ersus Real table (confusion matrix)

± Decision Tree diagram

Optimal Number of R ecords :

± Minimum of 100 records

± Maximum of 5,000,000 records



Decision Tree (II)Preprocessing Suggested : Summary Statistics - to deselect attributes that contain to many v alues to pro v ide any useful insight to

the exploration engine.

Underlying Algorithms : Information Gain splitting criteria, Shannon information theory and statistical significance tests.

The Data Used : Decision Tree works on data of any type. The DT algorithm is well-poised for analyzing v ery large databases because it does not require loading all the data in machine main memory simultaneously. The software takes a full ad v antageof this feature by implementing incremental DT learning with the help of the OL E DB for Data Mining mechanism. The DT

algorithm calculation time scalesvery well (grows only linearly) with increasing number of data columns. At the same time,it grows more than linearly with the growing number of data records - as N*log(N), where N is the number of records.

Problems to Solve : Decision Tree algorithm helps sol v ing the task of classifying cases into multiple categories. In many cases, thisis the fastest, as well as easily interpreted machine learning algorithm. The DT algorithm pro v ides intuiti v e rules for sol v ing agreat v ariety of classification tasks ranging from predicting buyers/non-buyers in database marketing, to automaticallydiagnosing patient in medicine, and to determining customer attrition causes in banking and insurance.

Target Attribute : The target attribute of a Decision Tree exploration must be of a Boolean (yes/no) or categorical data type.

When to Use This Algorithm : The Decision Tree exploration engine is used for task such as classifying records or predicting

outcomes. You should use decision trees when you goal is to assign your records to a few broad categories. Decision Trees prov ide easily understood rules that can help you identify the best fields for further exploration.

The Output : The Decision Tree report starts of by gi v ing measures resulting from the decision tree. These measures are the Number of non-terminal nodes, Number of lea v es, and depth of the constructed tree. Next, the report pro v ides classificationstatistics on the decision tree. After these measures, the predicti v e v ersus real table is shown.



Cluster Analysis

Cluster engine is used for the automated detecting clusters of records that lie close to each other in a certain sense in the spaceof all v ariables. Such clusters may represent different situationsor target groups, which one might find beneficial to studyseparately. The Cluster engine places records corresponding todifferent clusters in separate datasets for further analysis. Thecluster analysis pro v es to be useful for applications ranging fromdatabase marketing to quality control.

The use of all attributes makes the Cluster algorithm v ery useful for beginning data mining ± it is an undirected method, and does notrequire the selection of a target attribute.



Fuzzy L ogic ClassificationThe algorithm is used for assigning cases to different classes. O n

the output this exploration engine not only produces the prediction to which class the case belongs, but also pro v idesthe obtained symbolic classification rule generalizedautomatically from the training examples. The classifier engine furnishes simpler and more reliable results thansystems based on pure decision trees ideology. The predictionaccuracy obtained for the testing cases is comparable to theaccuracy obtained for the training cases. And again, statisticalsignificance of the generalized rule is determined rigorously

by the classifier engine. Note that the classifier engine canutilize either SKAT or M L R or neural network predictionmethod as its dri v ing mechanism.



L inear RegressionThe Stepwise L inear Regression algorithm is, to our knowledge, the

only system capable of including categorical v ariables, in addition tonumerical and logical v ariables, in the regression analysis.

ML R disco v ers linear relations in data, automatically selecting only

those independent v ariables which influence the target v ariablemost. It also pinpoints redundant, mutually correlating independentv ariables, and includes only their minimal subset in the results.

The L inear Regression is based on a v ery quick and robust calculationalgorithm. As with all other , the rigorous determination of

significance of the obtained results is performed for each modelconsidered. M L R is the fastest exploration engine and thus can beused as a complementary preprocessing module for the SKATexploration engine.



Symbolic Knowledge Acquisition

Technology (SKAT)Data mining is one of the most promising modern information technologies. The corporate world has

learned to deri v e new v alue from data by utilizing v arious intelligent tools and algorithmsdesigned for an automated disco v ery of non-tri v ial, useful, and pre v iously unknown knowledge inraw data.

Which factors influence the future v ariation of the price of some security shares?

What characteristics of a potential customer of some ser v ice make him/her the most probable buyer?

These and numerous other business questions can be successfully addressed by data mining.

The majority of a v ailable data mining tools are based on a few well-established technologies for dataanalysis. Different knowledge disco v ery methods are suited best for different applications.Among the useful knowledge presentation tasks one can name the dependency detection,numerical prediction, explicit relation modeling, or classification rules.

Despite the usefulness of traditional data mining methods in v arious situations, we choose toconcentrate here first on the problems that plague these methods. Then we discuss the solutions tothese problems, which become a v ailable with an ad v ent of SKAT - a next generation data miningtechnology. We outline the reasons, foundations, and commercial implementations of thisemerging approach.



Symbolic Knowledge Acquisition

Technology (SKAT)Among the v arious tasks a data mining system is asked to perform, twoquestions are encountered most frequently:

± Which database fields influence the selected target field?

± Precisely how the target field depends on other fields in the database?

While there are many successful methods designed to answer the firstquestion, it is far more difficult to answer the second. Why does thishappen? Simply, an obser v ation that across a number of cases with closev alues of all parameters except some parameter X, the target parameter

Yv

aries considerably, implies that Y depends on X. For multi-dimensional dependencies the issue becomes less straightforward, butthe basic idea for sol v ing the problem is the same. At the same time, thetask of automated determination of an explicit form of the dependence

between se v eral v ariables is significantly more difficult. The solution tothis problem cannot be based on similar simple-minded considerations.



Symbolic Knowledge

Acquisition Technology (SKAT)Traditional methods for finding the precise form of a soughtrelation implement the search for an expression representing thedependence among possible expressions from some fixed class.This idea is exploited in many existing data mining applications.For example, one of the most straightforward and popular methods of search for simple numerical dependencies - linear regression - selects a solution out of a set of linear formulaeapproximating the sought dependence. Systems from another

popular class of data mining algorithms - decision trees - searchfor classification rules represented as trees in v olv ing simpleequalities and inequalities in the nodes connected by BooleanAND and O R operations.



Symbolic Knowledge

Acquisition Technology (SKAT)Howe v er, beyond the limits of the narrow classes of dependencies thatcan be found by these systems there is an endless sea of dependencieswhich cannot e v en be represented in the language used by thesesystems. For example, assume you are using a decision tree system toanalyze the data holding the following simple rule: "most frequent

buyers of Post cereal are homemakers of age smaller than the in v ersesquare of their family income multiplied by a certain constant". Atraditional system has no means to disco v er such a rule. O nly if onefurnishes to the system explicitly the parameter "in v erse square of thefamily income" can the stated rule be found by traditional systems. Inother words, one has to guess an approximate form of the solution first -and then the machine does the rest of the job efficiently. While guessinga general form of the solution prior to automated modeling might be achallenging brain twister, it certainly does not make life of a corporatedata analyst much easier.



Symbolic Knowledge

Acquisition Technology (SKAT)



Case Study :

Bayesian Classification.



Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certaintypes of learning problems

Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis is correct.Prior knowledge can be combined with obser v ed data.Probabilistic prediction: Predict multiple hypotheses,

weighted by their probabilitiesStandard: E v en when Bayesian methods arecomputationally intractable, they can pro v ide a standard of optimal decision making against which other methods can

be measured



Bayesian Theorem: Basics

L et X be a data sample whose class label is unknownL et H be a hypothesis that X belongs to class CFor classification problems, determine P(H/X): the

probability that the hypothesis holds gi v en the obser v eddata sample X

P(H): prior probability of hypothesis H (i.e. the initial probability before we obser v e any data, reflects the

background knowledge)P(X): probability that sample data is obser v edP(X|H) : probability of obser v ing the sample X, gi v en that

the hypothesis holds



Bayesian Theorem

Giv en training data X, posteriori probability of a hypothesis H, P(H|X)follows the Bayes theorem

Informally, this can be written as posterior =likelihood x prior / e v idence

MAP (maximum posteriori) hypothesis

Practical difficulty: require initial knowledge of many probabilities,significant computational cost

)()()()(

X P H P H X P X H P !

.)()|(maxarg)|(maxarg h P h D P

H h

Dh P

H h M A P h |



Naï v e Bayesian Classifier

Each data sample X is represented as a v ector {x 1, x2, «, x n}

There are m classes C 1, C2, «, C m

Giv en unknown data sample X, the classifier will predict thatX belongs to class C i, iff

P(C i|X) > P (C j|X) where 1 e j e m , I { J

By Bayes theorem, P(C i|X)= P(X|C i)P(C i)/ P(X)



Naï v e Bayes Classifier A simplified assumption: attributes are conditionally

independent:

The product of occurrence of say 2 elements x 1 and x 2, giv enthe current class is C, is the product of the probabilities of each element taken separately, gi v en the same classP([y 1,y2],C) = P(y 1,C) * P(y 2,C)

No dependence relation between attributesGreatly reduces the computation cost, only count the class

distribution.O nce the probability P(X|C i) is known, assign X to the class

with maximum P(X|C i)*P(C i)

!

!

n

k C i xk P C i P

1

)()(



Training dataset

age income student credit_rating buys_computer <=30 high no fair no<=30 high no excellent no30«40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31«40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31«40 medium no excellent yes31«40 high yes fair yes>40 medium no excellent no

Class:

C1:buys_computer= yes

C2:buys_computer=

no

Data sampleX =(age<=30,Income=medium, Student=yesCredit_rating=Fair)



Naï v e Bayesian Classifier:

ExampleCompute P(X/Ci) for each classP(age=³<30´ | buys_computer=³yes´) = 2/9=0.222P(age=³<30´ | buys_computer=³no´) = 3/5 =0.6P(income=³medium´ | buys_computer=³yes´)= 4/9 =0.444P(income=³medium´ | buys_computer=³no´) = 2/5 = 0.4

P(student=³yes´ | buys_computer=³yes)= 6/9 =0.667P(student=³yes´ | buys_computer=³no´)= 1/5=0.2P(credit_rating=³fair´ | buys_computer=³yes´)=6/9=0.667P(credit_rating=³fair´ | buys_computer=³no )=2/5=0.4

X=( age< =30 ,income =medium, student =yes,credit_rating =fair)

P(X|C i) : P(X|buys_computer=³yes´)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044P(X|buys_computer=³no´)= 0.6 x 0.4 x 0.2 x 0.4 =0.019

P(X|C i)* P(C i ) : P(X|buys_computer=³yes´) * P(buys_computer=³yes´)=0.028P(X|buys_computer=³yes´) * P(buys_computer=³yes´)=0.007

X belongs to class ³buys_computer =yes´



Naï v e Bayesian Classifier:

CommentsAdv antages : ± Easy to implement

± Good results obtained in most of the cases

Disad v antages ± Assumption: class conditional independence , therefore loss of accuracy

± Practically, dependencies exist among v ariables

± E.g., hospitals : patients: Profile : age, family history etc

Symptoms : fe v er, cough etc , Disease : lung cancer, diabetes etc ,

Dependencies among these cannot be modeled by Naïv

e BayesianClassifier, use a Bayesian network

How to deal with these dependencies? ± Bayesian Belief Networks



Nai v e Bayesian Classifier:Example II

Giv en a training set, we can compute the probabilities

O utlook P N H um id ity P Nsunny 2 /9 3 /5 h igh 3 /9 4 /5overcast 4 /9 0 norm al 6 /9 1 /5rain 3 /9 2 /5Tem preatu re W indyho t 2 /9 2 /5 true 3 /9 3 /5m ild 4 /9 2 /5 false 6 /9 2 /5coo l 3 /9 1 /5



Bayesian NetworksBayesian belief network allows a subset of thev ariables conditionally independent

A graphical model of causal relationships ± Represents dependency among the v ariables ± Giv es a specification of joint probability distribution

X Y

ZP

Nodes: random variables

Links: dependencyX,Y are the parents of Z Y is the parent of PNo dependency between Zand P

Has no loops or cycles



Bayesian Belief Network: AnExample

FamilyHistory

Lung C ancer

Positive XR ay

Smoker

Emphysema

Dyspnea

L C

~ L C

(FH, S) (FH, ~S) (~ FH, S) (~ FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table forthe variable Lung C ancer:

Shows the conditional probability foreach possible combination of itsparents

!!

n

i Z Pare n ts i zi P zn z P

1))(|(),...,1(

Documents

Data Mining in Medicine