Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

1

Supervised vs. Unsupervised Learning

n Supervisedlearning(classification)

n Supervision:trainingdataarelabeledindicatingclasses

n Newinstancesareclassifiedbasedontrainingset

n Unsupervisedlearning (clustering)

n classlabelsareunknown

n Givenasetofobjects,establishtheexistenceofclassesorclustersinthedata

2

n Classificationn predictscategoricalclasslabels

n Numericpredictionn modelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvalues

n Typicalapplicationsn Credit/loanapprovaln Medicaldiagnosisn Frauddetectionn Webpagecategorization

Prediction: Classification vs. Numeric Prediction

3

4

Classification: A Two-Step Process

n Modelconstruction:describingasetofpredeterminedclassesn Eachtuple/sampleisassumedtobelongtoapredefinedclass,as

indicatedbytheclasslabelattributen Thesetoftuplesusedformodelconstructionistrainingsetn Themodelisrepresentedasclassificationrules,decisiontrees,or

mathematicalformulaen Modelusage:forclassifyingfutureorunknowninstances

n Estimateaccuracyofthemodeln Useanindependent(oftrainingset)testingset,comparepredictedclasslabelswithtrueclasslabels

n Computeaccuracy(percentageofcorrectlyclassifiedinstances)n Iftheaccuracyisacceptable,usethemodeltoclassifynewdata

5

Process 1: Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

6

Process 2: Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Issues: Data Preparation

n Data cleaningn Preprocess data in order to reduce noise and handle

missing valuesn Relevance analysis (feature selection)

n Remove the irrelevant or redundant attributesn Data transformation

n Generalize and/or normalize data

7



8

9

Decision Tree Induction: An Example

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fairexcellentyesno

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

q Trainingdataset:Buys_computerq Resultingtree:

10

Algorithm for Decision Tree Induction

n Basicalgorithm(agreedyalgorithm)n Treeisconstructedinatop-down(fromgeneraltospecific)recursive

divide-and-conquermannern Atstart,allthetrainingexamplesareattherootn Attributesarecategorical(ifcontinuous-valued,discretizationinadvance)n Examplesarepartitionedrecursivelybasedonselectedattributesn Attributesareselectedbasedonheuristicorstatisticalmeasure(e.g.,

informationgain)

n Whentostopn Allexampleforagivennodebelongtothesameclass(pure),orn Noremainingattributestoselectfrom,or

n majorityvoting todetermineclasslabelforthenoden Noexamplesleft

Random Tree Induction

Let a be the number of attributes. Let v be the maximum number of values any attribute can take

n Upper bound on the number of trees?n Lower bound on the number of trees?

n Random tree inductionn Randomly choose an attribute for splitn Same stopping criteria

n The design of decision trees has been largely influenced by the preference for simplicity.

11

Occam’s Razorn Occam’s Razor: rule of parsimony, principle of economy

n plurality should not be assumed without necessityn meaning, one should not increase, beyond what is necessary,

the number of entities required to explain anything

n Argument: the simplicity of nature and rarity of simple theories can be used to justify Occam's Razer.n First, nature exhibits regularity and natural phenomena are more often

simple than complex. At least, the phenomena humans choose to study tend to have simple explanations.

n Second, there are far fewer simple hypotheses than complex ones, so that there is only a small chance that any simple hypothesis that is wildly incorrect will be consistent with all observations.

n Occam's two razors: The sharp and the blunt (KDD’98)n Pedro Domingos

1288 - 1348

12

Attribute Selection Measure: Information Gain (ID3/C4.5)

n How to obtain smallest (shortest) tree?n Careful design on selection of attributen Quinlan pioneered using entropy in his ID3 algorithmn Entropy: in information theory, also called expected

information, is a measure of uncertainlyn Intuition: chaos, molecular disorder, temperature,

thermodynamic system, universen High entropy = high disorder

13

14

Attribute Selection Measure: Information Gain (ID3/C4.5)

n Selecttheattributewiththehighestinformationgainn Letpi betheprobabilitythatanarbitrarytupleinDbelongstoclass

Ci,estimatedby|Ci,D|/|D|n Expectedinformation (entropy)neededtoclassifyatupleinD:

n entropy:measureofuncertainty.largerentropy->largeruncertainty

n Information neededtoclassifyD(aggregatedentropyafterusingAtosplitDintovpartitions):

n Informationgained (entropydropped)bybranchingonattributeA

)(log)( 21

i

m

ii ppDInfo ∑

=

−=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo ×=∑=

(D)InfoInfo(D)Gain(A) A−=

15

Attribute Selection: Information Gain

g ClassP:buys_computer=“yes”g ClassN:buys_computer=“no”

means“age<=30” has5outof14samples,with2yes’esand3no’s.Hence

Similarly,

age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971

694.0)2,3(145

)0,4(144)3,2(

145)(

=+

+=

I

IIDInfoage

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0)()()( =−= DInfoDInfoageGain ageage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(145 I

940.0)145(log

145)

149(log

149)5,9()( 22 =−−== IDInfo

16

Computing Information-Gain for Continuous-Valued Attributes

n LetattributeAbeacontinuous-valuedattribute

n Mustdeterminethebestsplitpoint forA

n SortthevalueAinincreasingorder

n Typically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint

n (ai+ai+1)/2isthemidpointbetweenthevaluesofai andai+1

n Thepointwiththeminimumexpected informationrequirement forAisselectedasthesplit-pointforA

n Split:

n D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point

17

Gain Ratio for Attribute Selection (C4.5)

n Informationgainisbiasedtowardsattributeswithalargenumberofvalues

n C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)

n GainRatio(A)=Gain(A)/SplitInfo(A)

gain_ratio(income)=0.029/1.557=0.019

n Theattributewiththelargestgainratiowillbeselected

)||||

(log||||

)( 21 D

DDD

DSplitInfo jv

j

jA ×−= ∑

=

18

Gini Index (CART, IBM IntelligentMiner)

n IfadatasetDcontainsexamplesfromn classes,giniindex,gini(D)isdefinedas

wherepj istherelativefrequencyofclassj inDn IfadatasetD issplitonAintotwosubsetsD1 andD2,thegini

indexgini(D)isdefinedas

n ReductioninImpurity:

n Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(need toenumerateallthepossiblesplittingpointsforeachattribute)

∑=

−=n

jp jDgini121)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA +=

)()()( DginiDginiAgini A−=Δ

19

Computation of Gini Index

n Ex.Dhas9tuplesinbuys_computer=“yes” and5in“no”

n SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2

Gini{low,high} is0.458;Gini{medium,high} is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex

n Allattributesareassumedcontinuous-valuedn Mayneedothertools,e.g.,clustering,togetthepossiblesplit

valuesn Canbemodifiedforcategoricalattributes

459.0145

1491)(

22

=⎟⎠

⎞⎜⎝

⎛−⎟⎠

⎞⎜⎝

⎛−=Dgini

)(144)(

1410)( 21},{ DGiniDGiniDgini mediumlowincome ⎟

⎠

⎞⎜⎝

⎛+⎟⎠

⎞⎜⎝

⎛=∈

20

Comparing Attribute Selection Measures

n Thethreemeasures,ingeneral,returngoodresultsbutn Informationgain:

n biasedtowardsmultivaluedattributesn Gainratio:

n tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothers

n Giniindex:n biasedtomultivaluedattributesn hasdifficultywhen#ofclassesislargen tendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions

21

Other Attribute Selection Measures

n CHAID:apopulardecisiontreealgorithm,measurebasedonχ2 testforindependence

n C-SEP:performsbetterthaninfo.gainandginiindexincertaincases

n G-statistic:hasacloseapproximationtoχ2 distribution

n MDL(MinimalDescriptionLength)principle (i.e.,thesimplestsolutionispreferred):

n Thebesttreeastheonethatrequiresthefewest#ofbitstoboth(1)encodethetree,and(2)encodetheexceptionstothetree

n Multivariatesplits(partitionbasedonmultiplevariablecombinations)

n CART:findsmultivariatesplitsbasedonalinearcomb.ofattrs.

n Whichattributeselectionmeasureisthebest?

n Mostgivegoodresults,noneissignificantlysuperiorthanothers

Overfitting and Tree Pruning

n Overfitting: An induced tree may overfit the training data n Too many branches, some may reflect anomalies due to noise or outliersn Poor accuracy for unseen samples

n Blue: training error, red: generalization error

n Two approaches to avoid overfittingn Prepruning: Halt tree construction early—do not split a node if this would

result in the goodness measure falling below a thresholdn Difficult to choose an appropriate threshold

n Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

n Use a set of data (validation set) different from the training data to decide which is the “best pruned tree”

22

23

Enhancements to Basic Decision Tree Induction

n Allowforcontinuous-valuedattributesn Dynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervals

n Handlemissingattributevaluesn Assignthemostcommonvalueoftheattributen Assignprobabilitytoeachofthepossiblevalues

n Attributeconstructionn Createnewattributesbasedonexistingonesthataresparselyrepresented

n Thisreducesfragmentation,repetition,andreplication

24

Classification in Large Databases

n Classification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchers

n Scalability:Classifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeed

n Whyisdecisiontreeinductionpopular?n relativelyfasterlearningspeed(thanotherclassificationmethods)

n convertibletosimpleandeasytounderstandclassificationrulesn canuseSQLqueriesforaccessingdatabasesn comparableclassificationaccuracywithothermethods

n RainForest(VLDB’98— Gehrke,Ramakrishnan&Ganti)n BuildsanAVC-list(attribute,value,classlabel)

25

Scalability Framework for RainForest

n Separates the scalability aspects from the criteria that determine the quality of the tree

n Builds an AVC-list: AVC (Attribute, Value, Class_label)

n AVC-set (of an attribute X )

n Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated

n AVC-group (of a node n )

n Set of AVC-sets of all predictor attributes at the node n

26

Rainforest: Training Set and Its AVC Sets

student Buy_Computer

yes no

yes 6 1

no 3 4

Age Buy_Computer

yes no

<=30 2 3

31..40 4 0

>40 3 2

Creditrating

Buy_Computer

yes no

fair 6 2

excellent 3 3

age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

AVC-set on incomeAVC-set on Age

AVC-set on Student

Training Examplesincome Buy_Computer

yes no

high 2 2

medium 4 2

low 3 1

AVC-set on credit_rating

27

BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)

n Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory

n Each subset is used to create a tree, resulting in several trees

n These trees are examined and used to construct a new tree T’

n It turns out that T’ is very close to the tree that would be generated using the whole data set together

n Adv: requires only two scans of DB, an incremental alg.

27



28

Model Evaluation Metrics: Confusion Matrix

Actualclass\Predictedclass buy_computer=yes

buy_computer=no

Total

buy_computer=yes 6954 46 7000buy_computer=no 412 2588 3000

Total 7366 2634 10000

n Givenm classes,anentry,CMi,j inaconfusionmatrix indicates#oftuplesinclassi thatwerelabeledbytheclassifierasclassj

n Mayhaveextrarows/columnstoprovidetotals

ConfusionMatrix:Actualclass\Predictedclass C1 ¬C1

C1 TruePositives(TP) FalseNegatives(FN)

¬C1 FalsePositives(FP) TrueNegatives(TN)

Example of Confusion Matrix:

29

Model Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity

n Accuracy,orrecognitionrate:percentageoftestsettuplesthatarecorrectlyclassifiedAccuracy=(TP+TN)/All

n Errorrate: 1– accuracy,orErrorrate=(FP+FN)/All

n ClassImbalanceProblem:n Oneclassmayberare,e.g.fraud,orHIV-positive

n Significantmajorityofthenegativeclass andminorityofthepositiveclass

n Sensitivity:TruePositiverecognitionrate(recallfor+)

n Sensitivity=TP/Pn Specificity:TrueNegativerecognitionrate(recallfor-)

n Specificity=TN/N

A\P C ¬CC TP FN P¬C FP TN N

P’ N’ All

30

Model Evaluation Metrics: Precision and Recall, and F-measures

n Precision:exactness– what%oftuplesthattheclassifier(model)labeledaspositiveareactuallypositive

n Recall:completeness– what%ofpositivetuplesdidtheclassifier(model)labelaspositive?

n Perfectscoreis1.0n Inverserelationshipbetweenprecision&recalln Fmeasure(F1 or F-score):harmonicmeanofprecisionandrecall,

n Fß:weightedmeasureofprecisionandrecalln assignsß timesasmuchweighttorecallastoprecision

31

Model Evaluation Metrics: Example

n Precision =90/230=39.13%Recall =90/300=30.00%

ActualClass\Predictedclass cancer=yes cancer=no Total Recognition(%)cancer=yes 90 210 300 30.00(sensitivitycancer=no 140 9560 9700 98.56(specificity)

Total 230 9770 10000 96.40(accuracy)

32

Evaluating Learning Algorithm:Holdout & Cross-Validation Methods

n Holdoutmethodn Givendataisrandomlypartitionedintotwoindependentsets

n Trainingset(e.g.,2/3)formodelconstructionn Testingset(e.g.,1/3)foraccuracy(oranothermetric)estimation

n Randomsampling:avariationofholdoutn Repeatholdoutktimes,accuracy=avg.oftheaccuraciesobtained

n Cross-validation (k-fold,wherek=10ismostcommon)n Randomlypartitionthedataintok mutuallyexclusive subsets,eachapproximatelyequalsize

n Ati-thiteration,useDiastestingsetandothersastrainingsetn Leave-one-out:k foldswherek =#oftuples,forsmallsizeddatan Stratifiedcross-validation:foldsarestratifiedsothatclassdist.ineachfoldisapprox.thesameasthatintheinitialdata

33

Evaluating Classifier Accuracy: Bootstrapn Bootstrap

n Workswellwithsmalldatasetsn Samplesthegiventrainingtuplesuniformlywithreplacement

n i.e.,eachtimeatupleisselected,itisequallylikelytobeselectedagainandre-addedtothetrainingset

n Severalbootstrapmethods,andacommononeis.632boostrapn Adatasetwithd tuplesissampledd times,withreplacement,resultingin

atrainingsetofd samples.Thedatatuplesthatdidnotmakeitintothetrainingsetendupformingthetestset.About63.2%oftheoriginaldataendupinthebootstrap,andtheremaining36.8%formthetestset(since(1– 1/d)d ≈e-1 =0.368)

n Repeatthesamplingprocedurek times,overallaccuracyofthemodel:

34

Model Selection: ROC Curvesn ROC (ReceiverOperating

Characteristics)curves:forvisualcomparisonofclassificationmodels

n Originatedfromsignaldetectiontheoryn Showsthetrade-offbetweenthetrue

positiverateandthefalsepositiveraten TheareaundertheROCcurveisa

measureoftheaccuracyofthemodeln Rankthetesttuplesindecreasing

order:theonethatismostlikelytobelongtothepositiveclassappearsatthetopofthelist

n Theclosertothediagonalline(i.e.,theclosertheareaisto0.5),thelessaccurateisthemodel

n Verticalaxisrepresentsthetruepositiverate

n Horizontalaxisrep.thefalsepositiverate

n Theplotalsoshowsadiagonalline

n Amodelwithperfectaccuracywillhaveanareaof1.0

35

Model Selection Issues

n Accuracyn classifieraccuracy:predictingclasslabel

n Speedn timetoconstructthemodel(trainingtime)n timetousethemodel(classification/predictiontime)

n Robustness:handlingnoiseandmissingvaluesn Scalability:efficiencyindisk-residentdatabasesn Interpretability

n understandingandinsightprovidedbythemodeln Model(e.g.,decisiontree)sizeorcompactness

36



37

38

Using IF-THEN Rules for Classificationn RepresentknowledgeintheformofIF-THEN rules

R:IFage =youthANDstudent =yesTHENbuys_computer =yesn Ruleantecedent/preconditionvs.ruleconsequent

n Assessmentofarule:coverage andaccuracyn ncovers=#oftuplescoveredbyRn ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|accuracy(R)=ncorrect /ncovers

n Ifmorethanonerulearetriggered,needconflictresolutionn Sizeordering:assignthehighestprioritytothetriggeringrulesthathave

the“toughest” requirement(i.e.,withthemostattributetests)n Class-basedordering:decreasingorderofprevalenceormisclassification

costperclassn Rule-basedordering(decisionlist):rulesareorganizedintoonelong

prioritylist,accordingtosomemeasureofrulequalityorbyexperts

39

age?

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fairexcellentyesno

n Example:Ruleextractionfromourbuys_computer decision-treeIFage =youngANDstudent =no THENbuys_computer =noIFage =youngANDstudent =yes THENbuys_computer =yesIFage =mid-age THENbuys_computer =yesIFage =oldANDcredit_rating =excellent THENbuys_computer=noIFage =oldANDcredit_rating =fair THENbuys_computer =yes

Rule Extraction from Decision Tree

n Aroot-to-leafpathcorrespondstoarulen Eachattribute-valuepairalongapathforms

aconjunction:theleafholdstheclassprediction

n Rulesareexhaustiveandmutuallyexclusive

40

Rule Induction: Sequential Covering Method

n Sequentialcovering:Extractsrulesdirectlyfromtrainingdatan Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPERn Rulesarelearnedsequentially,eachforagivenclassCiwillcover

manytuplesofCibutnone(orfew)ofthetuplesofotherclassesn Steps:

n Rulesarelearnedoneatatimen Eachtimearuleislearned,thecoveredpositivetuplesareremoved

n Repeatuntilterminationcondition ismet.e.g.,nomoretrainingexamplesorthequalityofarulegeneratedisbelowauser-specifiedthreshold

n Unlikedecision-treesthatlearnasetofrulessimultaneously

41

Sequential Covering Algorithm

while(enoughtargettuplesleft)generatearuleremovepositivetargettuplessatisfyingthisrule

Examples coveredby Rule 3

Examples coveredby Rule 2Examples covered

by Rule 1

Positive examples

42

How to Learn One Rule?n Startwiththemostgeneralrule possible:condition=emptyn Addingnewattributes byadoptingagreedydepth-firststrategy

n Pickstheonethatmostimprovestherulequalityn Rule-Qualitymeasures:considerbothcoverageandaccuracy

n Foil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingcondition

n favorsrulesthathavehighaccuracyandcovermanypositivetuples

n Rulepruningbasedonanindependentsetoftesttuples

Pos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_Prune ishigherfortheprunedversionofR,pruneR

)log''

'(log'_ 22 negpospos

negposposposGainFOIL

+−

+×=

negposnegposRPruneFOIL

+

−=)(_

43

Learn one rulen Togeneratearule

while(true)findthebestpredicatepif foil-gain(p)>thresholdthen addp tocurrentruleelse break

Positive examples

Negative examples

A3=1A3=1&&A1=2A3=1&&A1=2&&A8=5

Trees and rulesn Most tree learners: divide and conquern Most rule learners: separate and conquer, i.e., sequential covering, (AQ,

CN2, RIPPER …)n Some conquering-without-separating (RISE, from Domingos, biased towards

complex models), rules are learned simultaneously, instance-based

n Decision space, decision boundary

n Both are interpretable classifiersn Other usage of rule learning: rule extraction, e.g., from ANN

44

Separate and conquer vs. set covern Set covering problem (minimum set cover): one of the most studied

combinatorial optimization problemsn Given a finite ground set X and S1, S2, … Sm as subsets of X, find I ⊆ {1, … m} with Ui ∈ I Si = X such that |I| is minimized.

n select as few as possible subsets from a given family such that each element in any subset of the family is covered

n NP-hardn Greedy algorithm: iteratively pick the subset that covers the maximum

number of uncovered elementsn Achieves 1 + ln n approximation ratio, optimal

n Greedy set cover vs. sequential coveringn Select one subset (learn one rule) at a timen Consider uncovered elements (remove covered examples)n Iterate until all elements (examples) are covered

n Other related problems: graph coloring, minimum clique partition 45



46

47

Bayesian Classification: Why?

n Astatisticalclassifier:performsprobabilisticprediction,i.e.,predictsclassmembershipprobabilities

n Foundation: BasedonBayes’ Theorem.n Performance: AsimpleBayesianclassifier,naïveBayesian

classifier,hascomparableperformancewithdecisiontreeandselectedneuralnetworkclassifiers

n Incremental:Eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrect—priorknowledgecanbecombinedwithobserveddata

n Standard:EvenwhenBayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured

Probability Model for Classifiers

n Let X = (x1, x2, …, xn) be a data sample (“evidence”): class label is unknown

n The probability model for a classifier is to determine P(C|X), the probability that X belongs to class C given the observed data sample Xn predicts X belongs to Ci iff the probability P(Ci|X) is the highest

among all the P(Ck|X) for all the k classes

48

Bayes’ Theorem

n P(C | X) : posteriorn P(C): prior, the initial probability

n E.g., one will buy computer, regardless of age, income, …n P(X): probability that the sample X is observedn P(X|C): likelihood, probability of observing the sample X,

given that the hypothesis holdsn E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income

n Informally, this can be written as posterior = prior x likelihood / evidence

)()|()()|( X

XX PCPCPCP =

49

Maximizing joint probability

n In practice we are only interested in the numerator of that fraction, since the denominator does not depend on H and the same value is shared by all classes.

n The numerator is the joint probability

),...2,1,(),()|()( XnXXCPCPCPCP == XX

)()|()()|( X

XX PCPCPCP =

50

Maximizing joint probability

repeatedly apply conditional probability,

)1,...2,1,|()...1,|2()|1()()2,1,|,...3()1,|2()|1()(

)1,|,...2()|1()()|,...2,1()(

−====

XnXXCXnPXCXPCXPCPXXCXnXPXCXPCXPCP

XCXnXPCXPCPCXnXXPCP

),...2,1,(),()|()( XnXXCPCPCPCP == XX

51

Naïve Bayes Classifier: Assuming Conditional Independence

Simplifying assumption: features are conditionally independent of each other, then,

then,

This greatly reduces the computation cost: Only counts the class distribution

)|(),|( CXiPXjCXiP =

)|()...|2()|1()()1,...2,1,|()...1,|2()|1()(

),...2,1,(

CXnPCXPCXPCPXnXXCXnPXCXPCXPCP

XnXXCP

=−=

52

53

Naïve Bayes Classifier

n Thisgreatlyreducesthecomputationcost:Onlycountstheclassdistribution

n IfAk iscategorical,P(xk|Ci)isthe#oftuplesinCi havingvaluexkforAk dividedby|Ci,D|(#oftuplesofCi inD)

n IfAk iscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμ andstandarddeviationσ

andP(xk|Ci)is 2

2

2)(

21),,( σ

µ

σπσµ

−−

=x

exg

),,()|(ii CCkxgCiP σµ=X

54

Naïve Bayes Classifier: Training Dataset

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’

Datatobeclassified:X=(age<=30,Income=medium,Student=yesCredit_rating=Fair)


Naïve Bayes Classifier: Examplen X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

n P(C): P(buys_computer = “yes”) = 9/14 = 0.643P(buys_computer = “no”) = 5/14= 0.357

n Compute P(X|C) for each classP(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

n P(X|C) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(C, X) = P(X|C)*P(C) P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


55

56

Avoiding the Zero-Probability Problem

n NaïveBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero

n Supposetrainingsethas1000tuplesforclassbuys_computer=yes.0forincome=low,990forincome=medium,and10forincome=high

n UseLaplaciancorrection (orLaplacianestimator)n Adding1toeachcase

Prob(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003

n The“corrected” prob.estimatesareclosetotheir“uncorrected”counterparts

∏=

=n

kCixkPCiXP

1)|()|(

57

Naïve Bayes Classifier: Comments

n Advantagesn Easytoimplementn Goodresultsobtainedinmostofthecases

n Disadvantagesn Assumption:classconditionalindependence,thereforelossofaccuracy

n Practically,dependenciesexistamongvariablesn E.g.,hospitals:patients:Profile:age,familyhistory,etc.

Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.

n DependenciesamongthesecannotbemodeledbyNaïveBayesClassifier

n Howtodealwiththesedependencies?BayesianBeliefNetworks(Chapter9)



58

Bayesian Belief Networksn Bayesian belief network relieves the conditional independence

assumption in naïve bayesn A graphical model of causal relationships

n Represents dependency among the variables n Gives a specification of joint probability distribution

X Y

ZP

q Nodes: random variablesq Links: dependencyq X and Y are the parents of Z, and Y is the parent of Pq No dependency between Z and Pq Has no loops or cycles

59

Bayesian Belief Network: An Example

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table(CPT) for variable LungCancer:

∏=

=n

ixiParentsxiPxxP n

1))(|(),...,( 1

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:

60

Training Bayesian Networks

n Several scenarios:n Given both the network structure and all variables

observable: learn only the CPTsn Network structure known, some hidden variables:

gradient descent (greedy hill-climbing) method, analogous to neural network learning

n Network structure unknown, all variables observable: search through the model space to reconstruct network topology

n Unknown structure, all hidden variables: No good algorithms known for this purpose

n Ref. D. Heckerman: Bayesian networks for data mining

61

Examplen Two events could cause

grass to be wet: either the sprinkler is on or it's raining

n The rain has a direct effect on the use of the sprinklern when it rains, the sprinkler

is usually not turned on

Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T and F.The joint probability function is:

P(G,S,R) = P(G | S,R)P(S | R)P(R)

where G = Grass wet, S = Sprinkler, and R = Rain

62

Example

n The model can answer questions like "What is the probability that it is raining, given the grass is wet?"

The joint probability function is:

P(G,S,R) = P(G | S,R)P(S | R)P(R)

where G = Grass wet, S = Sprinkler, and R = Rain

63



64

Ensemble Methods: Increasing the Accuracy

n Ensemblemethodsn Useacombinationofmodelstoincreaseaccuracyn Combineaseriesofklearnedmodels,M1,M2,…,Mk,withtheaimofcreatinganimprovedmodelM*

n Popularensemblemethodsn Bagging:averagingthepredictionoveracollectionofclassifiers

n Boosting:weightedvotewithacollectionofclassifiersn Ensemble:combiningasetofheterogeneousclassifiers

65

Bagging: Boostrap Aggregation

n Analogy:Diagnosisbasedonmultipledoctors’majorityvoten Training

n GivenasetDofdtuples,ateachiterationi,atrainingsetDi ofd tuplesissampledwithreplacementfromD(i.e.,bootstrap)

n AclassifiermodelMi islearnedforeachtrainingsetDi

n Classification:classifyanunknownsampleXn EachclassifierMi returnsitsclasspredictionn ThebaggedclassifierM*countsthevotesandassignstheclasswiththe

mostvotestoXn Prediction:canbeappliedtothepredictionofcontinuousvaluesbytaking

theaveragevalueofeachpredictionforagiventesttuplen Accuracy

n OftensignificantlybetterthanasingleclassifierderivedfromDn Fornoisedata:notconsiderablyworse,morerobustn Provedimprovedaccuracyinprediction

66

Boostingn Analogy:Consultseveraldoctors,basedonacombinationof

weighteddiagnoses—weightassignedbasedonthepreviousdiagnosisaccuracy

n Howboostingworks?n Weights areassignedtoeachtrainingtuplen Aseriesofkclassifiersisiterativelylearnedn AfteraclassifierMi islearned,theweightsareupdatedto

allowthesubsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuplesthatweremisclassifiedbyMi

n ThefinalM*combinesthevotes ofeachindividualclassifier,wheretheweightofeachclassifier'svoteisafunctionofitsaccuracy

n Boostingalgorithmcanbeextendedfornumericpredictionn Comparingwithbagging:Boostingtendstohavegreateraccuracy,

butitalsorisksoverfittingthemodeltomisclassifieddata67

68

Adaboost (Freund and Schapire, 1997)

n Givenasetofd class-labeledtuples,(X1,y1),…,(Xd,yd)n Initially,alltheweightsoftuplesaresetthesame(1/d)n Generatekclassifiersinkrounds.Atroundi,

n TuplesfromDaresampled(withreplacement)toformatrainingsetDi ofthesamesize

n Eachtuple’schanceofbeingselectedisbasedonitsweightn AclassificationmodelMi isderivedfromDi

n ItserrorrateiscalculatedusingDiasatestsetn Ifatupleismisclassified,itsweightisincreased,o.w.itisdecreased

n Errorrate:err(Xj)isthemisclassificationerroroftupleXj.ClassifierMierrorrateisthesumoftheweightsofthemisclassifiedtuples:

n TheweightofclassifierMi’svoteis

)()(1log

i

i

MerrorMerror−

∑ ×=d

jji errwMerror )()( jX

Random Forest (Breiman 2001) n RandomForest:

n Eachclassifierintheensembleisadecisiontreeclassifierandisgeneratedusingarandomselectionofattributesateachnodetodeterminethesplit

n Duringclassification,eachtreevotesandthemostpopularclassisreturned

n TwoMethodstoconstructRandomForest:n Forest-RI(randominputselection):Randomlyselect,ateachnode,F

attributesascandidatesforthesplitatthenode.TheCARTmethodologyisusedtogrowthetreestomaximumsize

n Forest-RC(randomlinearcombinations): Createsnewattributes(orfeatures)thatarealinearcombinationoftheexistingattributes(reducesthecorrelationbetweenindividualclassifiers)

n ComparableinaccuracytoAdaboost,butmorerobusttoerrorsandoutliersn Insensitivetothenumberofattributesselectedforconsiderationateach

split,andfasterthanbaggingorboosting69

Classification of Class-Imbalanced Data Sets

n Class-imbalanceproblem:Rarepositiveexamplebutnumerousnegativeones,e.g.,medicaldiagnosis,fraud,oil-spill,fault,etc.

n Traditionalmethodsassumeabalanceddistributionofclassesandequalerrorcosts:notsuitableforclass-imbalanceddata

n Typicalmethodsforimbalancedatain2-classclassification:n Oversampling:re-samplingofdatafrompositiveclassn Under-sampling:randomlyeliminatetuplesfromnegativeclass

n Threshold-moving:movesthedecisionthreshold,t,sothattherareclasstuplesareeasiertoclassify,andhence,lesschanceofcostlyfalsenegativeerrors

n Ensembletechniques:Ensemblemultipleclassifiersintroducedabove

n Stilldifficultforclassimbalanceproblemonmulticlasstasks70



71

72

Lazy vs. Eager Learningn Lazy vs. eager learning

n Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple

n Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

n Lazy: less time in training but more time in predictingn Accuracy

n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function

n Eager: must commit to a single hypothesis that covers the entire instance space

73

Lazy Learner: Instance-Based Methods

n Instance-based learning: n Store training examples and delay the processing (“lazy

evaluation”) until a new instance must be classifiedn Typical approaches

n k-nearest neighbor approachn Instances represented as points in a Euclidean

space.n Locally weighted regression

n Constructs local approximationn Case-based reasoning

n Uses symbolic representations and knowledge-based inference

74

The k-Nearest Neighbor Algorithm

n All instances correspond to points in the n-D spacen The nearest neighbor are defined in terms of

Euclidean distance, dist(X1, X2)n Target function could be discrete- or real- valuedn For discrete-valued, k-NN returns the most common

value among the k training examples nearest to xq

n Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

. _

+_ xq

+

_ _+

_

_

+

..

.. .

75

Discussion on the k-NN Algorithm

n k-NN for real-valued prediction for a given unknown tuplen Returns the mean values of the k nearest neighbors

n Distance-weighted nearest neighbor algorithmn Weight the contribution of each of the k neighbors

according to their distance to the query xq

n Give greater weight to closer neighborsn Robust to noisy data by averaging k-nearest neighborsn Curse of dimensionality: distance between neighbors could

be dominated by irrelevant attributes n To overcome it, axes stretch or elimination of the least

relevant attributes

2),(1

ixqxdw≡



76

SVM—Support Vector Machines

n A new classification method for both linear and nonlinear datan It uses a nonlinear mapping to transform the original training

data into a higher dimensionn With the new dimension, it searches for the linear optimal

separating hyperplane (i.e., “decision boundary”)n With an appropriate nonlinear mapping to a sufficiently high

dimension, data from two classes can always be separated by a hyperplane

n SVM finds this hyperplane using support vectors (“essential”training tuples) and margins (defined by the support vectors)

77

History and Applications

n Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s

n Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)

n Used both for classification and predictionn Applications:

n handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

78

General Philosophy

Support Vectors

Small Margin Large Margin

79

When Data Is Linearly Separable

m

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi

There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)

80

Kernel functionsn Instead of computing the dot product on the transformed data tuples,

it is mathematically equivalent to instead applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)

n Typical Kernel Functions

n SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional user parameters)

81

Why Is SVM Effective on High Dimensional Data?

n The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data

n The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)

n If all other training examples are removed and the training is repeated, the same separating hyperplane would be found

n The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality

n Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high

82

SVM—Introduction Literaturen “Statistical Learning Theory” by Vapnik: extremely hard to understand,

containing many errors too.n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern

Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.n Better than the Vapnik’s book, but still written too hard for

introduction, and the examples are so not-intuitive n The book “An Introduction to Support Vector Machines” by N.

Cristianini and J. Shawe-Taylorn Also written hard for introduction, but the explanation about the

mercer’s theorem is better than above literaturesn The neural network book by Haykins

n Contains one nice chapter of SVM introduction83

SVM Related Links

n SVM Website

n http://www.kernel-machines.org/

n Representative implementations

n LIBSVM: an efficient implementation of SVM, multi-class

classifications, nu-SVM, one-class SVM, including also various

interfaces with java, python, etc.

n SVM-light: simpler but performance is not better than LIBSVM,

support only binary classification and only C language

n SVM-torch: another recent implementation also written in C.

84

ANN—Artificial Neural Network(Classification by Backpropagation)

n An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another.

n Deep learning: deep neural networks

85

SVM vs. ANN

n SVMn Relatively new conceptn Deterministicn Nice Generalization

propertiesn Hard to learn – learned

in batch mode using quadratic programming techniques

n Using kernels can learn very complex functions

n ANNn Relatively old (but …)n Nondeterministicn Generalizes well but

doesn’t have strong mathematical foundation

n Can easily be learned in incremental fashion

n To learn complex functions—use multilayer perceptron (not that trivial)

86

Documents

Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning