Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
1
Supervised vs. Unsupervised Learning
n Supervisedlearning(classification)
n Supervision:trainingdataarelabeledindicatingclasses
n Newinstancesareclassifiedbasedontrainingset
n Unsupervisedlearning (clustering)
n classlabelsareunknown
n Givenasetofobjects,establishtheexistenceofclassesorclustersinthedata
2
n Classificationn predictscategoricalclasslabels
n Numericpredictionn modelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvalues
n Typicalapplicationsn Credit/loanapprovaln Medicaldiagnosisn Frauddetectionn Webpagecategorization
Prediction: Classification vs. Numeric Prediction
3
4
Classification: A Two-Step Process
n Modelconstruction:describingasetofpredeterminedclassesn Eachtuple/sampleisassumedtobelongtoapredefinedclass,as
indicatedbytheclasslabelattributen Thesetoftuplesusedformodelconstructionistrainingsetn Themodelisrepresentedasclassificationrules,decisiontrees,or
mathematicalformulaen Modelusage:forclassifyingfutureorunknowninstances
n Estimateaccuracyofthemodeln Useanindependent(oftrainingset)testingset,comparepredictedclasslabelswithtrueclasslabels
n Computeaccuracy(percentageofcorrectlyclassifiedinstances)n Iftheaccuracyisacceptable,usethemodeltoclassifynewdata
5
Process 1: Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
6
Process 2: Using the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Issues: Data Preparation
n Data cleaningn Preprocess data in order to reduce noise and handle
missing valuesn Relevance analysis (feature selection)
n Remove the irrelevant or redundant attributesn Data transformation
n Generalize and/or normalize data
7
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
8
9
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fairexcellentyesno
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
q Trainingdataset:Buys_computerq Resultingtree:
10
Algorithm for Decision Tree Induction
n Basicalgorithm(agreedyalgorithm)n Treeisconstructedinatop-down(fromgeneraltospecific)recursive
divide-and-conquermannern Atstart,allthetrainingexamplesareattherootn Attributesarecategorical(ifcontinuous-valued,discretizationinadvance)n Examplesarepartitionedrecursivelybasedonselectedattributesn Attributesareselectedbasedonheuristicorstatisticalmeasure(e.g.,
informationgain)
n Whentostopn Allexampleforagivennodebelongtothesameclass(pure),orn Noremainingattributestoselectfrom,or
n majorityvoting todetermineclasslabelforthenoden Noexamplesleft
Random Tree Induction
Let a be the number of attributes. Let v be the maximum number of values any attribute can take
n Upper bound on the number of trees?n Lower bound on the number of trees?
n Random tree inductionn Randomly choose an attribute for splitn Same stopping criteria
n The design of decision trees has been largely influenced by the preference for simplicity.
11
Occam’s Razorn Occam’s Razor: rule of parsimony, principle of economy
n plurality should not be assumed without necessityn meaning, one should not increase, beyond what is necessary,
the number of entities required to explain anything
n Argument: the simplicity of nature and rarity of simple theories can be used to justify Occam's Razer.n First, nature exhibits regularity and natural phenomena are more often
simple than complex. At least, the phenomena humans choose to study tend to have simple explanations.
n Second, there are far fewer simple hypotheses than complex ones, so that there is only a small chance that any simple hypothesis that is wildly incorrect will be consistent with all observations.
n Occam's two razors: The sharp and the blunt (KDD’98)n Pedro Domingos
1288 - 1348
12
Attribute Selection Measure: Information Gain (ID3/C4.5)
n How to obtain smallest (shortest) tree?n Careful design on selection of attributen Quinlan pioneered using entropy in his ID3 algorithmn Entropy: in information theory, also called expected
information, is a measure of uncertainlyn Intuition: chaos, molecular disorder, temperature,
thermodynamic system, universen High entropy = high disorder
13
14
Attribute Selection Measure: Information Gain (ID3/C4.5)
n Selecttheattributewiththehighestinformationgainn Letpi betheprobabilitythatanarbitrarytupleinDbelongstoclass
Ci,estimatedby|Ci,D|/|D|n Expectedinformation (entropy)neededtoclassifyatupleinD:
n entropy:measureofuncertainty.largerentropy->largeruncertainty
n Information neededtoclassifyD(aggregatedentropyafterusingAtosplitDintovpartitions):
n Informationgained (entropydropped)bybranchingonattributeA
)(log)( 21
i
m
ii ppDInfo ∑
=
−=
)(||||
)(1
j
v
j
jA DInfo
DD
DInfo ×=∑=
(D)InfoInfo(D)Gain(A) A−=
15
Attribute Selection: Information Gain
g ClassP:buys_computer=“yes”g ClassN:buys_computer=“no”
means“age<=30” has5outof14samples,with2yes’esand3no’s.Hence
Similarly,
age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971
694.0)2,3(145
)0,4(144)3,2(
145)(
=+
+=
I
IIDInfoage
048.0)_(151.0)(029.0)(
===
ratingcreditGainstudentGainincomeGain
246.0)()()( =−= DInfoDInfoageGain ageage income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
)3,2(145 I
940.0)145(log
145)
149(log
149)5,9()( 22 =−−== IDInfo
16
Computing Information-Gain for Continuous-Valued Attributes
n LetattributeAbeacontinuous-valuedattribute
n Mustdeterminethebestsplitpoint forA
n SortthevalueAinincreasingorder
n Typically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint
n (ai+ai+1)/2isthemidpointbetweenthevaluesofai andai+1
n Thepointwiththeminimumexpected informationrequirement forAisselectedasthesplit-pointforA
n Split:
n D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point
17
Gain Ratio for Attribute Selection (C4.5)
n Informationgainisbiasedtowardsattributeswithalargenumberofvalues
n C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)
n GainRatio(A)=Gain(A)/SplitInfo(A)
gain_ratio(income)=0.029/1.557=0.019
n Theattributewiththelargestgainratiowillbeselected
)||||
(log||||
)( 21 D
DDD
DSplitInfo jv
j
jA ×−= ∑
=
18
Gini Index (CART, IBM IntelligentMiner)
n IfadatasetDcontainsexamplesfromn classes,giniindex,gini(D)isdefinedas
wherepj istherelativefrequencyofclassj inDn IfadatasetD issplitonAintotwosubsetsD1 andD2,thegini
indexgini(D)isdefinedas
n ReductioninImpurity:
n Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(need toenumerateallthepossiblesplittingpointsforeachattribute)
∑=
−=n
jp jDgini121)(
)(||||)(
||||)( 2
21
1 DginiDD
DginiDDDginiA +=
)()()( DginiDginiAgini A−=Δ
19
Computation of Gini Index
n Ex.Dhas9tuplesinbuys_computer=“yes” and5in“no”
n SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2
Gini{low,high} is0.458;Gini{medium,high} is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex
n Allattributesareassumedcontinuous-valuedn Mayneedothertools,e.g.,clustering,togetthepossiblesplit
valuesn Canbemodifiedforcategoricalattributes
459.0145
1491)(
22
=⎟⎠
⎞⎜⎝
⎛−⎟⎠
⎞⎜⎝
⎛−=Dgini
)(144)(
1410)( 21},{ DGiniDGiniDgini mediumlowincome ⎟
⎠
⎞⎜⎝
⎛+⎟⎠
⎞⎜⎝
⎛=∈
20
Comparing Attribute Selection Measures
n Thethreemeasures,ingeneral,returngoodresultsbutn Informationgain:
n biasedtowardsmultivaluedattributesn Gainratio:
n tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothers
n Giniindex:n biasedtomultivaluedattributesn hasdifficultywhen#ofclassesislargen tendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions
21
Other Attribute Selection Measures
n CHAID:apopulardecisiontreealgorithm,measurebasedonχ2 testforindependence
n C-SEP:performsbetterthaninfo.gainandginiindexincertaincases
n G-statistic:hasacloseapproximationtoχ2 distribution
n MDL(MinimalDescriptionLength)principle (i.e.,thesimplestsolutionispreferred):
n Thebesttreeastheonethatrequiresthefewest#ofbitstoboth(1)encodethetree,and(2)encodetheexceptionstothetree
n Multivariatesplits(partitionbasedonmultiplevariablecombinations)
n CART:findsmultivariatesplitsbasedonalinearcomb.ofattrs.
n Whichattributeselectionmeasureisthebest?
n Mostgivegoodresults,noneissignificantlysuperiorthanothers
Overfitting and Tree Pruning
n Overfitting: An induced tree may overfit the training data n Too many branches, some may reflect anomalies due to noise or outliersn Poor accuracy for unseen samples
n Blue: training error, red: generalization error
n Two approaches to avoid overfittingn Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a thresholdn Difficult to choose an appropriate threshold
n Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
n Use a set of data (validation set) different from the training data to decide which is the “best pruned tree”
22
23
Enhancements to Basic Decision Tree Induction
n Allowforcontinuous-valuedattributesn Dynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervals
n Handlemissingattributevaluesn Assignthemostcommonvalueoftheattributen Assignprobabilitytoeachofthepossiblevalues
n Attributeconstructionn Createnewattributesbasedonexistingonesthataresparselyrepresented
n Thisreducesfragmentation,repetition,andreplication
24
Classification in Large Databases
n Classification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchers
n Scalability:Classifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeed
n Whyisdecisiontreeinductionpopular?n relativelyfasterlearningspeed(thanotherclassificationmethods)
n convertibletosimpleandeasytounderstandclassificationrulesn canuseSQLqueriesforaccessingdatabasesn comparableclassificationaccuracywithothermethods
n RainForest(VLDB’98— Gehrke,Ramakrishnan&Ganti)n BuildsanAVC-list(attribute,value,classlabel)
25
Scalability Framework for RainForest
n Separates the scalability aspects from the criteria that determine the quality of the tree
n Builds an AVC-list: AVC (Attribute, Value, Class_label)
n AVC-set (of an attribute X )
n Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated
n AVC-group (of a node n )
n Set of AVC-sets of all predictor attributes at the node n
26
Rainforest: Training Set and Its AVC Sets
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
Creditrating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
AVC-set on incomeAVC-set on Age
AVC-set on Student
Training Examplesincome Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on credit_rating
27
BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)
n Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory
n Each subset is used to create a tree, resulting in several trees
n These trees are examined and used to construct a new tree T’
n It turns out that T’ is very close to the tree that would be generated using the whole data set together
n Adv: requires only two scans of DB, an incremental alg.
27
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
28
Model Evaluation Metrics: Confusion Matrix
Actualclass\Predictedclass buy_computer=yes
buy_computer=no
Total
buy_computer=yes 6954 46 7000buy_computer=no 412 2588 3000
Total 7366 2634 10000
n Givenm classes,anentry,CMi,j inaconfusionmatrix indicates#oftuplesinclassi thatwerelabeledbytheclassifierasclassj
n Mayhaveextrarows/columnstoprovidetotals
ConfusionMatrix:Actualclass\Predictedclass C1 ¬C1
C1 TruePositives(TP) FalseNegatives(FN)
¬C1 FalsePositives(FP) TrueNegatives(TN)
Example of Confusion Matrix:
29
Model Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
n Accuracy,orrecognitionrate:percentageoftestsettuplesthatarecorrectlyclassifiedAccuracy=(TP+TN)/All
n Errorrate: 1– accuracy,orErrorrate=(FP+FN)/All
n ClassImbalanceProblem:n Oneclassmayberare,e.g.fraud,orHIV-positive
n Significantmajorityofthenegativeclass andminorityofthepositiveclass
n Sensitivity:TruePositiverecognitionrate(recallfor+)
n Sensitivity=TP/Pn Specificity:TrueNegativerecognitionrate(recallfor-)
n Specificity=TN/N
A\P C ¬CC TP FN P¬C FP TN N
P’ N’ All
30
Model Evaluation Metrics: Precision and Recall, and F-measures
n Precision:exactness– what%oftuplesthattheclassifier(model)labeledaspositiveareactuallypositive
n Recall:completeness– what%ofpositivetuplesdidtheclassifier(model)labelaspositive?
n Perfectscoreis1.0n Inverserelationshipbetweenprecision&recalln Fmeasure(F1 or F-score):harmonicmeanofprecisionandrecall,
n Fß:weightedmeasureofprecisionandrecalln assignsß timesasmuchweighttorecallastoprecision
31
Model Evaluation Metrics: Example
n Precision =90/230=39.13%Recall =90/300=30.00%
ActualClass\Predictedclass cancer=yes cancer=no Total Recognition(%)cancer=yes 90 210 300 30.00(sensitivitycancer=no 140 9560 9700 98.56(specificity)
Total 230 9770 10000 96.40(accuracy)
32
Evaluating Learning Algorithm:Holdout & Cross-Validation Methods
n Holdoutmethodn Givendataisrandomlypartitionedintotwoindependentsets
n Trainingset(e.g.,2/3)formodelconstructionn Testingset(e.g.,1/3)foraccuracy(oranothermetric)estimation
n Randomsampling:avariationofholdoutn Repeatholdoutktimes,accuracy=avg.oftheaccuraciesobtained
n Cross-validation (k-fold,wherek=10ismostcommon)n Randomlypartitionthedataintok mutuallyexclusive subsets,eachapproximatelyequalsize
n Ati-thiteration,useDiastestingsetandothersastrainingsetn Leave-one-out:k foldswherek =#oftuples,forsmallsizeddatan Stratifiedcross-validation:foldsarestratifiedsothatclassdist.ineachfoldisapprox.thesameasthatintheinitialdata
33
Evaluating Classifier Accuracy: Bootstrapn Bootstrap
n Workswellwithsmalldatasetsn Samplesthegiventrainingtuplesuniformlywithreplacement
n i.e.,eachtimeatupleisselected,itisequallylikelytobeselectedagainandre-addedtothetrainingset
n Severalbootstrapmethods,andacommononeis.632boostrapn Adatasetwithd tuplesissampledd times,withreplacement,resultingin
atrainingsetofd samples.Thedatatuplesthatdidnotmakeitintothetrainingsetendupformingthetestset.About63.2%oftheoriginaldataendupinthebootstrap,andtheremaining36.8%formthetestset(since(1– 1/d)d ≈e-1 =0.368)
n Repeatthesamplingprocedurek times,overallaccuracyofthemodel:
34
Model Selection: ROC Curvesn ROC (ReceiverOperating
Characteristics)curves:forvisualcomparisonofclassificationmodels
n Originatedfromsignaldetectiontheoryn Showsthetrade-offbetweenthetrue
positiverateandthefalsepositiveraten TheareaundertheROCcurveisa
measureoftheaccuracyofthemodeln Rankthetesttuplesindecreasing
order:theonethatismostlikelytobelongtothepositiveclassappearsatthetopofthelist
n Theclosertothediagonalline(i.e.,theclosertheareaisto0.5),thelessaccurateisthemodel
n Verticalaxisrepresentsthetruepositiverate
n Horizontalaxisrep.thefalsepositiverate
n Theplotalsoshowsadiagonalline
n Amodelwithperfectaccuracywillhaveanareaof1.0
35
Model Selection Issues
n Accuracyn classifieraccuracy:predictingclasslabel
n Speedn timetoconstructthemodel(trainingtime)n timetousethemodel(classification/predictiontime)
n Robustness:handlingnoiseandmissingvaluesn Scalability:efficiencyindisk-residentdatabasesn Interpretability
n understandingandinsightprovidedbythemodeln Model(e.g.,decisiontree)sizeorcompactness
36
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
37
38
Using IF-THEN Rules for Classificationn RepresentknowledgeintheformofIF-THEN rules
R:IFage =youthANDstudent =yesTHENbuys_computer =yesn Ruleantecedent/preconditionvs.ruleconsequent
n Assessmentofarule:coverage andaccuracyn ncovers=#oftuplescoveredbyRn ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|accuracy(R)=ncorrect /ncovers
n Ifmorethanonerulearetriggered,needconflictresolutionn Sizeordering:assignthehighestprioritytothetriggeringrulesthathave
the“toughest” requirement(i.e.,withthemostattributetests)n Class-basedordering:decreasingorderofprevalenceormisclassification
costperclassn Rule-basedordering(decisionlist):rulesareorganizedintoonelong
prioritylist,accordingtosomemeasureofrulequalityorbyexperts
39
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fairexcellentyesno
n Example:Ruleextractionfromourbuys_computer decision-treeIFage =youngANDstudent =no THENbuys_computer =noIFage =youngANDstudent =yes THENbuys_computer =yesIFage =mid-age THENbuys_computer =yesIFage =oldANDcredit_rating =excellent THENbuys_computer=noIFage =oldANDcredit_rating =fair THENbuys_computer =yes
Rule Extraction from Decision Tree
n Aroot-to-leafpathcorrespondstoarulen Eachattribute-valuepairalongapathforms
aconjunction:theleafholdstheclassprediction
n Rulesareexhaustiveandmutuallyexclusive
40
Rule Induction: Sequential Covering Method
n Sequentialcovering:Extractsrulesdirectlyfromtrainingdatan Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPERn Rulesarelearnedsequentially,eachforagivenclassCiwillcover
manytuplesofCibutnone(orfew)ofthetuplesofotherclassesn Steps:
n Rulesarelearnedoneatatimen Eachtimearuleislearned,thecoveredpositivetuplesareremoved
n Repeatuntilterminationcondition ismet.e.g.,nomoretrainingexamplesorthequalityofarulegeneratedisbelowauser-specifiedthreshold
n Unlikedecision-treesthatlearnasetofrulessimultaneously
41
Sequential Covering Algorithm
while(enoughtargettuplesleft)generatearuleremovepositivetargettuplessatisfyingthisrule
Examples coveredby Rule 3
Examples coveredby Rule 2Examples covered
by Rule 1
Positive examples
42
How to Learn One Rule?n Startwiththemostgeneralrule possible:condition=emptyn Addingnewattributes byadoptingagreedydepth-firststrategy
n Pickstheonethatmostimprovestherulequalityn Rule-Qualitymeasures:considerbothcoverageandaccuracy
n Foil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingcondition
n favorsrulesthathavehighaccuracyandcovermanypositivetuples
n Rulepruningbasedonanindependentsetoftesttuples
Pos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_Prune ishigherfortheprunedversionofR,pruneR
)log''
'(log'_ 22 negpospos
negposposposGainFOIL
+−
+×=
negposnegposRPruneFOIL
+
−=)(_
43
Learn one rulen Togeneratearule
while(true)findthebestpredicatepif foil-gain(p)>thresholdthen addp tocurrentruleelse break
Positive examples
Negative examples
A3=1A3=1&&A1=2A3=1&&A1=2&&A8=5
Trees and rulesn Most tree learners: divide and conquern Most rule learners: separate and conquer, i.e., sequential covering, (AQ,
CN2, RIPPER …)n Some conquering-without-separating (RISE, from Domingos, biased towards
complex models), rules are learned simultaneously, instance-based
n Decision space, decision boundary
n Both are interpretable classifiersn Other usage of rule learning: rule extraction, e.g., from ANN
44
Separate and conquer vs. set covern Set covering problem (minimum set cover): one of the most studied
combinatorial optimization problemsn Given a finite ground set X and S1, S2, … Sm as subsets of X, find I ⊆ {1, … m} with Ui ∈ I Si = X such that |I| is minimized.
n select as few as possible subsets from a given family such that each element in any subset of the family is covered
n NP-hardn Greedy algorithm: iteratively pick the subset that covers the maximum
number of uncovered elementsn Achieves 1 + ln n approximation ratio, optimal
n Greedy set cover vs. sequential coveringn Select one subset (learn one rule) at a timen Consider uncovered elements (remove covered examples)n Iterate until all elements (examples) are covered
n Other related problems: graph coloring, minimum clique partition 45
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
46
47
Bayesian Classification: Why?
n Astatisticalclassifier:performsprobabilisticprediction,i.e.,predictsclassmembershipprobabilities
n Foundation: BasedonBayes’ Theorem.n Performance: AsimpleBayesianclassifier,naïveBayesian
classifier,hascomparableperformancewithdecisiontreeandselectedneuralnetworkclassifiers
n Incremental:Eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrect—priorknowledgecanbecombinedwithobserveddata
n Standard:EvenwhenBayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured
Probability Model for Classifiers
n Let X = (x1, x2, …, xn) be a data sample (“evidence”): class label is unknown
n The probability model for a classifier is to determine P(C|X), the probability that X belongs to class C given the observed data sample Xn predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
48
Bayes’ Theorem
n P(C | X) : posteriorn P(C): prior, the initial probability
n E.g., one will buy computer, regardless of age, income, …n P(X): probability that the sample X is observedn P(X|C): likelihood, probability of observing the sample X,
given that the hypothesis holdsn E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
n Informally, this can be written as posterior = prior x likelihood / evidence
)()|()()|( X
XX PCPCPCP =
49
Maximizing joint probability
n In practice we are only interested in the numerator of that fraction, since the denominator does not depend on H and the same value is shared by all classes.
n The numerator is the joint probability
),...2,1,(),()|()( XnXXCPCPCPCP == XX
)()|()()|( X
XX PCPCPCP =
50
Maximizing joint probability
repeatedly apply conditional probability,
)1,...2,1,|()...1,|2()|1()()2,1,|,...3()1,|2()|1()(
)1,|,...2()|1()()|,...2,1()(
−====
XnXXCXnPXCXPCXPCPXXCXnXPXCXPCXPCP
XCXnXPCXPCPCXnXXPCP
),...2,1,(),()|()( XnXXCPCPCPCP == XX
51
Naïve Bayes Classifier: Assuming Conditional Independence
Simplifying assumption: features are conditionally independent of each other, then,
then,
This greatly reduces the computation cost: Only counts the class distribution
)|(),|( CXiPXjCXiP =
)|()...|2()|1()()1,...2,1,|()...1,|2()|1()(
),...2,1,(
CXnPCXPCXPCPXnXXCXnPXCXPCXPCP
XnXXCP
=−=
52
53
Naïve Bayes Classifier
n Thisgreatlyreducesthecomputationcost:Onlycountstheclassdistribution
n IfAk iscategorical,P(xk|Ci)isthe#oftuplesinCi havingvaluexkforAk dividedby|Ci,D|(#oftuplesofCi inD)
n IfAk iscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμ andstandarddeviationσ
andP(xk|Ci)is 2
2
2)(
21),,( σ
µ
σπσµ
−−
=x
exg
),,()|(ii CCkxgCiP σµ=X
54
Naïve Bayes Classifier: Training Dataset
Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’
Datatobeclassified:X=(age<=30,Income=medium,Student=yesCredit_rating=Fair)
age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Naïve Bayes Classifier: Examplen X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
n P(C): P(buys_computer = “yes”) = 9/14 = 0.643P(buys_computer = “no”) = 5/14= 0.357
n Compute P(X|C) for each classP(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
n P(X|C) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(C, X) = P(X|C)*P(C) P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
55
56
Avoiding the Zero-Probability Problem
n NaïveBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero
n Supposetrainingsethas1000tuplesforclassbuys_computer=yes.0forincome=low,990forincome=medium,and10forincome=high
n UseLaplaciancorrection (orLaplacianestimator)n Adding1toeachcase
Prob(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003
n The“corrected” prob.estimatesareclosetotheir“uncorrected”counterparts
∏=
=n
kCixkPCiXP
1)|()|(
57
Naïve Bayes Classifier: Comments
n Advantagesn Easytoimplementn Goodresultsobtainedinmostofthecases
n Disadvantagesn Assumption:classconditionalindependence,thereforelossofaccuracy
n Practically,dependenciesexistamongvariablesn E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.
n DependenciesamongthesecannotbemodeledbyNaïveBayesClassifier
n Howtodealwiththesedependencies?BayesianBeliefNetworks(Chapter9)
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
58
Bayesian Belief Networksn Bayesian belief network relieves the conditional independence
assumption in naïve bayesn A graphical model of causal relationships
n Represents dependency among the variables n Gives a specification of joint probability distribution
X Y
ZP
q Nodes: random variablesq Links: dependencyq X and Y are the parents of Z, and Y is the parent of Pq No dependency between Z and Pq Has no loops or cycles
59
Bayesian Belief Network: An Example
FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table(CPT) for variable LungCancer:
∏=
=n
ixiParentsxiPxxP n
1))(|(),...,( 1
CPT shows the conditional probability for each possible combination of its parents
Derivation of the probability of a particular combination of values of X, from CPT:
60
Training Bayesian Networks
n Several scenarios:n Given both the network structure and all variables
observable: learn only the CPTsn Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method, analogous to neural network learning
n Network structure unknown, all variables observable: search through the model space to reconstruct network topology
n Unknown structure, all hidden variables: No good algorithms known for this purpose
n Ref. D. Heckerman: Bayesian networks for data mining
61
Examplen Two events could cause
grass to be wet: either the sprinkler is on or it's raining
n The rain has a direct effect on the use of the sprinklern when it rains, the sprinkler
is usually not turned on
Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T and F.The joint probability function is:
P(G,S,R) = P(G | S,R)P(S | R)P(R)
where G = Grass wet, S = Sprinkler, and R = Rain
62
Example
n The model can answer questions like "What is the probability that it is raining, given the grass is wet?"
The joint probability function is:
P(G,S,R) = P(G | S,R)P(S | R)P(R)
where G = Grass wet, S = Sprinkler, and R = Rain
63
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
64
Ensemble Methods: Increasing the Accuracy
n Ensemblemethodsn Useacombinationofmodelstoincreaseaccuracyn Combineaseriesofklearnedmodels,M1,M2,…,Mk,withtheaimofcreatinganimprovedmodelM*
n Popularensemblemethodsn Bagging:averagingthepredictionoveracollectionofclassifiers
n Boosting:weightedvotewithacollectionofclassifiersn Ensemble:combiningasetofheterogeneousclassifiers
65
Bagging: Boostrap Aggregation
n Analogy:Diagnosisbasedonmultipledoctors’majorityvoten Training
n GivenasetDofdtuples,ateachiterationi,atrainingsetDi ofd tuplesissampledwithreplacementfromD(i.e.,bootstrap)
n AclassifiermodelMi islearnedforeachtrainingsetDi
n Classification:classifyanunknownsampleXn EachclassifierMi returnsitsclasspredictionn ThebaggedclassifierM*countsthevotesandassignstheclasswiththe
mostvotestoXn Prediction:canbeappliedtothepredictionofcontinuousvaluesbytaking
theaveragevalueofeachpredictionforagiventesttuplen Accuracy
n OftensignificantlybetterthanasingleclassifierderivedfromDn Fornoisedata:notconsiderablyworse,morerobustn Provedimprovedaccuracyinprediction
66
Boostingn Analogy:Consultseveraldoctors,basedonacombinationof
weighteddiagnoses—weightassignedbasedonthepreviousdiagnosisaccuracy
n Howboostingworks?n Weights areassignedtoeachtrainingtuplen Aseriesofkclassifiersisiterativelylearnedn AfteraclassifierMi islearned,theweightsareupdatedto
allowthesubsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuplesthatweremisclassifiedbyMi
n ThefinalM*combinesthevotes ofeachindividualclassifier,wheretheweightofeachclassifier'svoteisafunctionofitsaccuracy
n Boostingalgorithmcanbeextendedfornumericpredictionn Comparingwithbagging:Boostingtendstohavegreateraccuracy,
butitalsorisksoverfittingthemodeltomisclassifieddata67
68
Adaboost (Freund and Schapire, 1997)
n Givenasetofd class-labeledtuples,(X1,y1),…,(Xd,yd)n Initially,alltheweightsoftuplesaresetthesame(1/d)n Generatekclassifiersinkrounds.Atroundi,
n TuplesfromDaresampled(withreplacement)toformatrainingsetDi ofthesamesize
n Eachtuple’schanceofbeingselectedisbasedonitsweightn AclassificationmodelMi isderivedfromDi
n ItserrorrateiscalculatedusingDiasatestsetn Ifatupleismisclassified,itsweightisincreased,o.w.itisdecreased
n Errorrate:err(Xj)isthemisclassificationerroroftupleXj.ClassifierMierrorrateisthesumoftheweightsofthemisclassifiedtuples:
n TheweightofclassifierMi’svoteis
)()(1log
i
i
MerrorMerror−
∑ ×=d
jji errwMerror )()( jX
Random Forest (Breiman 2001) n RandomForest:
n Eachclassifierintheensembleisadecisiontreeclassifierandisgeneratedusingarandomselectionofattributesateachnodetodeterminethesplit
n Duringclassification,eachtreevotesandthemostpopularclassisreturned
n TwoMethodstoconstructRandomForest:n Forest-RI(randominputselection):Randomlyselect,ateachnode,F
attributesascandidatesforthesplitatthenode.TheCARTmethodologyisusedtogrowthetreestomaximumsize
n Forest-RC(randomlinearcombinations): Createsnewattributes(orfeatures)thatarealinearcombinationoftheexistingattributes(reducesthecorrelationbetweenindividualclassifiers)
n ComparableinaccuracytoAdaboost,butmorerobusttoerrorsandoutliersn Insensitivetothenumberofattributesselectedforconsiderationateach
split,andfasterthanbaggingorboosting69
Classification of Class-Imbalanced Data Sets
n Class-imbalanceproblem:Rarepositiveexamplebutnumerousnegativeones,e.g.,medicaldiagnosis,fraud,oil-spill,fault,etc.
n Traditionalmethodsassumeabalanceddistributionofclassesandequalerrorcosts:notsuitableforclass-imbalanceddata
n Typicalmethodsforimbalancedatain2-classclassification:n Oversampling:re-samplingofdatafrompositiveclassn Under-sampling:randomlyeliminatetuplesfromnegativeclass
n Threshold-moving:movesthedecisionthreshold,t,sothattherareclasstuplesareeasiertoclassify,andhence,lesschanceofcostlyfalsenegativeerrors
n Ensembletechniques:Ensemblemultipleclassifiersintroducedabove
n Stilldifficultforclassimbalanceproblemonmulticlasstasks70
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
71
72
Lazy vs. Eager Learningn Lazy vs. eager learning
n Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple
n Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify
n Lazy: less time in training but more time in predictingn Accuracy
n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function
n Eager: must commit to a single hypothesis that covers the entire instance space
73
Lazy Learner: Instance-Based Methods
n Instance-based learning: n Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classifiedn Typical approaches
n k-nearest neighbor approachn Instances represented as points in a Euclidean
space.n Locally weighted regression
n Constructs local approximationn Case-based reasoning
n Uses symbolic representations and knowledge-based inference
74
The k-Nearest Neighbor Algorithm
n All instances correspond to points in the n-D spacen The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)n Target function could be discrete- or real- valuedn For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
n Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples
. _
+_ xq
+
_ _+
_
_
+
..
.. .
75
Discussion on the k-NN Algorithm
n k-NN for real-valued prediction for a given unknown tuplen Returns the mean values of the k nearest neighbors
n Distance-weighted nearest neighbor algorithmn Weight the contribution of each of the k neighbors
according to their distance to the query xq
n Give greater weight to closer neighborsn Robust to noisy data by averaging k-nearest neighborsn Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes n To overcome it, axes stretch or elimination of the least
relevant attributes
2),(1
ixqxdw≡
Chapters 8-9. Classification
n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)
76
SVM—Support Vector Machines
n A new classification method for both linear and nonlinear datan It uses a nonlinear mapping to transform the original training
data into a higher dimensionn With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)n With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a hyperplane
n SVM finds this hyperplane using support vectors (“essential”training tuples) and margins (defined by the support vectors)
77
History and Applications
n Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s
n Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)
n Used both for classification and predictionn Applications:
n handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests
78
General Philosophy
Support Vectors
Small Margin Large Margin
79
When Data Is Linearly Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
80
Kernel functionsn Instead of computing the dot product on the transformed data tuples,
it is mathematically equivalent to instead applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
n Typical Kernel Functions
n SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional user parameters)
81
Why Is SVM Effective on High Dimensional Data?
n The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data
n The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)
n If all other training examples are removed and the training is repeated, the same separating hyperplane would be found
n The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality
n Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high
82
SVM—Introduction Literaturen “Statistical Learning Theory” by Vapnik: extremely hard to understand,
containing many errors too.n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.n Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive n The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylorn Also written hard for introduction, but the explanation about the
mercer’s theorem is better than above literaturesn The neural network book by Haykins
n Contains one nice chapter of SVM introduction83
SVM Related Links
n SVM Website
n http://www.kernel-machines.org/
n Representative implementations
n LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
n SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only C language
n SVM-torch: another recent implementation also written in C.
84
ANN—Artificial Neural Network(Classification by Backpropagation)
n An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another.
n Deep learning: deep neural networks
85
SVM vs. ANN
n SVMn Relatively new conceptn Deterministicn Nice Generalization
propertiesn Hard to learn – learned
in batch mode using quadratic programming techniques
n Using kernels can learn very complex functions
n ANNn Relatively old (but …)n Nondeterministicn Generalizes well but
doesn’t have strong mathematical foundation
n Can easily be learned in incremental fashion
n To learn complex functions—use multilayer perceptron (not that trivial)
86