38
Classification: Decision Trees Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.

Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Classification:DecisionTrees

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

Page 2: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

LastTime• Commondecompositionofmachinelearning,basedondifferencesin

inputsandoutputs– SupervisedLearning:Learnafunctionmappinginputstooutputsusing

labeledtrainingdata(yougetinstances/exampleswithbothinputsandgroundtruthoutput)

– UnsupervisedLearning:Learnsomethingaboutjustdatawithoutanylabels(harder!),forexampleclusteringinstancesthatare“similar”

– ReinforcementLearning:Learnhowtomakedecisionsgivenasparsereward

• MLisaninteractionbetween:– theinputdata– inputtothefunction(features/attributesofdata)– thethefunction(model)youchoose,and– theoptimizationalgorithmyouusetoexplorespaceoffunctions

• Wearenewatthis,solet’sstartbylearningif/thenrulesforclassification!

Page 3: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses

Input:Trainingexamplesofunknowntargetfunctionf

Output:Hypothesisthatbestapproximatesf

XY

f : X ! YH = {h | h : X ! Y}

h 2 H

BasedonslidebyTomMitchell

{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}

Page 4: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

SampleDataset(wasTennisPlayed?)• ColumnsdenotefeaturesXi

• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed

hxi, yii

hxi, yii

Page 5: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

DecisionTree• Apossibledecisiontreeforthedata:

• Eachinternalnode:testoneattributeXi

• Eachbranchfromanode:selectsonevalueforXi

• Eachleafnode:predictY

BasedonslidebyTomMitchell

Page 6: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

DecisionTree• Apossibledecisiontreeforthedata:

• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?

BasedonslidebyTomMitchell

Page 7: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles

• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels

7

Decisionboundary

Page 8: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Expressiveness

• Givenaparticularspaceoffunctions,youmaynotbeabletorepresenteverything

• Whatfunctionscandecisiontreesrepresent?

• Decisiontreescanrepresentanyfunctionoftheinputattributes!– Booleanoperations(and,or,

xor,etc.)?– Yes!– Allboolean functions?– Yes!

Page 9: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Stagesof(Batch)MachineLearningGiven: labeledtrainingdata

• Assumeseachwith

Trainthemodel:modelß classifier.train(X,Y )

Applythemodeltonewdata:• Given:newunlabeledinstance

yprediction ßmodel.predict(x)

model

learner

X,Y

x yprediction

X,Y = {hxi, yii}ni=1

xi ⇠ D(X ) yi = ftarget(xi)

x ⇠ D(X )

Page 10: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

BasicAlgorithmforTop-DownLearningofDecisionTrees

[ID3,C4.5byQuinlan]

node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,

recurse overnewleafnodes.

Howdowechoosewhichattributeisbest?

Page 11: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples

• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues

– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues

– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren

• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute

Page 12: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

12

InformationGain

Whichtestismoreinformative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

BasedonslidebyPedroDomingos

Page 13: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

13

Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples

InformationGain

BasedonslidebyPedroDomingos

Page 14: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

14

Impurity

Very impure group Less impure Minimum impurity

BasedonslidebyPedroDomingos

Page 15: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

15

Entropy:acommonwaytomeasureimpurity

• Entropy=

pi istheprobabilityofclassiComputeitastheproportionofclassiintheset.

• Entropycomesfrominformationtheory.Thehighertheentropythemoretheinformationcontent.

å-i

ii pp 2log

Whatdoesthatmeanforlearningfromexamples?

BasedonslidebyPedroDomingos

Page 16: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

16

2-ClassCases:

• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0

• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1

Minimum impurity

Maximumimpurity

notagoodtrainingsetforlearning

goodtrainingsetforlearning

BasedonslidebyPedroDomingos

H(x) = �nX

i=1

P (x = i) log2 P (x = i)Entropy

Page 17: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

SampleEntropy

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

Page 18: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

18

InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.

• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.

• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.

BasedonslidebyPedroDomingos

Page 19: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

FromEntropytoInformationGain

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

Page 20: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

InformationGain

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

Page 21: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

21

CalculatingInformationGain

996.03016log

3016

3014log

3014

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

787.0174log

174

1713log

1713

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.03013787.0

3017

=÷øö

çèæ ×+÷

øö

çèæ ×

Information Gain= 0.996 - 0.615 = 0.38

391.01312log

1312

131log

131

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

InformationGain =entropy(parent)– [averageentropy(children)]

parententropy

childentropy

childentropy

BasedonslidebyPedroDomingos

Page 22: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

Page 23: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

13

SlidebyTomMitchell

Page 24: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

13

SlidebyTomMitchell

Page 25: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

14

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/

DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output?

•  ID3 performs heuristic

search through space of

decision trees

•  It stops at smallest

acceptable tree. Why?

Occam’s razor: prefer the simplest hypothesis that fits the data

SlidebyTomMitchell

Page 26: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony

• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall

Idea:Thesimplestconsistentexplanationisthebest

Page 27: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

OverfittinginDecisionTrees• Manykindsof“noise” canoccurintheexamples:– Twoexampleshavesameattribute/valuepairs,butdifferentclassifications

– Somevaluesofattributesareincorrectbecauseoferrorsinthedataacquisitionprocessorthepreprocessingphase

– Theinstancewaslabeledincorrectly(+insteadof-)

• Also,someattributesareirrelevanttothedecision-makingprocess– e.g.,colorofadieisirrelevanttoitsoutcome

BasedonSlidefromM.desJardins &T.Finin27

Page 28: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

• Irrelevantattributes canresultinoverfitting thetrainingexampledata– Ifhypothesisspacehasmanydimensions(largenumberofattributes),wemayfindmeaninglessregularity inthedatathatisirrelevanttothetrue,important,distinguishingfeatures

• Ifwehavetoolittletrainingdata,evenareasonablehypothesisspacewill‘overfit’

BasedonSlidefromM.desJardins &T.Finin28

Overfitting inDecisionTrees

Page 29: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Overfitting inDecisionTreesConsideraddinganoisy trainingexampletothefollowingtree:

Whatwouldbetheeffectofadding:<outlook=sunny,temperature=hot,humidity=normal,wind=strong,playTennis=No>?

BasedonSlidebyPedroDomingos29

Page 30: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

SlidebyPedroDomingos30

OverfittinginDecisionTrees

Page 31: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

SlidebyPedroDomingos31

OverfittinginDecisionTrees

Page 32: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

AvoidingOverfittinginDecisionTreesHowcanweavoidoverfitting?• Stopgrowingwhendatasplitisnotstatisticallysignificant• Acquiremoretrainingdata• Removeirrelevantattributes (manualprocess– notalwayspossible)• Growfulltree,thenpost-prune

Howtoselect“best”tree:• Measureperformanceovertrainingdata• Measureperformanceoverseparatevalidationdataset• Addcomplexitypenaltytoperformancemeasure

(heuristic:simplerisbetter)

BasedonSlidebyPedroDomingos32

Page 33: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Reduced-ErrorPruning

Splittrainingdatafurtherintotraining andvalidation sets

Growtreebasedontraining set

Dountilfurtherpruningisharmful:1. Evaluateimpactonvalidationsetofpruningeach

possiblenode(plusthosebelowit)2. Greedilyremovethenodethatmostimproves

validationset accuracy

SlidebyPedroDomingos33

Page 34: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

PruningDecisionTrees• Pruningofthedecisiontreeisdonebyreplacingawhole

subtree byaleafnode.• Thereplacementtakesplaceifadecisionruleestablishesthat

theexpectederrorrateinthesubtree isgreaterthaninthesingleleaf.

• Forexample,

Color

1positive0negative

0positive2negative

red blue

Training Color

1positive3negative

1positive1negative

red blue

Validation

BasedonExamplefromM.desJardins &T.Finin

2correct4incorrect

Ifwehadsimplypredictedthemajorityclass(negative),wemake2errorsinsteadof4.

34

Page 35: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

35

Ontrainingdataitlooksgreat

Butthat’snotthecaseforthetest(validation)data

Page 36: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

Thetreeisprunedbacktotheredlinewhereitgivesmoreaccurateresultsonthetestdata 36

Page 37: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Summary:DecisionTreeLearning• Widelyusedinpractice

• Strengthsinclude– Fastandsimpletoimplement– Canconverttorules– Handlesnoisydata

• Weaknessesinclude– Univariate splits/partitioningusingonlyoneattributeatatime--- limitstypesofpossibletrees

– Largedecisiontreesmaybehardtounderstand– Requiresfixed-lengthfeaturevectors– Non-incremental(i.e.,batchmethod)

37

Page 38: Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

38

Summary:DecisionTreeLearning• Representation: decisiontrees• Bias: prefersmalldecisiontrees• Searchalgorithm: greedy• Heuristicfunction: informationgainorinformation

contentorothers• Overfitting /pruning

SlidebyPedroDomingos