Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of...

Preview:

Citation preview

Classification:DecisionTrees

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

LastTime• Commondecompositionofmachinelearning,basedondifferencesin

inputsandoutputs– SupervisedLearning:Learnafunctionmappinginputstooutputsusing

labeledtrainingdata(yougetinstances/exampleswithbothinputsandgroundtruthoutput)

– UnsupervisedLearning:Learnsomethingaboutjustdatawithoutanylabels(harder!),forexampleclusteringinstancesthatare“similar”

– ReinforcementLearning:Learnhowtomakedecisionsgivenasparsereward

• MLisaninteractionbetween:– theinputdata– inputtothefunction(features/attributesofdata)– thethefunction(model)youchoose,and– theoptimizationalgorithmyouusetoexplorespaceoffunctions

• Wearenewatthis,solet’sstartbylearningif/thenrulesforclassification!

FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses

Input:Trainingexamplesofunknowntargetfunctionf

Output:Hypothesisthatbestapproximatesf

XY

f : X ! YH = {h | h : X ! Y}

h 2 H

BasedonslidebyTomMitchell

{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}

SampleDataset(wasTennisPlayed?)• ColumnsdenotefeaturesXi

• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed

hxi, yii

hxi, yii

DecisionTree• Apossibledecisiontreeforthedata:

• Eachinternalnode:testoneattributeXi

• Eachbranchfromanode:selectsonevalueforXi

• Eachleafnode:predictY

BasedonslidebyTomMitchell

DecisionTree• Apossibledecisiontreeforthedata:

• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?

BasedonslidebyTomMitchell

DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles

• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels

7

Decisionboundary

Expressiveness

• Givenaparticularspaceoffunctions,youmaynotbeabletorepresenteverything

• Whatfunctionscandecisiontreesrepresent?

• Decisiontreescanrepresentanyfunctionoftheinputattributes!– Booleanoperations(and,or,

xor,etc.)?– Yes!– Allboolean functions?– Yes!

Stagesof(Batch)MachineLearningGiven: labeledtrainingdata

• Assumeseachwith

Trainthemodel:modelß classifier.train(X,Y )

Applythemodeltonewdata:• Given:newunlabeledinstance

yprediction ßmodel.predict(x)

model

learner

X,Y

x yprediction

X,Y = {hxi, yii}ni=1

xi ⇠ D(X ) yi = ftarget(xi)

x ⇠ D(X )

BasicAlgorithmforTop-DownLearningofDecisionTrees

[ID3,C4.5byQuinlan]

node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,

recurse overnewleafnodes.

Howdowechoosewhichattributeisbest?

ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples

• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues

– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues

– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren

• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute

12

InformationGain

Whichtestismoreinformative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

BasedonslidebyPedroDomingos

13

Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples

InformationGain

BasedonslidebyPedroDomingos

14

Impurity

Very impure group Less impure Minimum impurity

BasedonslidebyPedroDomingos

15

Entropy:acommonwaytomeasureimpurity

• Entropy=

pi istheprobabilityofclassiComputeitastheproportionofclassiintheset.

• Entropycomesfrominformationtheory.Thehighertheentropythemoretheinformationcontent.

å-i

ii pp 2log

Whatdoesthatmeanforlearningfromexamples?

BasedonslidebyPedroDomingos

16

2-ClassCases:

• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0

• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1

Minimum impurity

Maximumimpurity

notagoodtrainingsetforlearning

goodtrainingsetforlearning

BasedonslidebyPedroDomingos

H(x) = �nX

i=1

P (x = i) log2 P (x = i)Entropy

SampleEntropy

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

18

InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.

• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.

• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.

BasedonslidebyPedroDomingos

FromEntropytoInformationGain

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

InformationGain

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

21

CalculatingInformationGain

996.03016log

3016

3014log

3014

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

787.0174log

174

1713log

1713

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.03013787.0

3017

=÷øö

çèæ ×+÷

øö

çèæ ×

Information Gain= 0.996 - 0.615 = 0.38

391.01312log

1312

131log

131

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

InformationGain =entropy(parent)– [averageentropy(children)]

parententropy

childentropy

childentropy

BasedonslidebyPedroDomingos

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

13

SlidebyTomMitchell

13

SlidebyTomMitchell

14

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/

DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output?

•  ID3 performs heuristic

search through space of

decision trees

•  It stops at smallest

acceptable tree. Why?

Occam’s razor: prefer the simplest hypothesis that fits the data

SlidebyTomMitchell

Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony

• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall

Idea:Thesimplestconsistentexplanationisthebest

OverfittinginDecisionTrees• Manykindsof“noise” canoccurintheexamples:– Twoexampleshavesameattribute/valuepairs,butdifferentclassifications

– Somevaluesofattributesareincorrectbecauseoferrorsinthedataacquisitionprocessorthepreprocessingphase

– Theinstancewaslabeledincorrectly(+insteadof-)

• Also,someattributesareirrelevanttothedecision-makingprocess– e.g.,colorofadieisirrelevanttoitsoutcome

BasedonSlidefromM.desJardins &T.Finin27

• Irrelevantattributes canresultinoverfitting thetrainingexampledata– Ifhypothesisspacehasmanydimensions(largenumberofattributes),wemayfindmeaninglessregularity inthedatathatisirrelevanttothetrue,important,distinguishingfeatures

• Ifwehavetoolittletrainingdata,evenareasonablehypothesisspacewill‘overfit’

BasedonSlidefromM.desJardins &T.Finin28

Overfitting inDecisionTrees

Overfitting inDecisionTreesConsideraddinganoisy trainingexampletothefollowingtree:

Whatwouldbetheeffectofadding:<outlook=sunny,temperature=hot,humidity=normal,wind=strong,playTennis=No>?

BasedonSlidebyPedroDomingos29

SlidebyPedroDomingos30

OverfittinginDecisionTrees

SlidebyPedroDomingos31

OverfittinginDecisionTrees

AvoidingOverfittinginDecisionTreesHowcanweavoidoverfitting?• Stopgrowingwhendatasplitisnotstatisticallysignificant• Acquiremoretrainingdata• Removeirrelevantattributes (manualprocess– notalwayspossible)• Growfulltree,thenpost-prune

Howtoselect“best”tree:• Measureperformanceovertrainingdata• Measureperformanceoverseparatevalidationdataset• Addcomplexitypenaltytoperformancemeasure

(heuristic:simplerisbetter)

BasedonSlidebyPedroDomingos32

Reduced-ErrorPruning

Splittrainingdatafurtherintotraining andvalidation sets

Growtreebasedontraining set

Dountilfurtherpruningisharmful:1. Evaluateimpactonvalidationsetofpruningeach

possiblenode(plusthosebelowit)2. Greedilyremovethenodethatmostimproves

validationset accuracy

SlidebyPedroDomingos33

PruningDecisionTrees• Pruningofthedecisiontreeisdonebyreplacingawhole

subtree byaleafnode.• Thereplacementtakesplaceifadecisionruleestablishesthat

theexpectederrorrateinthesubtree isgreaterthaninthesingleleaf.

• Forexample,

Color

1positive0negative

0positive2negative

red blue

Training Color

1positive3negative

1positive1negative

red blue

Validation

BasedonExamplefromM.desJardins &T.Finin

2correct4incorrect

Ifwehadsimplypredictedthemajorityclass(negative),wemake2errorsinsteadof4.

34

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

35

Ontrainingdataitlooksgreat

Butthat’snotthecaseforthetest(validation)data

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

Thetreeisprunedbacktotheredlinewhereitgivesmoreaccurateresultsonthetestdata 36

Summary:DecisionTreeLearning• Widelyusedinpractice

• Strengthsinclude– Fastandsimpletoimplement– Canconverttorules– Handlesnoisydata

• Weaknessesinclude– Univariate splits/partitioningusingonlyoneattributeatatime--- limitstypesofpossibletrees

– Largedecisiontreesmaybehardtounderstand– Requiresfixed-lengthfeaturevectors– Non-incremental(i.e.,batchmethod)

37

38

Summary:DecisionTreeLearning• Representation: decisiontrees• Bias: prefersmalldecisiontrees• Searchalgorithm: greedy• Heuristicfunction: informationgainorinformation

contentorothers• Overfitting /pruning

SlidebyPedroDomingos

Recommended