Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of...

Classification:DecisionTrees

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

LastTime• Commondecompositionofmachinelearning,basedondifferencesin

inputsandoutputs– SupervisedLearning:Learnafunctionmappinginputstooutputsusing

labeledtrainingdata(yougetinstances/exampleswithbothinputsandgroundtruthoutput)

– UnsupervisedLearning:Learnsomethingaboutjustdatawithoutanylabels(harder!),forexampleclusteringinstancesthatare“similar”

– ReinforcementLearning:Learnhowtomakedecisionsgivenasparsereward

• MLisaninteractionbetween:– theinputdata– inputtothefunction(features/attributesofdata)– thethefunction(model)youchoose,and– theoptimizationalgorithmyouusetoexplorespaceoffunctions

• Wearenewatthis,solet’sstartbylearningif/thenrulesforclassification!

FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses

Input:Trainingexamplesofunknowntargetfunctionf

Output:Hypothesisthatbestapproximatesf

f : X ! YH = {h | h : X ! Y}

BasedonslidebyTomMitchell

{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}

SampleDataset(wasTennisPlayed?)• ColumnsdenotefeaturesXi

• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed

hxi, yii

DecisionTree• Apossibledecisiontreeforthedata:

• Eachinternalnode:testoneattributeXi

• Eachbranchfromanode:selectsonevalueforXi

• Eachleafnode:predictY

DecisionTree• Apossibledecisiontreeforthedata:

• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?

DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles

• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels

Decisionboundary

Expressiveness

• Givenaparticularspaceoffunctions,youmaynotbeabletorepresenteverything

• Whatfunctionscandecisiontreesrepresent?

• Decisiontreescanrepresentanyfunctionoftheinputattributes!– Booleanoperations(and,or,

xor,etc.)?– Yes!– Allboolean functions?– Yes!

Stagesof(Batch)MachineLearningGiven: labeledtrainingdata

• Assumeseachwith

Trainthemodel:modelß classifier.train(X,Y )

Applythemodeltonewdata:• Given:newunlabeledinstance

yprediction ßmodel.predict(x)

learner

x yprediction

X,Y = {hxi, yii}ni=1

xi ⇠ D(X ) yi = ftarget(xi)

x ⇠ D(X )

BasicAlgorithmforTop-DownLearningofDecisionTrees

[ID3,C4.5byQuinlan]

node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,

recurse overnewleafnodes.

Howdowechoosewhichattributeisbest?

ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples

• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues

– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues

– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren

• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute

InformationGain

Whichtestismoreinformative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

BasedonslidebyPedroDomingos

Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples

InformationGain

Impurity

Very impure group Less impure Minimum impurity

Entropy:acommonwaytomeasureimpurity

• Entropy=

pi istheprobabilityofclassiComputeitastheproportionofclassiintheset.

• Entropycomesfrominformationtheory.Thehighertheentropythemoretheinformationcontent.

ii pp 2log

Whatdoesthatmeanforlearningfromexamples?

2-ClassCases:

• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0

• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1

Minimum impurity

Maximumimpurity

notagoodtrainingsetforlearning

goodtrainingsetforlearning

H(x) = �nX

P (x = i) log2 P (x = i)Entropy

SampleEntropy

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.

• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.

• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.

FromEntropytoInformationGain

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

InformationGain

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

CalculatingInformationGain

996.03016log

3014log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

787.0174log

1713log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.03013787.0

=÷øö

çèæ ×+÷

çèæ ×

Information Gain= 0.996 - 0.615 = 0.38

391.01312log

131log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

InformationGain =entropy(parent)– [averageentropy(children)]

parententropy

childentropy

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/

DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output?

•  ID3 performs heuristic

search through space of

decision trees

•  It stops at smallest

acceptable tree. Why?

Occam’s razor: prefer the simplest hypothesis that fits the data

SlidebyTomMitchell

Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony

• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall

Idea:Thesimplestconsistentexplanationisthebest

OverfittinginDecisionTrees• Manykindsof“noise” canoccurintheexamples:– Twoexampleshavesameattribute/valuepairs,butdifferentclassifications

– Somevaluesofattributesareincorrectbecauseoferrorsinthedataacquisitionprocessorthepreprocessingphase

– Theinstancewaslabeledincorrectly(+insteadof-)

• Also,someattributesareirrelevanttothedecision-makingprocess– e.g.,colorofadieisirrelevanttoitsoutcome

BasedonSlidefromM.desJardins &T.Finin27

• Irrelevantattributes canresultinoverfitting thetrainingexampledata– Ifhypothesisspacehasmanydimensions(largenumberofattributes),wemayfindmeaninglessregularity inthedatathatisirrelevanttothetrue,important,distinguishingfeatures

• Ifwehavetoolittletrainingdata,evenareasonablehypothesisspacewill‘overfit’

BasedonSlidefromM.desJardins &T.Finin28

Overfitting inDecisionTrees

Overfitting inDecisionTreesConsideraddinganoisy trainingexampletothefollowingtree:

Whatwouldbetheeffectofadding:<outlook=sunny,temperature=hot,humidity=normal,wind=strong,playTennis=No>?

BasedonSlidebyPedroDomingos29

SlidebyPedroDomingos30

OverfittinginDecisionTrees

AvoidingOverfittinginDecisionTreesHowcanweavoidoverfitting?• Stopgrowingwhendatasplitisnotstatisticallysignificant• Acquiremoretrainingdata• Removeirrelevantattributes (manualprocess– notalwayspossible)• Growfulltree,thenpost-prune

Howtoselect“best”tree:• Measureperformanceovertrainingdata• Measureperformanceoverseparatevalidationdataset• Addcomplexitypenaltytoperformancemeasure

(heuristic:simplerisbetter)

BasedonSlidebyPedroDomingos32

Reduced-ErrorPruning

Splittrainingdatafurtherintotraining andvalidation sets

Growtreebasedontraining set

Dountilfurtherpruningisharmful:1. Evaluateimpactonvalidationsetofpruningeach

possiblenode(plusthosebelowit)2. Greedilyremovethenodethatmostimproves

validationset accuracy

PruningDecisionTrees• Pruningofthedecisiontreeisdonebyreplacingawhole

subtree byaleafnode.• Thereplacementtakesplaceifadecisionruleestablishesthat

theexpectederrorrateinthesubtree isgreaterthaninthesingleleaf.

• Forexample,

1positive0negative

0positive2negative

red blue

Training Color

1positive3negative

1positive1negative

red blue

Validation

BasedonExamplefromM.desJardins &T.Finin

2correct4incorrect

Ifwehadsimplypredictedthemajorityclass(negative),wemake2errorsinsteadof4.

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

Ontrainingdataitlooksgreat

Butthat’snotthecaseforthetest(validation)data

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

Thetreeisprunedbacktotheredlinewhereitgivesmoreaccurateresultsonthetestdata 36

Summary:DecisionTreeLearning• Widelyusedinpractice

• Strengthsinclude– Fastandsimpletoimplement– Canconverttorules– Handlesnoisydata

• Weaknessesinclude– Univariate splits/partitioningusingonlyoneattributeatatime--- limitstypesofpossibletrees

– Largedecisiontreesmaybehardtounderstand– Requiresfixed-lengthfeaturevectors– Non-incremental(i.e.,batchmethod)

Summary:DecisionTreeLearning• Representation: decisiontrees• Bias: prefersmalldecisiontrees• Searchalgorithm: greedy• Heuristicfunction: informationgainorinformation

contentorothers• Overfitting /pruning

SlidebyPedroDomingos

Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of...

Documents

Report by: Bassem Fadlia - University of Alberta€¦ · Bassem Fadlia A training set Classifier C(x) Decision tree algorithm (C4.5) C4.5 Bagging, Boosting, and C4.5 A training set

Id3,c4.5 algorithim

IMPLEMENTASI METODE DECISION TREE DAN ALGORITMA C4.5 …

Lecture5 - C4.5

id3 dan c4.5

Aplikasi Data Mining Menggunakan Metode Decision Tree Untuk Evaluasi Kinerja Motor Servo Dengan Algoritma c4.5

C4.5 Decision Tree Algorithm - University of Houstonsmiertsc/4397cis/C4.5_Decision_Tree...Apply Simplified C4 5Apply Simplified C4.5 Consider each branch and decide whether to terminate

An Introduction Presentation for Learning Bayesian ... · 2. Decision nets / influence diagrams: BN models consist Nature Nodes, and at least one Decision Node and one Utility Node

Enhancements to basic decision tree induction, C4.5

Decision Trees and Decision Rules - Computer Science …cis.poly.edu/~mleung/FRE7851/f07/decisionTrees.pdf · Section 2: C4.5 Algorithm: Generating a Decision Tree 4 1, regardless

Decision tree Using c4.5 Algorithm

Lecture 9: Decision Tree - GitHub Pages tree.pdf · Decision tree building: ID3 algorithm •Algorithm framework •Start from the root node with all data •For each node, calculate

C4.5 and CHAID Algorithm

tuning of decision trees - arXiv · approach for investigating the effects of hyperparameter tuning on three Decision Tree induction algorithms, CART, C4.5 and CTree. These algorithms

C4.5 Algoritması m - murattezgider.files.wordpress.com · C4.5 Algoritması •ID3 algoritmasını geliştiren Quinlan’ın geliştirdiği C4.5 karar ... •Bu algoritma değer

Porting Decision Tree Algorithms to Multicore using FastFlowpages.di.unipi.it/ruggieri/Papers/pkdd2010.pdf · The C4.5 decision tree induction algorithm [15] is a constant reference

Decision Tree C4.5 algorithm and its enhanced approach …eusrm.com/PublishedPaper/7Vol/Issue2/2015EUSRM2201521205-62da… · Decision trees in the role of base classifiers. ... Definition4

Classification – Decision Treesai.fon.bg.ac.rs/.../Classification-Decision-Trees... · • Implementation of C4.5 algorithm for generating decision trees. • C4.5 algorithm is

Student Performance Prediction Using C4.5 Decision Tree

Decision Trees - Penn Engineering · 2019-01-22 · Basic Algorithm for Top-Down Induction of Decision Trees [ID3, C4.5 by Quinlan] node = root of decision tree Main loop: 1. Aßthe