Decision Trees - Penn Engineering · 2019-01-22 · Basic Algorithm for Top-Down Induction of...

DecisionTrees

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses

Input:Trainingexamplesofunknowntargetfunctionf

Output:Hypothesisthatbestapproximatesf

f : X ! YH = {h | h : X ! Y}

BasedonslidebyTomMitchell

{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}

SampleDataset• ColumnsdenotefeaturesXi

• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed

hxi, yii

DecisionTree• Apossibledecisiontreeforthedata:

• Eachinternalnode:testoneattributeXi

• Eachbranchfromanode:selectsonevalueforXi

• Eachleafnode:predictY (or)

p(Y | x 2 leaf)

DecisionTree• Apossibledecisiontreeforthedata:

• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?

DecisionTree• Iffeaturesarecontinuous,internalnodescantestthevalueofafeatureagainstathreshold

Problem Setting: •  Set of possible instances X

–  each instance x in X is a feature vector

–  e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>

•  Unknown target function f : XY

–  Y is discrete valued

•  Set of function hypotheses H={ h | h : XY }

–  each hypothesis h is a decision tree

–  trees sorts x to leaf, which assigns y

Decision Tree Learning

Problem Setting: •  Set of possible instances X

–  each instance x in X is a feature vector

x = < x1, x2 … xn>

•  Unknown target function f : XY

–  Y is discrete valued

•  Set of function hypotheses H={ h | h : XY }

–  each hypothesis h is a decision tree

Input: •  Training examples {<x(i),y(i)>} of unknown target function f

Output: •  Hypothesis h ∈ H that best approximates target function f

DecisionTreeLearning

SlidebyTomMitchell

Stagesof(Batch)MachineLearningGiven: labeledtrainingdata

• Assumeseachwith

Trainthemodel:modelß classifier.train(X,Y )

Applythemodeltonewdata:• Given:newunlabeledinstance

yprediction ßmodel.predict(x)

learner

x yprediction

X,Y = {hxi, yii}ni=1

xi ⇠ D(X ) yi = ftarget(xi)

x ⇠ D(X )

ExampleApplication:ATreetoPredictCaesareanSectionRisk

Decision Trees Suppose X = <X1,… Xn>

where Xi are boolean variables

How would you represent Y = X2 X5 ? Y = X2 ∨ X5

How would you represent X2 X5 ∨ X3X4(¬X1)

BasedonExamplebyTomMitchell

DecisionTreeInducedPartition

ShapeSize +

+- Size

big small

roundsquare

redgreen blue

DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles

• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels

Decisionboundary

Expressiveness• Decisiontreescanrepresentanyboolean functionoftheinputattributes

• Intheworstcase,thetreewillrequireexponentiallymanynodes

Truthtablerowà pathtoleaf

ExpressivenessDecisiontreeshaveavariable-sizedhypothesisspace• Asthe#nodes(ordepth)increases,thehypothesisspacegrows– Depth1(“decisionstump”):canrepresentanybooleanfunctionofonefeature

– Depth2:anyboolean fn oftwofeatures;someinvolvingthreefeatures(e.g.,)

– etc.(x1 ^ x2) _ (¬x1 ^ ¬x3)

BasedonslidebyPedroDomingos

AnotherExample:RestaurantDomain(Russell&Norvig)

Model a patron’s decision of whether to wait for a table at a restaurant

~7,000possiblecases

ADecisionTreefromIntrospection

Isthisthebestdecisiontree?

Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony

• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall

Idea:Thesimplestconsistentexplanationisthebest

BasicAlgorithmforTop-DownInductionofDecisionTrees

[ID3,C4.5byQuinlan]

node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,

recurse overnewleafnodes.

Howdowechoosewhichattributeisbest?

ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples

• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues

– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues

– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren

• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute

ChoosinganAttributeIdea:agoodattributesplitstheexamplesintosubsetsthatare(ideally)“allpositive” or“allnegative”

Whichsplitismoreinformative:Patrons? orType?

BasedonSlidefromM.desJardins &T.Finin

ID3-inducedDecisionTree

ComparetheTwoDecisionTrees

InformationGain

Whichtestismoreinformative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples

InformationGain

Impurity

Very impure group Less impure Minimum impurity

node = Root

[ID3, C4.5, Quinlan]

Entropy

Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a

randomly drawn value of X (under most efficient code)

Why? Information theory:

•  Most efficient code assigns -log2P(X=i) bits to encode

the message X=i

•  So, expected number of bits to code one random X is:

# of possible values for X

SlidebyTomMitchell

Entropy:acommonwaytomeasureimpurity

node = Root

[ID3, C4.5, Quinlan]

Entropy

H(X) is the expected number of bits needed to encode a

randomly drawn value of X (under most efficient code)

Why? Information theory:

•  Most efficient code assigns -log2P(X=i) bits to encode

the message X=i

•  So, expected number of bits to code one random X is:

# of possible values for X

SlidebyTomMitchell

Entropy:acommonwaytomeasureimpurity

Example:Huffmancode• In1952MITstudentDavidHuffmandevised,inthecourse

ofdoingahomeworkassignment,anelegantcodingschemewhichisoptimalinthecasewhereallsymbols’probabilitiesareintegralpowersof1/2.

• AHuffmancodecanbebuiltinthefollowingmanner:–Rankallsymbolsinorderofprobabilityofoccurrence–Successivelycombinethetwosymbolsofthelowestprobabilitytoformanewcompositesymbol;eventuallywewillbuildabinarytreewhereeachnodeistheprobabilityofallnodesbeneathit–Traceapathtoeachleaf,noticingdirectionateachnode

HuffmancodeexampleMPA.125B.125C.25D.5

.125.125

M code length probA 000 3 0.125 0.375B 001 3 0.125 0.375C 01 2 0.250 0.500D 1 1 0.500 0.500

average message length 1.750

Ifweusethiscodetomanymessages(A,B,CorD)withthisprobabilitydistribution,then,overtime,theaveragebits/messageshouldapproach1.75

2-ClassCases:

• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0

• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1

Minimum impurity

Maximumimpurity

notagoodtrainingsetforlearning

goodtrainingsetforlearning

H(x) = �nX

P (x = i) log2 P (x = i)Entropy

SampleEntropy

Sample Entropy

Entropy

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.

• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.

• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.

FromEntropytoInformationGain

Sample Entropy

Entropy

SlidebyTomMitchell

Sample Entropy

Entropy

SlidebyTomMitchell

Sample Entropy

Entropy

SlidebyTomMitchell

Sample Entropy

Entropy

SlidebyTomMitchell

InformationGain

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

CalculatingInformationGain

996.03016log

3014log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

787.0174log

1713log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.03013787.0

=÷øö

çèæ ×+÷

çèæ ×

Information Gain= 0.996 - 0.615 = 0.38

391.01312log

131log

22 =÷øö

çèæ ×-÷

çèæ ×-=impurity

InformationGain =entropy(parent)– [averageentropy(children)]

parententropy

childentropy

Entropy-BasedAutomaticDecisionTreeConstruction

Node1Whatfeatureshouldbeused?

Whatvalues?

TrainingSetXx1=(f11,f12,…f1m)x2=(f21,f22,f2m)

.xn=(fn1,f22,f2m)

Quinlansuggestedinformationgain inhisID3systemandlaterthegainratio,bothbasedonentropy.

UsingInformationGaintoConstructaDecisionTree

AttributeA

v1 vkv2

FullTrainingSetX

SetX¢

repeatrecursivelytillwhen?

Disadvantageofinformationgain:• Itprefersattributeswithlargenumberofvaluesthatsplit

thedataintosmall,puresubsets• Quinlan’sgainratiousesnormalizationtoimprovethis

X¢={xÎX |value(A)=v1}

ChoosetheattributeAwithhighestinformationgainforthefulltrainingsetattherootofthetree.

ConstructchildnodesforeachvalueofA.EachhasanassociatedsubsetofvectorsinwhichAhasaparticularvalue.

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

DecisionTreeApplet

http://webdocs.cs.ualberta.ca/~aixplore/learning/DecisionTrees/Applet/DecisionTreeApplet.html

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/

DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output?

•  ID3 performs heuristic

search through space of

decision trees

•  It stops at smallest

acceptable tree. Why?

Occam’s razor: prefer the simplest hypothesis that fits the data

SlidebyTomMitchell

The ID3 algorithm builds a decision tree, given a set of non-categorical attributes C1, C2, .., Cn, the class attribute C, and a training set T of records

function ID3(R:input attributes, C:class attribute, S:training set) returns decision tree;

If S is empty, return single node with value Failure;

If every example in S has same value for C, return single node with that value;

If R is empty, then return a single node with mostfrequent of the values of C found in examples S; # causes errors -- improperly classified record

Let D be attribute with largest Gain(D,S) among R;

Let {dj| j=1,2, .., m} be values of attribute D;

Let {Sj| j=1,2, .., m} be subsets of S consisting of records with value dj for attribute D;

Return tree with root labeled D and arcs labeled d1..dm going to the trees ID3(R-{D},C,S1). . .ID3(R-{D},C,Sm);

Howwelldoesitwork?Manycasestudieshaveshownthatdecisiontreesareatleastasaccurateashumanexperts.–Astudyfordiagnosingbreastcancerhadhumanscorrectlyclassifyingtheexamples65%ofthetime;thedecisiontreeclassified72%correct–BritishPetroleumdesignedadecisiontreeforgas-oilseparationforoffshoreoilplatformsthatreplacedanearlierrule-basedexpertsystem–Cessnadesignedanairplaneflightcontrollerusing90,000examplesand20attributesperexample

Decision Trees - Penn Engineering · 2019-01-22 · Basic Algorithm for Top-Down Induction of...

Documents

COURSE OUTLINE - Alexander Technological Educational ... · Decision trees, ID3, C4.5, entropy, information gain, pruning, advantages and disadvantages of ... Lecture notes in e-class

C4.5 - pruning decision trees - Leiden Universityliacs.leidenuniv.nl/~csdatami/DaMi2012-2013/slides/C45 pruning.pdfC4.5’s method! Derive confidence interval from training data! Use

Decision Trees and Decision Rules - Computer Science …cis.poly.edu/~mleung/FRE7851/f07/decisionTrees.pdf · Section 2: C4.5 Algorithm: Generating a Decision Tree 4 1, regardless

IMPLEMENTASI METODE DECISION TREE DAN ALGORITMA C4.5 …

Lecture 23: Decision Trees Decision trees

C4.5 - pruning decision trees

PENERAPAN METODE DECISION TREE C4.5 UNTUK ...eprints.uty.ac.id/4834/1/NASKAH PUBLIKASI-Martin Aditya...PENERAPAN METODE DECISION TREE C4.5 UNTUK KLASIFIKASI PREDIKSI PENYAKIT KARIES

Classification: Decision TreesLearning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision attribute for the next node. 2

Decision Trees -

Decision Trees and Forests: A Probabilistic Perspectivebalaji/balaji-phd-thesis.pdf · ular variants such as C4.5, CART, boosted trees and random forests lack a probabilistic interpretation

Enhancements to basic decision tree induction, C4.5

PERBANDINGAN ALGORITMA DECISION TREE (C4.5) DAN …eprints.ums.ac.id/36124/4/HALAMAN DEPAN.pdf · ini dengan judul “Perbandingan Algoritma Decision Tree (C4.5) Dan Naïve Bayes

COMPARATIVE TUDY BETWEEN ECISION NEURAL NETWORKS … · Decision Trees C4.5 and CART, and various Artificial Neural Networks (MLP) in order to predict the fatality of road accidents

Penerapan Decision Tree Menggunakan Algoritma C4.5 …

Decision tree Using c4.5 Algorithm

Decision Trees Jeff Storey. Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Decision

Student Performance Prediction Using C4.5 Decision Tree

Data Science with R Decision Trees · Information gain (ID3/C4.5) ... Résumé du modèle Arbre de décision pour Classification (construit avec 'rpart') : n= 1005 node), split, n,

Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Data classification (II)staff.fmi.uvt.ro/.../lecturesEN/lecture4/dm2016_lecture4.pdfData Mining - Lecture 4 (2016) 2 Outline Decision trees Choice of the splitting attribute ID3 C4.5