Trees

Tree Structured Classier Reference: Classication and Regression Trees by L.Breiman, J. H. Friedman, R. A. Olshen, andC. J.Stone, Chapman & Hall, 1984. A Medical Example (CART): Predict high risk patients who will not survive atleast 30daysonthebasisoftheinitial24-hourdata. 19 variables are measured during the rst 24 hours.These include blood pressure, age, etc. A tree structure classication rule is as follows:Low riskHighriskLow riskIs sinus tachycardia present?Is the minimum systolic bloodpressure over the initial 24 hourperiod > 91?HighriskyesyesyesnonoIs age > 62.5no1 Denote the feature space by X. The input vector XXcontains p features X1, X2, ..., Xp, some of whichmay be categorical. Tree structured classiers are constructed by repeatedsplits of subsets ofXinto two descendantsubsets,beginning with X itself. Denitions: node, terminal node (leaf node), parentnode, child node. The union of the regions occupied by two child nodesis the region occupied by their parent node. Every leaf node is assigned with a class. A query isassociated with class of the leaf node it lands in. Notation: A node is denoted byt. Its left child node is de-noted by tL and right by tR. The collection of all the nodes is denoted byT;and the collection of all the leaf nodes byT. A split is denoted by s. The set of splits is denotedby S.2X1X2X1X2X split 1X3X4X5X6X1X2X3X4X5X6Xsplit 2 split 3X5X3X4X7X8X1X2X3X4X5X6X7X8Xsplit 41 2 21 33The Three Elements The construction of a tree involves the following threeelements:1. The selection of the splits.2. The decisions when to declare a node terminal orto continue splitting it.3. The assignment of each terminal node to a class. In particular, we need to decide the following:1. A set Qof binary questions of the form{Is X A?}, A X.2. A goodnessof split criterion(s, t) that can beevaluated for any split s of any node t.3. A stop-splitting rule.4. A rule for assigning every terminal node to a class.4Standard Set of Questions The input vector X=(X1, X2, ..., Xp) contains fea-tures of both categorical and ordered types. Each split depends on the value of only a unique vari-able. For each ordered variableXj, Q includes all ques-tions of the form{Is Xj c?}for all real-valued c. Since the training data set is nite, there are onlynitely many distinct splits that can be generatedby the question {Is Xj c?}. If Xj is categorical, taking values, say in{1, 2, ..., M},then Q contains all questions of the form{Is Xj A?} .A ranges over all subsets of {1, 2, ..., M}. The splits for allp variables constitute the standardset of questions.5Goodness of Split Thegoodnessofsplit ismeasuredbyanimpurityfunction dened for each node.xxxxxxxxoooooo o ooo ooooxxxxxxxxxxoooooo o ooo ooooxxxxxxxxxxoooooo o ooo ooooxx Intuitively, we want each leaf node to be pure, thatis, one class dominates. Denition: An impurity function is a function de-ned on the set of all K-tuples of numbers (p1, ..., pK)satisfying pj0,j =1, ...,K, j pj=1 with theproperties:1. is a maximum only at the point (1K,1K, ...,1K).2. achieves its minimumonly at the points (1, 0, ..., 0),(0, 1, 0, ..., 0), ..., (0, 0, ..., 0, 1).3. is a symmetric function of p1, ..., pK, i.e., if youpermute pj, remains constant.6 Denition: Given an impurity function, dene theimpurity measure i(t) of a node t asi(t) = (p(1 | t), p(2 | t), ..., p(K| t)) ,where p(j| t) is the estimated probability of class jwithin node t. Goodness of a split s for node t, denoted by (s, t),is dened by(s, t) = i(s, t) = i(t) pRi(tR) pLi(tL) ,where pR and pL are the proportions of the samples innodet that go to the right nodetR and the left nodetL respectively.7 Dene I(t)=i(t)p(t), that is, the impurity functionofnodetweightedbytheestimatedproportionofdata that go to node t. The impurity of tree T, I(T) is dened byI(T) =

tTI(t) =

tTi(t)p(t) . Note for any node t the following equations hold:p(tL) + p(tR) = p(t)pL = p(tL)/p(t), pR = p(tR)/p(t)pL + pR = 1 DeneI(s, t) =I(t) I(tL) I(tR)=p(t)i(t) p(tL)i(tL) p(tR)i(tR)=p(t)(i(t) pLi(tL) pRi(tR))=p(t)i(s, t)8 Possible impurity function:1. Entropy: Kj=1pj log1pj. Ifpj=0, use the limitlimpj0pj log pj= 0.2. Misclassication rate:1 maxj pj.3. Gini index: Kj=1pj(1 pj) = 1

Kj=1p2j. Gini index seems to work best in practice for manyproblems. The twoing rule:At a node t, choose the split s thatmaximizespLpR4

j|p(j| tL) p(j| tR)|2.9Estimate the posterior probabilities of classes in eachnode: The total number of samples is N and the number ofsamples in class j, 1 j K, is Nj. The number of samples going to node t is N(t); thenumber of sampleswith classjgoing to nodet isNj(t). Kj=1Nj(t) = N(t). Nj(tL) + Nj(tR) = Nj(t). For a full tree (balanced), the sum of N(t) over allthe ts at the same level is N. Denote the prior probability of class j by j. The priorsjcan be estimated from the data byNj/N. Sometimes priors are given before-hand. The estimated probability of a sample in class j goingto node t is p(t | j) = Nj(t)/Nj. p(tL | j) + p(tR | j) = p(t | j). For a full tree, the sum of p(t | j) over all ts at thesame level is 1.10 The joint probability of a sample being in class j andgoing to node t is thus:p(j, t) = jp(t | j) = jNj(t)/Nj. The probability of any sample going to node t is:p(t) =K

j=1p(j, t) =K

j=1jNj(t)/Nj.Note p(tL) + p(tR) = p(t). The probability of a sample being in class j given thatit goes to node t is:p(j| t) = p(j, t)/p(t) .For any t,

Kj=1p(j| t) = 1. When j= Nj/N, we have the following simplica-tion: p(j| t) = Nj(t)/N(t). p(t) = N(t)/N. p(j, t) = Nj(t)/N.11Stopping Criteria A simple criteria: stop splitting a node t whenmaxsSI(s, t) < ,where is a chosen threshold. The above stopping criteria is unsatisfactory. A node with a small decrease of impurity after onestep of splitting may have a large decrease aftermultiple levels of splits.12Class Assignment Rule A class assignment rule assigns a class j= {1, ..., K}to every terminal node tT. The class assigned tonode t T is denoted by (t). For 0-1 loss, the class assignment rule is:(t) = arg maxjp(j| t) . The resubstitution estimate r(t) of the probability ofmisclassication, given that a case falls into node t isr(t) = 1 maxjp(j| t) = 1 p((t) | t) . Denote R(t) = r(t)p(t). The resubstitution estimate for the overall misclassi-cation rate R(T) of the tree classier T is:R(T) =

tTR(t) .13 Proposition: For any split of a node t into tL and tR,R(t) R(tL) + R(tR) .Proof:Denote j= (t).p(j| t) =p(j, tL | t) + p(j, tR | t)=p(j| tL)p(tL | t) + p(j| tR)p(tR | t)=pLp(j| tL) + pRp(j| tR)pL maxjp(j| tL) + pR maxjp(j| tR)Hence,r(t) =1 p(j| t)1

pL maxjp(j| tL) + pR maxjp(j| tR)

=pL(1 maxjp(j| tL)) + pR(1 maxjp(j| tR))=pLr(tL) + pRr(tR)Finally,R(t) =p(t)r(t)p(t)pLr(tL) + p(t)pRr(tR)=p(tL)r(tL) + p(tR)r(tR)=R(tL) + R(tR)14Digit Recognition Example (CART) The 10 digits are shown by different on-off combina-tions of seven horizontal and vertical lights. Each digit is represented by a 7-dimensional vector ofzeros and ones. The ith sample is xi = (xi1, xi2, ..., xi7).If xij=1, the jth light is on; if xij=0, the jth lightis off.Digit x1x2x3x4x5x6x71 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 10 1 1 1 0 1 1 121346 5715 The data for the example are generated by a malfunc-tioning calculator. Each of the seven lights has probability0.1 of beingin the wrong state independently. The training set contains200 samples generated ac-cording to the specied distribution. A tree structured classier is applied. The set of questions Q contains:Is xj= 0?, j= 1, 2, ..., 7. The twoing rule is used in splitting. The pruning cross-validation method is used to choosethe right sized tree. Classication performance: The error rate estimated by using a test set of size5000 is 0.30. The error rate estimated by cross-validation usingthe training set is 0.30. The resubstitution estimate of the error rate is 0.29. The Bayes error rate is 0.26. There is little room for improvement over the treeclassier.167 390145628YNYY YYYYYYNNNNNNNX2=0 X1=0X1=0X3=0X3=0X4=0X2=0X5=0X4=0 Accidently, every digit occupies one leaf node. In general, one class may occupy any number ofleaf nodes and occasionally no leaf node. X6 and X7 are never used.17Waveform Example (CART) Three functionsh1(), h2(), h3() are shifted ver-sions of each other, as shown in the gure.h12h3h1 3 5 7 9 11 13 15 17 19 216 Eachhjis specied by the equal-lateral right trian-gle function. Its values at integers =121 aremeasured.18 The three classes of waveforms are random convexcombinations of two of these waveforms plus inde-pendent Gaussian noise. Each sample is a 21 dimen-sional vector containing the values of the randomwave-forms measured at = 1, 2, ..., 21. To generate a sample in class1, a random num-ber u uniformly distributed in [0, 1] and 21 randomnumbers1, 2, ..., 21normally distributed withmean zero and variance 1 are generated.xj= uh1(j) + (1 u)h2(j) + j, j= 1, ..., 21. To generate a sample in class2, repeat the aboveprocess to generate a randomnumber u and 21 ran-dom numbers 1, ..., 21 and setxj= uh1(j) + (1 u)h3(j) + j, j= 1, ..., 21. Class 3 vectors are generated byxj= uh2(j) + (1 u)h3(j) + j, j= 1, ..., 21. Example random waveforms are shown below.190 5 10 15 20420246Class 10 5 10 15 2042024680 5 10 15 20202468Class 25 10 15 204202465 10 15 20420246Class 30 5 10 15 2050520 300 random samples are generated using prior proba-bilities (13,13,13) for training. Construction of the tree: The set of questions: {Is xjc?} for c rangingover all real numbers and j= 1, ..., 21. Gini index is used for measuring goodness of split. Thenaltreeisselectedbypruningandcross-validation. Results: The cross-validation estimate of misclassicationrate is 0.29. The misclassication rate on a separate test set ofsize 5000 is 0.28. The Bayes classication rule can be derived. Ap-plying this rule to the test set yields a misclassi-cation rate of 0.14.2120041008511527024646867804501900361710917596094791785010597804010005175911390070490710140131815702001051210744403403361113114134906301 21 22x6

Documents

Trees