Conditional trees

Conditional Treesor

Unbiased recursive partitioningA conditional inference framework

Christoph MolnarSupervisor: Stephanie Möst

Department of Statistics, LMU

18 December 2012

1 / 36

Overview

Introduction and Motivation

Algorithm for unbiased trees

Conditional inference with permutation tests

Examples

Properties

Summary

2 / 36

CART trees

Model: Y = f (X )

Structure of decision treesRecursive partitioning of covariable space XSplit optimizes criterion (Gini, information gain, sum ofsquares) depending on scale of YSplit point search: exhaustive search procedureAvoid overfitting: Early stopping or pruningUsage: prediction and explanationOther tree types: ID3, C4.5, CHAID, . . .

3 / 36

What are conditional trees?

Special kind of treesRecursive partitioning with binary splits and early stoppingConstant models in terminal nodesVariable selection, early stopping and split point search basedon conditional inferenceUses permutation tests for inferenceSolves problems of CART trees

4 / 36

Why conditional trees?

Helps to overcome problems of trees:overfitting (can be solved with other techniques as well)Selection bias towards covariables with many possible splits(i.e. numeric, multi categorial)Difficult interpretation due to selection biasVariable selection: No concept of statistical significanceNot all scales of Y and X covered (ID3, C4.5, ...)

5 / 36

Simulation: selection bias

Variable selection unbiased ⇔ Probability of selecting acovariable, which is independent from Y is the same for allindependent covariablesMeasurement scale of covariable shouldn’t play a roleSimulation illustrating the selection bias:Y ∼ N(0, 1)X1 ∼ M

(n, 1

2 ,12

)X2 ∼ M

(n, 1

3 ,13 ,

13

)X3 ∼ M

(n, 1

4 ,14 ,

14 ,

14

)

6 / 36

Simulation: results

Selection frequencies for the first split:X1: 0.128, X2: 0.302, X3: 0.556, none: 0.014

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 none

Strongly biased towards variables with many possible splitsExample of a tree:

x3 = 1,2

x2 = 1,3−0.19

−0.098 0.36

yes no

Overfitting! (Note: complexity parameter not cross-validated)Desirable here: No split at allProblem source: Exhaustive search through all variables and allpossible split pointsNumeric/multi-categorial categorial have more split options ⇒Multiple comparison problem 7 / 36

Idea of conditional trees

Variable selection and search for split point ⇒ two stepsEmbed all decisions into hypothesis testsAll tests with conditional inference (permutation tests)

8 / 36

Ctree algorithm

1 Stop criterionTest global null hypothesis H0 of independence between Y andall Xj withH0 = ∩m

j=1Hj0 and H j

0 : D(Y|Xj) = D(Y)If H0 not rejected ⇒ Stop

2 Select variable Xj∗ with strongest association3 Search best split point for Xj∗ and partitionate data4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

9 / 36

How can we test hypothesis of independence?

Parametric tests depend on distribution assumptionsProblem: Unknown conditional distributionD(Y |X ) = D(Y |X1, ...,Xm) = D(Y |f (X1), ..., f (Xm))

Need for a general framework, which can handle arbitraryscales

Let the data speak: ⇒ permutation tests!

10 / 36

Excursion: permutation tests

11 / 36

Permutation tests: simple example

Possible treatments for disease: A or BNumeric measurement (blood value)Question: Different blood values between treatment A and B?⇔ µB 6= µA?Test statistic: T0 = µA − µB

H0: µA − µB = 0, H1: µA − µB 6= 0Distribution unknown ⇒ Permutation test

●● ●●● ●●● ● ●

1 2y

Treatment

●

●

A

B

T0 = µA − µB = 2.06 - 1.2 = 0.86

12 / 36

Permute

Original data:

B B B B A A A A B A0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8

One possible permutation:

B B B B A A A A B A2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0

Permute the labels (A and B) and the numeric measurementCalculate test statistic T for each permutationDo this with all possible permutationsResult: Distribution of test statistic conditioned on sample

13 / 36

P-value and decision

k = {Permutation samples : |µA,perm − µB,perm| > |µA − µB |}p-value = k

#Permutations

p-value < α = 0.05? ⇒ If yes, H0 can be rejected

●●● ●

●●

●●

●● ●

● ●● ●●● ●

●●

●●●

●● ●● ●

●●

● ●●

● ●●●● ● ●

●●

● ●●

●●●0.0

0.2

0.4

0.6

−1.0 −0.5 0.0 0.5 1.0Difference of means per treatment

dens

ity

Test statistic of

●

●

original

permutation

14 / 36

General algorithm for permutation tests

Requirement: Under H0 response and covariables areexchangeableDo the following:

1 Calculate test statistic T02 Calculate test statistic T for all permutations of pairs Y , X3 Compute nextreme : Count number of T which are more

extreme than T04 p-value p = nextreme

npermutations5 Reject H0 if p < α, with significance level α

If # possible permutations too big, draw random permutationsin 2.) (Monte Carlo sampling)

15 / 36

Framework by Strasser and Weber

General test statistic:

Tj(Ln,w) = vec(

n∑i=1

wigj(Xij)h(Yi , (Y1, ...,Yn))T)∈ Rpjq

h is called influence function, gj is transformation of Xj

Choose gj , h depending on scaleIt’s possible to calculate µ and Σ of T

Standardized test statistic: c(t, µ,Σ) = maxk=1,...,pq

∣∣∣∣ (t−µ)k√(Σ)kk

∣∣∣∣Why so complex? ⇒ Cover all cases: Multicategorial X or Y ,different scales

16 / 36

End of excursionLets get back to business

17 / 36

Ctree algorithm with permutation tests

1 Stop criterionTest global null hypothesis H0 of independence between Y andall Xj withH0 = ∩m

j=1Hj0 and H j

0 : D(Y|Xj) = D(Y) (permutation testsfor each Xj)If H0 not rejected (no significance for all Xj) ⇒ Stop

2 Select variable Xj∗ with strongest association (smallestp-value)

3 Search best split point for Xj∗ (max. test statistic c) andpartition data

4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

18 / 36

Permutation tests for stop criterion

Choose influence function h for YChoose transformation function g for each Xj

Test each variable Xj separately for association with Y(H j

0 : D(Y |Xj) = D(Y ) = Variable Xj has no influence on Y )Global H0 = ∩m

j=1Hj0: No variable has influence on Y .

Test global H0: Multiple Testing ⇒ Adjust α (Bonferronicorrection, ...)

19 / 36

Permutation tests for variable selection

Choose variable with smallest p-value for splitNote: Switch to p-value comparison gets rid of scaling problem

20 / 36

Test statistic for best split point

Use test statistic instead of Gini/SSE for split point search

TAj (Ln,w) = vec

(n∑

i=1wi I (Xji ∈ A) · h(Yi , (Y1, . . . ,Yn))T

)Standardized test statistic: c = maxk

∣∣∣∣ (TAj −µ)k√(Σ)kk

∣∣∣∣Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji /∈ A}Calculate c for all possible splits; Choose split point withmaximal cCovers different scales of Y and X

21 / 36

Usage examples with R- Let’s get the party started -

22 / 36

Bodyfat: example for continuous regression

Example: bodyfat dataPredict body fat with anthropometric measurementsData: Measurements of 71 healthy womenResponse Y : body fat measured by DXA (numeric)Covariables X : different body measurements (numeric)For example: waist circumference, breadth of the knee, ...h = Yi

g = Xi

Tj(Ln,w) =n∑

i=1wiXijYi

c = | t−µσ | ∝

∣∣∣∣∣∣∣∣∣∑

i :nodeXijYi−nnode Xj Y√√√√( ∑

i :node(Yi−Y )2

)( ∑i :node

(Xij−Xj )2

)∣∣∣∣∣∣∣∣∣ (Pearson

correlation coefficient)23 / 36

Bodyfat: R-code

library("party")library("rpart")library("rpart.plot")data(bodyfat, package = "mboost")## conditional treecond_tree <- ctree(DEXfat ~ ., data = bodyfat)## normal treeclassic_tree <- rpart(DEXfat ~ ., data = bodyfat)

24 / 36

Bodyfat: conditional tree

plot(cond_tree)

hipcircp < 0.001

1

≤ 108 > 108

anthro3cp < 0.001

2

≤ 3.76 > 3.76

anthro3cp = 0.001

3

≤ 3.39 > 3.39

Node 4 (n = 13)

10

20

30

40

50

60

Node 5 (n = 12)

10

20

30

40

50

60

waistcircp = 0.003

6

≤ 86 > 86

Node 7 (n = 13)

10

20

30

40

50

60

Node 8 (n = 7)

10

20

30

40

50

60

kneebreadthp = 0.006

9

≤ 10.6 > 10.6

Node 10 (n = 19)

10

20

30

40

50

60

Node 11 (n = 7)

10

20

30

40

50

60

25 / 36

Bodyfat: CART tree

rpart.plot(classic_tree)

waistcir < 88

anthro3c < 3.4

hipcirc < 101

hipcirc < 110

17

23 30

35 45

yes no

⇒ Structurally different trees!26 / 36

Glaucoma: example for classification

Predict Glaucoma (= eye disease) based on laser scanningmeasurementsResponse Y : Binary, y ∈ {Glaucoma, normal}Covariables X : Different volumes and areas of the eye (allnumeric)

h = eJ(Yi ) =

{(1, 0)T Glaucoma(0, 1)T normal

g(Xij) = Xij

Tj(Ln,w) = vec(

n∑i=1

wiXijeJ(Yi )T)

=(nGlaucoma · Xj ,Glaucoma

nnormal · Xj ,normal

)T

c ∝ max∣∣ngroup · (Xj ,group − Xj ,node)

∣∣ group ∈ {Glaucoma,normal}

27 / 36

Glaucoma: R-code

library("rpart")library("party")data("GlaucomaM", package = "ipred")cond_tree <- ctree(Class ~ ., data = GlaucomaM)classic_tree <- rpart(Class ~ ., data = GlaucomaM)

28 / 36

Glaucoma: conditional tree

Node 1 (n = 196)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 2 (n = 87)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 3 (n = 79)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1Node 4 (n = 8)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 5 (n = 109)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 6 (n = 65)no

rmal

glauc

oma

0

0.2

0.4

0.6

0.8

1Node 7 (n = 44)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

## 1) vari <= 0.059; criterion = 1, statistic = 71.475## 2) vasg <= 0.066; criterion = 1, statistic = 29.265## 3)* weights = 79## 2) vasg > 0.066## 4)* weights = 8## 1) vari > 0.059## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221## 6)* weights = 65## 5) tms > -0.066## 7)* weights = 44

29 / 36

Glaucoma: CART tree

rpart.plot(classic_tree, cex = 1.5)

varg < 0.21

mhcg >= 0.17

vars < 0.064

tms >= −0.066

eas < 0.45

glaucoma

glaucoma

glaucoma

glaucoma normal

normal

yes no

30 / 36

Appendix: Examples of other scales

Y categorial, X categorialh = eJ(Yi ), g = eK (Xij)⇒ T is vectorized contingency table of Xj and Y

−2.08

−1.64

0.00

1.64

Pearsonresiduals:

p−value =0.009

XjY

32

11 2 3

Y and Xj numeric,h = rg(Yi ), g = rg(Xij) ⇒ Spearman’srhoFlexible T for different situations: Multivariate regression,ordinal regression, censored regression, . . .

31 / 36

Properties

Prediction accuracy: Not better than normal trees, but notworse eitherComputational considerations: Same speed as normal trees.Two possible interpretations of significance level α:

1. Pre-specified nominal level of underlying association tests2. Simple hyper parameter determining the tree sizeLow α yields smaller trees

32 / 36

Summary conditional trees

Not heuristics, but non-parametric models with well-definedtheoretical backgroundSuitable for regression with arbitrary scales of Y and XUnbiased variable selectionNo overfittingConditional trees structurally different from trees partitionedwith exhaustive search procedures

33 / 36

Literature and Software

J. Friedman, T. Hastie, and R. Tibshirani.The elements of statistical learning, volume 1.Springer Series in Statistics, 2001.

T. Hothorn, K. Hornik, and A. Zeileis.Unbiased recursive partitioning: A conditional inferenceframework.Journal of Computational and Graphical Statistics, 15(3):651–674, 2006.

H. Strasser and C. Weber.On the asymptotic theory of permutation statistics.1999.

R-packages:rpart: Recursive partitioningrpart.plot: Plot function for rpartparty: A Laboratory for Recursive Partytioning

All available on CRAN34 / 36

Appendix: Competitors

Other partitioning algorithms in this area:CHAID: Nominal response, χ2 test, multiway splits, nominalcovariablesGUIDE: Continuous response only, p-value from χ2 test,categorizes continuous covariablesQUEST: ANOVA F-Test for continuous response, χ2 test fornominal, compare on p-scale ⇒ reduces selection biasCRUISE: Multiway splits, discriminant analysis in each node,unbiased variable selection

35 / 36

Appendix: Properties of test statistic T

µj = E(Tj (Ln,w)|S(Ln,w)) = vec

((n∑

i=1

wigj (Xji )

)E(h|S(Ln,w))T

)Σj = V(Tj (Ln,w)|S(Ln,w))

=w.

w.− 1V(h|S(Ln,w))⊗

(∑i

wigj (Xji )⊗ wigj (Xji )T

)

− 1w.− 1

V(h|S(Ln,w))⊗

(∑i

wigj (Xji )

)⊗

(∑i

wigj (Xji )

)T

w. =n∑

i=1

wi

E(h|S(Ln,w)) = w.−1∑

i

wih(Yi , (Y1, . . . ,Yn)) ∈ Rq

V(h|S(Ln,w)) = w.−1∑

i

wi (h(Yi , (Y1, . . . ,Yn))− E(h|S(Ln,w)))

(h(Yi , (Y1, . . . ,Yn))− E(h|S(Ln,w)))T

36 / 36

Education

Conditional trees