36
Conditional Trees or Unbiased recursive partitioning A conditional inference framework Christoph Molnar Supervisor: Stephanie Möst Department of Statistics, LMU 18 December 2012 1 / 36

Conditional trees

Embed Size (px)

DESCRIPTION

Slides of my presentation on conditional trees

Citation preview

Page 1: Conditional trees

Conditional Treesor

Unbiased recursive partitioningA conditional inference framework

Christoph MolnarSupervisor: Stephanie Möst

Department of Statistics, LMU

18 December 2012

1 / 36

Page 2: Conditional trees

Overview

Introduction and Motivation

Algorithm for unbiased trees

Conditional inference with permutation tests

Examples

Properties

Summary

2 / 36

Page 3: Conditional trees

CART trees

Model: Y = f (X )

Structure of decision treesRecursive partitioning of covariable space XSplit optimizes criterion (Gini, information gain, sum ofsquares) depending on scale of YSplit point search: exhaustive search procedureAvoid overfitting: Early stopping or pruningUsage: prediction and explanationOther tree types: ID3, C4.5, CHAID, . . .

3 / 36

Page 4: Conditional trees

What are conditional trees?

Special kind of treesRecursive partitioning with binary splits and early stoppingConstant models in terminal nodesVariable selection, early stopping and split point search basedon conditional inferenceUses permutation tests for inferenceSolves problems of CART trees

4 / 36

Page 5: Conditional trees

Why conditional trees?

Helps to overcome problems of trees:overfitting (can be solved with other techniques as well)Selection bias towards covariables with many possible splits(i.e. numeric, multi categorial)Difficult interpretation due to selection biasVariable selection: No concept of statistical significanceNot all scales of Y and X covered (ID3, C4.5, ...)

5 / 36

Page 6: Conditional trees

Simulation: selection bias

Variable selection unbiased ⇔ Probability of selecting acovariable, which is independent from Y is the same for allindependent covariablesMeasurement scale of covariable shouldn’t play a roleSimulation illustrating the selection bias:Y ∼ N(0, 1)X1 ∼ M

(n, 1

2 ,12

)X2 ∼ M

(n, 1

3 ,13 ,

13

)X3 ∼ M

(n, 1

4 ,14 ,

14 ,

14

)

6 / 36

Page 7: Conditional trees

Simulation: results

Selection frequencies for the first split:X1: 0.128, X2: 0.302, X3: 0.556, none: 0.014

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 none

Strongly biased towards variables with many possible splitsExample of a tree:

x3 = 1,2

x2 = 1,3−0.19

−0.098 0.36

yes no

Overfitting! (Note: complexity parameter not cross-validated)Desirable here: No split at allProblem source: Exhaustive search through all variables and allpossible split pointsNumeric/multi-categorial categorial have more split options ⇒Multiple comparison problem 7 / 36

Page 8: Conditional trees

Idea of conditional trees

Variable selection and search for split point ⇒ two stepsEmbed all decisions into hypothesis testsAll tests with conditional inference (permutation tests)

8 / 36

Page 9: Conditional trees

Ctree algorithm

1 Stop criterionTest global null hypothesis H0 of independence between Y andall Xj withH0 = ∩m

j=1Hj0 and H j

0 : D(Y|Xj) = D(Y)If H0 not rejected ⇒ Stop

2 Select variable Xj∗ with strongest association3 Search best split point for Xj∗ and partitionate data4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

9 / 36

Page 10: Conditional trees

How can we test hypothesis of independence?

Parametric tests depend on distribution assumptionsProblem: Unknown conditional distributionD(Y |X ) = D(Y |X1, ...,Xm) = D(Y |f (X1), ..., f (Xm))

Need for a general framework, which can handle arbitraryscales

Let the data speak: ⇒ permutation tests!

10 / 36

Page 11: Conditional trees

Excursion: permutation tests

11 / 36

Page 12: Conditional trees

Permutation tests: simple example

Possible treatments for disease: A or BNumeric measurement (blood value)Question: Different blood values between treatment A and B?⇔ µB 6= µA?Test statistic: T0 = µA − µB

H0: µA − µB = 0, H1: µA − µB 6= 0Distribution unknown ⇒ Permutation test

●● ●●● ●●● ● ●

1 2y

Treatment

A

B

T0 = µA − µB = 2.06 - 1.2 = 0.86

12 / 36

Page 13: Conditional trees

Permute

Original data:

B B B B A A A A B A0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8

One possible permutation:

B B B B A A A A B A2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0

Permute the labels (A and B) and the numeric measurementCalculate test statistic T for each permutationDo this with all possible permutationsResult: Distribution of test statistic conditioned on sample

13 / 36

Page 14: Conditional trees

P-value and decision

k = {Permutation samples : |µA,perm − µB,perm| > |µA − µB |}p-value = k

#Permutations

p-value < α = 0.05? ⇒ If yes, H0 can be rejected

●●● ●

●●

●●

●● ●

● ●● ●●● ●

●●

●●●

●● ●● ●

●●

● ●●

● ●●●● ● ●

●●

● ●●

●●●0.0

0.2

0.4

0.6

−1.0 −0.5 0.0 0.5 1.0Difference of means per treatment

dens

ity

Test statistic of

original

permutation

14 / 36

Page 15: Conditional trees

General algorithm for permutation tests

Requirement: Under H0 response and covariables areexchangeableDo the following:

1 Calculate test statistic T02 Calculate test statistic T for all permutations of pairs Y , X3 Compute nextreme : Count number of T which are more

extreme than T04 p-value p = nextreme

npermutations5 Reject H0 if p < α, with significance level α

If # possible permutations too big, draw random permutationsin 2.) (Monte Carlo sampling)

15 / 36

Page 16: Conditional trees

Framework by Strasser and Weber

General test statistic:

Tj(Ln,w) = vec(

n∑i=1

wigj(Xij)h(Yi , (Y1, ...,Yn))T)∈ Rpjq

h is called influence function, gj is transformation of Xj

Choose gj , h depending on scaleIt’s possible to calculate µ and Σ of T

Standardized test statistic: c(t, µ,Σ) = maxk=1,...,pq

∣∣∣∣ (t−µ)k√(Σ)kk

∣∣∣∣Why so complex? ⇒ Cover all cases: Multicategorial X or Y ,different scales

16 / 36

Page 17: Conditional trees

End of excursionLets get back to business

17 / 36

Page 18: Conditional trees

Ctree algorithm with permutation tests

1 Stop criterionTest global null hypothesis H0 of independence between Y andall Xj withH0 = ∩m

j=1Hj0 and H j

0 : D(Y|Xj) = D(Y) (permutation testsfor each Xj)If H0 not rejected (no significance for all Xj) ⇒ Stop

2 Select variable Xj∗ with strongest association (smallestp-value)

3 Search best split point for Xj∗ (max. test statistic c) andpartition data

4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

18 / 36

Page 19: Conditional trees

Permutation tests for stop criterion

Choose influence function h for YChoose transformation function g for each Xj

Test each variable Xj separately for association with Y(H j

0 : D(Y |Xj) = D(Y ) = Variable Xj has no influence on Y )Global H0 = ∩m

j=1Hj0: No variable has influence on Y .

Test global H0: Multiple Testing ⇒ Adjust α (Bonferronicorrection, ...)

19 / 36

Page 20: Conditional trees

Permutation tests for variable selection

Choose variable with smallest p-value for splitNote: Switch to p-value comparison gets rid of scaling problem

20 / 36

Page 21: Conditional trees

Test statistic for best split point

Use test statistic instead of Gini/SSE for split point search

TAj (Ln,w) = vec

(n∑

i=1wi I (Xji ∈ A) · h(Yi , (Y1, . . . ,Yn))T

)Standardized test statistic: c = maxk

∣∣∣∣ (TAj −µ)k√(Σ)kk

∣∣∣∣Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji /∈ A}Calculate c for all possible splits; Choose split point withmaximal cCovers different scales of Y and X

21 / 36

Page 22: Conditional trees

Usage examples with R- Let’s get the party started -

22 / 36

Page 23: Conditional trees

Bodyfat: example for continuous regression

Example: bodyfat dataPredict body fat with anthropometric measurementsData: Measurements of 71 healthy womenResponse Y : body fat measured by DXA (numeric)Covariables X : different body measurements (numeric)For example: waist circumference, breadth of the knee, ...h = Yi

g = Xi

Tj(Ln,w) =n∑

i=1wiXijYi

c = | t−µσ | ∝

∣∣∣∣∣∣∣∣∣∑

i :nodeXijYi−nnode Xj Y√√√√( ∑

i :node(Yi−Y )2

)( ∑i :node

(Xij−Xj )2

)∣∣∣∣∣∣∣∣∣ (Pearson

correlation coefficient)23 / 36

Page 24: Conditional trees

Bodyfat: R-code

library("party")library("rpart")library("rpart.plot")data(bodyfat, package = "mboost")## conditional treecond_tree <- ctree(DEXfat ~ ., data = bodyfat)## normal treeclassic_tree <- rpart(DEXfat ~ ., data = bodyfat)

24 / 36

Page 25: Conditional trees

Bodyfat: conditional tree

plot(cond_tree)

hipcircp < 0.001

1

≤ 108 > 108

anthro3cp < 0.001

2

≤ 3.76 > 3.76

anthro3cp = 0.001

3

≤ 3.39 > 3.39

Node 4 (n = 13)

10

20

30

40

50

60

Node 5 (n = 12)

10

20

30

40

50

60

waistcircp = 0.003

6

≤ 86 > 86

Node 7 (n = 13)

10

20

30

40

50

60

Node 8 (n = 7)

10

20

30

40

50

60

kneebreadthp = 0.006

9

≤ 10.6 > 10.6

Node 10 (n = 19)

10

20

30

40

50

60

Node 11 (n = 7)

10

20

30

40

50

60

25 / 36

Page 26: Conditional trees

Bodyfat: CART tree

rpart.plot(classic_tree)

waistcir < 88

anthro3c < 3.4

hipcirc < 101

hipcirc < 110

17

23 30

35 45

yes no

⇒ Structurally different trees!26 / 36

Page 27: Conditional trees

Glaucoma: example for classification

Predict Glaucoma (= eye disease) based on laser scanningmeasurementsResponse Y : Binary, y ∈ {Glaucoma, normal}Covariables X : Different volumes and areas of the eye (allnumeric)

h = eJ(Yi ) =

{(1, 0)T Glaucoma(0, 1)T normal

g(Xij) = Xij

Tj(Ln,w) = vec(

n∑i=1

wiXijeJ(Yi )T)

=(nGlaucoma · Xj ,Glaucoma

nnormal · Xj ,normal

)T

c ∝ max∣∣ngroup · (Xj ,group − Xj ,node)

∣∣ group ∈ {Glaucoma,normal}

27 / 36

Page 28: Conditional trees

Glaucoma: R-code

library("rpart")library("party")data("GlaucomaM", package = "ipred")cond_tree <- ctree(Class ~ ., data = GlaucomaM)classic_tree <- rpart(Class ~ ., data = GlaucomaM)

28 / 36

Page 29: Conditional trees

Glaucoma: conditional tree

Node 1 (n = 196)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 2 (n = 87)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 3 (n = 79)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1Node 4 (n = 8)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 5 (n = 109)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

Node 6 (n = 65)no

rmal

glauc

oma

0

0.2

0.4

0.6

0.8

1Node 7 (n = 44)

norm

algla

ucom

a

0

0.2

0.4

0.6

0.8

1

## 1) vari <= 0.059; criterion = 1, statistic = 71.475## 2) vasg <= 0.066; criterion = 1, statistic = 29.265## 3)* weights = 79## 2) vasg > 0.066## 4)* weights = 8## 1) vari > 0.059## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221## 6)* weights = 65## 5) tms > -0.066## 7)* weights = 44

29 / 36

Page 30: Conditional trees

Glaucoma: CART tree

rpart.plot(classic_tree, cex = 1.5)

varg < 0.21

mhcg >= 0.17

vars < 0.064

tms >= −0.066

eas < 0.45

glaucoma

glaucoma

glaucoma

glaucoma normal

normal

yes no

30 / 36

Page 31: Conditional trees

Appendix: Examples of other scales

Y categorial, X categorialh = eJ(Yi ), g = eK (Xij)⇒ T is vectorized contingency table of Xj and Y

−2.08

−1.64

0.00

1.64

Pearsonresiduals:

p−value =0.009

XjY

32

11 2 3

Y and Xj numeric,h = rg(Yi ), g = rg(Xij) ⇒ Spearman’srhoFlexible T for different situations: Multivariate regression,ordinal regression, censored regression, . . .

31 / 36

Page 32: Conditional trees

Properties

Prediction accuracy: Not better than normal trees, but notworse eitherComputational considerations: Same speed as normal trees.Two possible interpretations of significance level α:

1. Pre-specified nominal level of underlying association tests2. Simple hyper parameter determining the tree sizeLow α yields smaller trees

32 / 36

Page 33: Conditional trees

Summary conditional trees

Not heuristics, but non-parametric models with well-definedtheoretical backgroundSuitable for regression with arbitrary scales of Y and XUnbiased variable selectionNo overfittingConditional trees structurally different from trees partitionedwith exhaustive search procedures

33 / 36

Page 34: Conditional trees

Literature and Software

J. Friedman, T. Hastie, and R. Tibshirani.The elements of statistical learning, volume 1.Springer Series in Statistics, 2001.

T. Hothorn, K. Hornik, and A. Zeileis.Unbiased recursive partitioning: A conditional inferenceframework.Journal of Computational and Graphical Statistics, 15(3):651–674, 2006.

H. Strasser and C. Weber.On the asymptotic theory of permutation statistics.1999.

R-packages:rpart: Recursive partitioningrpart.plot: Plot function for rpartparty: A Laboratory for Recursive Partytioning

All available on CRAN34 / 36

Page 35: Conditional trees

Appendix: Competitors

Other partitioning algorithms in this area:CHAID: Nominal response, χ2 test, multiway splits, nominalcovariablesGUIDE: Continuous response only, p-value from χ2 test,categorizes continuous covariablesQUEST: ANOVA F-Test for continuous response, χ2 test fornominal, compare on p-scale ⇒ reduces selection biasCRUISE: Multiway splits, discriminant analysis in each node,unbiased variable selection

35 / 36

Page 36: Conditional trees

Appendix: Properties of test statistic T

µj = E(Tj (Ln,w)|S(Ln,w)) = vec

((n∑

i=1

wigj (Xji )

)E(h|S(Ln,w))T

)Σj = V(Tj (Ln,w)|S(Ln,w))

=w.

w.− 1V(h|S(Ln,w))⊗

(∑i

wigj (Xji )⊗ wigj (Xji )T

)

− 1w.− 1

V(h|S(Ln,w))⊗

(∑i

wigj (Xji )

)⊗

(∑i

wigj (Xji )

)T

w. =n∑

i=1

wi

E(h|S(Ln,w)) = w.−1∑

i

wih(Yi , (Y1, . . . ,Yn)) ∈ Rq

V(h|S(Ln,w)) = w.−1∑

i

wi (h(Yi , (Y1, . . . ,Yn))− E(h|S(Ln,w)))

(h(Yi , (Y1, . . . ,Yn))− E(h|S(Ln,w)))T

36 / 36