2014 9-22

.

......

The Chow-Liu algorithm based on the MDL with discreeteand continuous variables

Joe Suzuki

Osaka University

AIGM 2014, Paris

Joe Suzuki (Osaka University) The Chow-Liu algorithm based on the MDL with discreete and continuous variablesAIGM 2014, Paris 1 / 26

The Chow-Liu Algorithm

Chow-Liu

P1,··· ,N : Probability of X (1), · · · ,X (N) N (≥ 1)G = (V ,E ): Undirected GraphE := {}, V := {1, · · · ,N} (N ≥ 1), E := {{i , j}|i = j , i , j ∈ V }do E = {}

...1 choose {i , j} ∈ E that maximizes I (i , j)

...2 remove {i , j} from E

...3 if no loop is generated, add {i , j} to E

Mutual Information of X (i),X (j):

I (i , j) :=∑x(i)

∑x(j)

Pi ,j(x(i), x (j)) log

Pi ,j(x(i), x (j))

Pi (x (j))Pi (x (i))

.Tree E s.t.

∑{i ,j}∈E I (i , j) → max

..

......D(P1,··· ,N ||Q) → min



Example

Q(x (1), x (2), x (3), x (4))

=P1,2(x

(1), x (2))P1,3(x(1), x (3))P1,4(x

(1), x (4))

P1(x (1))P2(x (1)) · P1(x (1))P3(x (1)) · P1(x (1))P4(x (4))

·P1(x(1))P2(x

(2))P3(x(3))P4(x

(4))

= P(x (1))P(x (2)|x (1))P(x (3)|x (1))P(x (4)|x (1))

i 1 1 2 1 2 3

j 2 3 3 4 4 4

I (i , j) 12 10 8 6 4 2

j jj j2 4

1 3 j jj j2 4

1 3 j jj j2 4

1 3 j jj j2 4

1 3@@



Dendroid Distribution

X (1), · · · ,X (N): Discrete Random VariablesV := {1, · · · ,N}E ⊆ {{i , j}|i = j , i , j ∈ V }

Q(x (1), · · · , x (N)|E ) =∏

{i ,j}∈E

Pi ,j(x(i), x (j))

Pi (x (i))Pj(x (j))

∏i∈V

Pi (x(i)) ,

{Pi (x(i))}i∈V , {Pi ,j(x

(i), x (j))}i =j : from P1,··· ,N(x(1), · · · , x (N))



Contribution

.Starting from Data........Learning rather than Approximation

distribution P1,··· ,N

data xn = {(x (1)i , · · · , x (N)i )}ni=1

.In any database,........some fields are discrete and others continuous

Joe Suzuki: A Construction of Bayesian Networks from DatabasesBased on an MDL Principle, UAI 1993

David Edwords, et. al: Selecting high-dimensional mixed graphicalmodels using minimal AIC or BIC forests, BMC Informatics 2010

Joe Suzuki: Learning Bayesian network structures when discrete andcontinous variables are present, PGM 2014



Maximum Likelihood (ML)

{Pi (x(i))}i∈V , {Pi ,j(x

(i), x (j))}i =j are obtained from xn

　ML Estimation of MI:

I (i , j) :=∑x(i)

∑x(j)

Pi ,j(x(i), x (j)) log

Pi ,j(x(i), x (j))

Pi (x (j))Pi (x (i))

Empirical Entropy given E (minus Likelihood given E ):

Hn(xn|E ) := n∑i∈V

H(i)− n∑

{i ,j}∈E

I (i , j)

.ML seeks a tree even if X (1), · · ·X (N) are independent........The true graph is not obtained even if n → ∞



Prior Distribution over Forest (V ,E )

pij : the prior probability of X (i) ⊥⊥ X (j)

π(E ) :=1

K

∏{i ,j}∈E

1− pijpij

K :=∑ ∏

{i ,j}∈E

1− pijpij



Minimum Description Length (Suzuki, UAI-1993)

R(i) =

∫P({x (i)k }nk=1|θ)w(θ)dθ

R(i , j) =

∫P({x (i)k , x

(j)k }nk=1|θ)w(θ)dθ

Rn(xn|E ) :=∏

{i ,j}∈E

R(i , j)

R(i)R(j)

∏i∈V

R(i)

L(xn|E ) := − logR(xn|E )Description Length:

l(xn) = − log π(E ) + L(xn|E ) → min

Bayesian Estimation of MI:

J(i , j) :=1

nlog

R(i , j)

R(i)R(j)



If we expand using approximaion, we find

k(E ): # of Parameters in Eα(i): # of values X (i) takes

L(xn|E ) ≈ Hn(xn|E ) + 1

2k(E ) log n

l(xn) ≈ Hn(xn|E ) + 1

2k(E ) log n − log π(E )

J(i , j) ≈ I (i , j)− 1

2n(α(i) − 1)(α(j) − 1) log n − 1

nlog

1− pijpij

　

the orders of choosing edges are different

J(i , j) could be negative and makes a forest while I (i , j) makes a tree



Univesality

.Universal Measure w.r.t. finte set A..

......

There exists Rn s.t.1

nlog

Pn(xn)

Rn(xn)→ 0

(xn ∈ An) with Pn-Probability one as n → ∞ for any Pn.

P(i) =∏n

k=1 P(x(i)k ) , P(i , j) =

∏nk=1 P(x

(i)k , x

(j)k )

1

nlog

P(i)

R(i)→ 0 ,

1

nlog

P(i , j)

R(i , j)→ 0



Consistency

Qn(xn|E ) :=∏

{i ,j}∈E

P(i , j)

P(i)P(j)

∏i∈V

P(i)

with Prob. 1 as n → ∞ for any Qn(·|E )

1

nlog

Qn(xn|E )Rn(xn|E )

→ 0

For large n,

π(E1)Q(xn|E1) ≤ π(E2)Q(xn|E2) ⇐⇒ π(E1)R(xn|E1) ≤ π(E2)R(x

n|E2)

A maximum posterior probability forest is obtained for large n.



ML vs MDL

ML MDL

Choices Minimize Minimize

of E Hn(xn|E ) Hn(xn|E )+1

2k(E ) log n − log π(E )

Choices of {i , j} Maximize I (i , j) Maximize J(i , j)

Criteria Fitness of xn to E Fitness of xn to Eand Simplicity of E

Consistency No Yes


When Density Exists

When density f exists for X (Ryabko, 2009)

A0 := {A}Aj+1 is a refinement of Aj

for each j , xn = (x1, · · · , xn) ∈ Rn 7→ (a(j)1 , · · · , a(j)n ) ∈ An

j

......

......

-

-

-

A1

A2

Aj

gn1 (x

n) =Rn1 (a

(1)1 , · · · , a(1)n )

λ(a(1)1 ) · · ·λ(a(1)n )

gn2 (x

n) =Rn2 (a

(2)1 , · · · , a(2)n )

λ(a(2)1 ) · · ·λ(a(2)n )

gnj (x

n) =Rnj (a

(j)1 , · · · , a(j)n )

λ(a(j)1 ) · · ·λ(a(j)n )

λ: Lebesgue measure (width of interval), Rnj : Universal Measure w.r.t. Aj


When Density Exists

∑j wj = 1, wj > 0

gn(xn) :=∞∑j=1

wjgnj (x

n)

f : density functionfj (density function of level j)f n(xn) := f (x1) · · · f (xn).Ryabko 2009..

......

for any f s.t. D(f ||fj) → 0 (j → ∞)

1

nlog

f n(xn)

gn(xn)→ 0

as n → ∞


When Density does not exists

Extensions from Ryabko 2009

Remove the assumption that a density exists.

Remove the restricion of density class“for any f s.t. D(f ||fj) → 0 (j → ∞)” → “for any f ”



When density does not exist for X (Suzuki 2011)

B1 := {{1}, {2, 3, · · · }}B2 := {{1}, {2}, {3, 4, · · · }}. . .Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}. . .

for each level k, xn = (x1, · · · , xn) ∈ Nn 7→ (b(k)1 , · · · , b(k)n ) ∈ Bn

k

η({k}) = 1

k− 1

k + 1

gnk (y

n) :=Rnk (b

(k)1 , · · · , b(k)n )

η(b(k)1 ) · · · η(b(k)n )

∑ωk = 1, ωk > 0, gn(xn) :=

∞∑k=1

ωkgnk (x

n)



D(f ||fj) −→ 0 as j → ∞ (1)

∫ 1

12

f (x)dx > 0

-0 1 x

C0

C1

C2

C3...

......



D(f ||fj) −→ 0 as j → ∞ (2)

∫ ∞

1f (x)dx > 0

-0 1 x

C0

C1

C2

C3...

......



D(f ||fj) −→ 0 as j → ∞

Universal Histogram Sequence {Ck}∞k=0

...... -

xµ σ−σ x

C0

C1

C2

C3

...

.Suzuki 2013..

......

For any (generalized) density f as n → ∞ with Prob. 1

1

nlog

f n(xn)

gn(xn)→ 0



Computing gn(xn)

Input xn ∈ An, output gn(xn)...1 For each k = 1, · · · ,K , gn

k (xn) := 0

...2 For each k = 1, · · · ,K and each a ∈ Ak , ck(a) := 0

...3 For each i = 1, · · · , n, for each k = 1, · · · ,K...1 Find ai ∈ Ak from xi ∈ A

...2 gnk (x

n) := gnk (x

n)− logck(ai ) + 1/2

i − 1 + |Ak |/2+ log(ηX (ai ))

...3 ck(ai ) := ck(ai ) + 1

...4 gn(xn) := 1K

∑Kk=1 g

nk (x

n)

Universal Measure w.r.t. Ak

Rnk (x

n) =n∏

i=1

c(a(k)i ) + 1/2

i − 1 + |Ak |/2



Computation: O(nN2K )

.Computing gn(xn) and gn(xn, yn)..

......

O(nN2K )(O(nN2) for discrete case)

Proportional to n and N + N(N − 1)/2

a(1)i 7→ a

(2)i 7→ · · · 7→ a

(K)i : Binary Search

Proprtional to K

gn(xn, yn) can be obtained byK∑

k=1

ωkgnk,k(x

n, yn) rather thanJ∑

j=1

K∑k=1

ωjkgnjk(x

n, yn).

.Computng MI and finding the forest........N(N − 1)/2



Bayesian Estimator of Mutual Information

J(i , j) =1

nlog

gn(i , j)

gn(i)gn(j)− 1

nlog

1− pi ,jpij

age height menarche sex igf1 tanner testvol weight

age NA 0.7627465 0.8521553 0.01010264 0.5138440 0.52534862 0.1997714 0.6091554

height NA NA 0.6706380 0.26225428 0.4132932 0.68547041 0.3105466 0.9269808

menarche NA NA NA 0.68786102 0.4919746 0.84283639 0.0000000 0.6456718

sex NA NA NA NA 0.2778511 0.08923994 0.1083901 0.1925525

igf1 NA NA NA NA NA 0.47529101 0.2272998 0.3722551

tanner NA NA NA NA NA NA 0.3796768 0.6420483

testvol NA NA NA NA NA NA NA 0.2409487

weight NA NA NA NA NA NA NA NA



R ISwR package juul2

The juul data frame has 1339 rows and 6 columns. It contains a referencesample of the distribution of insulin-like growth factor (IGF-I), oneobservation per subject in various ages, with the bulk of the data collectedin connection with school physical examinations.

��

��

��

��

��

��

��

��

weight height

sex

age

tanner

igf1

menar-che

testvol



Experiments

n 100 500 1000 2000

Jn(i , j) 0.90 0.99 1.86 3.15HSIC 0.50 9.51 40.28 185.53

(a) N = 4

n 100 500 1000 2000

perfectly matching rate 0.52 0.60 0.72 0.79K-L divergence loss 0.0169 0.00303 0.00152 0.000405execution time (sec) 1.64 12.71 22.45 51.24

(b) N = 4

n 100 500 1000 2000

perfectly matching rate 0.18 0.31 0.38 0.59K-L divergence loss 0.0652 0.00800 0.00575 0.00298execution time (sec) 4.27 24.44 52.5 116.1



Experiments

data.frame n N discrete timeContinuous (sec)

airquality 153 6 (d,d,c,d,d,d) 10.47anscombe 51 4 (d,c,c,d) 3.32attenu 182 5 (d,c,d,c,c) 9.64attitude 30 7 (d,d,d,d,d,d,d) 4.26beaver1 114 4 (d,d,c,d) 2.54beaver2 100 4 (d,d,c,d) 2.73BOD 6 2 (d,c) 0.11cars 50 2 (d,d) 0.80ChickWeight 578 4 (d,d,d,d) 13.01chickwts 71 2 (d,d) 0.98CO2 84 5 (d,d,d,d,c) 3.33DNase 176 3 (d,c,c) 2.36esoph 88 5 (d,d,d,d,d) 2.12faithful 272 2 (c,d) 1.52Formaldehyde 6 2 (c.c) 0.18freeny 39 5 (c,c,c,c,c) 2.57Indometh 66 3 (d,c,c) 0.97Infert 248 8 (d,d,d,d,d,d,d,

d) 13.91InsecSprays 72 2 (d,d) 0.23iris 150 5 (c,c,c,c,d) 6.94LifeCycleSavings 50 5 (c,c,c,c,c) 3.1Lobllolly 84 3 (c,d,d) 1.01longley 16 7 (c,c,c,c,c,d,c) 2.26morley 100 3 (d,d,d) 1.21mtcars 32 11 (c,c,c,c,c,c,c,

c,c,c,c) 6.73Orange 35 3 (d,d,d) 0.5OrchadSprays 64 4 (d,d,d,d) 1.09PlantGrowth 30 2 (c,d) 0.16pressure 19 2 (d,c) 0.22Puromycin 23 3 (c,d,d) 0.34quakes 1000 5 (c,c,c,c,d) 56.12sleep 20 3 (c,c,d) 0.48stackloss 21 4 (d,d,d,d) 0.53swiss 47 6 (c,c,d,d,c,c) 4.18Theoph 132 5 (d,c,c,c,c) 6.94ToothGrowth 60 4 (d,c,d,c) 1.11trees 31 3 (c,d,c) 0.58USArrests 50 4 (c,d,d,c) 1.87USJudgeRatings 43 12 (c,c,c,c,c,c,c,

c,c,c,c,c) 13.66warpbreaks 54 3 (d,d,d) 0.27women 15 2 (d,d) 0.9


Conclusion

Conclusion

.Establish Chow-Liu Learning based on MDL without assuming eitherDiscrete or Continuous..

......

Theoretical Analysis w.r.t. n,N,K (K : quantization depth)

Realistic Computation using R

　Insight:

The implimation is not hard

The computation is proportional to K

　Future Works:

Optimal K w.r.t. n,N

Exponential Memory w.r.t. K

R Package Publication


Presentations & Public Speaking

2014 9-22