1
Carnegi e Mellon Junction trees Trees where each node is a set of variables - Running intersection property: every clique between C i and C j contains C i C j - C i and C j are neighbors S ij =C i C j is called a separator - Example: - Notation: V ij is a set of all variables on the same side of edge i-j as clique C j : - V 34 ={GF}, V 31 ={A}, V4 3 ={AD} - Encoded independencies: (V ij V ji | S ij ) Efficient Principled Learning of Junction Trees Anton Chechetka and Carlos Guestrin Carnegie Mellon University Motivation Probabilistic graphical models are everywhere - Medical diagnosis, datacenter performance monitoring, sensor nets, … Main advantages - Compact representation of probability distributions - Exploit structure to speed up inference ≤4 neighbors per variable (a constant!), but inference still hard But also problems - Compact representationtractable inference. - Exact inference #P-complete in general - Often still need exponential time even for compact models - Example: - Often do not even have structure, only data - Best structure is NP-complete to find - Most structure learning algorithms return complex models, where inference is intractable - Very few structure learning algorithms have global quality guarantees We address both of these issues! We provide - efficient structure learning algorithm - guaranteed to learn tractable models - with global guarantees on the results quality This work: contributions - The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with - guaranteed tractable inference! - Key theoretical insight: polynomial- time upper bound on conditional mutual information for arbitrarily large sets of variables - Empirical viability demonstration Tractability guarantees: - Inference exponential in clique size k - Small cliques tractable inference JTs as approximations Often exact conditional independence is too strict a requirement - generalization: conditional mutual information I(A , B | C) H(A | B) – H(A | BC) - H() is conditional entropy - I(A , B | C) 0 always - I(A , B | C) = 0 (A B | C) - intuitively: if C is already known, how much new information about A is contained in B? Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time AB CD BC BE Clique s B C B E Separato rs 1 2 3 4 5 6 EF EG V 3 4 Theorem [Narasimhan and Bilmes, UAI05]: If for every separator S ij in the junction tree it holds that the conditional mutual information I(V ij , V ji | S ij ) < (call it -junction tree) then KL(P||P tree ) < |V| Approximation quality guarantee: E Constraint-based learning Naively - for every candidate sep. S of size k - for every XV\S - if I(X, V\SX | S) < - add (S,X) to the “list of useful components” L - find a JT consistent with L Naïv e Our work n k n k O(2 n ) O(n k+3 ) O(2 n ) O(2 4k+4 ) O(2 n ) O(n k+2 ) Complexity: Key theoretical result Efficient upper bound for I(,| ) Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small. Computation time is reduced from exponential in |V| to polynomial! Set S does not have to relate to the separators of the trueJT in any way! Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that I(XA, X(V\SA)S | S) < then I(A, V\SA | S) < |V|(+) Complexity: O(n k+1 ). Polynomial in n, instead of O(exp(n)) for straightforward computation A B S Only need to comput e I(X,Y|S) for small X and Y! I(A,B | S)=?? I(X,Y|S) X Y Finding almost independent subsets Question: if S is a separator of an -JT, which variables are on the same side of S? - More than one correct answer possible: - We will settle for finding one - Drop the complexity to polynomial from exponential AB AC AD A A S={A}: {B}, {C,D} OR {B,C}, {D} Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right. then a partitioning of Q into X and Y exists s.t. I(X,Y|S)< B D C possible partitionin gs if no such splits exist, all variables of Q must be on the same side of S Alg. 1 (given candidate sep. S), threshold : - each variable of V\S starts out as a separate partition - for every QV\S of size at most k+2 - if min XQ I(X,Q\S | S) > - merge all partitions that have variables in Q Fixed size regardl ess of |Q| Complexity: O(n k+3 ). Polynomial in n. Theorem (results quality): If after invoking Alg.1(S,=) a set U is a connected component, then - For every Z s.t. I(Z, V\ZS | S)< it holds that UZ - I(U, V\US | S)<nk never mistakenly put variables together Incorrect splits not too bad Example: =0.25 Pairwi se I(.,.|S) Test edge, merge variables I() too low, do not merge 0. 4 0. 3 0. 2 merge; end resul t Constructing a junction tree Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ| S)<|V|(+) Example: AB CD BC BE EF C E S Q : E , , , , , , { } Problem: From L, reconstruct a junction tree. This is non-trivial. Complications: - L may encode more independencies than a single JT encodes - Several different JTs may be consistent with independencies in L Key insight [Arnborg+al:,SIAM- JADM1987, Narasimhan+Bilmes: UAI05]: In a junction tree, components (S,Q) have recursive decomposition: a clique in the junction tree smaller components from L DP algorithm (input list L of pairs (S,Q) ) : - sort L in the order of increasing | Q| - mark (S,Q)L with |Q|=1 as positive - for (S,Q)L, Q 2, in the sorted order - if xQ, (S 1 ,Q 1 ), …, (S m ,Q m ) L s.t. - S i {Sx}, (S i ,Q i ) is positive - Q i Q j = - i=1:m Q i =Q\x - then mark (S,Q) positive - decomposition(S,Q)=(S 1 ,Q 1 ),...,(S m ,Q m ) - if S s.t. all (S,Q i )L are positive - return corresponding junction tree Look for such recursive decompositions in L! NP-complete to decide We use greedy heuristic Greedy heuristic for decomposition search - initialize decomposition to empty - iteratively add pairs (S i ,Q i ) that do not conflict with those already in the decomposition - if all variables of Q are covered, success - May fail even if a decomposition exists - But we prove that for certain distributions guaranteed to work ABEF F ABCD B A B EF B CD C D C B A EF Theoretical guarantees Intuition: if the intra-clique dependencies are strong enough, guaranteed to find a well- approximating JT in polynomial time. Theorem: Suppose a maximal -JT tree of treewidth k exists for P(V) s.t. for every clique C and separator S of tree it holds that min X(C\S) I(X,C\SX|S) > (k+3)(+) then our algorithm will find a k|V| (+)-JT for P(V) with probability at least (1-) using n O k log 1 log 2 2 2 2 4 4 samples and n n O k k log 1 log 2 2 2 2 4 4 3 2 time Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that minX(C\S)I(X,C\SX|S) > for fixed >0 are efficiently PAC learnable Related work Experimental results Model quality (log-likelihood on test set) Compare this work with - ordering-based search (OBS) [Teyssier+Koller:UAI05] - Chow-Liu alg. [Chow+Liu:IEEE68] - Karger-Srebro alg. [Karger+Srebro:SODA01] - local search - this work + local search combination (using our algorithm to initialize local search) Data: Beinlich+al:ECAIM1988 37 variables, treewidth 4, learned treewidth 3 Data: Krause+Guestrin:UAI05 32 variables, treewidth 3 Data: Desphande+al:VLDB04 54 variables, treewidth 2 Future work - Extend to non-maximal junction trees - Heuristics to speed up performance - Using information about edges likelihood (e.g. from L1 regularized logistic regression) to cut down on computation. Ref. Model Guarante es Time [1,2] tractab le local poly(n) [3] tree global O(n 2 log n) [4] tree mix local O(n 2 log n) [5] compact local poly(n) [6] all global exp(n) [7] tractab le const- factor poly(n) [8] compact PAC poly(n) [9] tractab le PAC exp(n) this work tractab le PAC poly(n) [1] Bach+Jordan:NIPS-02 [2] Choi+al:UAI-05 [3] Chow+Liu:IEEE-1968 [4] Meila+Jordan:JMLR-01 [5] Teyssier+Koller:UAI-05 [6] Singh+Moore:CMU-CALD-05 [7] Karger+Srebro:SODA-01 [8] Abbeel+al:JMLR-06 [9] Narasimhan+Bilmes:UAI-04

Junction trees Trees where each node is a set of variables

Embed Size (px)

DESCRIPTION

Efficient Principled Learning of Junction Trees. A. A. AB. AC. AD. 0.3. 0.4. 0.2. Anton Chechetka and Carlos Guestrin. Carnegie Mellon University. Motivation. Constructing a junction tree Using Alg.1 for every S V , obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)

Citation preview

Page 1: Junction trees Trees where  each node  is a  set of variables

Carnegie Mellon

Junction trees

Trees where each node is a set of variables

- Running intersection property: every clique between Ci and Cj contains Ci Cj

- Ci and Cj are neighbors Sij=Ci Cj is called a separator

- Example:

- Notation: Vij is a set of all variables on the same side of edge i-j as clique Cj:

- V34={GF}, V31={A}, V43={AD}- Encoded independencies:

(Vij Vji | Sij)

Efficient Principled Learning of Junction TreesAnton Chechetka and Carlos Guestrin

Carnegie Mellon University

MotivationProbabilistic graphical models

are everywhere- Medical diagnosis, datacenter

performance monitoring, sensor nets, …

Main advantages- Compact representation of probability

distributions- Exploit structure to speed up inference

≤4 neighbors per variable (a constant!), but inference still hard

But also problems- Compact representation≠ tractable

inference. - Exact inference #P-complete in

general- Often still need exponential time

even for compact models- Example:

- Often do not even have structure, only data- Best structure is NP-complete to find- Most structure learning algorithms

return complex models, where inference is intractable

- Very few structure learning algorithms have global quality guarantees

We address both of these issues! We provide

- efficient structure learning algorithm- guaranteed to learn tractable models- with global guarantees on the results

quality

This work: contributions

- The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with

- guaranteed tractable inference!

- Key theoretical insight: polynomial-time upper bound on conditional mutual information for arbitrarily large sets of variables

- Empirical viability demonstration

Tractability guarantees:- Inference exponential in clique size k- Small cliques tractable inference

JTs as approximationsOften exact conditional independence is

toostrict a requirement- generalization: conditional mutual

informationI(A , B | C) H(A | B) – H(A | BC)- H() is conditional entropy- I(A , B | C) ≥0 always- I(A , B | C) = 0 (A B | C)

- intuitively: if C is already known, how much new information about A is contained in B?

Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time

AB

CD

BC BECliques

B

C

B

E

Separators

1

2

3 4

5

6EF

EGV3 4

Theorem [Narasimhan and Bilmes, UAI05]: If for every separator Sij in the junction tree it holds that the conditional mutual information

I(Vij, Vji | Sij ) < (call it -junction tree)

thenKL(P||Ptree) < |V|

Approximation quality guarantee:

E

Constraint-based learning

Naively- for every candidate

sep. S of size k- for every XV\S

- if I(X, V\SX | S) < - add (S,X) to the “list of

useful components” L- find a JT consistent with L

Naïve

Our work

nk nk

O(2n) O(nk+3)

O(2n) O(24k+4)

O(2n) O(nk+2)

Complexity:

Key theoretical resultEfficient upper bound for I(,|)

Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small.

Computation time is reduced from exponential in |V| to

polynomial!

Set S does not have to relate to the separators of the “true” JT in any

way!

Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that

I(XA, X(V\SA)S | S) < then

I(A, V\SA | S) < |V|(+)

Complexity: O(nk+1). Polynomial in n,instead of O(exp(n)) for straightforward computation

A B

S

Only needto

compute

I(X,Y|S)for smallX and Y!

I(A,B | S)=??

I(X,Y|S)X

Y

Finding almost independent

subsetsQuestion: if S is a separator of an -JT, which variables are on the same side of

S?- More than one correct answer

possible:

- We will settle for finding one- Drop the complexity to polynomial

from exponential

AB AC ADA A

S={A}: {B}, {C,D} OR {B,C}, {D}

Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right.then a partitioning of Q into X and Y exists s.t. I(X,Y|S)<

B

DC

possible partitionings

if no such splits exist, all variables of Q must be on the same side of S

Alg. 1 (given candidate sep. S), threshold :

- each variable of V\S starts out as a separate partition

- for every QV\S of size at most k+2- if minXQ I(X,Q\S | S) >

- merge all partitions that have variables in Q

Fixed size

regardless

of |Q|

Complexity: O(nk+3). Polynomial in n.

Theorem (results quality):If after invoking Alg.1(S,=) a set U is a connected component, then

- For every Z s.t. I(Z, V\ZS | S)<it holds that UZ

- I(U, V\US | S)<nk

never mistakenly

put variablestogetherIncorrect splits not too

bad

Example: =0.25

Pairwise

I(.,.|S)

Test edge,merge

variables

I() too low, do

not merge

0.4

0.30.

2

merge;end

result

Constructing a junction tree

Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)<|V|(+)

Example:

AB

CD

BC

BE

EF C E

S

Q:

E

, , ,

, ,

,{ }Problem: From L, reconstruct a junction

tree.This is non-trivial. Complications: - L may encode more independencies than

a single JT encodes- Several different JTs may be consistent

with independencies in L

Key insight [Arnborg+al:,SIAM-JADM1987, Narasimhan+Bilmes: UAI05]:

In a junction tree, components (S,Q) have recursive decomposition:

a clique in the

junction tree

smaller components

from L

DP algorithm (input list L of pairs (S,Q)):- sort L in the order of increasing |Q|- mark (S,Q)L with |Q|=1 as positive- for (S,Q)L, Q≥2, in the sorted order

- if xQ, (S1,Q1), …, (Sm,Qm) L s.t.

- Si {Sx}, (Si,Qi) is positive

- QiQj=- i=1:mQi=Q\x

- then mark (S,Q) positive- decomposition(S,Q)=(S1,Q1),...,

(Sm,Qm)

- if S s.t. all (S,Qi)L are positive- return corresponding junction tree

Look for such recursive decompositions in L!

NP-complete to decide

We use greedy heuristic

Greedy heuristic for decomposition search

- initialize decomposition to empty- iteratively add pairs (Si,Qi) that do not

conflict with those already in the decomposition

- if all variables of Q are covered, success- May fail even if a decomposition exists

- But we prove that for certain distributions guaranteed to work

ABEF F ABCD

B

A

B

EF

B

CD

C

D

C

B A EF

Theoretical guarantees

Intuition: if the intra-clique dependencies are

strong enough, guaranteed to find a well-approximating JT in polynomial time.

Theorem: Suppose a maximal -JT tree of

treewidth k exists for P(V) s.t. for every clique

C and separator S of tree it holds that minX(C\S)I(X,C\SX|S) > (k+3)(+)

then our algorithm will find a k|V|(+)-JT for

P(V) with probability at least (1-) using

n

Ok

log1

log2

22

2

44

samples and

nn

Okk

log1

log2

22

2

4432time

Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that

minX(C\S)I(X,C\SX|S) >for fixed >0 are efficiently PAC

learnable

Related work

Experimental resultsModel quality (log-likelihood on test

set)Compare this work with - ordering-based search (OBS)

[Teyssier+Koller:UAI05]- Chow-Liu alg. [Chow+Liu:IEEE68]- Karger-Srebro alg.

[Karger+Srebro:SODA01]- local search- this work + local search combination

(using our algorithm to initialize local search)

Data: Beinlich+al:ECAIM198837 variables, treewidth 4,learned treewidth 3

Data: Krause+Guestrin:UAI0532 variables, treewidth 3

Data: Desphande+al:VLDB0454 variables, treewidth 2

Future work- Extend to non-maximal junction trees- Heuristics to speed up performance- Using information about edges likelihood

(e.g. from L1 regularized logistic regression) to cut down on computation.

Ref. Model Guarantees

Time

[1,2] tractable

local poly(n)

[3] tree global O(n2 log n)

[4] tree mix local O(n2 log n)

[5] compact local poly(n)

[6] all global exp(n)

[7] tractable

const-factor

poly(n)

[8] compact PAC poly(n)

[9] tractable

PAC exp(n)

this work

tractable

PAC poly(n)[1] Bach+Jordan:NIPS-02[2] Choi+al:UAI-05[3] Chow+Liu:IEEE-1968[4] Meila+Jordan:JMLR-01[5] Teyssier+Koller:UAI-05[6] Singh+Moore:CMU-CALD-05[7] Karger+Srebro:SODA-01[8] Abbeel+al:JMLR-06[9] Narasimhan+Bilmes:UAI-04