Upload
akeem-everett
View
37
Download
0
Embed Size (px)
DESCRIPTION
Efficient Principled Learning of Junction Trees. A. A. AB. AC. AD. 0.3. 0.4. 0.2. Anton Chechetka and Carlos Guestrin. Carnegie Mellon University. Motivation. Constructing a junction tree Using Alg.1 for every S V , obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)
Citation preview
Carnegie Mellon
Junction trees
Trees where each node is a set of variables
- Running intersection property: every clique between Ci and Cj contains Ci Cj
- Ci and Cj are neighbors Sij=Ci Cj is called a separator
- Example:
- Notation: Vij is a set of all variables on the same side of edge i-j as clique Cj:
- V34={GF}, V31={A}, V43={AD}- Encoded independencies:
(Vij Vji | Sij)
Efficient Principled Learning of Junction TreesAnton Chechetka and Carlos Guestrin
Carnegie Mellon University
MotivationProbabilistic graphical models
are everywhere- Medical diagnosis, datacenter
performance monitoring, sensor nets, …
Main advantages- Compact representation of probability
distributions- Exploit structure to speed up inference
≤4 neighbors per variable (a constant!), but inference still hard
But also problems- Compact representation≠ tractable
inference. - Exact inference #P-complete in
general- Often still need exponential time
even for compact models- Example:
- Often do not even have structure, only data- Best structure is NP-complete to find- Most structure learning algorithms
return complex models, where inference is intractable
- Very few structure learning algorithms have global quality guarantees
We address both of these issues! We provide
- efficient structure learning algorithm- guaranteed to learn tractable models- with global guarantees on the results
quality
This work: contributions
- The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with
- guaranteed tractable inference!
- Key theoretical insight: polynomial-time upper bound on conditional mutual information for arbitrarily large sets of variables
- Empirical viability demonstration
Tractability guarantees:- Inference exponential in clique size k- Small cliques tractable inference
JTs as approximationsOften exact conditional independence is
toostrict a requirement- generalization: conditional mutual
informationI(A , B | C) H(A | B) – H(A | BC)- H() is conditional entropy- I(A , B | C) ≥0 always- I(A , B | C) = 0 (A B | C)
- intuitively: if C is already known, how much new information about A is contained in B?
Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time
AB
CD
BC BECliques
B
C
B
E
Separators
1
2
3 4
5
6EF
EGV3 4
Theorem [Narasimhan and Bilmes, UAI05]: If for every separator Sij in the junction tree it holds that the conditional mutual information
I(Vij, Vji | Sij ) < (call it -junction tree)
thenKL(P||Ptree) < |V|
Approximation quality guarantee:
E
Constraint-based learning
Naively- for every candidate
sep. S of size k- for every XV\S
- if I(X, V\SX | S) < - add (S,X) to the “list of
useful components” L- find a JT consistent with L
Naïve
Our work
nk nk
O(2n) O(nk+3)
O(2n) O(24k+4)
O(2n) O(nk+2)
Complexity:
Key theoretical resultEfficient upper bound for I(,|)
Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small.
Computation time is reduced from exponential in |V| to
polynomial!
Set S does not have to relate to the separators of the “true” JT in any
way!
Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that
I(XA, X(V\SA)S | S) < then
I(A, V\SA | S) < |V|(+)
Complexity: O(nk+1). Polynomial in n,instead of O(exp(n)) for straightforward computation
A B
S
Only needto
compute
I(X,Y|S)for smallX and Y!
I(A,B | S)=??
I(X,Y|S)X
Y
Finding almost independent
subsetsQuestion: if S is a separator of an -JT, which variables are on the same side of
S?- More than one correct answer
possible:
- We will settle for finding one- Drop the complexity to polynomial
from exponential
AB AC ADA A
S={A}: {B}, {C,D} OR {B,C}, {D}
Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right.then a partitioning of Q into X and Y exists s.t. I(X,Y|S)<
B
DC
possible partitionings
if no such splits exist, all variables of Q must be on the same side of S
Alg. 1 (given candidate sep. S), threshold :
- each variable of V\S starts out as a separate partition
- for every QV\S of size at most k+2- if minXQ I(X,Q\S | S) >
- merge all partitions that have variables in Q
Fixed size
regardless
of |Q|
Complexity: O(nk+3). Polynomial in n.
Theorem (results quality):If after invoking Alg.1(S,=) a set U is a connected component, then
- For every Z s.t. I(Z, V\ZS | S)<it holds that UZ
- I(U, V\US | S)<nk
never mistakenly
put variablestogetherIncorrect splits not too
bad
Example: =0.25
Pairwise
I(.,.|S)
Test edge,merge
variables
I() too low, do
not merge
0.4
0.30.
2
merge;end
result
Constructing a junction tree
Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)<|V|(+)
Example:
AB
CD
BC
BE
EF C E
S
Q:
E
, , ,
, ,
,{ }Problem: From L, reconstruct a junction
tree.This is non-trivial. Complications: - L may encode more independencies than
a single JT encodes- Several different JTs may be consistent
with independencies in L
Key insight [Arnborg+al:,SIAM-JADM1987, Narasimhan+Bilmes: UAI05]:
In a junction tree, components (S,Q) have recursive decomposition:
a clique in the
junction tree
smaller components
from L
DP algorithm (input list L of pairs (S,Q)):- sort L in the order of increasing |Q|- mark (S,Q)L with |Q|=1 as positive- for (S,Q)L, Q≥2, in the sorted order
- if xQ, (S1,Q1), …, (Sm,Qm) L s.t.
- Si {Sx}, (Si,Qi) is positive
- QiQj=- i=1:mQi=Q\x
- then mark (S,Q) positive- decomposition(S,Q)=(S1,Q1),...,
(Sm,Qm)
- if S s.t. all (S,Qi)L are positive- return corresponding junction tree
Look for such recursive decompositions in L!
NP-complete to decide
We use greedy heuristic
Greedy heuristic for decomposition search
- initialize decomposition to empty- iteratively add pairs (Si,Qi) that do not
conflict with those already in the decomposition
- if all variables of Q are covered, success- May fail even if a decomposition exists
- But we prove that for certain distributions guaranteed to work
ABEF F ABCD
B
A
B
EF
B
CD
C
D
C
B A EF
Theoretical guarantees
Intuition: if the intra-clique dependencies are
strong enough, guaranteed to find a well-approximating JT in polynomial time.
Theorem: Suppose a maximal -JT tree of
treewidth k exists for P(V) s.t. for every clique
C and separator S of tree it holds that minX(C\S)I(X,C\SX|S) > (k+3)(+)
then our algorithm will find a k|V|(+)-JT for
P(V) with probability at least (1-) using
n
Ok
log1
log2
22
2
44
samples and
nn
Okk
log1
log2
22
2
4432time
Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that
minX(C\S)I(X,C\SX|S) >for fixed >0 are efficiently PAC
learnable
Related work
Experimental resultsModel quality (log-likelihood on test
set)Compare this work with - ordering-based search (OBS)
[Teyssier+Koller:UAI05]- Chow-Liu alg. [Chow+Liu:IEEE68]- Karger-Srebro alg.
[Karger+Srebro:SODA01]- local search- this work + local search combination
(using our algorithm to initialize local search)
Data: Beinlich+al:ECAIM198837 variables, treewidth 4,learned treewidth 3
Data: Krause+Guestrin:UAI0532 variables, treewidth 3
Data: Desphande+al:VLDB0454 variables, treewidth 2
Future work- Extend to non-maximal junction trees- Heuristics to speed up performance- Using information about edges likelihood
(e.g. from L1 regularized logistic regression) to cut down on computation.
Ref. Model Guarantees
Time
[1,2] tractable
local poly(n)
[3] tree global O(n2 log n)
[4] tree mix local O(n2 log n)
[5] compact local poly(n)
[6] all global exp(n)
[7] tractable
const-factor
poly(n)
[8] compact PAC poly(n)
[9] tractable
PAC exp(n)
this work
tractable
PAC poly(n)[1] Bach+Jordan:NIPS-02[2] Choi+al:UAI-05[3] Chow+Liu:IEEE-1968[4] Meila+Jordan:JMLR-01[5] Teyssier+Koller:UAI-05[6] Singh+Moore:CMU-CALD-05[7] Karger+Srebro:SODA-01[8] Abbeel+al:JMLR-06[9] Narasimhan+Bilmes:UAI-04