149Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Learning Bayesian Networks
23.02.2006
150Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Learning Bayesian Networks
p S p x Ssh
i i ih
i
n
( | , ) ( | , , )x paθ θ==
∏1
localdistributionfunctions
joint distribution function
Sh is the hypothesis that structure S (minimal) encodes the joint distribution function
151Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Example
θ θ θ θs = { , , }| |1 2 1 2 1
X1 X2S:
p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )
( )|
|
1 2 1 1 2 1 2 1
1 2 11
θ θ θθ θ
=
= −
p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )|
|
1 2 1 1 2 1 2 1
1 2 1
θ θ θθ θ
==
152Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Updating or "Learning" Parameters(Sh is certain, θθθθs is uncertain)Let be a random sample from
D
p Sm
sh
= { , , }
( | , ).
x x
x1 K
θ
p D S p S p D S
p S p S
sh
sh
sh
sh
l sh
l
m
( | , ) ( | ) ( | , )
( | ) ( | , )
θ θ θ
θ θ
∝
==
∏ x1
∫ ++ = sh
sh
smh
m dSDPSDxPSDxP θθθ ),|(),,|(),|( 11
153Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Example
sample 1
sample 2
M
Θ2|1Θ1 Θ2|1
X1 X2
X1 X2
θ θ θ θs = { , , }| |1 2 1 2 1
X1 X2S:
154Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Exact computation of p(θθθθs|D,Sh)
� No missing data, no hidden variables
� Independent parameters
� Local distribution functions from the exponential family, conjugate priors
p S p Ssh
ih
i
n
( | ) ( | )θ θ==
∏1
155Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Parameter Independence
sample 1
sample 2
M
Θ2|1Θ1 Θ2|1
X1 X2
X1 X2
156Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
No Missing Data
Each parameter can be updated independently….
sample 1
sample 2
M
Θ2|1Θ1 Θ2|1
X1 X2
X1 X2
P N N S ph N N( | , , ) ( ) ( )| | | |θ θ θ θ2 1 21 2 1 2 1 2 1 2 121 2 11∝ −
157Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Updating Parameters
x1 : true false x2 : false true x3 : true true
X1 X2
Data (D)
x4 : true true x5 : false true x6 : true false
p D p( | ) ( )θ θ θ θ1 1 14
121∝ −( )
158Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Updating Parameters
p D p( | ) ( )θ θ θ θ2 1 2 1 2 12
2 121| | | |( )∝ −x1 : true false
x2 : false true x3 : true true
X1 X2
Data (D)
x4 : true truex5 : false true x6 : true false
159Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Updating Parameters
p D p( | ) ( )θ θ θ θ2 1 2 1 2 12
2 101| | | |( )∝ −x1 : true false
x2 : false truex3 : true true
X1 X2
Data (D)
x4 : true true x5 : false truex6 : true false
160Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Conjugate Priors
Θ2|1Θ1 Θ2|1
X1 X2
X1 X2
Beta( 1 1θ α| )
M
Beta( 2|1 2|1θ α| )
Beta( 2| 1 2| 1θ α| )
161Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Missing Data
x1 : ? true X1 X2
Data
p x p x x p x x p x x p x x( ) =θ θ θ2 1 2 2 1 1 2 1 2 2 1 1 2 1 2| | || ( | , ) ( | ) ( | , ) ( | )+
Beta Beta
Mixing coefficients
For real problems, exact calculations are intractable.
162Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Approximate computation of p(θθθθs|D,Sh)� Monte Carlo: Gibbs sampling, importance sampling (e.g., Neal 93)
� Gaussian approximation (e.g., Kass et al. 88)
� MAP: ~θs
→)~()~(
21
),~|(),|(ss
tss Ah
smh
s eSDcpSSpθθθθ
θθ−−−∞→
163Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Gibbs Sampling Geman and Geman (1984)
Basic idea:Given = { and ( ), estimate ( ( )) as follows:
Initialize somehow Repeat
- For = , ,
Sample each according to ( | \ )
Compute using current values of
Average of ( ) regardless of the initial .
X x x p X f X
X
i n
x p x X x
f X X
f X f X X
n
i i i
1
1
, , } E
( )
E( ( )),
K
K
••
→
164Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
••
→
Initialize Repeat
- Sample missing data (given and producing
- Compute
- Sample from
Average of
θ
θθ
θ θ
θ θ
s
s l lc
s lc h
s s lc h
s lc h
s lh
D D
p D S
p D S
p D S p D S
)
( | , )
( | , )
( | , ) ( | , )
Using Gibbs Sampling to Approximate p(θθθθ|D,Sh)
165Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Gaussian ApproximationTierney and Kadane (1986), Kass et al. (1988)
Basic idea: for large data sets, will often be ~Gaussianp D Ss
h( | , )θ
θ s
p D Ssh( | , )θ
166Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Gaussian Approximationg p D p
g g
g g A A g
p D p p D p e
T
AT
( ) log ( | ) ( )
( ) ( ~):
( ) ( ~) ( ~) ( ~) ( ' ' ( ~))
( | ) ( ) ( | ~) ( ~)( ~ ) ( ~ )
θ θ θθ θ
θ θ θ θ θ θ θ
θ θ θ θθ θ θ θ
≡
≈ + − − − = −
≈− − −
Expand about its maximum 12
12
167Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Computational Considerations� Find �Gradient methods�Monte-Carlo methods�EM, if p(D|θθθθs,Sh) is in the exponential family
� Compute �Numerical methods (Meng and Rubin 91) �Likelihood ratio tests (e.g., Raftery 94) �Via inference, if variables discrete (Thiesson 95)
g''( ~)θ
~θ
168Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Expectation-Maximization (EM) Algorithm (Dempster et al. (1977))� Initialize parameters� Expectation step: compute the expected sufficient statistics
� Maximization step: choose parameters so as to maximize their posterior probability given the expected sufficient statistics
� Iterate. Parameters will converge to a local MAP value
E N p x xs l sl
N
( | ) ( , | , )12 1 21
θ θ==∑ x
θθ α
θ α θ α2 121 21
21 21 2 1 2 1
11 1| :
( | )( | ) ( | )
=+ −
+ − + + −E N
E N E Ns
s s
169Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Learning Structure(both Sh and θθθθs are uncertain)
?X1 X2
X1 X2
Sa:
Sb:
)|(),|(
)|(),|()|(
1
11
DSpDSxp
DSpDSxpDxphb
hbm
ha
hamm
+
++ +=
model averaging
170Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Learning Structure(both Sh and θθθθs are uncertain)
p S D p S p D S
p S p D S p S d
h h h
hs
hs
hs
( | ) ( ) ( | )
( ) ( | , ) ( | )
∝
= ∫ θ θ θ
Parameters updated as beforeParameters updated as before
marginallikelihood
171Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Model Selection
� The number of possible structures for a given domain is more than exponential in the number of variables� Solution: Use only one or a handful of models� Necessary components: �Search method�Scoring method
172Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
One Reasonable Score: Posterior Probability of a Structure
p S D p S p D S
p S p D S p S d
h h h
hs
hs
hs
( | ) ( ) ( | )
( ) ( | , ) ( | )
∝
= ∫ θ θ θ
structureprior
parameterprior
likelihood
173Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Relation to cross-validation(Dawid 84)
log ( | ) log ( | , , , )
log ( | ) log ( | , ) log ( | , , )
p D S p S
p S p S p S
hl l
h
l
m
h h h
=
= + + +
−=∑ x x x
x x x x x x
1 11
1 2 1 3 1 2
K
L
Obs! The last term in the sumis related to cross-validation
174Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Predictive ScoreSpiegelhalter et al. (1993)
pred(S p y y y Shl l l l
h
l
m
) log ( | ,{ , }, ,{ , }, )= − −=∑ x x x1 1 1 1
1
K
Ydisease
X1
X2
Xn symptoms...
175Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Exact computation of p(D|Sh)
� No missing data, no hidden variables
� Independent parameters
� Local distribution functions from the exponential family, conjugate priors� (Prior modularity)
p S p Ssh
ih
i
n
( | ) ( | )θ θ==
∏1
176Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Prior Modularity
The parameter priors are modular.That is, these quantities for Xi depend only on the structure that is local to Xi (i.e., Pai) and not all of S.
p S p Sih
ih( | ) ( | )θ θ1 2=
If Xi has the same parents in S1 and S2, then
177Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Example: All Local Distributions Multinomial
p Sx
hx
ki ij
ik
ij
ijk( | )| |
θ θ αpa pa
∝ −∏ 1Conjugate prior:
p x Sik
ij
ih
xik
ij( | , , )
|pa
paθ θ=
Local distribution functions:
178Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Bayesian Dirichlet ScoreCooper and Herskovits (1991)
p D SN
Nh ij
ij ijj
q
i
nijk ijk
ijkk
ri i
( | )( )
( )
( )
( )=
++
== =∏∏ ∏
ΓΓ
ΓΓ
αα
αα11 1
N X x
r X
q X
N N
ijk i i ij
i i
i i
ij ijkk
r
ij ijkk
ri i
:
::
# cases where = and =
number of states of number of instances of parents of
ik Pa pa
α α= == =∑ ∑
1 1
179Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
� Monte Carlo � Gaussian approximation
� Bayesian Information Criterion (BIC)log ( | ) log ( | , $ ) ( / ) log( ) ( )p D S p D S d m Oh h
s= − +θ 2 1
log ( | ) log ( | , ~ ) log ( ~ | )
( / ) log( ) ( / ) log | | ( )
p D S p D S p S
d A O m
h hs s
h= +
+ − + −
θ θ
π2 2 1 2 1
Approximations for p(D|Sh)
180Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Bayesian Information Criterion (BIC) Schwarz (1978)
� No priors needed� BIC = - approximation of stochastic complexity (Rissanen 87)
log ( | ) log ( | , $ ) logp D S p D Sd
mh hs= −θ
2
181Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Another ApproximationCheeseman & Stutz (1995)
� Accuracy (CH96): Gaussian > CS >> BIC� As efficient as BIC
log ( | ) log ( | )p D S p D Sh h≈ EM
p D SE N
E NEM
h ij
ij ijj
q
i
nijk ijk
ijkk
ri i
( | )( )
( ( ))
( ( ))
( )≈
++
== =∏∏ ∏
ΓΓ
ΓΓ
αα
αα11 1
182Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Problems When There Are Many Possible Structures� Prior assessment� Search
183Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Assumptions for ConstructingParameter Priors Geiger and Heckerman (1995)
Relatively small# of assessments
Parameter priorsfor all structuresin a given domain
⇒
184Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Assumptions That Simplify Assessment of Priors� Parameter independence� Prior modularity� (Conjugate priors)� Marginal likelihood equivalence
185Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Distribution Equivalence
Suppose the local likelihoods are restricted to some family F (e.g., discrete, linear regression).Two network structures for X are distribution equivalent wrt F if they encode the same distributions on X.
186Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Independence and Distribution Equivalence
S1, S2 distribution eqv wrt some F ⇒ S1, S2 independence eqvbut the converse does not hold for all F. E.g.:
p x S
a b xi i
ji
h
i ijx
jj i
( | , , )
exp
pa
pa
θ =
+ +�
��
��
��
���
∈∑
1
1
X1
X2 X3
X1
X2 X3
and are not distribution equivalent
187Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Assumption:Covered-Arc-Reversal EquivalenceGiven local likelihoods restricted to F, any two structures for X that differ only by a covered arc reversal are distribution equivalent.Examples:- unrestricted discrete- linear regression (e.g., Shachter and Kenley)
S1, S2 independence eqv⇒ S1, S2 distribution eqv wrt F
188Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
θθθθx: Parameters for the Joint Likelihood
X1 X2 X3
X1 X3 X2
X2 X1 X3
X2 X3 X1
X3 X1 X2
X3 X2 X1
p Sch( | )θ x
p Sch( | , )x xθ
189Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
θθθθx: Discrete Case
p S p x x Sch
n ch
x xn( | , ) ( , , | , ) , ,x x xθ θ θ= =1 1
K �
θ θx x x x xi
n
n i i1 1 11
, , | , ,� �=−
=∏
∂θ∂θ
θx
scx x x
r
x xi
n
i i
j in
j
i
=−
= + −
=
−
∏∏ [ ]| , ,[ ]
, ,1 1
1
1
1
1
1
Π
190Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
θθθθx: Linear-Regression Casep S Wc
hn( | , ) ( | , )x N xxθ µ=
µ µi i ji jj
i
m b= +=
−∑1
1
W iW i
v v
v v
i it
i
i
i
i
i i
( )( )
+ =+
−
−
+ +
+
+
+
+
+ +
1 1
1 1
1
1
1
1
1 1
b b b
b
191Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Discrete Case: Dirichlet Priors are Inevitable
p Sch
x xp x x S
x xn
n ch
n
( | ) , ,( , , | )
, ,
θ θαx = ⋅ −∏ 1
1
1
1K
K
K
parameter independence
given:likelihood equivalence
192Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Discrete Case
⇓ parameter independenceparameter independencelikelihood equivalencelikelihood equivalence
X1 X2 X1 X2
f f f f f fy x1 1 21 2 1 2 1
1 1
2 2 12 12 12 12
2 21 1( ) ( ) ( )
( )( ) ( ) ( )
( )| | | | | | | |
θ θ θθ θ
θ θ θθ θ−
=−
193Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Assessments for the BDeScore
is Dirichlet
p S ch( | )θ x
- equivalent sample size
-
αp X X Sn c
h( , , | )1 K
prior Bayesian networkfor {X1,...,Xn}
194Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
BDe ScoreHeckerman, Geiger, and Chickering(1994)
p D SN
Nh ij
ij ijj
q
i
nijk ijk
ijkk
ri i
( | )( )
( )
( )
( )=
++
== =∏∏ ∏
ΓΓ
ΓΓ
αα
αα11 1
α ijk i ik
i ik
chp X x S= = =( , | )Pa pa
“Equivalent structures get equal scores.”
195Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Linear-Regression Case:Normal-Wishart Prior is Inevitable
p S Wch
n W( | ) ( | , , , )θ θ µ α αµx xNw= 0 0
parameter independence
given:likelihood equivalence
196Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Combine Domain Knowledge and Data
x1 : true false true . . . truex2 : false true true . . . false
xN : true true false . . . true
X1 X2 X3 . . . Xn
...
X1
X4X3
X2
Prior Network Equivalent Sample Size Data
α
X1
X4
X2
X3
Priors for all structures
Learned structure
197Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Structure Priors
p S D p S D p S p D Sh h h h( | ) ( , ) ( ) ( | )∝ =
structure prior
198Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Structure Prior
δ is the number of arcs that differ between
Sh and the prior network
p S h( ) ,∝ < <κ κδ 0 1
199Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Search Methods
��� � � � � ��� � ���� � �� � � � � � � � � � � � � � � � � � �� � � �� � � � � � �� � � �� � �� � � � � � � �
� � � � � � ��� � � � � � � � ! � � � �" � � ��# $ %& � � � �� � � � '( )
* � � � � � �� � � � � � � � �
+, - � � � �/. %�0 132 � � � � � � � � )
+ 4� � �2 " � � � � �� � � � �
+ 5 � �� 2 & � � 6 � �� � � � � � � � � � � �
7� � � � � �� � ��8 9: - �; �=< � >� �; � 6� �� � 6 � � �� �
+ 7� � � �� � � � 5 � � � % $ ' '( )@? & � � � �� � � � % $ ' 'A )
& � � � 6�� �� ; �=< �� �� � � 6�� �� � � � �score( , ) ( , , )S D s X Di i i
i
n
==
∏ Pa1
200Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
score( , ) log ( , , )
( , , ) log ( , , )
S D s X Pa D
w X Pa D s X D
hi i i
i
n
i ii
n
i ii
n
=
= + ∅
=
= =
∑∑ ∑
1
1 1
w X X D s X X D s X Di j i j i i i( , , ) log ( , , ) log ( , , )≡ − ∅
Special Case: Each Node Has at Most One Parent
201Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Each Node Has at Most One Parent
The network with the highest probability is theone for which Σ w(Xi , Pai, D) is a maximum
• Scores without likelihood+prior equivalence:Maximum branchings (Edmonds)
• Scores with likelihood+prior equivalence:Maximum spanning tree
202Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
General Case: Each Node Has at Most kParents
Finding best structure is NP-hard for k>1� Greedy (+/- restarts)� Best-first search� Simulating annealing� Other Monte-Carlo approaches
203Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Local Search
scoreall possible
single changes
anychangesbetter?
performbest
change
yes
no
returnsaved structure
initializestructure
• prior structure• empty graph• max spanning tree• random graph
extension:multiple restarts
204Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Simulated Annealing (Metropolis)
pick random changeand compute:
p=exp(∆∆∆∆score/T)
quit?
performchange
with probp
yes
returnsaved structure
initializestructure
• prior structure• empty graph• max spanning tree• random graph
no
205Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Evaluation Methodology
goldstandardnetwork
learnednetwork(s)
priornetwork
datarandomsample
noise (ηηηη)
Two measures of utility of learned network:- Cross Entropy (gold standard network, learned network)- Structural difference
score +search
206Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
R
AA
A
A
RR
R
D
D
DD
D
D
R
Gold Standard
Prior Network
Learned Network
Data
207Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
D
Gold Standard
Prior Network (no arcs)
Learned Network
Data (10,000 cases)
Search over equivalence classesSpirtes and Meek (1995)
208Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Learning with Hidden VariablesSpecial case of learning with missing dataE.g.: AutoClass (Cheeseman)
H
X1 X2 Xn...
observed
unobserved
209Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006
Identifying Hidden Variables
� Prior knowledge (e.g., AutoClass)� Dependency cliques